此心得為 2025-05-14 考完試後寫的,請自行斟酌有效性。
前言
雖然說 delta lake 101 文章是因為要考 Databricks Certified Data Engineer Professional 才開始寫的,但從 4 月初發文後,就沒在更新 blog 了😅 ,
美其名是因為準備證照,但實際上是除了準備證照外的個人活動有點多沒時間寫🫠,但因為今天不負沒發文的愧疚感總算還是考過了,就趁這感覺分享一下該如何準備吧!
考前準備
懶人版本就是乖乖讀 ExamTopics – Databricks Certified Data Engineer Professional Exam 的題目就對了!
因為 Databricks 是使用 delta lake 加 Spark,我相信若沒有這 2 個工具的使用經驗的話會讀得非常痛苦,所以雖然我寫的不多,但還是可以看看我之前寫的有關 Spark 的文章,然後我還需要在補上題目中有出現過的 Spark Stream 相關主題,
而 delta lake 和 databricks 的話我覺得可以從這個 Amrit-Hub 的 GitHub 去讀他列出的重點文件學習相關知識點,例如 VACUUM、Auto Loader 等,
最後就是建議一定要在 local 端執行成功過 Spark + delta lake 的程式,不自己親手跑過一次怎會有 Spark 處理大數據的感覺,databricks 就算了,執行環境有點難取得 XD,但你得知道 databricks 的基本才行,像 Job, Task, Dag, notebook, DLT 等等等。
2025-05-14 的考題
共 65 題,有 5 題不會計分(哪 5 題不曉得),全部的題目只有 1 題沒出現在 ExamTopics 中,然後裡面有差不多有 9 題出到 ExamTopics 中那種有答案有分歧的題目,像第 131 題:
An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id.
For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.
Which solution meets these requirements?
- A. Iterate through an ordered set of changes to the table, applying each in turn to create the current state of the table, (insert, update, delete), timestamp of change, and the values.
- B. Use merge into to insert, update, or delete the most recent entry for each pk_id into a table, then propagate all changes throughout the system.
- C. Deduplicate records in each batch by pk_id and overwrite the target table.
- D. Use Delta Lake’s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

這題答案 D 是多數派,但我覺得答案是少數派的 B,
若遇到這種類型的題目,建議還是自己去翻翻 delta lake、databricks 或 Spark 的官方文章找答案,所以最後這題我在考試時就選了 B (但不知有沒有答對🤣)。
送出答案後
你會馬上得知你的考試結果,databricks 比 Google Data Engineer 考試好的是,你不會只看到 pass 或 fail,你會看到在不同的類別中各自的分數,例如下面是我的結果。
Topic Level Scoring:
- Tooling: 91%
- Data Processing: 100%
- Data Modeling: 91%
- Security: 100%
- Monitoring: 83%
- Testing and Deployment: 100%
Result: PASS
結論
看 ExamTopics – Databricks Certified Data Engineer Professional Exam 、練習 Spark + delta lake、參考 Amrit-Hub 的 GitHub 去了解 databricks 的基本,最後就是相信自己啦!
希望這篇文章對想要去考 Databricks Certified Data Engineer Professional 證照的人有所幫助!