Pre-Summer Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: clap70

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

A Delta Lake table representing metadata about content posts from users has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

This table is partitioned by the date column. A query is run with the following filter:

longitude < 20 and longitude > -20

Which statement describes how data will be filtered?

Options:

A.

Statistics in the Delta Log will be used to identify partitions that might Include files in the filtered range.

B.

No file skipping will occur because the optimizer does not know the relationship between the partition column and the longitude.

C.

The Delta Engine will use row-level statistics in the transaction log to identify the flies that meet the filter criteria.

D.

Statistics in the Delta Log will be used to identify data files that might include records in the filtered range.

E.

The Delta Engine will scan the parquet file footers to identify each row that meets the filter criteria.

Buy Now
Questions 5

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected.

The function is displayed below with a blank:

Which response correctly fills in the blank to meet the specified requirements?

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

E.

Option E

Buy Now
Questions 6

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.

A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have

Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.

Which statement captures best practices for this situation?

Options:

A.

Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

B.

All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

C.

In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

D.

Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

Buy Now
Questions 7

Where in the Spark UI can one diagnose a performance problem induced by not leveraging predicate push-down?

Options:

A.

In the Executor ' s log file, by gripping for " predicate push-down "

B.

In the Stage ' s Detail screen, in the Completed Stages table, by noting the size of data read from the Input column

C.

In the Storage Detail screen, by noting which RDDs are not stored on disk

D.

In the Delta Lake transaction log. by noting the column statistics

E.

In the Query Detail screen, by interpreting the Physical Plan

Buy Now
Questions 8

A data engineer needs to install the PyYAML Python package within an air-gapped Databricks environment . The workspace has no direct internet access to PyPI. The engineer has downloaded the .whl file locally and wants it available automatically on all new clusters.

Which approach should the data engineer use?

Options:

A.

Upload the PyYAML .whl file to the user home directory and create a cluster-scoped init script to install it.

B.

Upload the PyYAML .whl file to a Unity Catalog Volume, ensure it’s allow-listed, and create a cluster-scoped init script that installs it from that path.

C.

Set up a private PyPI repository and install via pip index URL.

D.

Add the .whl file to Databricks Git Repos and assume automatic installation.

Buy Now
Questions 9

A platform team lead is responsible for automating SQL Warehouse usage attribution across business units. They need to identify warehouse usage at the individual user level and share a daily usage report with an executive team that includes business leaders from multiple departments.

How should the platform lead generate an automated report that can be shared daily?

Options:

A.

Use system tables to capture audit and billing usage data and share the queries with the executive team for manual execution.

B.

Use system tables to capture audit and billing usage data and create a dashboard with a daily refresh schedule shared with the executive team.

C.

Restrict users from running SQL queries unless they provide query details for attribution tracking.

D.

Let users run queries normally and have individual teams manually report usage to the executive team.

Buy Now
Questions 10

A data engineering team needs to implement a tagging system for their tables as part of an automated ETL process, and needs to apply tags programmatically to tables in Unity Catalog.

Which SQL command adds tags to a table programmatically?

Options:

A.

ALTER TABLE table_name SET TAGS ( ' key1 ' = ' value1 ' , ' key2 ' = ' value2 ' );

B.

APPLY TAGS ON table_name VALUES ( ' key1 ' = ' value1 ' , ' key2 ' = ' value2 ' );

C.

COMMENT ON TABLE table_name TAGS ( ' key1 ' = ' value1 ' , ' key2 ' = ' value2 ' );

D.

SET TAGS FOR table_name AS ( ' key1 ' = ' value1 ' , ' key2 ' = ' value2 ' );

Buy Now
Questions 11

The view updates represents an incremental batch of all newly ingested data to be inserted or updated in the customers table.

The following logic is used to process these records.

MERGE INTO customers

USING (

SELECT updates.customer_id as merge_ey, updates .*

FROM updates

UNION ALL

SELECT NULL as merge_key, updates .*

FROM updates JOIN customers

ON updates.customer_id = customers.customer_id

WHERE customers.current = true AND updates.address < > customers.address

) staged_updates

ON customers.customer_id = mergekey

WHEN MATCHED AND customers. current = true AND customers.address < > staged_updates.address THEN

UPDATE SET current = false, end_date = staged_updates.effective_date

WHEN NOT MATCHED THEN

INSERT (customer_id, address, current, effective_date, end_date)

VALUES (staged_updates.customer_id, staged_updates.address, true, staged_updates.effective_date, null)

Which statement describes this implementation?

Options:

A.

The customers table is implemented as a Type 2 table; old values are overwritten and new customers are appended.

B.

The customers table is implemented as a Type 1 table; old values are overwritten by new values and no history is maintained.

C.

The customers table is implemented as a Type 2 table; old values are maintained but marked as no longer current and new values are inserted.

D.

The customers table is implemented as a Type 0 table; all writes are append only with no changes to existing values.

Buy Now
Questions 12

A production workload incrementally applies updates from an external Change Data Capture feed to a Delta Lake table as an always-on Structured Stream job. When data was initially migrated for this table, OPTIMIZE was executed and most data files were resized to 1 GB. Auto Optimize and Auto Compaction were both turned on for the streaming production job. Recent review of data files shows that most data files are under 64 MB, although each partition in the table contains at least 1 GB of data and the total table size is over 10 TB.

Which of the following likely explains these smaller file sizes?

Options:

A.

Databricks has autotuned to a smaller target file size to reduce duration of MERGE operations

B.

Z-order indices calculated on the table are preventing file compaction

C Bloom filler indices calculated on the table are preventing file compaction

C.

Databricks has autotuned to a smaller target file size based on the overall size of data in the table

D.

Databricks has autotuned to a smaller target file size based on the amount of data in each partition

Buy Now
Questions 13

An upstream source writes Parquet data as hourly batches to directories named with the current date. A nightly batch job runs the following code to ingest all data from the previous day as indicated by the date variable:

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order.

If the upstream system is known to occasionally produce duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

Each write to the orders table will only contain unique records, and only those records without duplicates in the target table will be written.

B.

Each write to the orders table will only contain unique records, but newly written records may have duplicates already present in the target table.

C.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, these records will be overwritten.

D.

Each write to the orders table will only contain unique records; if existing records with the same key are present in the target table, the operation will tail.

E.

Each write to the orders table will run deduplication over the union of new and existing records, ensuring no duplicate records are present.

Buy Now
Questions 14

A streaming video analytics team ingests billions of events daily into a Unity Catalog-managed Delta table video_events . Analysts run ad-hoc point-lookup queries on columns like user_id, campaign_id, and region. The team manually runs OPTIMIZE video_events ZORDER BY (user_id, campaign_id, region), but still sees poor performance on recent data and dislikes the operational overhead. The team wants a hands-off way to keep hot columns co-located as query patterns evolve.

Options:

A.

Schedule OPTIMIZE/ZORDER to run after each job to improve recent file performance.

B.

Enable Delta caching.

C.

Utilize Liquid Clustering (CLUSTER BY AUTO) and Predictive Optimization.

D.

Enable auto-compaction (optimizeWrite and autoCompact).

Buy Now
Questions 15

A task orchestrator has been configured to run two hourly tasks. First, an outside system writes Parquet data to a directory mounted at /mnt/raw_orders/. After this data is written, a Databricks job containing the following code is executed:

(spark.readStream

.format( " parquet " )

.load( " /mnt/raw_orders/ " )

.withWatermark( " time " , " 2 hours " )

.dropDuplicates([ " customer_id " , " order_id " ])

.writeStream

.trigger(once=True)

.table( " orders " )

)

Assume that the fields customer_id and order_id serve as a composite key to uniquely identify each order, and that the time field indicates when the record was queued in the source system. If the upstream system is known to occasionally enqueue duplicate entries for a single order hours apart, which statement is correct?

Options:

A.

The orders table will not contain duplicates, but records arriving more than 2 hours late will be ignored and missing from the table.

B.

The orders table will contain only the most recent 2 hours of records and no duplicates will be present.

C.

All records will be held in the state store for 2 hours before being deduplicated and committed to the orders table.

D.

Duplicate records enqueued more than 2 hours apart may be retained and the orders table may contain duplicate records with the same customer_id and order_id.

Buy Now
Questions 16

Which statement describes the correct use of pyspark.sql.functions.broadcast?

Options:

A.

It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.

B.

It marks a column as small enough to store in memory on all executors, allowing a broadcast join.

C.

It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.

D.

It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

E.

It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.

Buy Now
Questions 17

The data engineering team maintains the following code:

Assuming that this code produces logically correct results and the data in the source tables has been de-duplicated and validated, which statement describes what will occur when this code is executed?

Options:

A.

A batch job will update the enriched_itemized_orders_by_account table, replacing only those rows that have different values than the current version of the table, using accountID as the primary key.

B.

The enriched_itemized_orders_by_account table will be overwritten using the current valid version of data in each of the three tables referenced in the join logic.

C.

An incremental job will leverage information in the state store to identify unjoined rows in the source tables and write these rows to the enriched_iteinized_orders_by_account table.

D.

An incremental job will detect if new rows have been written to any of the source tables; if new rows are detected, all results will be recalculated and used to overwrite the enriched_itemized_orders_by_account table.

E.

No computation will occur until enriched_itemized_orders_by_account is queried; upon query materialization, results will be calculated using the current valid version of data in each of the three tables referenced in the join logic.

Buy Now
Questions 18

Which REST API call can be used to review the notebooks configured to run as tasks in a multi-task job?

Options:

A.

/jobs/runs/list

B.

/jobs/runs/get-output

C.

/jobs/runs/get

D.

/jobs/get

E.

/jobs/list

Buy Now
Questions 19

Given the following error traceback:

AnalysisException: cannot resolve ' heartrateheartrateheartrate ' given input columns:

[spark_catalog.database.table.device_id, spark_catalog.database.table.heartrate,

spark_catalog.database.table.mrn, spark_catalog.database.table.time]

The code snippet was:

display(df.select(3* " heartrate " ))

Which statement describes the error being raised?

Options:

A.

There is a type error because a DataFrame object cannot be multiplied.

B.

There is a syntax error because the heartrate column is not correctly identified as a column.

C.

There is no column in the table named heartrateheartrateheartrate.

D.

There is a type error because a column object cannot be multiplied.

Buy Now
Questions 20

A developer has successfully configured credential for Databricks Repos and cloned a remote Git repository. Hey don not have privileges to make changes to the main branch, which is the only branch currently visible in their workspace.

Use Response to pull changes from the remote Git repository commit and push changes to a branch that appeared as a changes were pulled.

Options:

A.

Use Repos to merge all differences and make a pull request back to the remote repository.

B.

Use repos to merge all difference and make a pull request back to the remote repository.

C.

Use Repos to create a new branch commit all changes and push changes to the remote Git repertory.

D.

Use repos to create a fork of the remote repository commit all changes and make a pull request on the source repository

Buy Now
Questions 21

A data engineer is using Lakeflow Declarative Pipelines Expectations feature to track the data quality of their incoming sensor data. Periodically, sensors send bad readings that are out of range, and they are currently flagging those rows with a warning and writing them to the silver table along with the good data. They’ve been given a new requirement – the bad rows need to be quarantined in a separate quarantine table and no longer included in the silver table.

This is the existing code for their silver table:

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

What code will satisfy the requirements?

Options:

A.

@dlt.table

@dlt.expect( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

B.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading < 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

C.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect_or_drop( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_sensor_reading " , " reading < 120 " )

def silver_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

@dlt.table

@dlt.expect( " invalid_sensor_reading " , " reading > = 120 " )

def quarantine_sensor_readings():

return spark.readStream.table( " bronze_sensor_readings " )

Buy Now
Questions 22

Spill occurs as a result of executing various wide transformations. However, diagnosing spill requires one to proactively look for key indicators.

Where in the Spark UI are two of the primary indicators that a partition is spilling to disk?

Options:

A.

Stage’s detail screen and Executor’s files

B.

Stage’s detail screen and Query’s detail screen

C.

Driver’s and Executor’s log files

D.

Executor’s detail screen and Executor’s log files

Buy Now
Questions 23

A data engineer is designing a Lakeflow Spark Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id is not null and amount is greater than zero. Invalid records should be dropped. Which Lakeflow Spark Declarative Pipelines configuration implements this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect_or_drop( " valid_amount " , " amount > 0 " )

B.

@dlt.table

def silver_orders():

return dlt.read_stream( " bronze_orders " ) \

.expect( " valid_customer " , " customer_id IS NOT NULL " ) \

.expect( " valid_amount " , " amount > 0 " )

C.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 24

Which method can be used to determine the total wall-clock time it took to execute a query?

Options:

A.

In the Spark UI, take the job duration of the longest-running job associated with that query.

B.

In the Spark UI, take the sum of all task durations that ran across all stages for all jobs associated with that query.

C.

Open the Query Profiler associated with that query and use the Total wall-clock duration metric.

D.

Open the Query Profiler associated with that query and use the Aggregated task time metric.

Buy Now
Questions 25

A team of data engineer are adding tables to a DLT pipeline that contain repetitive expectations for many of the same data quality checks.

One member of the team suggests reusing these data quality rules across all tables defined for this pipeline.

What approach would allow them to do this?

Options:

A.

Maintain data quality rules in a Delta table outside of this pipeline’s target schema, providing the schema name as a pipeline parameter.

B.

Use global Python variables to make expectations visible across DLT notebooks included in the same pipeline.

C.

Add data quality constraints to tables in this pipeline using an external job with access to pipeline configuration files.

D.

Maintain data quality rules in a separate Databricks notebook that each DLT notebook of file.

Buy Now
Questions 26

A data engineer is designing a pipeline in Databricks that processes records from a Kafka stream where late-arriving data is common.

Which approach should the data engineer use?

Options:

A.

Implement a custom solution using Databricks Jobs to periodically reprocess all historical data.

B.

Use batch processing and overwrite the entire output table each time to ensure late data is incorporated correctly.

C.

Use an Auto CDC pipeline with batch tables to simplify late data handling.

D.

Use a watermark to specify the allowed lateness to accommodate records that arrive after their expected window, ensuring correct aggregation and state management.

Buy Now
Questions 27

The business reporting tem requires that data for their dashboards be updated every hour. The total processing time for the pipeline that extracts transforms and load the data for their pipeline runs in 10 minutes.

Assuming normal operating conditions, which configuration will meet their service-level agreement requirements with the lowest cost?

Options:

A.

Schedule a jo to execute the pipeline once and hour on a dedicated interactive cluster.

B.

Schedule a Structured Streaming job with a trigger interval of 60 minutes.

C.

Schedule a job to execute the pipeline once hour on a new job cluster.

D.

Configure a job that executes every time new data lands in a given directory.

Buy Now
Questions 28

Although the Databricks Utilities Secrets module provides tools to store sensitive credentials and avoid accidentally displaying them in plain text users should still be careful with which credentials are stored here and which users have access to using these secrets.

Which statement describes a limitation of Databricks Secrets?

Options:

A.

Because the SHA256 hash is used to obfuscate stored secrets, reversing this hash will display the value in plain text.

B.

Account administrators can see all secrets in plain text by logging on to the Databricks Accounts console.

C.

Secrets are stored in an administrators-only table within the Hive Metastore; database administrators have permission to query this table by default.

D.

Iterating through a stored secret and printing each character will display secret contents in plain text.

E.

The Databricks REST API can be used to list secrets in plain text if the personal access token has proper credentials.

Buy Now
Questions 29

The security team is exploring whether or not the Databricks secrets module can be leveraged for connecting to an external database.

After testing the code with all Python variables being defined with strings, they upload the password to the secrets module and configure the correct permissions for the currently active user. They then modify their code to the following (leaving all other variables unchanged).

Which statement describes what will happen when the above code is executed?

Options:

A.

The connection to the external table will fail; the string " redacted " will be printed.

B.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the encoded password will be saved to DBFS.

C.

An interactive input box will appear in the notebook; if the right password is provided, the connection will succeed and the password will be printed in plain text.

D.

The connection to the external table will succeed; the string value of password will be printed in plain text.

E.

The connection to the external table will succeed; the string " redacted " will be printed.

Buy Now
Questions 30

A small company based in the United States has recently contracted a consulting firm in India to implement several new data engineering pipelines to power artificial intelligence applications. All the company ' s data is stored in regional cloud storage in the United States.

The workspace administrator at the company is uncertain about where the Databricks workspace used by the contractors should be deployed.

Assuming that all data governance considerations are accounted for, which statement accurately informs this decision?

Options:

A.

Databricks runs HDFS on cloud volume storage; as such, cloud virtual machines must be deployed in the region where the data is stored.

B.

Databricks workspaces do not rely on any regional infrastructure; as such, the decision should be made based upon what is most convenient for the workspace administrator.

C.

Cross-region reads and writes can incur significant costs and latency; whenever possible, compute should be deployed in the same region the data is stored.

D.

Databricks leverages user workstations as the driver during interactive development; as such, users should always use a workspace deployed in a region they are physically near.

E.

Databricks notebooks send all executable code from the user ' s browser to virtual machines over the open internet; whenever possible, choosing a workspace region near the end users is the most secure.

Buy Now
Questions 31

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone.

A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before.

Why are the cloned tables no longer working?

Options:

A.

The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.

B.

Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.

C.

The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

D.

Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.

Buy Now
Questions 32

A junior developer complains that the code in their notebook isn ' t producing the correct results in the development environment. A shared screenshot reveals that while they ' re using a notebook versioned with Databricks Repos, they ' re using a personal branch that contains old logic. The desired branch named dev-2.3.9 is not available from the branch selection dropdown.

Which approach will allow this developer to review the current logic for this notebook?

Options:

A.

Use Repos to make a pull request use the Databricks REST API to update the current branch to dev-2.3.9

B.

Use Repos to pull changes from the remote Git repository and select the dev-2.3.9 branch.

C.

Use Repos to checkout the dev-2.3.9 branch and auto-resolve conflicts with the current branch

D.

Merge all changes back to the main branch in the remote Git repository and clone the repo again

E.

Use Repos to merge the current branch and the dev-2.3.9 branch, then make a pull request to sync with the remote repository

Buy Now
Questions 33

Which statement describes Delta Lake Auto Compaction?

Options:

A.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 1 GB.

B.

Before a Jobs cluster terminates, optimize is executed on all tables modified during the most recent job.

C.

Optimized writes use logical partitions instead of directory partitions; because partition boundaries are only represented in metadata, fewer small files are written.

D.

Data is queued in a messaging bus instead of committing data directly to memory; all data is committed from the messaging bus in one batch once the job is complete.

E.

An asynchronous job runs after the write completes to detect if files could be further compacted; if yes, an optimize job is executed toward a default of 128 MB.

Buy Now
Questions 34

A company has a task management system that tracks the most recent status of tasks. The system takes task events as input and processes events in near real-time using Lakeflow Declarative Pipelines. A new task event is ingested into the system when a task is created or the task status is changed. Lakeflow Declarative Pipelines provides a streaming table (tasks_status) for BI users to query.

The table represents the latest status of all tasks and includes 5 columns:

    task_id (unique for each task)

    task_name

    task_owner

    task_status

    task_event_time

The table enables three properties: deletion vectors, row tracking, and change data feed (CDF).

A data engineer is asked to create a new Lakeflow Declarative Pipeline to enrich the tasks_status table in near real-time by adding one additional column representing task_owner’s department, which can be looked up from a static dimension table (employee).

How should this enrichment be implemented?

Options:

A.

Create a new Lakeflow Declarative Pipeline: use the readStream() function to read tasks_status table; enrich with the employee table; store the result in a new streaming table.

B.

Create a new Lakeflow Declarative Pipeline: use readStream() function with option readChangeFeed to read tasks_status table CDF; enrich with the employee table; create a new streaming table as the result table and use apply_changes() function to process the changes from the enriched CDF.

C.

Create a new Lakeflow Declarative Pipeline: use the read() function to read tasks_status table; enrich with employee table; store the result in a materialized view.

D.

Create a new Lakeflow Declarative Pipeline: use the readStream() function with the option skipChangeCommits to read the tasks_status table; enrich with the employee table; store the result in a new streaming table.

Buy Now
Questions 35

A Data Engineer is building a simple data pipeline using Lakeflow Declarative Pipelines (LDP) in Databricks to ingest customer data. The raw customer data is stored in a cloud storage location in JSON format. The task is to create Lakeflow Declarative Pipelines that read the raw JSON data and write it into a Delta table for further processing.

Which code snippet will correctly ingest the raw JSON data and create a Delta table using LDP?

Options:

A.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " csv " ).load( " s3://my-bucket/raw-customers/ " )

B.

import dlt

@dlt.table

def raw_customers():

return spark.read.json( " s3://my-bucket/raw-customers/ " )

C.

import dlt

@dlt.table

def raw_customers():

return spark.read.format( " parquet " ).load( " s3://my-bucket/raw-customers/ " )

D.

import dlt

@dlt.view

def raw_customers():

return spark.format.json( " s3://my-bucket/raw-customers/ " )

Buy Now
Questions 36

A data engineering team is migrating off its legacy Hadoop platform. As part of the process, they are evaluating storage formats for performance comparison. The legacy platform uses ORC and RCFile formats. After converting a subset of data to Delta Lake , they noticed significantly better query performance. Upon investigation, they discovered that queries reading from Delta tables leveraged a Shuffle Hash Join , whereas queries on legacy formats used Sort Merge Joins . The queries reading Delta Lake data also scanned less data.

Which reason could be attributed to the difference in query performance?

Options:

A.

Delta Lake enables data skipping and file pruning using a vectorized Parquet reader.

B.

The queries against the Delta Lake tables were able to leverage the dynamic file pruning optimization.

C.

Shuffle Hash Joins are always more efficient than Sort Merge Joins.

D.

The queries against the ORC tables leveraged the dynamic data skipping optimization but not the dynamic file pruning optimization.

Buy Now
Questions 37

A data engineer is designing a Lakeflow Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id and amount are greater than zero. Invalid records should be dropped.

Which Lakeflow Declarative Pipelines configurations implement this requirement using Python?

Options:

A.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

.expect_or_drop( " valid_amount " , " amount > 0 " )

)

B.

@dlt.table

@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

C.

@dlt.table

def silver_orders():

return (

dlt.read_stream( " bronze_orders " )

.expect( " valid_customer " , " customer_id IS NOT NULL " )

.expect( " valid_amount " , " amount > 0 " )

)

D.

@dlt.table

@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )

@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )

def silver_orders():

return dlt.read_stream( " bronze_orders " )

Buy Now
Questions 38

The data science team has requested assistance in accelerating queries on free form text from user reviews. The data is currently stored in Parquet with the below schema:

item_id INT, user_id INT, review_id INT, rating FLOAT, review STRING

The review column contains the full text of the review left by the user. Specifically, the data science team is looking to identify if any of 30 key words exist in this field.

A junior data engineer suggests converting this data to Delta Lake will improve query performance.

Which response to the junior data engineer s suggestion is correct?

Options:

A.

Delta Lake statistics are not optimized for free text fields with high cardinality.

B.

Text data cannot be stored with Delta Lake.

C.

ZORDER ON review will need to be run to see performance gains.

D.

The Delta log creates a term matrix for free text fields to support selective filtering.

E.

Delta Lake statistics are only collected on the first 4 columns in a table.

Buy Now
Questions 39

The data governance team is reviewing code used for deleting records for compliance with GDPR. They note the following logic is used to delete records from the Delta Lake table named users .

Assuming that user_id is a unique identifying key and that delete_requests contains all users that have requested deletion, which statement describes whether successfully executing the above logic guarantees that the records to be deleted are no longer accessible and why?

Options:

A.

Yes; Delta Lake ACID guarantees provide assurance that the delete command succeeded fully and permanently purged these records.

B.

No; the Delta cache may return records from previous versions of the table until the cluster is restarted.

C.

Yes; the Delta cache immediately updates to reflect the latest data files recorded to disk.

D.

No; the Delta Lake delete command only provides ACID guarantees when combined with the merge into command.

E.

No; files containing deleted records may still be accessible with time travel until a vacuum command is used to remove invalidated data files.

Buy Now
Questions 40

A data engineer has a Delta table orders with deletion vectors enabled. The engineer executes the following command:

DELETE FROM orders WHERE status = ' cancelled ' ;

What should be the behavior of deletion vectors when the command is executed?

Options:

A.

Rows are marked as deleted both in metadata and in files.

B.

Delta automatically removes all cancelled orders permanently.

C.

Files are physically rewritten without the deleted rows.

D.

Rows are marked as deleted in metadata, not in files.

Buy Now
Questions 41

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id.

Which statement describes what the number alongside this field represents?

Options:

A.

The job_id is returned in this field.

B.

The job_id and number of times the job has been are concatenated and returned.

C.

The number of times the job definition has been run in the workspace.

D.

The globally unique ID of the newly triggered run.

Buy Now
Questions 42

A data engineer, while designing a Pandas UDF to process financial time-series data with complex calculations that require maintaining state across rows within each stock symbol group, must ensure the function is efficient and scalable.

Which approach will solve the problem with minimum overhead while preserving data integrity?

Options:

A.

Use a SCALAR_ITER Pandas UDF with iterator-based processing, implementing state management through persistent storage (Delta tables) that gets updated after each batch to maintain continuity across iterator chunks.

B.

Use a SCALAR Pandas UDF that processes the entire dataset at once, implementing custom partitioning logic within the UDF to group by stock symbol and maintain state using global variables shared across all executor processes.

C.

Use applyInPandas() on a Spark DataFrame that receives all rows for each stock symbol as a Pandas DataFrame, allowing processing within each group while maintaining state variables local to each group’s processing function.

D.

Use a grouped_agg Pandas UDF that processes each stock symbol group independently, maintaining state through intermediate aggregation results that get passed between successive UDF calls via broadcast variables.

Buy Now
Questions 43

An analytics team wants to run a short-term experiment in Databricks SQL on the customer transactions Delta table (about 20 billion records) created by the data engineering team. Which strategy should the data engineering team use to ensure minimal downtime and no impact on the ongoing ETL processes?

Options:

A.

Create a new table for the analytics team using a CTAS statement.

B.

Deep clone the table for the analytics team.

C.

Give the analytics team direct access to the production table.

D.

Shallow clone the table for the analytics team.

Buy Now
Questions 44

A production cluster has 3 executor nodes and uses the same virtual machine type for the driver and executor.

When evaluating the Ganglia Metrics for this cluster, which indicator would signal a bottleneck caused by code executing on the driver?

Options:

A.

The five Minute Load Average remains consistent/flat

B.

Bytes Received never exceeds 80 million bytes per second

C.

Total Disk Space remains constant

D.

Network I/O never spikes

E.

Overall cluster CPU utilization is around 25%

Buy Now
Questions 45

Review the following error traceback:

Which statement describes the error being raised?

Options:

A.

The code executed was PvSoark but was executed in a Scala notebook.

B.

There is no column in the table named heartrateheartrateheartrate

C.

There is a type error because a column object cannot be multiplied.

D.

There is a type error because a DataFrame object cannot be multiplied.

E.

There is a syntax error because the heartrate column is not correctly identified as a column.

Buy Now
Questions 46

A table is registered with the following code:

Both users and orders are Delta Lake tables. Which statement describes the results of querying recent_orders ?

Options:

A.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query finishes.

B.

All logic will execute when the table is defined and store the result of joining tables to the DBFS; this stored data will be returned when the table is queried.

C.

Results will be computed and cached when the table is defined; these cached results will incrementally update as new records are inserted into source tables.

D.

All logic will execute at query time and return the result of joining the valid versions of the source tables at the time the query began.

E.

The versions of each source table will be stored in the table transaction log; query results will be saved to DBFS with each query.

Buy Now
Questions 47

A data engineer is building a Lakeflow Declarative Pipelines pipeline to process healthcare claims data. A metadata JSON file defines data quality rules for multiple tables, including:

{

" claims " : [

{ " name " : " valid_patient_id " , " constraint " : " patient_id IS NOT NULL " },

{ " name " : " non_negative_amount " , " constraint " : " claim_amount > = 0 " }

]

}

The pipeline must dynamically apply these rules to the claims table without hardcoding the rules.

How should the data engineer achieve this?

Options:

A.

Load the JSON metadata, loop through its entries, and apply expectations using dlt.expect_all.

B.

Invoke an external API to validate records against the metadata rules.

C.

Reference each expectation with @dlt.expect decorators in the table declaration.

D.

Use a SQL CONSTRAINT block referencing the JSON file path.

Buy Now
Questions 48

The following code has been migrated to a Databricks notebook from a legacy workload:

The code executes successfully and provides the logically correct results, however, it takes over 20 minutes to extract and load around 1 GB of data.

Which statement is a possible explanation for this behavior?

Options:

A.

%sh triggers a cluster restart to collect and install Git. Most of the latency is related to cluster startup time.

B.

Instead of cloning, the code should use %sh pip install so that the Python code can get executed in parallel across all nodes in a cluster.

C.

%sh does not distribute file moving operations; the final line of code should be updated to use %fs instead.

D.

Python will always execute slower than Scala on Databricks. The run.py script should be refactored to Scala.

E.

%sh executes shell code on the driver node. The code does not take advantage of the worker nodes or Databricks optimized Spark.

Buy Now
Questions 49

The data engineering team is migrating an enterprise system with thousands of tables and views into the Lakehouse. They plan to implement the target architecture using a series of bronze, silver, and gold tables. Bronze tables will almost exclusively be used by production data engineering workloads, while silver tables will be used to support both data engineering and machine learning workloads. Gold tables will largely serve business intelligence and reporting purposes. While personal identifying information (PII) exists in all tiers of data, pseudonymization and anonymization rules are in place for all data at the silver and gold levels.

The organization is interested in reducing security concerns while maximizing the ability to collaborate across diverse teams.

Which statement exemplifies best practices for implementing this system?

Options:

A.

Isolating tables in separate databases based on data quality tiers allows for easy permissions management through database ACLs and allows physical separation of default storage locations for managed tables.

B.

Because databases on Databricks are merely a logical construct, choices around database organization do not impact security or discoverability in the Lakehouse.

C.

Storinq all production tables in a single database provides a unified view of all data assets available throughout the Lakehouse, simplifying discoverability by granting all users view privileges on this database.

D.

Working in the default Databricks database provides the greatest security when working with managed tables, as these will be created in the DBFS root.

E.

Because all tables must live in the same storage containers used for the database they ' re created in, organizations should be prepared to create between dozens and thousands of databases depending on their data isolation requirements.

Buy Now
Questions 50

The data science team has created and logged a production using MLFlow. The model accepts a list of column names and returns a new column of type DOUBLE.

The following code correctly imports the production model, load the customer table containing the customer_id key column into a Dataframe, and defines the feature columns needed for the model.

Which code block will output DataFrame with the schema ' ' customer_id LONG, predictions DOUBLE ' ' ?

Options:

A.

Model, predict (df, columns)

B.

Df, map (lambda k:midel (x [columns]) ,select ( ' ' customer_id predictions ' ' )

C.

Df. Select ( ' ' customer_id ' ' .

Model ( ' ' columns) alias ( ' ' predictions ' ' )

D.

Df.apply(model, columns). Select ( ' ' customer_id, prediction ' '

Buy Now
Questions 51

A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a critical column in their Delta table.

The profile metrics table (prod_catalog.prod_schema.customer_data_profile_metrics) stores hourly percent_null values.

The team wants to:

    Trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days .

    Ensure that notifications are not spammed during sustained issues.

Options:

Options:

A.

SELECT percent_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 1 ' DAY

Alert Condition: percent_null > 5

Notification Frequency: At most every 24 hours

B.

WITH daily_avg AS (

SELECT DATE_TRUNC( ' DAY ' , window.end) AS day,

AVG(percent_null) AS avg_null

FROM prod_catalog.prod_schema.customer_data_profile_metrics

GROUP BY DATE_TRUNC( ' DAY ' , window.end)

)

SELECT day, avg_null

FROM daily_avg

ORDER BY day DESC

LIMIT 3

Alert Condition: ALL avg_null > 5 for the latest 3 rows

Notification Frequency: Just once

C.

SELECT AVG(percent_null) AS daily_avg

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: daily_avg > 5

Notification Frequency: Each time alert is evaluated

D.

SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days

FROM prod_catalog.prod_schema.customer_data_profile_metrics

WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY

Alert Condition: violation_days > = 3

Notification Frequency: Just once

Buy Now
Questions 52

The data engineer is using Spark ' s MEMORY_ONLY storage level.

Which indicators should the data engineer look for in the spark UI ' s Storage tab to signal that a cached table is not performing optimally?

Options:

A.

Size on Disk is > 0

B.

The number of Cached Partitions > the number of Spark Partitions

C.

The RDD Block Name included the ' ' annotation signaling failure to cache

D.

On Heap Memory Usage is within 75% of off Heap Memory usage

Buy Now
Questions 53

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write.

Which consideration will impact the decisions made by the engineer while migrating this workload?

Options:

A.

All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

B.

Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.

C.

Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake ' s upsert functionality.

D.

Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.

Buy Now
Questions 54

A data engineer is configuring a Databricks Asset Bundle to deploy a job with granular permissions. The requirements are:

• Grant the data-engineers group CAN_MANAGE access to the job.

• Ensure the auditors’ group can view the job but not modify/run it.

• Avoid granting unintended permissions to other users/groups.

How should the data engineer deploy the job while meeting the requirements?

Options:

A.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

- group_name: admin-team

level: IS_OWNER

B.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

permissions:

- group_name: auditors

level: CAN_VIEW

C.

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

D.

resources:

jobs:

my-job:

name: data-pipeline

tasks: [...]

job_clusters: [...]

permissions:

- group_name: data-engineers

level: CAN_MANAGE

- group_name: auditors

level: CAN_VIEW

Buy Now
Questions 55

A data governance team at a large enterprise is improving data discoverability across its organization. The team has hundreds of tables in their Databricks Lakehouse with thousands of columns that lack proper documentation. Many of these tables were created by different teams over several years, with missing context about column meanings and business logic. The data governance team needs to quickly generate comprehensive column descriptions for all existing tables to meet compliance requirements and improve data literacy across the organization. They want to leverage modern capabilities to automatically generate meaningful descriptions rather than manually documenting each column, which would take months to complete.

Which approach should the team use in Databricks to automatically generate column comments and descriptions for existing tables?

Options:

A.

Navigate to the table in Databricks Catalog Explorer, select the table schema view, and use the AI Generate option which leverages artificial intelligence to automatically create meaningful column descriptions based on column names, data types, sample values, and data patterns.

B.

Use Delta Lake’s DESCRIBE HISTORY command to analyze table evolution and infer column purposes from historical changes.

C.

Use the DESCRIBE TABLE command to extract existing schema information and manually write descriptions based on column names and data types.

D.

Write custom PySpark code using df.describe() and df.schema to programmatically generate basic statistical descriptions for each column.

Buy Now
Questions 56

What is true for Delta Lake?

Options:

A.

Views in the Lakehouse maintain a valid cache of the most recent versions of source tables at all times.

B.

Delta Lake automatically collects statistics on the first 32 columns of each table, which are leveraged in data skipping based on query filters.

C.

Z-ORDER can only be applied to numeric values stored in Delta Lake tables.

D.

Primary and foreign key constraints can be leveraged to ensure duplicate values are never entered into a dimension table.

Buy Now
Questions 57

An upstream system is emitting change data capture (CDC) logs that are being written to a cloud object storage directory. Each record in the log indicates the change type (insert, update, or delete) and the values for each field after the change. The source table has a primary key identified by the field pk_id .

For auditing purposes, the data governance team wishes to maintain a full record of all values that have ever been valid in the source system. For analytical purposes, only the most recent value for each record needs to be recorded. The Databricks job to ingest these records occurs once per hour, but each individual record may have changed multiple times over the course of an hour.

Which solution meets these requirements?

Options:

A.

Create a separate history table for each pk_id resolve the current state of the table by running a union all filtering the history tables for the most recent state.

B.

Use merge into to insert, update, or delete the most recent entry for each pk_id into a bronze table, then propagate all changes throughout the system.

C.

Iterate through an ordered set of changes to the table, applying each in turn; rely on Delta Lake ' s versioning ability to create an audit log.

D.

Use Delta Lake ' s change data feed to automatically process CDC data from an external system, propagating all changes to all dependent tables in the Lakehouse.

E.

Ingest all log information into a bronze table; use merge into to insert, update, or delete the most recent entry for each pk_id into a silver table to recreate the current table state.

Buy Now
Questions 58

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A.

If task A fails during a scheduled run, which statement describes the results of this run?

Options:

A.

Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.

B.

Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.

C.

Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.

D.

Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

E.

Tasks B and C will be skipped; task A will not commit any changes because of stage failure.

Buy Now
Questions 59

A Delta Lake table representing metadata about content from user has the following schema:

user_id LONG, post_text STRING, post_id STRING, longitude FLOAT, latitude FLOAT, post_time TIMESTAMP, date DATE

Based on the above schema, which column is a good candidate for partitioning the Delta Table?

Options:

A.

Date

B.

Post_id

C.

User_id

D.

Post_time

Buy Now
Questions 60

A workspace admin has created a new catalog called finance_data and wants to delegate permission management to a finance team lead without giving them full admin rights.

Which privilege should be granted to the finance team lead?

Options:

A.

ALL PRIVILEGES on the finance_data catalog.

B.

Make the finance team lead a metastore admin.

C.

GRANT OPTION privilege on the finance_data catalog.

D.

MANAGE privilege on the finance_data catalog.

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: May 3, 2026
Questions: 195
Databricks-Certified-Professional-Data-Engineer pdf

Databricks-Certified-Professional-Data-Engineer PDF

$25.5  $84.99
Databricks-Certified-Professional-Data-Engineer Engine

Databricks-Certified-Professional-Data-Engineer Testing Engine

$30  $99.99
Databricks-Certified-Professional-Data-Engineer PDF + Engine

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$40.5  $134.99