Month End Sale Special Limited Time 65% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: geek65

Databricks-Certified-Professional-Data-Engineer Databricks Certified Data Engineer Professional Exam Questions and Answers

Questions 4

Which of the following SQL command can be used to insert or update or delete rows based on a condition to check if a row(s) exists?

Options:

A.

MERGE INTO table_name

B.

COPY INTO table_name

C.

UPDATE table_name

D.

INSERT INTO OVERWRITE table_name

E.

INSERT IF EXISTS table_name

Buy Now
Questions 5

What steps need to be taken to set up a DELTA LIVE PIPELINE as a job using the workspace UI?

Options:

A.

DELTA LIVE TABLES do not support job cluster

B.

Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and select the notebook

C.

Select Workflows UI and Delta live tables tab, under task type select Delta live tables pipeline and select the pipeline JSON file

D.

Use Pipeline creation UI, select a new pipeline and job cluster

Buy Now
Questions 6

What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table

Options:

A.

3

B.

2

(Correct)

C.

1

D.

0

E.

NULL

Buy Now
Questions 7

You are currently working on a notebook that will populate a reporting table for downstream process consumption, this process needs to run on a schedule every hour, what type of cluster are you going to use to set up this job?

Options:

A.

Since it’s just a single job and we need to run every hour, we can use an all-purpose cluster

B.

The job cluster is best suited for this purpose.

C.

Use Azure VM to read and write delta tables in Python

D.

Use delta live table pipeline to run in continuous mode

Buy Now
Questions 8

The data engineering team is looking to add a new column to the table, but the QA team would like to test the change before implementing in production, which of the below options allow you to quickly copy the table from Prod to the QA environment, modify and run the tests?

Options:

A.

DEEP CLONE

B.

SHADOW CLONE

C.

ZERO COPY CLONE

D.

SHALLOW CLONE

E.

METADATA CLONE

Buy Now
Questions 9

A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')

What is the expected behavior when a batch of data containing data that violates these constraints is processed?

Options:

A.

Records that violate the expectation are added to the target dataset and recorded as invalid in the event log.

B.

Records that violate the expectation are dropped from the target dataset and recorded as invalid in the event log.

C.

Records that violate the expectation cause the job to fail.

D.

Records that violate the expectation are added to the target dataset and flagged as in-valid in a field added to the target dataset.

E.

Records that violate the expectation are dropped from the target dataset and loaded into a quarantine table.

Buy Now
Questions 10

Operations team is using a centralized data quality monitoring system, a user can publish data quality metrics through a webhook, you were asked to develop a process to send messages using a webhook if there is atleast one duplicate record, which of the following approaches can be taken to integrate an alert with current data quality monitoring system

Options:

A.

Use notebook and Jobs to use python to publish DQ metrics

B.

Setup an alert to send an email, use python to parse email, and publish a webhook message

C.

Setup an alert with custom template

D.

Setup an alert with custom Webhook destination

E.

Setup an alert with dynamic template

Buy Now
Questions 11

Your colleague was walking you through how a job was setup, but you noticed a warning message that said, “Jobs running on all-purpose cluster are considered all purpose compute", the colleague was not sure why he was getting the warning message, how do you best explain this warning mes-sage?

Options:

A.

All-purpose clusters cannot be used for Job clusters, due to performance issues.

B.

All-purpose clusters take longer to start the cluster vs a job cluster

C.

All-purpose clusters are less expensive than the job clusters

D.

All-purpose clusters are more expensive than the job clusters

E.

All-purpose cluster provide interactive messages that can not be viewed in a job

Buy Now
Questions 12

What is the purpose of the bronze layer in a Multi-hop Medallion architecture?

Options:

A.

Copy of raw data, easy to query and ingest data for downstream processes.

B.

Powers ML applications

C.

Data quality checks, corrupt data quarantined

D.

Contain aggregated data that is to be consumed into Silver

E.

Reduces data storage by compressing the data

Buy Now
Questions 13

Which of the following statements are true about a lakehouse?

Options:

A.

Lakehouse only supports Machine learning workloads and Data warehouses support BI workloads

B.

Lakehouse only supports end-to-end streaming workloads and Data warehouses support Batch workloads

C.

Lakehouse does not support ACID

D.

Lakehouse do not support SQL

E.

Lakehouse supports Transactions

Buy Now
Questions 14

If you create a database sample_db with the statement CREATE DATABASE sample_db what will be the default location of the database in DBFS?

Options:

A.

Default location, DBFS:/user/

B.

Default location, /user/db/

C.

Default Storage account

D.

Statement fails “Unable to create database without location”

E.

Default Location, dbfs:/user/hive/warehouse

Buy Now
Questions 15

What is the purpose of the silver layer in a Multi hop architecture?

Options:

A.

Replaces a traditional data lake

B.

Efficient storage and querying of full, unprocessed history of data

C.

Eliminates duplicate data, quarantines bad data

D.

Refined views with aggregated data

E.

Optimized query performance for business-critical data

Buy Now
Questions 16

Which of the following techniques structured streaming uses to ensure recovery of failures during stream processing?

Options:

A.

Checkpointing and Watermarking

B.

Write ahead logging and watermarking

C.

Checkpointing and write-ahead logging

D.

Delta time travel

E.

The stream will failover to available nodes in the cluster

F.

Checkpointing and Idempotent sinks

Buy Now
Questions 17

Which of the below SQL commands creates a session scoped temporary view?

Options:

A.

1.CREATE OR REPLACE TEMPORARY VIEW view_name

2.AS SELECT * FROM table_name

(Correct)

B.

1.CREATE OR REPLACE LOCAL TEMPORARY VIEW view_name

2.AS SELECT * FROM table_name

C.

1.CREATE OR REPLACE GLOBAL TEMPORARY VIEW view_name

2.AS SELECT * FROM table_name

D.

1.CREATE OR REPLACE VIEW view_name

2.AS SELECT * FROM table_name

E.

1.CREATE OR REPLACE LOCAL VIEW view_name

2.AS SELECT * FROM table_name

Buy Now
Questions 18

At the end of the inventory process a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data incrementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically. Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code.

1.spark.readStream

2..format("cloudfiles")

3..option("_______",”csv)

4..option("_______", ‘dbfs:/location/checkpoint/’)

5..load(data_source)

6..writeStream

7..option("_______",’ dbfs:/location/checkpoint/’)

8..option("_______", "true")

9..table(table_name))

Options:

A.

format, checkpointlocation, schemalocation, overwrite

B.

cloudfiles.format, checkpointlocation, cloudfiles.schemalocation, overwrite

C.

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, mergeSchema

D.

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, append

E.

cloudfiles.format, cloudfiles.schemalocation, checkpointlocation, overwrite

Buy Now
Questions 19

How are Delt tables stored?

Options:

A.

A Directory where parquet data files are stored, a sub directory _delta_log where meta data, and the transaction log is stored as JSON files.

B.

A Directory where parquet data files are stored, all of the meta data is stored in memory

C.

A Directory where parquet data files are stored in Data plane, a sub directory _delta_log where meta data, history and log is stored in control pane.

D.

A Directory where parquet data files are stored, all of the metadata is stored in parquet files

E.

Data is stored in Data plane and Metadata and delta log are stored in control pane

Buy Now
Questions 20

What is the output of below function when executed with input parameters 1, 3 :

1.def check_input(x,y):

2. if x < y:

3. x= x+1

4. if x>y:

5. x= x+1

6. if x

7. x = x+1

8. return x

Options:

A.

1

B.

2

C.

3

D.

4

E.

5

Buy Now
Questions 21

You have noticed the Data scientist team is using the notebook versioning feature with git integra-tion, you have recommended them to switch to using Databricks Repos, which of the below reasons could be the reason the why the team needs to switch to Databricks Repos.

Options:

A.

Databricks Repos allows multiple users to make changes

B.

Databricks Repos allows merge and conflict resolution

C.

Databricks Repos has a built-in version control system

D.

Databricks Repos automatically saves changes

E.

Databricks Repos allow you to add comments and select the changes you want to commit.

Buy Now
Questions 22

The data engineering team noticed that one of the job normally finishes in 15 mins but gets stuck randomly when reading remote databases due to a network packet drop, which of the following steps can be used to improve the stability of the job?

Options:

A.

Use Databrick REST API to monitor long running jobs and issue a kill command

B.

Use Jobs runs, active runs UI section to monitor and kill long running job

C.

Modify the task, to include a timeout to kill the job if it runs more than 15 mins.

D.

Use Spark job time out setting in the Spark UI

E.

Use Cluster timeout setting in the Job cluster UI

Buy Now
Questions 23

Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data discovery?

Options:

A.

DELTA LIVE Pipelines

B.

Unity Catalog

C.

Data Governance

D.

DELTA lake

E.

Lakehouse

Buy Now
Questions 24

You are working on a table called orders which contains data for 2021 and you have the second table called orders_archive which contains data for 2020, you need to combine the data from two tables and there could be a possibility of the same rows between both the tables, you are looking to combine the results from both the tables and eliminate the duplicate rows, which of the following SQL statements helps you accomplish this?

Options:

A.

SELECT * FROM orders UNION SELECT * FROM orders_archive

(Correct)

B.

SELECT * FROM orders INTERSECT SELECT * FROM orders_archive

C.

SELECT * FROM orders UNION ALL SELECT * FROM orders_archive

D.

SELECT * FROM orders_archive MINUS SELECT * FROM orders

E.

SELECT distinct * FROM orders JOIN orders_archive on order.id = or-ders_archive.id

Buy Now
Questions 25

The marketing team is launching a new campaign to monitor the performance of the new campaign for the first two weeks, they would like to set up a dashboard with a refresh schedule to run every 5 minutes, which of the below steps can be taken to reduce of the cost of this refresh over time?

Options:

A.

Reduce the size of the SQL Cluster size

B.

Reduce the max size of auto scaling from 10 to 5

C.

Setup the dashboard refresh schedule to end in two weeks

D.

Change the spot instance policy from reliability optimized to cost optimized

E.

Always use X-small cluster

Buy Now
Questions 26

Which of the following Auto loader structured streaming commands successfully performs a hop from the landing area into Bronze?

Options:

A.

1.spark\

2..readStream\

3..format("csv")\

4..option("cloudFiles.schemaLocation", checkpoint_directory)\

5..load("landing")\

6..writeStream.option("checkpointLocation", checkpoint_directory)\

7..table(raw)

B.

1.spark\

2..readStream\

3..format("cloudFiles")\

4..option("cloudFiles.format","csv")\

5..option("cloudFiles.schemaLocation", checkpoint_directory)\

6..load("landing")\

7..writeStream.option("checkpointLocation", checkpoint_directory)\

8..table(raw)

(Correct)

C.

1.spark\

2..read\

3..format("cloudFiles")\

4..option("cloudFiles.format",”csv”)\

5..option("cloudFiles.schemaLocation", checkpoint_directory)\

6..load("landing")\

7..writeStream.option("checkpointLocation", checkpoint_directory)\

8..table(raw)

D.

1.spark\

2..readStream\

3..load(rawSalesLocation)\

4..writeStream \

5..option("checkpointLocation", checkpointPath).outputMode("append")\

6..table("uncleanedSales")

E.

1.spark\

2..read\

3..load(rawSalesLocation) \

4..writeStream\

5..option("checkpointLocation", checkpointPath) \

6..outputMode("append")\

7..table("uncleanedSales")

Buy Now
Questions 27

When working with AUTO LOADER you noticed that most of the columns that were inferred as part of loading are string data types including columns that were supposed to be integers, how can we fix this?

Options:

A.

Provide the schema of the source table in the cloudfiles.schemalocation

B.

Provide the schema of the target table in the cloudfiles.schemalocation

C.

Provide schema hints

D.

Update the checkpoint location

E.

Correct the incoming data by explicitly casting the data types

Buy Now
Questions 28

The below spark command is looking to create a summary table based customerId and the number of times the customerId is present in the event_log delta table and write a one-time micro-batch to a summary table, fill in the blanks to complete the query.

1.spark._________

2. .format("delta")

3. .table("events_log")

4. .groupBy("customerId")

5. .count()

6. ._______

7. .format("delta")

8. .outputMode("complete")

9. .option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")

10. .trigger(______)

11. .table("target_table")

Options:

A.

writeStream, readStream, once

B.

readStream, writeStream, once

C.

writeStream, processingTime = once

D.

writeStream, readStream, once = True

E.

readStream, writeStream, once = True

Buy Now
Questions 29

You were asked to write python code to stop all running streams, which of the following command can be used to get a list of all active streams currently running so we can stop them, fill in the blank.

1.for s in _______________:

2. s.stop()

Options:

A.

Spark.getActiveStreams()

B.

spark.streams.active

C.

activeStreams()

D.

getActiveStreams()

E.

spark.streams.getActive

Buy Now
Questions 30

Which of the following two options are supported in identifying the arrival of new files, and incre-mental data from Cloud object storage using Auto Loader?

Options:

A.

Directory listing, File notification

B.

Checking pointing, watermarking

C.

Writing ahead logging, read head logging

D.

File hashing, Dynamic file lookup

E.

Checkpointing and Write ahead logging

Buy Now
Questions 31

You are trying to calculate total sales made by all the employees by parsing a complex struct data type that stores employee and sales data, how would you approach this in SQL

Table definition,

batchId INT, performance ARRAY>, in-sertDate TIMESTAMP

Sample data of performance column

1.[

2.{ "employeeId":1234

3."sales" : 10000},

4.

5.{ "employeeId":3232

6."sales" : 30000}

7.]

Calculate total sales made by all the employees?

Sample data with create table syntax for the data:

1.create or replace table sales as

2.select 1 as batchId ,

3.from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',

4. 'ARRAY>') as performance,

5. current_timestamp() as insertDate

6.union all

7.select 2 as batchId ,

8. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',

9. 'ARRAY>') as performance,

10. current_timestamp() as insertDate

Options:

A.

1.WITH CTE as (SELECT EXPLODE (performance) FROM table_name)

2.SELECT SUM (performance.sales) FROM CTE

B.

1.WITH CTE as (SELECT FLATTEN (performance) FROM table_name)

2.SELECT SUM (sales) FROM CTE

C.

1.select aggregate(flatten(collect_list(performance.sales)), 0, (x, y) -> x + y)

2.as total_sales from sales

D.

SELECT SUM(SLICE (performance, sales)) FROM employee

E.

1.select reduce(flatten(collect_list(performance:sales)), 0, (x, y) -> x + y)

2.as total_sales from sales

Buy Now
Questions 32

You have written a notebook to generate a summary data set for reporting, Notebook was scheduled using the job cluster, but you realized it takes 8 minutes to start the cluster, what feature can be used to start the cluster in a timely fashion so your job can run immediatley?

Options:

A.

Setup an additional job to run ahead of the actual job so the cluster is running second job starts

B.

Use the Databricks cluster pools feature to reduce the startup time

C.

Use Databricks Premium edition instead of Databricks standard edition

D.

Pin the cluster in the cluster UI page so it is always available to the jobs

E.

Disable auto termination so the cluster is always running

Buy Now
Questions 33

Which of the following operations are not supported on a streaming dataset view?

spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_view")

Options:

A.

SELECT sum(unitssold) FROM streaming_view

B.

SELECT max(unitssold) FROM streaming_view

C.

SELECT id, sum(unitssold) FROM streaming_view GROUP BY id ORDER BY id

D.

SELECT id, count(*) FROM streaming_view GROUP BY id

E.

SELECT * FROM streadming_view ORDER BY id

Buy Now
Exam Name: Databricks Certified Data Engineer Professional Exam
Last Update: Mar 21, 2023
Questions: 222
Databricks-Certified-Professional-Data-Engineer pdf

Databricks-Certified-Professional-Data-Engineer PDF

$28  $80
Databricks-Certified-Professional-Data-Engineer Engine

Databricks-Certified-Professional-Data-Engineer Testing Engine

$33.25  $95
Databricks-Certified-Professional-Data-Engineer PDF + Engine

Databricks-Certified-Professional-Data-Engineer PDF + Testing Engine

$45.5  $130