Which of the following SQL command can be used to insert or update or delete rows based on a condition to check if a row(s) exists?
What steps need to be taken to set up a DELTA LIVE PIPELINE as a job using the workspace UI?
What could be the expected output of query SELECT COUNT (DISTINCT *) FROM user on this table
You are currently working on a notebook that will populate a reporting table for downstream process consumption, this process needs to run on a schedule every hour, what type of cluster are you going to use to set up this job?
The data engineering team is looking to add a new column to the table, but the QA team would like to test the change before implementing in production, which of the below options allow you to quickly copy the table from Prod to the QA environment, modify and run the tests?
A dataset has been defined using Delta Live Tables and includes an expectations clause: CON-STRAINT valid_timestamp EXPECT (timestamp > '2020-01-01')
What is the expected behavior when a batch of data containing data that violates these constraints is processed?
Operations team is using a centralized data quality monitoring system, a user can publish data quality metrics through a webhook, you were asked to develop a process to send messages using a webhook if there is atleast one duplicate record, which of the following approaches can be taken to integrate an alert with current data quality monitoring system
Your colleague was walking you through how a job was setup, but you noticed a warning message that said, “Jobs running on all-purpose cluster are considered all purpose compute", the colleague was not sure why he was getting the warning message, how do you best explain this warning mes-sage?
If you create a database sample_db with the statement CREATE DATABASE sample_db what will be the default location of the database in DBFS?
Which of the following techniques structured streaming uses to ensure recovery of failures during stream processing?
At the end of the inventory process a file gets uploaded to the cloud object storage, you are asked to build a process to ingest data which of the following method can be used to ingest the data incrementally, schema of the file is expected to change overtime ingestion process should be able to handle these changes automatically. Below is the auto loader to command to load the data, fill in the blanks for successful execution of below code.
1.spark.readStream
2..format("cloudfiles")
3..option("_______",”csv)
4..option("_______", ‘dbfs:/location/checkpoint/’)
5..load(data_source)
6..writeStream
7..option("_______",’ dbfs:/location/checkpoint/’)
8..option("_______", "true")
9..table(table_name))
What is the output of below function when executed with input parameters 1, 3 :
1.def check_input(x,y):
2. if x < y:
3. x= x+1
4. if x>y:
5. x= x+1
6. if x 7. x = x+1 8. return x
You have noticed the Data scientist team is using the notebook versioning feature with git integra-tion, you have recommended them to switch to using Databricks Repos, which of the below reasons could be the reason the why the team needs to switch to Databricks Repos.
The data engineering team noticed that one of the job normally finishes in 15 mins but gets stuck randomly when reading remote databases due to a network packet drop, which of the following steps can be used to improve the stability of the job?
Which of the following tool provides Data Access control, Access Audit, Data Lineage, and Data discovery?
You are working on a table called orders which contains data for 2021 and you have the second table called orders_archive which contains data for 2020, you need to combine the data from two tables and there could be a possibility of the same rows between both the tables, you are looking to combine the results from both the tables and eliminate the duplicate rows, which of the following SQL statements helps you accomplish this?
The marketing team is launching a new campaign to monitor the performance of the new campaign for the first two weeks, they would like to set up a dashboard with a refresh schedule to run every 5 minutes, which of the below steps can be taken to reduce of the cost of this refresh over time?
Which of the following Auto loader structured streaming commands successfully performs a hop from the landing area into Bronze?
When working with AUTO LOADER you noticed that most of the columns that were inferred as part of loading are string data types including columns that were supposed to be integers, how can we fix this?
The below spark command is looking to create a summary table based customerId and the number of times the customerId is present in the event_log delta table and write a one-time micro-batch to a summary table, fill in the blanks to complete the query.
1.spark._________
2. .format("delta")
3. .table("events_log")
4. .groupBy("customerId")
5. .count()
6. ._______
7. .format("delta")
8. .outputMode("complete")
9. .option("checkpointLocation", "/tmp/delta/eventsByCustomer/_checkpoints/")
10. .trigger(______)
11. .table("target_table")
You were asked to write python code to stop all running streams, which of the following command can be used to get a list of all active streams currently running so we can stop them, fill in the blank.
1.for s in _______________:
2. s.stop()
Which of the following two options are supported in identifying the arrival of new files, and incre-mental data from Cloud object storage using Auto Loader?
You are trying to calculate total sales made by all the employees by parsing a complex struct data type that stores employee and sales data, how would you approach this in SQL
Table definition,
batchId INT, performance ARRAY
Sample data of performance column
1.[
2.{ "employeeId":1234
3."sales" : 10000},
4.
5.{ "employeeId":3232
6."sales" : 30000}
7.]
Calculate total sales made by all the employees?
Sample data with create table syntax for the data:
1.create or replace table sales as
2.select 1 as batchId ,
3.from_json('[{ "employeeId":1234,"sales" : 10000 },{ "employeeId":3232,"sales" : 30000 }]',
4. 'ARRAY
5. current_timestamp() as insertDate
6.union all
7.select 2 as batchId ,
8. from_json('[{ "employeeId":1235,"sales" : 10500 },{ "employeeId":3233,"sales" : 32000 }]',
9. 'ARRAY
10. current_timestamp() as insertDate
You have written a notebook to generate a summary data set for reporting, Notebook was scheduled using the job cluster, but you realized it takes 8 minutes to start the cluster, what feature can be used to start the cluster in a timely fashion so your job can run immediatley?
Which of the following operations are not supported on a streaming dataset view?
spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_view")