Weekend Sale 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: clap70

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Databricks Certified Associate Developer for Apache Spark 3.5 – Python Questions and Answers

Questions 4

4 of 55.

A developer is working on a Spark application that processes a large dataset using SQL queries. Despite having a large cluster, the developer notices that the job is underutilizing the available resources. Executors remain idle for most of the time, and logs reveal that the number of tasks per stage is very low. The developer suspects that this is causing suboptimal cluster performance.

Which action should the developer take to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Enable dynamic resource allocation to scale resources as needed

D.

Increase the size of the dataset to create more partitions

Buy Now
Questions 5

35 of 55.

A data engineer is building a Structured Streaming pipeline and wants it to recover from failures or intentional shutdowns by continuing where it left off.

How can this be achieved?

Options:

A.

By configuring the option recoveryLocation during SparkSession initialization.

B.

By configuring the option checkpointLocation during readStream.

C.

By configuring the option checkpointLocation during writeStream.

D.

By configuring the option recoveryLocation during writeStream.

Buy Now
Questions 6

A data engineer wants to process a streaming DataFrame that receives sensor readings every second with columns sensor_id, temperature, and timestamp. The engineer needs to calculate the average temperature for each sensor over the last 5 minutes while the data is streaming.

Which code implementation achieves the requirement?

Options from the images provided:

A)

B)

C)

D)

Options:

A.

Option A

B.

Option B

C.

Option C

D.

Option D

Buy Now
Questions 7

29 of 55.

A Spark application is experiencing performance issues in client mode due to the driver being resource-constrained.

How should this issue be resolved?

Options:

A.

Switch the deployment mode to cluster mode.

B.

Add more executor instances to the cluster.

C.

Increase the driver memory on the client machine.

D.

Switch the deployment mode to local mode.

Buy Now
Questions 8

A data engineer observes that an upstream streaming source sends duplicate records, where duplicates share the same key and have at most a 30-minute difference in event_timestamp. The engineer adds:

dropDuplicatesWithinWatermark("event_timestamp", "30 minutes")

What is the result?

Options:

A.

It is not able to handle deduplication in this scenario

B.

It removes duplicates that arrive within the 30-minute window specified by the watermark

C.

It removes all duplicates regardless of when they arrive

D.

It accepts watermarks in seconds and the code results in an error

Buy Now
Questions 9

17 of 55.

A data engineer has noticed that upgrading the Spark version in their applications from Spark 3.0 to Spark 3.5 has improved the runtime of some scheduled Spark applications.

Looking further, the data engineer realizes that Adaptive Query Execution (AQE) is now enabled.

Which operation should AQE be implementing to automatically improve the Spark application performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Buy Now
Questions 10

A data engineer noticed improved performance after upgrading from Spark 3.0 to Spark 3.5. The engineer found that Adaptive Query Execution (AQE) was enabled.

Which operation is AQE implementing to improve performance?

Options:

A.

Dynamically switching join strategies

B.

Collecting persistent table statistics and storing them in the metastore for future use

C.

Improving the performance of single-stage Spark jobs

D.

Optimizing the layout of Delta files on disk

Buy Now
Questions 11

A Spark engineer must select an appropriate deployment mode for the Spark jobs.

What is the benefit of using cluster mode in Apache Spark™?

Options:

A.

In cluster mode, resources are allocated from a resource manager on the cluster, enabling better performance and scalability for large jobs

B.

In cluster mode, the driver is responsible for executing all tasks locally without distributing them across the worker nodes.

C.

In cluster mode, the driver runs on the client machine, which can limit the application's ability to handle large datasets efficiently.

D.

In cluster mode, the driver program runs on one of the worker nodes, allowing the application to fully utilize the distributed resources of the cluster.

Buy Now
Questions 12

15 of 55.

A data engineer is working on a Streaming DataFrame (streaming_df) with the following streaming data:

id

name

count

timestamp

1

Delhi

20

2024-09-19T10:11

1

Delhi

50

2024-09-19T10:12

2

London

50

2024-09-19T10:15

3

Paris

30

2024-09-19T10:18

3

Paris

20

2024-09-19T10:20

4

Washington

10

2024-09-19T10:22

Which operation is supported with streaming_df?

Options:

A.

streaming_df.count()

B.

streaming_df.filter("count < 30")

C.

streaming_df.select(countDistinct("name"))

D.

streaming_df.show()

Buy Now
Questions 13

40 of 55.

A developer wants to refactor older Spark code to take advantage of built-in functions introduced in Spark 3.5.

The original code:

from pyspark.sql import functions as F

min_price = 110.50

result_df = prices_df.filter(F.col("price") > min_price).agg(F.count("*"))

Which code block should the developer use to refactor the code?

Options:

A.

result_df = prices_df.filter(F.col("price") > F.lit(min_price)).agg(F.count("*"))

B.

result_df = prices_df.where(F.lit("price") > min_price).groupBy().count()

C.

result_df = prices_df.withColumn("valid_price", when(col("price") > F.lit(min_price), True))

D.

result_df = prices_df.filter(F.lit(min_price) > F.col("price")).count()

Buy Now
Questions 14

41 of 55.

A data engineer is working on the DataFrame df1 and wants the Name with the highest count to appear first (descending order by count), followed by the next highest, and so on.

The DataFrame has columns:

id | Name | count | timestamp

---------------------------------

1 | USA | 10

2 | India | 20

3 | England | 50

4 | India | 50

5 | France | 20

6 | India | 10

7 | USA | 30

8 | USA | 40

Which code fragment should the engineer use to sort the data in the Name and count columns?

Options:

A.

df1.orderBy(col("count").desc(), col("Name").asc())

B.

df1.sort("Name", "count")

C.

df1.orderBy("Name", "count")

D.

df1.orderBy(col("Name").desc(), col("count").asc())

Buy Now
Questions 15

A data engineer wants to write a Spark job that creates a new managed table. If the table already exists, the job should fail and not modify anything.

Which save mode and method should be used?

Options:

A.

saveAsTable with mode ErrorIfExists

B.

saveAsTable with mode Overwrite

C.

save with mode Ignore

D.

save with mode ErrorIfExists

Buy Now
Questions 16

44 of 55.

A data engineer is working on a real-time analytics pipeline using Spark Structured Streaming.

They want the system to process incoming data in micro-batches at a fixed interval of 5 seconds.

Which code snippet fulfills this requirement?

Options:

A.

query = df.writeStream \

.outputMode("append") \

.trigger(processingTime="5 seconds") \

.start()

B.

query = df.writeStream \

.outputMode("append") \

.trigger(continuous="5 seconds") \

.start()

C.

query = df.writeStream \

.outputMode("append") \

.trigger(once=True) \

.start()

D.

query = df.writeStream \

.outputMode("append") \

.start()

Buy Now
Questions 17

A data engineer is running a batch processing job on a Spark cluster with the following configuration:

10 worker nodes

16 CPU cores per worker node

64 GB RAM per node

The data engineer wants to allocate four executors per node, each executor using four cores.

What is the total number of CPU cores used by the application?

Options:

A.

160

B.

64

C.

80

D.

40

Buy Now
Questions 18

A data scientist is working on a large dataset in Apache Spark using PySpark. The data scientist has a DataFrame df with columns user_id, product_id, and purchase_amount and needs to perform some operations on this data efficiently.

Which sequence of operations results in transformations that require a shuffle followed by transformations that do not?

Options:

A.

df.filter(df.purchase_amount > 100).groupBy("user_id").sum("purchase_amount")

B.

df.withColumn("discount", df.purchase_amount * 0.1).select("discount")

C.

df.withColumn("purchase_date", current_date()).where("total_purchase > 50")

D.

df.groupBy("user_id").agg(sum("purchase_amount").alias("total_purchase")).repartition(10)

Buy Now
Questions 19

A developer wants to test Spark Connect with an existing Spark application.

What are the two alternative ways the developer can start a local Spark Connect server without changing their existing application code? (Choose 2 answers)

Options:

A.

Execute their pyspark shell with the option --remote "https://localhost "

B.

Execute their pyspark shell with the option --remote "sc://localhost"

C.

Set the environment variable SPARK_REMOTE="sc://localhost" before starting the pyspark shell

D.

Add .remote("sc://localhost") to their SparkSession.builder calls in their Spark code

E.

Ensure the Spark property spark.connect.grpc.binding.port is set to 15002 in the application code

Buy Now
Questions 20

Which feature of Spark Connect is considered when designing an application to enable remote interaction with the Spark cluster?

Options:

A.

It provides a way to run Spark applications remotely in any programming language

B.

It can be used to interact with any remote cluster using the REST API

C.

It allows for remote execution of Spark jobs

D.

It is primarily used for data ingestion into Spark from external sources

Buy Now
Questions 21

Given the following code snippet in my_spark_app.py:

What is the role of the driver node?

Options:

A.

The driver node orchestrates the execution by transforming actions into tasks and distributing them to worker nodes

B.

The driver node only provides the user interface for monitoring the application

C.

The driver node holds the DataFrame data and performs all computations locally

D.

The driver node stores the final result after computations are completed by worker nodes

Buy Now
Questions 22

A data engineer wants to create a Streaming DataFrame that reads from a Kafka topic called feed.

Which code fragment should be inserted in line 5 to meet the requirement?

Code context:

spark \

.readStream \

.format("kafka") \

.option("kafka.bootstrap.servers", "host1:port1,host2:port2") \

.[LINE 5] \

.load()

Options:

Options:

A.

.option("subscribe", "feed")

B.

.option("subscribe.topic", "feed")

C.

.option("kafka.topic", "feed")

D.

.option("topic", "feed")

Buy Now
Questions 23

A Spark developer is building an app to monitor task performance. They need to track the maximum task processing time per worker node and consolidate it on the driver for analysis.

Which technique should be used?

Options:

A.

Use an RDD action like reduce() to compute the maximum time

B.

Use an accumulator to record the maximum time on the driver

C.

Broadcast a variable to share the maximum time among workers

D.

Configure the Spark UI to automatically collect maximum times

Buy Now
Questions 24

A Spark application developer wants to identify which operations cause shuffling, leading to a new stage in the Spark execution plan.

Which operation results in a shuffle and a new stage?

Options:

A.

DataFrame.groupBy().agg()

B.

DataFrame.filter()

C.

DataFrame.withColumn()

D.

DataFrame.select()

Buy Now
Questions 25

In the code block below, aggDF contains aggregations on a streaming DataFrame:

Which output mode at line 3 ensures that the entire result table is written to the console during each trigger execution?

Options:

A.

complete

B.

append

C.

replace

D.

aggregate

Buy Now
Questions 26

A data analyst builds a Spark application to analyze finance data and performs the following operations: filter, select, groupBy, and coalesce.

Which operation results in a shuffle?

Options:

A.

groupBy

B.

filter

C.

select

D.

coalesce

Buy Now
Questions 27

A data scientist at a financial services company is working with a Spark DataFrame containing transaction records. The DataFrame has millions of rows and includes columns for transaction_id, account_number, transaction_amount, and timestamp. Due to an issue with the source system, some transactions were accidentally recorded multiple times with identical information across all fields. The data scientist needs to remove rows with duplicates across all fields to ensure accurate financial reporting.

Which approach should the data scientist use to deduplicate the orders using PySpark?

Options:

A.

df = df.dropDuplicates()

B.

df = df.groupBy("transaction_id").agg(F.first("account_number"), F.first("transaction_amount"), F.first("timestamp"))

C.

df = df.filter(F.col("transaction_id").isNotNull())

D.

df = df.dropDuplicates(["transaction_amount"])

Buy Now
Questions 28

13 of 55.

A developer needs to produce a Python dictionary using data stored in a small Parquet table, which looks like this:

region_id

region_name

10

North

12

East

14

West

The resulting Python dictionary must contain a mapping of region_id to region_name, containing the smallest 3 region_id values.

Which code fragment meets the requirements?

Options:

A.

regions_dict = dict(regions.take(3))

B.

regions_dict = regions.select("region_id", "region_name").take(3)

C.

regions_dict = dict(regions.select("region_id", "region_name").rdd.collect())

D.

regions_dict = dict(regions.orderBy("region_id").limit(3).rdd.map(lambda x: (x.region_id, x.region_name)).collect())

Buy Now
Questions 29

34 of 55.

A data engineer is investigating a Spark cluster that is experiencing underutilization during scheduled batch jobs.

After checking the Spark logs, they noticed that tasks are often getting killed due to timeout errors, and there are several warnings about insufficient resources in the logs.

Which action should the engineer take to resolve the underutilization issue?

Options:

A.

Set the spark.network.timeout property to allow tasks more time to complete without being killed.

B.

Increase the executor memory allocation in the Spark configuration.

C.

Reduce the size of the data partitions to improve task scheduling.

D.

Increase the number of executor instances to handle more concurrent tasks.

Buy Now
Questions 30

A data analyst wants to add a column date derived from a timestamp column.

Options:

Options:

A.

dates_df.withColumn("date", f.unix_timestamp("timestamp")).show()

B.

dates_df.withColumn("date", f.to_date("timestamp")).show()

C.

dates_df.withColumn("date", f.date_format("timestamp", "yyyy-MM-dd")).show()

D.

dates_df.withColumn("date", f.from_unixtime("timestamp")).show()

Buy Now
Questions 31

What is the risk associated with this operation when converting a large Pandas API on Spark DataFrame back to a Pandas DataFrame?

Options:

A.

The conversion will automatically distribute the data across worker nodes

B.

The operation will fail if the Pandas DataFrame exceeds 1000 rows

C.

Data will be lost during conversion

D.

The operation will load all data into the driver's memory, potentially causing memory overflow

Buy Now
Questions 32

A data scientist is working on a project that requires processing large amounts of structured data, performing SQL queries, and applying machine learning algorithms. The data scientist is considering using Apache Spark for this task.

Which combination of Apache Spark modules should the data scientist use in this scenario?

Options:

Options:

A.

Spark DataFrames, Structured Streaming, and GraphX

B.

Spark SQL, Pandas API on Spark, and Structured Streaming

C.

Spark Streaming, GraphX, and Pandas API on Spark

D.

Spark DataFrames, Spark SQL, and MLlib

Buy Now
Questions 33

A developer is running Spark SQL queries and notices underutilization of resources. Executors are idle, and the number of tasks per stage is low.

What should the developer do to improve cluster utilization?

Options:

A.

Increase the value of spark.sql.shuffle.partitions

B.

Reduce the value of spark.sql.shuffle.partitions

C.

Increase the size of the dataset to create more partitions

D.

Enable dynamic resource allocation to scale resources as needed

Buy Now
Questions 34

32 of 55.

A developer is creating a Spark application that performs multiple DataFrame transformations and actions. The developer wants to maintain optimal performance by properly managing the SparkSession.

How should the developer handle the SparkSession throughout the application?

Options:

A.

Use a single SparkSession instance for the entire application.

B.

Avoid using a SparkSession and rely on SparkContext only.

C.

Create a new SparkSession instance before each transformation.

D.

Stop and restart the SparkSession after each action.

Buy Now
Questions 35

An engineer has two DataFrames: df1 (small) and df2 (large). A broadcast join is used:

python

CopyEdit

from pyspark.sql.functions import broadcast

result = df2.join(broadcast(df1), on='id', how='inner')

What is the purpose of using broadcast() in this scenario?

Options:

Options:

A.

It filters the id values before performing the join.

B.

It increases the partition size for df1 and df2.

C.

It reduces the number of shuffle operations by replicating the smaller DataFrame to all nodes.

D.

It ensures that the join happens only when the id values are identical.

Buy Now
Questions 36

36 of 55.

What is the main advantage of partitioning the data when persisting tables?

Options:

A.

It compresses the data to save disk space.

B.

It automatically cleans up unused partitions to optimize storage.

C.

It ensures that data is loaded into memory all at once for faster query execution.

D.

It optimizes by reading only the relevant subset of data from fewer partitions.

Buy Now
Questions 37

A data engineer is working on the DataFrame:

(Referring to the table image: it has columns Id, Name, count, and timestamp.)

Which code fragment should the engineer use to extract the unique values in the Name column into an alphabetically ordered list?

Options:

A.

df.select("Name").orderBy(df["Name"].asc())

B.

df.select("Name").distinct().orderBy(df["Name"])

C.

df.select("Name").distinct()

D.

df.select("Name").distinct().orderBy(df["Name"].desc())

Buy Now
Questions 38

Given a CSV file with the content:

And the following code:

from pyspark.sql.types import *

schema = StructType([

StructField("name", StringType()),

StructField("age", IntegerType())

])

spark.read.schema(schema).csv(path).collect()

What is the resulting output?

Options:

A.

[Row(name='bambi'), Row(name='alladin', age=20)]

B.

[Row(name='alladin', age=20)]

C.

[Row(name='bambi', age=None), Row(name='alladin', age=20)]

D.

The code throws an error due to a schema mismatch.

Buy Now
Questions 39

A Spark engineer is troubleshooting a Spark application that has been encountering out-of-memory errors during execution. By reviewing the Spark driver logs, the engineer notices multiple "GC overhead limit exceeded" messages.

Which action should the engineer take to resolve this issue?

Options:

A.

Optimize the data processing logic by repartitioning the DataFrame.

B.

Modify the Spark configuration to disable garbage collection

C.

Increase the memory allocated to the Spark Driver.

D.

Cache large DataFrames to persist them in memory.

Buy Now
Questions 40

25 of 55.

A Data Analyst is working on employees_df and needs to add a new column where a 10% tax is calculated on the salary.

Additionally, the DataFrame contains the column age, which is not needed.

Which code fragment adds the tax column and removes the age column?

Options:

A.

employees_df = employees_df.withColumn("tax", col("salary") * 0.1).drop("age")

B.

employees_df = employees_df.withColumn("tax", lit(0.1)).drop("age")

C.

employees_df = employees_df.dropField("age").withColumn("tax", col("salary") * 0.1)

D.

employees_df = employees_df.withColumn("tax", col("salary") + 0.1).drop("age")

Buy Now
Exam Name: Databricks Certified Associate Developer for Apache Spark 3.5 – Python
Last Update: Oct 19, 2025
Questions: 136
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 pdf

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF

$25.5  $84.99
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 Testing Engine

$30  $99.99
Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Engine

Databricks-Certified-Associate-Developer-for-Apache-Spark-3.5 PDF + Testing Engine

$40.5  $134.99