Demo Running Spark On Databricks

Lesson objectives

In this lesson, we will explain the following topics:

Demonstrate the process of running Spark on Databricks.
Understand the benefits of using Databricks for Spark workloads.
Explore practical examples of Spark applications running on Databricks.

Introduction To RDD

# Create an RDD from a list of numbers
numbers = [1, 2, 3, 4, 5]
numbers_rdd = sc.parallelize(numbers)
numbers_rdd

print(numbers_rdd)

# Apply a transformation: multiply each number by 2
doubled_rdd = numbers_rdd.map(lambda x: x * 2)

# Perform an action: collect the results to a list
result = doubled_rdd.collect()

# Print the result
print(result)  # Output: [2, 4, 6, 8, 10]

Spark Lazy Evaluation

# Create an RDD
rdd = sc.parallelize([
    ("John", 28),
    ("Smith", 44),
    ("Adam", 65),
    ("Henry", 23)
])

# Apply a map transformation to create a new RDD with a tuple including the name and a boolean flag
# if the person is older than 30
mapped_rdd = rdd.map(lambda x: (x[0], x[1], x[1] > 30))

# Filter the RDD to include only people older than 30
filtered_rdd = mapped_rdd.filter(lambda x: x[2])

# Convert the filtered RDD back to a DataFrame
df = spark.createDataFrame(filtered_rdd, ["Name", "Age", "OlderThan30"])

# Select only the name and age columns
final_df = df.select("Name", "Age")

# # Collect the results which triggers the execution of all transformations
results = final_df.collect()
display(results)