# Module 6 Homework In this homework we'll put what we learned about Spark in practice. For this homework we will be using the Yellow 2025-11 data from the official website: ``` wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet ``` ## Question 1: Install Spark and PySpark * Install Spark * Run PySpark * Create a local spark session * Execute `spark.version`. What's the output? **Answer:** Following the updated guidelines for the 2026 cohort, the setup process for Spark 4.x on macOS has been greatly simplified. Since PySpark 4.x bundles its own Spark distribution, there is no longer a need to manually download Spark binaries or set up the `SPARK_HOME` environment variable. I installed Java 17 via Homebrew and used `uv` (an extremely fast Python package manager) to handle the dependencies: ```bash # Install Java 17 brew install openjdk@17 # Initialize uv and add pyspark uv init uv add pyspark ``` After setting up the environment, I created a local Spark session to verify the installation: ```python import pyspark from pyspark.sql import SparkSession # Initialize a local Spark session spark = SparkSession.builder \ .master("local[*]") \ .appName('test') \ .getOrCreate() # Check the Spark version print(f"Spark version: {spark.version}") ``` Output: **'4.0.1'** The output confirms that Spark 4.x is successfully installed and running locally. ## Question 2: Yellow November 2025 Read the November 2025 Yellow into a Spark Dataframe. Repartition the Dataframe to 4 partitions and save it to parquet. What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? First, I downloaded the requested dataset using `wget`: ```bash wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet ``` Then, I used PySpark to read the file, repartition the dataframe into exactly 4 partitions, and write the output to a new directory. Repartitioning is a crucial operation in Spark to optimize parallel processing by distributing data evenly across executors. ```python import pyspark from pyspark.sql import SparkSession # Initialize session spark = SparkSession.builder \ .master("local[*]") \ .appName('Zoomcamp_Q2') \ .getOrCreate() # Read the original parquet file df = spark.read.parquet('yellow_tripdata_2025-11.parquet') # Repartition and save output_path = 'data/pq/yellow/2025/11/' df.repartition(4).write.parquet(output_path, mode='overwrite') ``` After the job completed, I checked the size of the generated `.parquet` files inside the output directory using the terminal: ```bash ls -lh data/pq/yellow/2025/11/ ``` Answer: **25MB**. ## Question 3: Count records How many taxi trips were there on the 15th of November? Consider only trips that started on the 15th of November. Answer: **162,604**. ### Methodology To find the total number of trips that started on a specific day, I needed to filter the dataframe based on the pickup datetime column (`tpep_pickup_datetime`). Since the column contains timestamp data (both date and time), I imported PySpark SQL functions to cast the timestamp to a simple date format using `F.to_date()`. This ensures that trips occurring at any time during that specific day are correctly grouped and counted. ```python from pyspark.sql import functions as F # Filter the dataframe for trips starting exactly on 2025-11-15 and count them trips_nov_15 = df_trips.filter( F.to_date(df_trips.tpep_pickup_datetime) == '2025-11-15' ).count() print(f"Total trips on November 15, 2025: {trips_nov_15}") ``` ## Question 4: Longest trip What is the length of the longest trip in the dataset in hours? Answer: **90.6** ### Methodology To find the longest trip, I needed to calculate the time difference between the dropoff and pickup datetime columns. Since Spark handles these as Timestamps, the most robust mathematical approach is to convert them into Unix timestamps (seconds since the Unix epoch), subtract the pickup time from the dropoff time, and divide the result by 3600 to convert seconds into hours. I solved this using PySpark's DataFrame API: ```python from pyspark.sql import functions as F # Calculate duration in hours df_with_duration = df_trips.withColumn( 'duration_hours', (F.unix_timestamp('tpep_dropoff_datetime') - F.unix_timestamp('tpep_pickup_datetime')) / 3600 ) # Get the maximum duration longest_trip = df_with_duration.select(F.max('duration_hours')).first()[0] print(f"Longest trip in hours: {longest_trip}") ``` ## Question 5: User Interface Spark's User Interface which shows the application's dashboard runs on which local port? * 80 * 443 * 4040 * 8080 Answer: **4040** When a `SparkSession` is initialized, Apache Spark automatically launches a Web UI to monitor and inspect the application's execution (including Jobs, Stages, Tasks, memory usage, and the DAG). By default, this application dashboard is hosted on port **4040**. (Note: If port 4040 is already in use by another active Spark context, Spark will increment the port number to 4041, 4042, and so on). We can easily verify the exact URL of your active session programmatically by running: ```python print(spark.sparkContext.uiWebUrl) ``` ## Question 6: Least frequent pickup location zone Load the zone lookup data into a temp view in Spark. Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone? Answer: **Governor's Island/Ellis Island/Liberty Island** or **Arden Heights**. To identify the least frequent pickup zone, I needed to join the main trips dataset with the provided zone lookup CSV. As suggested by the instructions, I used Spark SQL to perform the aggregation and join. First, I downloaded the lookup table into my raw data directory: ```bash wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv ``` Next, I read the CSV into a Spark DataFrame, created temporary views for both datasets, and executed an SQL query to count the pickups per zone, ordering the results in ascending order to find the minimum: ```python # Load the zone lookup data df_zones = spark.read \ .option("header", "true") \ .option("inferSchema", "true") \ .csv('/data/homework/raw/taxi_zone_lookup.csv') # Create temporary views df_zones.createOrReplaceTempView("zones") df_trips.createOrReplaceTempView("trips") # Query to find the least frequent pickup location zone spark.sql(""" SELECT z.Zone, COUNT(1) AS pickup_count FROM trips t JOIN zones z ON t.PULocationID = z.LocationID GROUP BY z.Zone ORDER BY pickup_count ASC LIMIT 1 """).show(truncate=False) ```