# Module 6 Homework ## Question 1: Install Spark and PySpark > Install PySpark follow this [guide](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/06-batch/setup/) > - Install Spark > - Run PySpark > - Create a local spark session > - Execute spark.version. > What's the output? ```python import pyspark from pyspark.sql import SparkSession spark = SparkSession.builder \ .master("local[*]") \ .appName('homework') \ .getOrCreate() print(f"Spark version: {spark.version}") ``` ``` Spark version: 3.5.8 ``` The output is **3.5.8**. ## Question 2: Yellow November 2025 > Read the November 2025 Yellow into a Spark Dataframe. > > Repartition the Dataframe to 4 partitions and save it to parquet. > > What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? ```python !mkdir -p /data/homework/raw !mkdir -p /data/homework/parquet !wget -nc https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2025-11.parquet \ -O /data/homework/raw/yellow_tripdata_2025-11.parquet ``` ```python # Read the November 2025 Yellow into a Spark Dataframe. df_trips = spark.read.parquet('/data/homework/raw/*.parquet') # Repartition the Dataframe to 4 partitions and save it to parquet. df_trips.repartition(4).write.parquet('/data/homework/parquet', mode='overwrite') ``` ```python # What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? !ls -lh /data/homework/parquet/*.parquet ``` ``` -rw-r--r-- 1 root root 25M Mar 7 11:24 /data/homework/parquet/part-00000-7ef2f8c4-626d-4401-8cc4-97eb59ad5841-c000.snappy.parquet -rw-r--r-- 1 root root 25M Mar 7 11:24 /data/homework/parquet/part-00001-7ef2f8c4-626d-4401-8cc4-97eb59ad5841-c000.snappy.parquet -rw-r--r-- 1 root root 25M Mar 7 11:24 /data/homework/parquet/part-00002-7ef2f8c4-626d-4401-8cc4-97eb59ad5841-c000.snappy.parquet -rw-r--r-- 1 root root 25M Mar 7 11:24 /data/homework/parquet/part-00003-7ef2f8c4-626d-4401-8cc4-97eb59ad5841-c000.snappy.parquet ``` > Select the answer which most closely matches. > > - 6MB > - 25MB > - 75MB > - 100MB The average size is **25MB**. ## Question 3: Count records > How many taxi trips were there on the 15th of November? > > Consider only trips that started on the 15th of November. ```python from pyspark.sql import functions as F df_trips = df_trips.withColumn('pickup_date', F.to_date(df_trips.tpep_pickup_datetime)) df_trips.filter(df_trips.pickup_date == '2025-11-15').count() ``` ``` 162604 ``` The number of trips started on the 15th of November is: - **162,604** ## Question 4: Longest trip > What is the length of the longest trip in the dataset in hours? - 22.7 - 58.2 - 90.6 - 134.5 ```python from pyspark.sql import types from datetime import datetime def duration(start: datetime, end: datetime) -> int: # Difference in minutes (absolute value to support inverted values) minutes = abs((end - start).total_seconds()) / 60 # Conversion to hours hours = minutes / 60 return hours duration_udf = F.udf(duration, returnType=types.FloatType()) df_trips \ .withColumn('duration_hours', duration_udf(df_trips.tpep_pickup_datetime, df_trips.tpep_dropoff_datetime)) \ .groupBy() \ .max('duration_hours') \ .show() ``` ``` +-------------------+ |max(duration_hours)| +-------------------+ | 90.64667| +-------------------+ ``` The answer is **90.6**. ## Question 5: User Interface > Spark's User Interface which shows the application's dashboard runs on which local port? It runs on port **4040**. ## Question 6: Least frequent pickup location zone > Load the zone lookup data into a temp view in Spark. ```python !wget -nc https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv \ -O /data/homework/raw/taxi_zone_lookup.csv ``` ```python df_zones = spark.read.csv('/data/homework/raw/taxi_zone_lookup.csv', header=True) df_trips.createOrReplaceTempView('trips') df_zones.createOrReplaceTempView('zones') spark.sql(""" WITH trips_by_pickup_location AS ( SELECT PULocationID, COUNT(*) AS trip_count FROM trips GROUP BY PULocationID ORDER BY trip_count ) SELECT z.Zone, SUM(t.trip_count) AS trip_count FROM zones AS z LEFT JOIN trips_by_pickup_location AS t ON t.PULocationID = z.LocationID GROUP BY z.Zone ORDER BY trip_count """).show(truncate=False) ``` ``` +---------------------------------------------+----------+ |Zone |trip_count| +---------------------------------------------+----------+ |Charleston/Tottenville |NULL | |Freshkills Park |NULL | |Great Kills Park |NULL | |Governor's Island/Ellis Island/Liberty Island|1 | |Eltingville/Annadale/Prince's Bay |1 | |Arden Heights |1 | |Port Richmond |3 | |Rikers Island |4 | |Rossville/Woodrow |4 | |Great Kills |4 | |Green-Wood Cemetery |4 | |Jamaica Bay |5 | |Westerleigh |12 | |West Brighton |14 | |New Dorp/Midland Beach |14 | |Oakwood |14 | |Crotona Park |14 | |Willets Point |15 | |Breezy Point/Fort Tilden/Riis Beach |16 | |Saint George/New Brighton |17 | +---------------------------------------------+----------+ only showing top 20 rows ``` > Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone? The 3 zones with the least number of trips are these zones, which had no trip recorded during that period: * **Charleston/Tottenville** * **Freshkills Park** * **Great Kills Park** Then, these 3 zones had only one trip recorded: * **Governor's Island/Ellis Island/Liberty Island** * **Eltingville/Annadale/Prince's Bay** * **Arden Heights** For **Rikers Island** 4 trips were recorded and for **Jamaica Bay** there were 5. ## Submitting the solutions - Form for submitting: https://courses.datatalks.club/de-zoomcamp-2026/homework/hw6 - Deadline: See the website