## Question 1: Install Spark and PySpark - Install Spark - Run PySpark - Create a local spark session - Execute spark.version. I ran the test_pyspark.py file given in the input and this is what I get What's the output? image >- 4.1.1 ## Question 2: Yellow November 2025 Read the November 2025 Yellow into a Spark Dataframe. Repartition the Dataframe to 4 partitions and save it to parquet. ```Python df = spark.read \ .option("header","true") \ .parquet("yellow_tripdata_2025-11.parquet") df = df.repartition(4) df.write.parquet('yellowtrip/november') ``` What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches. image >- 25MB ## Question 3: Count records How many taxi trips were there on the 15th of November? Consider only trips that started on the 15th of November. ```Python from pyspark.sql import functions as F df.withColumn('pickup_date', F.to_date(df.tpep_pickup_datetime)) \ .filter(F.col('pickup_date') == '2025-11-15') \ .select(F.count('pickup_date')) \ .show() ``` image >- 162,604 ## Question 4: Longest trip What is the length of the longest trip in the dataset in hours? ```Python from pyspark.sql import functions as F df.withColumn('duration_hours', (F.unix_timestamp('tpep_dropoff_datetime') - F.unix_timestamp('tpep_pickup_datetime')) / 3600) \ .select('duration_hours', 'trip_distance') \ .orderBy(F.col('duration_hours').desc()) \ .limit(1) \ .show() ``` image >- 90.6 ## Question 5: User Interface Spark's User Interface which shows the application's dashboard runs on which local port? image >- 4040 ## Question 6: Least frequent pickup location zone Load the zone lookup data into a temp view in Spark: ```bash wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv ``` Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone? ```Python df_join = df.join(df_zones, df.PULocationID == df_zones.LocationID) df_join.groupBy('Zone') \ .agg(F.count('Zone').alias('n_trips')) \ .orderBy(F.col('n_trips').asc()) \ .show() ``` image >- Governor's Island/Ellis Island/Liberty Island >- Arden Heights If multiple answers are correct, select any