## Question 1: Install Spark and PySpark
- Install Spark
- Run PySpark
- Create a local spark session
- Execute spark.version.
I ran the test_pyspark.py file given in the input and this is what I get
What's the output?
>- 4.1.1
## Question 2: Yellow November 2025
Read the November 2025 Yellow into a Spark Dataframe.
Repartition the Dataframe to 4 partitions and save it to parquet.
```Python
df = spark.read \
.option("header","true") \
.parquet("yellow_tripdata_2025-11.parquet")
df = df.repartition(4)
df.write.parquet('yellowtrip/november')
```
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.
>- 25MB
## Question 3: Count records
How many taxi trips were there on the 15th of November?
Consider only trips that started on the 15th of November.
```Python
from pyspark.sql import functions as F
df.withColumn('pickup_date', F.to_date(df.tpep_pickup_datetime)) \
.filter(F.col('pickup_date') == '2025-11-15') \
.select(F.count('pickup_date')) \
.show()
```
>- 162,604
## Question 4: Longest trip
What is the length of the longest trip in the dataset in hours?
```Python
from pyspark.sql import functions as F
df.withColumn('duration_hours',
(F.unix_timestamp('tpep_dropoff_datetime') - F.unix_timestamp('tpep_pickup_datetime')) / 3600) \
.select('duration_hours', 'trip_distance') \
.orderBy(F.col('duration_hours').desc()) \
.limit(1) \
.show()
```
>- 90.6
## Question 5: User Interface
Spark's User Interface which shows the application's dashboard runs on which local port?
>- 4040
## Question 6: Least frequent pickup location zone
Load the zone lookup data into a temp view in Spark:
```bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv
```
Using the zone lookup data and the Yellow November 2025 data, what is the name of the LEAST frequent pickup location Zone?
```Python
df_join = df.join(df_zones, df.PULocationID == df_zones.LocationID)
df_join.groupBy('Zone') \
.agg(F.count('Zone').alias('n_trips')) \
.orderBy(F.col('n_trips').asc()) \
.show()
```
>- Governor's Island/Ellis Island/Liberty Island
>- Arden Heights
If multiple answers are correct, select any