# Module 6: Batch Processing with Spark ## Overview This module covers batch processing using Apache Spark and PySpark — distributed computing for processing large datasets at scale. ## Setup ```bash sudo apt install -y openjdk-17-jre-headless pip install pyspark ``` **Versions:** Java 17, PySpark 4.1.1 ## Dataset - Yellow Taxi November 2025 (~4M rows, Parquet) - Taxi Zone Lookup CSV ## What I Did - Created a local Spark session with `local[*]` (all cores) - Read Parquet file into a Spark DataFrame - Repartitioned to 4 partitions and wrote back to Parquet - Filtered trips by date using `tpep_pickup_datetime` - Calculated trip duration in hours using `unix_timestamp` - Joined trips with zone lookup to find least frequent pickup zone - Used Spark UI (port 4040) to monitor job execution ## Running ```bash cd 06-batch-spark python3 homework.py ``` ## Homework Answers | Q | Answer | |---|--------| | Q1 - Spark version | `4.1.1` | | Q2 - Avg parquet file size | `25 MB` | | Q3 - Trips on Nov 15 | `162,604` | | Q4 - Longest trip | `90.6 hours` | | Q5 - Spark UI port | `4040` | | Q6 - Least frequent pickup zone | `Governor's Island/Ellis Island/Liberty Island` | ## Key Learnings - Spark processes data in partitions across workers — `repartition(4)` controls parallelism - `local[*]` uses all available CPU cores for local mode - Parquet is Spark's native format — reads/writes are highly optimized - `unix_timestamp` difference divided by 3600 gives duration in hours - Spark UI on port 4040 shows stages, tasks, DAG, and execution metrics - Joins in Spark are distributed — zone lookup was broadcast automatically