## Module 6: Batch Processing Introduction to Spark and PySpark for data lake creation using NYC Taxi data --- ### Contents ``` 06-batch ├── README.md # this file └── Week6_HW.ipynb # Python notebook with Python, PySpark, and SQL needed to explore and answer the homework questions ``` --- ### Prerequisites For this Module and Homework, I set up a local instance of Spark on a Windows 11 machine - [Spark setup on Windows](https://github.com/DataTalksClub/data-engineering-zoomcamp/blob/main/06-batch/setup/windows.md) - 📢 I ran into several issues during this configuration due to previous installations of Java, Python, and Jupyter Notebook - I enlisted assistance from Claude to resolve these configuration issues, but ultimately, these were the problems encountered - JAVA_HOME was not set to the correct java version - SPARK_HOME was set to a manually installed location rather than the pip-installed pyspark location in AppData - PYSPARK_PYTHON was not set to the correct python version - Variables were not holding persistently - Upon completion of these steps, I confirmed that Spark had been installed, so the 1st bullet point in Homework Question 1 was completed. --- ### Instructions #### To view the contents of the notebook 📓 - Clone or download [Week6_HW.ipynb](https://github.com/lemieuxjm/data-engineering-zoomcamp/blob/main/06-batch/Week6_HW.ipynb) - Open the notebook #### To run the notebook 📓 - Be sure you are running Jupyter Notebook or other IDE that is configured in alignment with your Spark/PySpark configuration - No additional installations should be needed #### Notes: - Cell 1 in the notebook will run PySpark, completing the 2nd bullet point in Homework Question 2. - Cell 1 in the notebook will also create a local spark session - Execute spark.version was performed outside of the notebook - Answer to Homework Question 1 - What is the output of executing spark.version? - 4.1.1 - Notebook/code not needed to answer Question 5: - Homework Question 5 asks what local port is used to run Spark's application dashboard - Port 4040 --- ### Notebook Steps, Explanation, and Homework Cell 1: - Prepare notebook for pyspark use by importing pyspark and SparkSession - Create the spark object Cell 2: - Retrieve Yellow taxi data for November 2025 in parquet format from official NYC data site using wget - Output displays batches retrieved, percent complete, size of batches, and duration of batch retrieval - Indicates the data was saved to 'yellow_tripdata_2025-11.parquet' Cell 3: - Create a dataframe from the Yellow 2025-11 data Cell 4: - Display Yellow 2025-11 dataset to explore data - Output displays header and first 20 records of dataset Cell 5: - Repartition the dataset into 4 partitions Cell 6: - Write the reparitioned data to disk - (outside of notebook) Use Windows Explorer to visualize saved files, noting the size of the files - Answer to Homework Question 2 - What is the avg. size of the Parquet files created by this cell? - 25MG Cell 7: - Display schema of Yellow 2025-11 dataset Cell 8: - Import SQL functions for use with SQL statements Cell 9: - Identify 2 timestamp columns - Create 2 new columns - Populate the new columns with data from timestamp columns reformatted into date-only format - Create new expanded Yellow taxi dataset with new columns plus existing data Cell 10: - Display expanded Yellow taxi dataset to explore results - Output displays header and first 20 records of dataset Cell 11: - Create a temp table from the expanded Yellow taxi dataset Cell 12: - Explore first 10 records of the temp table using SQL ```sql SELECT * FROM yellow_nov_data LIMIT 10; ``` - Output displays header and first 10 records of dataset Cell 13: - Runs SQL query to answer Homework Question 3 ```sql SELECT COUNT(*) FROM yellow_nov_data WHERE pickup_date = '2025-11-15'; ``` - Display the results of the SQL query - Answer to Homework Question 3 - How many taxi trips were started on November 15th? - 162604 Cell 14: - Runs SQL query to answer Homework Question 4 ```sql SELECT MAX(timestampdiff(SECOND, tpep_pickup_datetime, tpep_dropoff_datetime)) / 3600 AS longest_trip_hour FROM yellow_nov_data; ``` - Display the results of the SQL query - Answer to Homework Question 4 - What is the length of the longest trip (in hours) in the dataset - 90.65 Cell 15: - Retrieve zone lookup data from official NYC data site using wget - Output displays batch retrieved, percent complete, size of batch, and duration of batch retrieval - Indicates the data was saved to 'taxi_zone_lookup.csv' Cell 16: - Create a dataframe from the taxi zone lookup data Cell 17: - Display zone lookup dataset to explore data - Output displays header and first 20 records of dataset Cell 18: - Create a temp view from the zone lookup data set Cell 19: - Runs SQL query to answer Homework Question 6 ```sql SELECT y.PULocationID, z.Zone, COUNT(*) AS frequency FROM yellow_nov_data AS y JOIN zones AS z ON z.LocationID = y.PULocationID GROUP BY y.PULocationID,z.Zone ORDER BY frequency ASC; ``` - Display the results of the SQL query - Answer to Homework Question 6 - What is the name of the pickup location zone with the fewest records? - (there were three possible answers, 1 chosen) Arden Heights