{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## This notebook is part of Hadoop and Spark training delivered by CERN IT\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### SPARK DataFrame Hands-On Lab\n", "Contact: Luca.Canali@cern.ch\n", "\n", "### Objective: Perform Basic DataFrame Operations\n", "1. Creating DataFrames\n", "2. Select columns\n", "3. Add, rename and drop columns\n", "4. Filtering rows\n", "5. Aggregations" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercises and solutions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reminder: documentation at \n", "https://spark.apache.org/docs/latest/api/python/index.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create the SparkSession\n", "# and read the dataset\n", "\n", "from pyspark.sql import SparkSession\n", "\n", "spark = SparkSession.builder \\\n", " .master(\"local[*]\") \\\n", " .appName(\"DataFrame HandsOn 1\") \\\n", " .config(\"spark.ui.showConsoleProgress\",\"false\") \\\n", " .getOrCreate()\n", "\n", "online_retail_schema=\"InvoiceNo int, StockCode string, Description string, Quantity int,\\\n", "InvoiceDate timestamp,UnitPrice float,CustomerId int, Country string\"\n", "\n", "df = spark.read \\\n", " .option(\"header\", \"true\") \\\n", " .option(\"timestampFormat\", \"M/d/yyyy H:m\")\\\n", " .csv(\"../data/online-retail-dataset.csv.gz\",\n", " schema=online_retail_schema)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "

SparkSession - in-memory

\n", " \n", "
\n", "

SparkContext

\n", "\n", "

Spark UI

\n", "\n", "
\n", "
Version
\n", "
v3.3.1
\n", "
Master
\n", "
local[*]
\n", "
AppName
\n", "
DataFrame HandsOn 1
\n", "
\n", "
\n", " \n", "
\n", " " ], "text/plain": [ "" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "spark" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
InvoiceNoStockCodeDescriptionQuantityInvoiceDateUnitPriceCustomerIdCountry
053636585123AWHITE HANGING HEART T-LIGHT HOLDER62010-12-01 08:26:002.5517850United Kingdom
153636571053WHITE METAL LANTERN62010-12-01 08:26:003.3917850United Kingdom
253636584406BCREAM CUPID HEARTS COAT HANGER82010-12-01 08:26:002.7517850United Kingdom
353636584029GKNITTED UNION FLAG HOT WATER BOTTLE62010-12-01 08:26:003.3917850United Kingdom
453636584029ERED WOOLLY HOTTIE WHITE HEART.62010-12-01 08:26:003.3917850United Kingdom
\n", "
" ], "text/plain": [ " InvoiceNo StockCode Description Quantity \\\n", "0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \n", "1 536365 71053 WHITE METAL LANTERN 6 \n", "2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 \n", "3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 \n", "4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 \n", "\n", " InvoiceDate UnitPrice CustomerId Country \n", "0 2010-12-01 08:26:00 2.55 17850 United Kingdom \n", "1 2010-12-01 08:26:00 3.39 17850 United Kingdom \n", "2 2010-12-01 08:26:00 2.75 17850 United Kingdom \n", "3 2010-12-01 08:26:00 3.39 17850 United Kingdom \n", "4 2010-12-01 08:26:00 3.39 17850 United Kingdom " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.limit(5).toPandas()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Task: Show 5 lines of the \"description\" column " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-----------------------------------+\n", "|description |\n", "+-----------------------------------+\n", "|WHITE HANGING HEART T-LIGHT HOLDER |\n", "|WHITE METAL LANTERN |\n", "|CREAM CUPID HEARTS COAT HANGER |\n", "|KNITTED UNION FLAG HOT WATER BOTTLE|\n", "|RED WOOLLY HOTTIE WHITE HEART. |\n", "+-----------------------------------+\n", "only showing top 5 rows\n", "\n" ] } ], "source": [ "df.select(\"description\").show(5,truncate=False)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Task: Count the number of distinct invoices in the dataframe" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+-------------------------+\n", "|count(DISTINCT InvoiceNo)|\n", "+-------------------------+\n", "| 22061|\n", "+-------------------------+\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import countDistinct\n", "\n", "df.select(countDistinct(\"InvoiceNo\")).show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Task: Find out in which month most invoices have been processed" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+-----+\n", "|month(InvoiceDate)|count|\n", "+------------------+-----+\n", "| 2|27707|\n", "| 4|29916|\n", "| 1|35147|\n", "| 8|35284|\n", "| 3|36748|\n", "| 6|36874|\n", "| 5|37030|\n", "| 7|39518|\n", "| 9|50226|\n", "| 10|60742|\n", "| 12|68006|\n", "| 11|84711|\n", "+------------------+-----+\n", "\n" ] } ], "source": [ "# This shows how many line items have been processed per month\n", "\n", "from pyspark.sql.functions import month\n", "df.groupby(month(\"InvoiceDate\")).count().sort(\"count\").show()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+------------------+----------------+\n", "|month(InvoiceDate)|DistinctInvoices|\n", "+------------------+----------------+\n", "| 11| 3021|\n", "| 12| 2568|\n", "| 10| 2275|\n", "| 9| 1994|\n", "| 5| 1848|\n", "| 6| 1683|\n", "| 3| 1665|\n", "| 7| 1657|\n", "| 4| 1504|\n", "| 8| 1456|\n", "| 1| 1216|\n", "| 2| 1174|\n", "+------------------+----------------+\n", "\n" ] } ], "source": [ "# This shows how distinct invoices have been processed per month\n", "\n", "from pyspark.sql.functions import col, month, countDistinct\n", "\n", "(df\n", " .groupBy(month('InvoiceDate'))\n", " .agg(countDistinct('InvoiceNo').alias('DistinctInvoices'))\n", " .orderBy(col('DistinctInvoices').desc())\n", " .show()\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Task: Filter the lines where the Quantity is more than 30" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+\n", "|InvoiceNo|StockCode| Description|Quantity| InvoiceDate|UnitPrice|CustomerId| Country|\n", "+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+\n", "| 536367| 84879|ASSORTED COLOUR B...| 32|2010-12-01 08:34:00| 1.69| 13047|United Kingdom|\n", "| 536370| 10002|INFLATABLE POLITI...| 48|2010-12-01 08:45:00| 0.85| 12583| France|\n", "| 536370| 22492|MINI PAINT SET VI...| 36|2010-12-01 08:45:00| 0.65| 12583| France|\n", "| 536371| 22086|PAPER CHAIN KIT 5...| 80|2010-12-01 09:00:00| 2.55| 13748|United Kingdom|\n", "| 536374| 21258|VICTORIAN SEWING ...| 32|2010-12-01 09:09:00| 10.95| 15100|United Kingdom|\n", "| 536376| 22114|HOT WATER BOTTLE ...| 48|2010-12-01 09:32:00| 3.45| 15291|United Kingdom|\n", "| 536376| 21733|RED HANGING HEART...| 64|2010-12-01 09:32:00| 2.55| 15291|United Kingdom|\n", "| 536378| 21212|PACK OF 72 RETROS...| 120|2010-12-01 09:37:00| 0.42| 14688|United Kingdom|\n", "| 536378| 85183B|CHARLIE & LOLA WA...| 48|2010-12-01 09:37:00| 1.25| 14688|United Kingdom|\n", "| 536378| 85071B|RED CHARLIE+LOLA ...| 96|2010-12-01 09:37:00| 0.38| 14688|United Kingdom|\n", "| 536381| 22719|GUMBALL MONOCHROM...| 36|2010-12-01 09:41:00| 1.06| 15311|United Kingdom|\n", "| 536382| 22381|TOY TIDY PINK POL...| 50|2010-12-01 09:45:00| 1.85| 16098|United Kingdom|\n", "| 536384| 84755|COLOUR GLASS T-LI...| 48|2010-12-01 09:53:00| 0.65| 18074|United Kingdom|\n", "| 536384| 22469|HEART OF WICKER S...| 40|2010-12-01 09:53:00| 1.45| 18074|United Kingdom|\n", "| 536384| 22470|HEART OF WICKER L...| 40|2010-12-01 09:53:00| 2.55| 18074|United Kingdom|\n", "| 536386| 84880|WHITE WIRE EGG HO...| 36|2010-12-01 09:57:00| 4.95| 16029|United Kingdom|\n", "| 536386| 85099C|JUMBO BAG BAROQU...| 100|2010-12-01 09:57:00| 1.65| 16029|United Kingdom|\n", "| 536386| 85099B|JUMBO BAG RED RET...| 100|2010-12-01 09:57:00| 1.65| 16029|United Kingdom|\n", "| 536387| 79321| CHILLI LIGHTS| 192|2010-12-01 09:58:00| 3.82| 16029|United Kingdom|\n", "| 536387| 22780|LIGHT GARLAND BUT...| 192|2010-12-01 09:58:00| 3.37| 16029|United Kingdom|\n", "+---------+---------+--------------------+--------+-------------------+---------+----------+--------------+\n", "only showing top 20 rows\n", "\n" ] } ], "source": [ "df.where(\"Quantity > 30\").show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Task: Show the four most sold items (by quantity)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "+--------------------+-------------+\n", "| Description|totalQuantity|\n", "+--------------------+-------------+\n", "|WORLD WAR 2 GLIDE...| 53847|\n", "|JUMBO BAG RED RET...| 47363|\n", "|ASSORTED COLOUR B...| 36381|\n", "| POPCORN HOLDER| 36334|\n", "+--------------------+-------------+\n", "only showing top 4 rows\n", "\n" ] } ], "source": [ "from pyspark.sql.functions import desc, asc, expr\n", "(df.groupBy(\"Description\")\n", " .agg(expr(\"sum(Quantity) as totalQuantity\"))\n", " .sort(\"totalQuantity\", ascending=False)\n", " .show(4))\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Bonus question: why do these two operations return different results? Hint: look at the documentation " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "22062\n", "+-------------------------+\n", "|count(DISTINCT InvoiceNo)|\n", "+-------------------------+\n", "| 22061|\n", "+-------------------------+\n", "\n" ] } ], "source": [ "print(df.select(\"InvoiceNo\").distinct().count())\n", "from pyspark.sql.functions import countDistinct\n", "df.select(countDistinct(\"InvoiceNo\")).show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see from the output of `countDistinct`, internally it runs `count(DISTINCT`, which excludes `null`s.\n", "\n", "https://spark.apache.org/docs/latest/api/sql/#count\n", "\n", "* `count()` Returns the total number of retrieved rows, including rows containing null\n", "\n", "* `count(DISTINCT expr[, expr...])` - Returns the number of rows for which the supplied expression(s) are unique and non-null." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "sparkconnect": { "bundled_options": [], "list_of_options": [] } }, "nbformat": 4, "nbformat_minor": 1 }