{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data analysis with Python, Apache Spark, and PixieDust\n",
"***\n",
"\n",
"In this notebook you will:\n",
"\n",
"* analyze customer demographics, such as, age, gender, income, and location\n",
"* combine that data with sales data to examine trends for product categories, transaction types, and product popularity\n",
"* load data from GitHub as well as from a public open data set\n",
"* cleanse, shape, and enrich the data, and then visualize the data with the PixieDust library\n",
"\n",
"Don't worry! PixieDust graphs don't require coding. \n",
"\n",
"By the end of the notebook, you will understand how to combine data to gain insights about which customers you might target to increase sales.\n",
"\n",
"This notebook runs on Python 2 with Spark 2.1, and PixieDust 1.1.10."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## Table of contents\n",
"\n",
"#### [Setup](#Setup)\n",
"[Load data into the notebook](#Load-data-into-the-notebook)\n",
"#### [Explore customer demographics](#part1)\n",
"[Prepare the customer data set](#Prepare-the-customer-data-set)
\n",
"[Visualize customer demographics and locations](#Visualize-customer-demographics-and-locations)
\n",
"[Enrich demographic information with open data](#Enrich-demographic-information-with-open-data)
\n",
"\n",
"#### [Summary and next steps](#summary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Setup\n",
"You need to import libraries and load the customer data into this notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the necessary libraries:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pixiedust\n",
"import pyspark.sql.functions as func\n",
"import pyspark.sql.types as types\n",
"import re\n",
"import json\n",
"import os\n",
"import requests "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**If you get any errors or if a package is out of date:**\n",
"\n",
"* uncomment the lines in the next cell (remove the `#`)\n",
"* restart the kernel (from the Kernel menu at the top of the notebook)\n",
"* reload the browser page\n",
"* run the cell above, and continue with the notebook"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#!pip install jinja2 --user --upgrade\n",
"#!pip install pixiedust --user --upgrade\n",
"#!pip install -U --no-deps bokeh"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load data into the notebook\n",
"\n",
"The data file contains both the customer demographic data that you'll analyzed in Part 1, and the sales transaction data for Part 2.\n",
"\n",
"With `pixiedust.sampleData()` you can load csv data from any url. The below loads the data in a Spark DataFrame. \n",
"\n",
"> In case you wondered, this works with Pandas as well, just add `forcePandas = True` to load data in a Pandas DataFrame. *But do not add this to the below cell as in this notebook you will use Spark.*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw_df = pixiedust.sampleData('https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/data/customers_orders1_opt.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"raw_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"\n",
"# Explore customer demographics \n",
"In this part of the notebook, you will prepare the customer data and then start learning about your customers by creating multiple charts and maps. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Prepare the customer data set\n",
"Create a new Spark DataFrame with only the data you need and then cleanse and enrich the data.\n",
"\n",
"Extract the columns that you are interested in, remove duplicate customers, and add a column for aggregations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Extract the customer information from the data set\n",
"customer_df = raw_df.select(\"CUST_ID\", \n",
" \"CUSTNAME\", \n",
" \"ADDRESS1\", \n",
" \"ADDRESS2\", \n",
" \"CITY\", \n",
" \"POSTAL_CODE\", \n",
" \"POSTAL_CODE_PLUS4\", \n",
" \"STATE\", \n",
" \"COUNTRY_CODE\", \n",
" \"EMAIL_ADDRESS\", \n",
" \"PHONE_NUMBER\",\n",
" \"AGE\",\n",
" \"GenderCode\",\n",
" \"GENERATION\",\n",
" \"NATIONALITY\", \n",
" \"NATIONAL_ID\", \n",
" \"DRIVER_LICENSE\").dropDuplicates()\n",
"\n",
"customer_df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Notice that the data type of the AGE column is currently a string. Convert the AGE column to a numeric data type so you can run calculations on customer age."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ---------------------------------------\n",
"# Cleanse age (enforce numeric data type) \n",
"# ---------------------------------------\n",
"\n",
"def getNumericVal(col):\n",
" \"\"\"\n",
" input: pyspark.sql.types.Column\n",
" output: the numeric value represented by col or None\n",
" \"\"\"\n",
" try:\n",
" return int(col)\n",
" except ValueError:\n",
" # age-33\n",
" match = re.match('^age\\-(\\d+)$', col)\n",
" if match:\n",
" try:\n",
" return int(match.group(1))\n",
" except ValueError: \n",
" return None\n",
" return None \n",
"\n",
"toNumericValUDF = func.udf(lambda c: getNumericVal(c), types.IntegerType())\n",
"customer_df = customer_df.withColumn(\"AGE\", toNumericValUDF(customer_df[\"AGE\"]))\n",
"customer_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"customer_df.show(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The GenderCode column contains salutations instead of gender values. Derive the gender information for each customer based on the salutation and rename the GenderCode column to GENDER."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ------------------------------\n",
"# Derive gender from salutation\n",
"# ------------------------------\n",
"def deriveGender(col):\n",
" \"\"\" input: pyspark.sql.types.Column\n",
" output: \"male\", \"female\" or \"unknown\"\n",
" \"\"\" \n",
" if col in ['Mr.', 'Master.']:\n",
" return 'male'\n",
" elif col in ['Mrs.', 'Miss.']:\n",
" return 'female'\n",
" else:\n",
" return 'unknown';\n",
" \n",
"deriveGenderUDF = func.udf(lambda c: deriveGender(c), types.StringType())\n",
"customer_df = customer_df.withColumn(\"GENDER\", deriveGenderUDF(customer_df[\"GenderCode\"]))\n",
"customer_df.cache()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Explore the customer data set\n",
"\n",
"Instead of exploring the data with `.printSchema()` and `.show()` you can quickly explore data sets using PixieDust'. Invoke the `display()` command and click the table icon to review the schema and preview the data. Customize the options to display only a subset of the fields or rows or apply a filter (by clicking the funnel icon)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"filter": "{\"regex\": \"False\", \"field\": \"AGE\", \"case_matter\": \"False\", \"value\": \"30\", \"constraint\": \"less_than\"}",
"handlerId": "tableView",
"no_margin": "true",
"rowCount": "200"
}
}
},
"outputs": [],
"source": [
"display(customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"## Visualize customer demographics and locations\n",
"\n",
"Now you are ready to explore the customer base. Using simple charts, you can quickly see these characteristics:\n",
" * Customer demographics (gender and age)\n",
" * Customer locations (city, state, and country)\n",
"\n",
"You will create charts with the PixieDust library:\n",
"\n",
" - [View customers by gender in a pie chart](#View-customers-by-gender-in-a-pie-chart)\n",
" - [View customers by generation in a bar chart](#View-customers-by-generation-in-a-bar-chart)\n",
" - [View customers by age in a histogram chart](#View-customers-by-age-in-a-histogram-chart)\n",
" - [View specific information with a filter function](#View-specific-information-with-a-filter-function)\n",
" - [View customer density by location with a map](#View-customer-density-by-location-with-a-map)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### View customers by gender in a pie chart\n",
"\n",
"Run the `display()` command and then configure the graph to show the percentages of male and female customers:\n",
"\n",
"1. Run the next cell. The PixieDust interactive widget appears. \n",
"1. Click the chart button and choose **Pie Chart**. The chart options tool appears.\n",
"1. In the chart options, drag `GENDER` into the **Keys** box. \n",
"1. In the **Aggregation** field, choose **COUNT**. \n",
"1. Increase the **# of Rows to Display** to a very large number to display all data.\n",
"1. Click **OK**. The pie chart appears.\n",
"\n",
"If you want to make further changes, click **Options** to return to the chart options tool."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "COUNT",
"chartsize": "78",
"handlerId": "pieChart",
"keyFields": "GENDER",
"rowCount": "100000"
}
}
},
"outputs": [],
"source": [
"display(customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"### View customers by generation in a bar chart\n",
"Look at how many customers you have per \"generation.\"\n",
"\n",
"Run the next cell and configure the graph: \n",
"1. Choose **Bar Chart** as the chart type and configure the chart options as instructed below.\n",
"2. Put `GENERATION` into the **Keys** box.\n",
"3. Set **aggregation** to `COUNT`.\n",
"1. Increase the **# of Rows to Display** to a very large number to display all data.\n",
"4. Click **OK**\n",
"4. Change the **Renderer** at the top right of the chart to explore different visualisations. \n",
"4. You can use clustering to group customers, for example by geographic location. To group generations by country, select `COUNTRY_CODE` from the **Cluster by** list from the menu on the left of the chart. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "COUNT",
"chartsize": "73",
"handlerId": "barChart",
"keyFields": "GENERATION",
"rendererId": "brunel",
"rowCount": "10000000"
}
}
},
"outputs": [],
"source": [
"display(customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"### View customers by age in a histogram chart\n",
"A generation is a broad age range. You can look at a smaller age range with a histogram chart. A histogram is like a bar chart except each bar represents a range of numbers, called a bin. You can customize the size of the age range by adjusting the bin size. The more bins you specify, the smaller the age range.\n",
"\n",
"Run the next cell and configure the graph:\n",
"1. Choose **Histogram** as the chart type. \n",
"2. Put `AGE` into the **Values** box.\n",
"1. Increase the **# of Rows to Display** to a very large number to display all data.\n",
"1. Click **OK**.\n",
"3. Use the **Bin count** slider to specify the number of the bins. Try starting with 40."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"handlerId": "histogram",
"rowCount": "1000000",
"valueFields": "AGE"
}
}
},
"outputs": [],
"source": [
"display(customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"### View specific information with a filter function\n",
"\n",
"You can filter records to restrict analysis by using the [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame) `filter()` function.\n",
"\n",
"If you want to view the age distribution for a specific generation, uncomment the desired filter condition and run the next cell:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"GENERATION": "Baby_Boomers",
"chartsize": "60",
"handlerId": "histogram",
"rowCount": "500",
"valueFields": "AGE"
}
},
"scrolled": false
},
"outputs": [],
"source": [
"# Data subsetting: display age distribution for a specific generation\n",
"# (Chart type: histogram, Chart Options > Values: AGE)\n",
"# to change the filter condition remove the # sign \n",
"condition = \"GENERATION = 'Baby_Boomers'\"\n",
"#condition = \"GENERATION = 'Gen_X'\"\n",
"#condition = \"GENERATION = 'Gen_Y'\"\n",
"#condition = \"GENERATION = 'Gen_Z'\"\n",
"boomers_df = customer_df.filter(condition)\n",
"display(boomers_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PixieDust supports basic filtering to make it easy to analyse data subsets. For example, to view the age distribution for a specific gender configure the chart as follows:\n",
"\n",
" 1. Choose `Histogram` as the chart type.\n",
" 2. Put `AGE` into the **Values** box and click OK.\n",
" 3. Click the filter button (looking like a funnel), and choose **GENDER** as field and `female` as value.\n",
" \n",
"The filter is only applied to the working data set and does not modify the input `customer_df`.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"filter": "{\"regex\": \"false\", \"field\": \"GENDER\", \"case_matter\": \"false\", \"value\": \"female\", \"constraint\": \"None\"}",
"handlerId": "histogram",
"no_margin": "true",
"rowCount": "100000",
"valueFields": "AGE"
}
}
},
"outputs": [],
"source": [
"display(customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also filter by location. For example, the following command creates a new DataFrame that filters for customers from the USA:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"condition = \"COUNTRY_CODE = 'US'\"\n",
"us_customer_df = customer_df.filter(condition)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can pivot your analysis perspective based on aspects that are of interest to you by choosing different keys and clusters.\n",
"\n",
"Create a bar chart and cluster the data.\n",
"\n",
"Run the next cell and configure the graph:\n",
"1. Choose **Bar chart** as the chart type.\n",
"2. Put `COUNTRY_CODE` into the **Keys** box.\n",
"4. Set Aggregation to **COUNT**.\n",
"5. Click **OK**. The chart displays the number of US customers.\n",
"6. From the **Cluster By** list, choose **GENDER**. The chart shows the number of customers by gender."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "COUNT",
"clusterby": "GENDER",
"handlerId": "barChart",
"keyFields": "COUNTRY_CODE",
"rowCount": "100000"
}
}
},
"outputs": [],
"source": [
"display(us_customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now try to cluster the customers by state.\n",
"\n",
"A bar chart isn't the best way to show geographic location!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"### View customer density by location with a map\n",
"Maps are a much better way to view location data than other chart types. \n",
"\n",
"Visualize customer density by US state with a map.\n",
"\n",
"Run the next cell and configure the graph:\n",
"1. Choose **Map** as the chart type.\n",
"2. Put `STATE` into the **Keys** box.\n",
"4. Set Aggregation to **COUNT**.\n",
"5. Click **OK**. The map displays the number of US customers.\n",
"6. From the **Renderer** list, choose **brunel**.\n",
"\n",
" > PixieDust supports three map renderers: brunel, [mapbox](https://www.mapbox.com/) and Google. Note that the Mapbox renderer and the Google renderer require an API key or access token and supported features vary by renderer.\n",
"\n",
"7. You can explore more about customers in each state by changing the aggregation method, for example look at customer age ranges (avg, minimum, and maximum) by state. Simply Change the aggregation function to `AVG`, `MIN`, or `MAX` and choose `AGE` as value. \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"aggregation": "COUNT",
"chartsize": "90",
"googlemapapikey": "1234",
"handlerId": "mapView",
"keyFields": "STATE",
"mapDisplayMode": "markers",
"mapboxtoken": "5678",
"rendererId": "google",
"rowCount": "100000"
}
},
"scrolled": true
},
"outputs": [],
"source": [
"display(us_customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Back to Table of Contents](#toc)\n",
"## Enrich demographic information with open data\n",
"You can easily combine other sources of data with your existing data. There is a lot of publicly available open data sets that can be very helpful. For example, knowing the approximate income level of your customers might help you target your marketing campaigns.\n",
"\n",
"Run the next cell to load [this data set](https://apsportal.ibm.com/exchange/public/entry/view/beb8c30a3f559e58716d983671b70337) from the United States Census Bureau into your notebook. The data set contains US household income statistics compiled at the zip code geography level."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load median income information for all US ZIP codes from a public source\n",
"income_df = pixiedust.sampleData('https://raw.githubusercontent.com/IBM/analyze-customer-data-spark-pixiedust/master/data/x19_income_select.csv')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"income_df.printSchema()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now cleanse the income data set to remove the data that you don't need. Create a new DataFrame for this data:\n",
" - The zip code, extracted from the GEOID column.\n",
" - The column B19049e1, which contains the median household income for 2013."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# ------------------------------\n",
"# Helper: Extract ZIP code\n",
"# ------------------------------\n",
"def extractZIPCode(col):\n",
" \"\"\" input: pyspark.sql.types.Column containing a geo code, like '86000US01001'\n",
" output: ZIP code\n",
" \"\"\"\n",
" m = re.match('^\\d+US(\\d\\d\\d\\d\\d)$',col)\n",
" if m:\n",
" return m.group(1)\n",
" else:\n",
" return None \n",
" \n",
"getZIPCodeUDF = func.udf(lambda c: extractZIPCode(c), types.StringType())\n",
"income_df = income_df.select('GEOID', 'B19049e1').withColumnRenamed('B19049e1', 'MEDIAN_INCOME_IN_ZIP').withColumn(\"ZIP\", getZIPCodeUDF(income_df['GEOID']))\n",
"income_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Perform a left outer join on the customer data set with the income data set, using the zip code as the join condition. For the complete syntax of joins, go to the pyspark DataFrame documentation and scroll down to the `join` syntax. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"us_customer_df = us_customer_df.join(income_df, us_customer_df.POSTAL_CODE == income_df.ZIP, 'left_outer').drop('GEOID').drop('ZIP')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"pixiedust": {
"displayParams": {
"handlerId": "histogram",
"rowCount": "100000",
"valueFields": "MEDIAN_INCOME_IN_ZIP"
}
}
},
"outputs": [],
"source": [
"display(us_customer_df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can visualize the income distribution of your customers by zip code.\n",
" Visualize income distribution for our customers.\n",
"Run the next cell and configure the graph:\n",
"1. Choose **Histogram** as the chart type.\n",
"2. Put `MEDIAN_INCOME_IN_ZIP` into the **Values** box and click **OK**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The majority of your customers live in zip codes where the median income is around 40,000 USD. "
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"[Back to Table of Contents](#toc)\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"Copyright © 2017, 2018 IBM. This notebook and its source code are released under the terms of the MIT License."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3.6 with Spark",
"language": "python3",
"name": "python36"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.8"
}
},
"nbformat": 4,
"nbformat_minor": 1
}