{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "lvnKTXzOO5p_"
},
"source": [
"# Apache Spark - Analyzing Food Insecurity in NYC using KeyFoods Price Catalogs"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "4D1Sn_NiL9Ls"
},
"source": [
"The goal of this notebook is to study the food insecurity problem by looking at the listed prices of various food items across neighborhoods in NYC. Our hypothesis is that *people living in areas with higher food insecurity problems would pay more for the same items compared to those in more secured areas*. For the scope of work, we will only assess food products from KeyFoods supermarkets, one of the top 4 Supermarket Leaders in Metro New York (according [Food Trade News 2021 report](https://www.foodtradenews.com/2021/06/29/food-trade-news-2021-market-study-issue/)). In particular, we will use the following datasets:"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "GQ_nVumH-5Yw"
},
"source": [
"### **`keyfood_products.csv`**\n",
"\n",
"This CSV file contains the price information about 2 million food items listed on KeyFoods stores in NYC.\n",
"\n",
"|store|department|upc|product|size|price|\n",
"|--|--|--|--|--|--|\n",
"102|bakery|102-28556000000|Store Prepared - Challah Egg|1 EA|\\$4.99 each|\n",
"102|bakery|102-28781600000|Store Prepared - fw Cheesecake Plain 7 Inch|1 EA|\\$27.99 each|\n",
"|...|...|...|...|...|...|\n",
"\n",
"The details of the columns are as follows:\n",
"\n",
"|Column|Description|\n",
"|--|--|\n",
"|**store** | The unique id of each store |\n",
"|**department**| The department (or aisle) that the food item belongs to. Possible values are: `'bakery'`,`'beverages'`,`'breakfast'`,`'deli'`,`'frozen'`,`'international'`, `'meatandseafood'`,`'pantry'`,`'produce'`,`'refrigerated'`, and `'snacks'`|\n",
"|**upc**|The unique id for each item in a food store. It is often in the format of `SID-XXXXXXXXXX`, where `SID` is a store id if it's specific to a store, `UPC` if it's a general product, or `'KEY'` if it's a KeyFoodsproduct. If an item doesn't have any UPC code, this field will be `N/A`.|\n",
"|**product**|This is the listed name of the product|\n",
"|**size**|The unit that the product is being sold in|\n",
"|**price**|The price text of the product as shown on their websites. This is not a number but have been verified to start with the price mark`$XX.XX`. Note that for items without price information, this field could be listed as `Price not Available`|\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "QzPdViLm-88u"
},
"source": [
"### **`keyfood_nyc_stores.json`**\n",
"\n",
"This JSON file contains information for all KeyFoods stores in NYC. There are a lot of details about each store, however, we are only interested in the following fields:\n",
"\n",
"|Field|Description|\n",
"|--|--|\n",
"|**name** | This is the unique id of each store, which could be crosswalk with the **store** field above |\n",
"|**communityDistrict**|The community district code that the store belongs to. It's simply a larger geographical unit comparing to a zip code. More information can be found [here](https://communityprofiles.planning.nyc.gov/).|\n",
"|**foodInsecurity**|A food insecurity score computed for the community district that the stores belong to. This value has the range of 0 to 1 with 0 being without any food insecurity rish, and 1 has the most food insecure risk.|\n",
"\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "yZCnuyPH-_7-"
},
"source": [
"### **`keyfood_sample_items.csv`**\n",
"\n",
"This data contains the list of 22 food items that we would like to study initially to assess our hypothesis. For each item, we have the UPC code (which needs to be generalized across store) and the item name. Here is the list:\n",
"\n",
"|UPC code|Item Name|\n",
"|--|--|\n",
"|SID-20308200000|Broccoli Crowns|\n",
"|KEY-000000004094|Fresh Produce - Carrot Bunch|\n",
"|KEY-000000004062|Fresh Produce - Cucumbers|\n",
"|SID-00000004072|Fresh Produce - Potatoes Russet|\n",
"|SID-00000004131|Fresh Produce - Apples Fuji Large|\n",
"|KEY-00000004013|Produce - Orange Navel 113|\n",
"|UPC-048500001004|Tropicana - Juice Orange Pure Prem Orig|\n",
"|UPC-017400108700|Carolina - Whole Grain Brown Rice|\n",
"|UPC-016000487697|General Mills - Cherrios Multi Grain Cereal|\n",
"|UPC-073296027686|Urban Meadow - 100 Whole Wheat Bread|\n",
"|UPC-048000013651|Chicken of the Sea - Solid Wht Albacore Tuna in Oil|\n",
"|SID-20115200000|Beef - Beef Semi Bnls Chuck Stk|\n",
"|SID-28080600000|Perdue - Split Chicken Breast Fam Pack|\n",
"|UPC-073296057461|Urban Meadow - Plain Low Fat Yogurt|\n",
"|UPC-041757021443|Laughing Cow - White Cheddar Wedges|\n",
"|UPC-073296069280|Urban Meadow - Large White Eggs|\n",
"|UPC-088365000347|Cream O Land - Gallon 2% Milk|\n",
"|UPC-072940744016|Redpack - Tomato Crushed|\n",
"|UPC-051500255162|Jif - Creamy Peanut Butter|\n",
"|UPC-073296025903|Urban Meadow - Canola Oil|\n",
"|UPC-041331124461|Goya - Beans Cannelini Can|\n",
"|UPC-014500001894|Birds Eye - Spinach Leaf|\n",
"\n",
"where `SID` should be replaced with the store id.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"id": "Yx1oVB3NgKXr"
},
"source": [
"## Environment Setup\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "6ObjDDE1P4qW",
"outputId": "8f6ebb5d-f582-45ae-8597-6b2eec7ca1dd"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
"Collecting pyspark\n",
" Downloading pyspark-3.4.0.tar.gz (310.8 MB)\n",
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m310.8/310.8 MB\u001b[0m \u001b[31m4.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
"\u001b[?25h Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
"Requirement already satisfied: py4j==0.10.9.7 in /usr/local/lib/python3.10/dist-packages (from pyspark) (0.10.9.7)\n",
"Building wheels for collected packages: pyspark\n",
" Building wheel for pyspark (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
" Created wheel for pyspark: filename=pyspark-3.4.0-py2.py3-none-any.whl size=311317145 sha256=579502e3a029e6ec1d489f0024d9bcc80cf690db6eaab46fc67629f7b8c97d39\n",
" Stored in directory: /root/.cache/pip/wheels/7b/1b/4b/3363a1d04368e7ff0d408e57ff57966fcdf00583774e761327\n",
"Successfully built pyspark\n",
"Installing collected packages: pyspark\n",
"Successfully installed pyspark-3.4.0\n"
]
},
{
"data": {
"text/plain": []
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%%shell\n",
"gdown --quiet 1O1U_t-cpmValVK2mjdTzcFxIbGw05vOw\n",
"gdown --quiet 1YUBKrtNV3QUz1RutMnMbJdQj7rv-Lkd5\n",
"gdown --quiet 1f79oETtvN3NQLYPnVGhurE1UBDP4IQP-\n",
"pip install pyspark"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 270
},
"id": "xj4RxOyXH3oC",
"outputId": "b7d179bd-1b8b-4443-892b-265f8c1e6e8e"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
":9: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`\n",
" IPython.display.set_matplotlib_formats('svg')\n"
]
},
{
"data": {
"text/html": [
"\n",
"