{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Big Mart Sales Prediction\n", "By Patrick Kong\n", "\n", "[Website](pkong97.github.io) | [Github](https://github.com/pkong97) | [LinkedIn](https://www.linkedin.com/in/pkong97/)\n", "\n", "\n", "## Contents:\n", "\n", "### Section 1. Introduction\n", "1. Background\n", "2. Questions\n", "\n", "### Section 2. Data\n", "1. Data source\n", "2. Data cleaning\n", "3. Replace missing and impossible values\n", "\n", "### Section 3. Exploratory Data Analysis\n", "1. Correlation matrix and heatmap\n", "2. Important variables\n", "\n", "### Section 4. Model Fitting\n", "1. Change categorical variables into numerical variables\n", "2. Mean-based model\n", "3. Linear regression model\n", "4. Ridge regression model\n", "5. Decision tree model\n", "6. Random frest model\n", "\n", "### Section 5. Conclusion" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###
| \n", " | Item_Identifier | \n", "Item_Weight | \n", "Item_Fat_Content | \n", "Item_Visibility | \n", "Item_Type | \n", "Item_MRP | \n", "Outlet_Identifier | \n", "Outlet_Establishment_Year | \n", "Outlet_Size | \n", "Outlet_Location_Type | \n", "Outlet_Type | \n", "Item_Outlet_Sales | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "FDA15 | \n", "9.300 | \n", "Low Fat | \n", "0.016047 | \n", "Dairy | \n", "249.8092 | \n", "OUT049 | \n", "1999 | \n", "Medium | \n", "Tier 1 | \n", "Supermarket Type1 | \n", "3735.1380 | \n", "
| 1 | \n", "DRC01 | \n", "5.920 | \n", "Regular | \n", "0.019278 | \n", "Soft Drinks | \n", "48.2692 | \n", "OUT018 | \n", "2009 | \n", "Medium | \n", "Tier 3 | \n", "Supermarket Type2 | \n", "443.4228 | \n", "
| 2 | \n", "FDN15 | \n", "17.500 | \n", "Low Fat | \n", "0.016760 | \n", "Meat | \n", "141.6180 | \n", "OUT049 | \n", "1999 | \n", "Medium | \n", "Tier 1 | \n", "Supermarket Type1 | \n", "2097.2700 | \n", "
| 3 | \n", "FDX07 | \n", "19.200 | \n", "Regular | \n", "0.000000 | \n", "Fruits and Vegetables | \n", "182.0950 | \n", "OUT010 | \n", "1998 | \n", "NaN | \n", "Tier 3 | \n", "Grocery Store | \n", "732.3800 | \n", "
| 4 | \n", "NCD19 | \n", "8.930 | \n", "Low Fat | \n", "0.000000 | \n", "Household | \n", "53.8614 | \n", "OUT013 | \n", "1987 | \n", "High | \n", "Tier 3 | \n", "Supermarket Type1 | \n", "994.7052 | \n", "
| 5 | \n", "FDP36 | \n", "10.395 | \n", "Regular | \n", "0.000000 | \n", "Baking Goods | \n", "51.4008 | \n", "OUT018 | \n", "2009 | \n", "Medium | \n", "Tier 3 | \n", "Supermarket Type2 | \n", "556.6088 | \n", "
| 6 | \n", "FDO10 | \n", "13.650 | \n", "Regular | \n", "0.012741 | \n", "Snack Foods | \n", "57.6588 | \n", "OUT013 | \n", "1987 | \n", "High | \n", "Tier 3 | \n", "Supermarket Type1 | \n", "343.5528 | \n", "
| 7 | \n", "FDP10 | \n", "NaN | \n", "Low Fat | \n", "0.127470 | \n", "Snack Foods | \n", "107.7622 | \n", "OUT027 | \n", "1985 | \n", "Medium | \n", "Tier 3 | \n", "Supermarket Type3 | \n", "4022.7636 | \n", "
| 8 | \n", "FDH17 | \n", "16.200 | \n", "Regular | \n", "0.016687 | \n", "Frozen Foods | \n", "96.9726 | \n", "OUT045 | \n", "2002 | \n", "NaN | \n", "Tier 2 | \n", "Supermarket Type1 | \n", "1076.5986 | \n", "
| 9 | \n", "FDU28 | \n", "19.200 | \n", "Regular | \n", "0.094450 | \n", "Frozen Foods | \n", "187.8214 | \n", "OUT017 | \n", "2007 | \n", "NaN | \n", "Tier 2 | \n", "Supermarket Type1 | \n", "4710.5350 | \n", "
| \n", " | Item_Weight | \n", "Item_Visibility | \n", "Item_MRP | \n", "Outlet_Establishment_Year | \n", "Item_Outlet_Sales | \n", "
|---|---|---|---|---|---|
| count | \n", "7060.000000 | \n", "8523.000000 | \n", "8523.000000 | \n", "8523.000000 | \n", "8523.000000 | \n", "
| mean | \n", "12.857645 | \n", "0.066132 | \n", "140.992782 | \n", "1997.831867 | \n", "2181.288914 | \n", "
| std | \n", "4.643456 | \n", "0.051598 | \n", "62.275067 | \n", "8.371760 | \n", "1706.499616 | \n", "
| min | \n", "4.555000 | \n", "0.000000 | \n", "31.290000 | \n", "1985.000000 | \n", "33.290000 | \n", "
| 25% | \n", "8.773750 | \n", "0.026989 | \n", "93.826500 | \n", "1987.000000 | \n", "834.247400 | \n", "
| 50% | \n", "12.600000 | \n", "0.053931 | \n", "143.012800 | \n", "1999.000000 | \n", "1794.331000 | \n", "
| 75% | \n", "16.850000 | \n", "0.094585 | \n", "185.643700 | \n", "2004.000000 | \n", "3101.296400 | \n", "
| max | \n", "21.350000 | \n", "0.328391 | \n", "266.888400 | \n", "2009.000000 | \n", "13086.964800 | \n", "
| Outlet_Type | \n", "Grocery Store | \n", "Supermarket Type1 | \n", "Supermarket Type2 | \n", "Supermarket Type3 | \n", "
|---|---|---|---|---|
| Outlet_Size | \n", "\n", " | \n", " | \n", " | \n", " |
| High | \n", "NaN | \n", "932.0 | \n", "NaN | \n", "NaN | \n", "
| Medium | \n", "NaN | \n", "930.0 | \n", "928.0 | \n", "935.0 | \n", "
| Small | \n", "528.0 | \n", "1860.0 | \n", "NaN | \n", "NaN | \n", "
| \n", " | Item_Weight | \n", "Item_Visibility | \n", "Item_MRP | \n", "Item_Outlet_Sales | \n", "Outlet_Age | \n", "
|---|---|---|---|---|---|
| Item_Weight | \n", "1.000000 | \n", "-0.018053 | \n", "0.025821 | \n", "0.012088 | \n", "0.008376 | \n", "
| Item_Visibility | \n", "-0.018053 | \n", "1.000000 | \n", "-0.004525 | \n", "-0.128449 | \n", "0.075175 | \n", "
| Item_MRP | \n", "0.025821 | \n", "-0.004525 | \n", "1.000000 | \n", "0.567574 | \n", "-0.005020 | \n", "
| Item_Outlet_Sales | \n", "0.012088 | \n", "-0.128449 | \n", "0.567574 | \n", "1.000000 | \n", "0.049135 | \n", "
| Outlet_Age | \n", "0.008376 | \n", "0.075175 | \n", "-0.005020 | \n", "0.049135 | \n", "1.000000 | \n", "
| \n", " | Item_Outlet_Sales | \n", "
|---|---|
| Outlet_Type | \n", "\n", " |
| Grocery Store | \n", "339.828500 | \n", "
| Supermarket Type1 | \n", "2316.181148 | \n", "
| Supermarket Type2 | \n", "1995.498739 | \n", "
| Supermarket Type3 | \n", "3694.038558 | \n", "
Public LB Score = 1271.42
" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV Score = 1205.6279644382307\n" ] }, { "data": { "text/plain": [ "Public LB Score = 1273.24
" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV Score = 1207.268600984918\n" ] }, { "data": { "text/plain": [ "Public LB Score = 1167.33 (Rank 965)
\n", "\n", "We see that \"Item_MRP\", \"Outlet_Type\", and \"Outlet_Age\" are the most important features for the Decision tree model." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV Score = 1088.7693871484626\n" ] }, { "data": { "text/plain": [ "Public LB Score = 1154.16 (Rank 657 out of 2149)
\n", "\n", "We see that \"Item_MRP\", \"Outlet_Type\", and \"Outlet_Age\" are the most important features for the Random Forest Model." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CV Score = 1082.8669304067994\n" ] }, { "data": { "text/plain": [ "