Multiple Linear Regression¶

Predicting Food Prices in Nigreia¶

  • The Economics of Eating: Predicting Food Price Trends in Nigeria
  • This dataset contains Food Prices data for Nigeria, sourced from the World Food Programme Price Database. The World Food Programme Price Database covers foods such as maize, rice, beans, fish, and sugar for 98 countries and some 3000 markets. It is updated weekly but contains to a large extent monthly data. The data goes back as far as 1992 for a few countries, although many countries started reporting from 2003 or thereafter.

Exploratory Data Analysis: Unveiling Patterns and Insights¶

1. Introduction
    Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.

2. Objectives of Exploratory Data Analysis
    1. Understand Data Characteristics:
        Gain insights into the distribution, central tendency, and variability of the data.
        Identify the presence of missing values, outliers, and anomalies.

    2. Explore Relationships:
        Examine correlations and dependencies between different variables.
        Uncover potential patterns and trends within the dataset.

    3. Visualize Data Distributions:
        Utilize graphical representations to visualize the distribution of data.
        Choose appropriate plots such as histograms, box plots, and scatter plots.

    4. Identify Patterns and Anomalies:
        Uncover hidden patterns that may not be apparent in raw data.
        Detect outliers and anomalies that could impact analysis outcomes.


3. Techniques and Tools
    1. Descriptive Statistics:
        Calculate measures such as mean, median, and standard deviation.
        Utilize summary statistics to provide an overview of the dataset.
        Data Visualization:

        Employ graphical representations like histograms, box plots, and scatter plots.
        Create visualizations to illustrate trends, patterns, and relationships.
        Correlation Analysis:

        Use correlation matrices to quantify the relationships between variables.
        Identify strong positive/negative correlations and potential multicollinearity.
        Outlier Detection:

        Apply statistical methods or visual inspection to identify outliers.
        Assess the impact of outliers on the analysis and consider appropriate handling.

4. Steps in Exploratory Data Analysis
    1. Data Collection:
        Gather the raw dataset from reliable sources.

    2. Data Cleaning:
        Handle missing values, duplicate entries, and inconsistencies.
        Ensure data is in a suitable format for analysis.

    3. Descriptive Statistics:
        Compute basic statistics to describe the central tendency and dispersion.

    4. Visualization:
        Generate visualizations to explore data distributions and relationships.

    5. Correlation Analysis:
        Investigate correlations between variables.

    6. Outlier Detection:
        Identify and analyze outliers to understand their impact.

5. Case Study: Applying EDA to Real-World Data
    Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.

6. Conclusion
    Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.

7. References
    Include references to any tools, libraries, or methodologies used in the EDA process.

Import Libraries¶

In [1]:
import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Documentation
import handcalcs.render

# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px


from sympy import Sum, symbols, Indexed, lambdify, diff
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
In [2]:
# Path
data_path = './Data/'

Import Dataset¶

In [3]:
raw_data = pd.read_csv(data_path+"wfp_food_prices_nga.csv",  low_memory=False).reset_index(drop=True)
raw_data.shape
Out[3]:
(80982, 14)
In [4]:
raw_data
Out[4]:
date admin1 admin2 market latitude longitude category commodity unit priceflag pricetype currency price usdprice
0 #date #adm1+name #adm2+name #loc+market+name #geo+lat #geo+lon #item+type #item+name #item+unit #item+price+flag #item+price+type #currency #value #value+usd
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Maize KG actual Wholesale NGN 175.92 1.5525
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Millet KG actual Wholesale NGN 150.18 1.3254
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Rice (imported) KG actual Wholesale NGN 358.7 3.1656
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Sorghum KG actual Wholesale NGN 155.61 1.3733
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80977 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Onions 400 G forecast Retail NGN 0.0 0.0
80978 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Oranges 400 G forecast Retail NGN 0.0 0.0
80979 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Spinach 300 G forecast Retail NGN 0.0 0.0
80980 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Tomatoes 0.5 KG forecast Retail NGN 0.0 0.0
80981 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Watermelons 2.1 KG forecast Retail NGN 0.0 0.0

80982 rows × 14 columns

In [5]:
column_info = raw_data.iloc[0]
column_info
Out[5]:
date                    #date
admin1             #adm1+name
admin2             #adm2+name
market       #loc+market+name
latitude             #geo+lat
longitude            #geo+lon
category           #item+type
commodity          #item+name
unit               #item+unit
priceflag    #item+price+flag
pricetype    #item+price+type
currency            #currency
price                  #value
usdprice           #value+usd
Name: 0, dtype: object
In [6]:
raw_data = raw_data.drop(0).reset_index(drop=True)
raw_data
Out[6]:
date admin1 admin2 market latitude longitude category commodity unit priceflag pricetype currency price usdprice
0 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Maize KG actual Wholesale NGN 175.92 1.5525
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Millet KG actual Wholesale NGN 150.18 1.3254
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Rice (imported) KG actual Wholesale NGN 358.7 3.1656
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 cereals and tubers Sorghum KG actual Wholesale NGN 155.61 1.3733
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.08 7.24 pulses and nuts Beans (niebe) KG actual Wholesale NGN 196.87 1.7374
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80976 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Onions 400 G forecast Retail NGN 0.0 0.0
80977 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Oranges 400 G forecast Retail NGN 0.0 0.0
80978 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Spinach 300 G forecast Retail NGN 0.0 0.0
80979 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Tomatoes 0.5 KG forecast Retail NGN 0.0 0.0
80980 2023-12-15 Yobe Yusufari Yusufari 13.062664985656738 11.171370506286621 vegetables and fruits Watermelons 2.1 KG forecast Retail NGN 0.0 0.0

80981 rows × 14 columns

In [7]:
# Rename the columns
raw_data.rename(columns={ 'date': 'Date', 'admin1': 'Admin1', 'admin2': 'Admin2', 'market': 'Market', 'latitude': 'Latitude', 'longitude': 'Longitude', 'category': 'Category', 
                     'commodity': 'Commodity', 'unit': 'Unit', 'priceflag': 'Price_Flag', 'pricetype': 'Price_Type', 'currency': 'Currency', 'price': 'Price', 'usdprice':'USD_Price' }, inplace=True)
raw_data.columns
Out[7]:
Index(['Date', 'Admin1', 'Admin2', 'Market', 'Latitude', 'Longitude',
       'Category', 'Commodity', 'Unit', 'Price_Flag', 'Price_Type', 'Currency',
       'Price', 'USD_Price'],
      dtype='object')
In [8]:
raw_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80981 entries, 0 to 80980
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        80981 non-null  object
 1   Admin1      80981 non-null  object
 2   Admin2      80981 non-null  object
 3   Market      80981 non-null  object
 4   Latitude    80981 non-null  object
 5   Longitude   80981 non-null  object
 6   Category    80981 non-null  object
 7   Commodity   80981 non-null  object
 8   Unit        80981 non-null  object
 9   Price_Flag  80981 non-null  object
 10  Price_Type  80981 non-null  object
 11  Currency    80981 non-null  object
 12  Price       80981 non-null  object
 13  USD_Price   80981 non-null  object
dtypes: object(14)
memory usage: 8.6+ MB
In [9]:
raw_data['Date'] = raw_data['Date'].astype('datetime64[ns]')
raw_data['Latitude'] = raw_data['Latitude'].astype(float)
raw_data['Longitude'] = raw_data['Longitude'].astype(float)
raw_data['Price'] = raw_data['Price'].astype(float)
raw_data['USD_Price'] = raw_data['USD_Price'].astype(float)
In [10]:
raw_data
Out[10]:
Date Admin1 Admin2 Market Latitude Longitude Category Commodity Unit Price_Flag Price_Type Currency Price USD_Price
0 2002-01-15 Katsina Jibia Jibia (CBM) 13.080000 7.240000 cereals and tubers Maize KG actual Wholesale NGN 175.92 1.5525
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.080000 7.240000 cereals and tubers Millet KG actual Wholesale NGN 150.18 1.3254
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.080000 7.240000 cereals and tubers Rice (imported) KG actual Wholesale NGN 358.70 3.1656
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.080000 7.240000 cereals and tubers Sorghum KG actual Wholesale NGN 155.61 1.3733
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.080000 7.240000 pulses and nuts Beans (niebe) KG actual Wholesale NGN 196.87 1.7374
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80976 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Onions 400 G forecast Retail NGN 0.00 0.0000
80977 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Oranges 400 G forecast Retail NGN 0.00 0.0000
80978 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Spinach 300 G forecast Retail NGN 0.00 0.0000
80979 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Tomatoes 0.5 KG forecast Retail NGN 0.00 0.0000
80980 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Watermelons 2.1 KG forecast Retail NGN 0.00 0.0000

80981 rows × 14 columns

In [11]:
raw_data.describe()
Out[11]:
Latitude Longitude Price USD_Price
count 80981.000000 80981.000000 80981.000000 80981.000000
mean 11.396802 10.278874 4887.086852 29.488854
std 1.699392 2.926734 10640.517897 329.652934
min 5.147640 3.401897 0.000000 0.000000
25% 11.280160 8.429120 100.000000 0.279300
50% 11.831100 11.166670 314.280000 0.968800
75% 12.362300 13.092646 1254.000000 3.909100
max 13.645000 13.264360 138000.000000 28662.419400
In [12]:
raw_data.nunique()
Out[12]:
Date            264
Admin1           14
Admin2           31
Market           40
Latitude         40
Longitude        40
Category          8
Commodity        42
Unit             22
Price_Flag        4
Price_Type        2
Currency          1
Price         20163
USD_Price     29982
dtype: int64

Finding for missing values¶

In [13]:
raw_data.isnull().sum()
Out[13]:
Date          0
Admin1        0
Admin2        0
Market        0
Latitude      0
Longitude     0
Category      0
Commodity     0
Unit          0
Price_Flag    0
Price_Type    0
Currency      0
Price         0
USD_Price     0
dtype: int64
In [14]:
for column in raw_data.columns:
    if column not in ['Date', 'Latitude', 'Longitude', 'Price', 'USD_Price']:
        print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
        print(raw_data[column].unique())
        print("--------------------------------------------------------------------------------------------------------------")
        
------------------------------------------------- Admin1  -  14 ---------------------------------------------------
['Katsina' 'Sokoto' 'Borno' 'Kano' 'Jigawa' 'Oyo' 'Lagos' 'Kaduna'
 'Zamfara' 'Abia' 'Gombe' 'Kebbi' 'Adamawa' 'Yobe']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Admin2  -  31 ---------------------------------------------------
['Jibia' 'Gada' "Mai'Adua" 'Mobbar' 'Dawakin Tofa' 'Maigatari'
 'Ibadan North' 'Maiduguri' 'Kosofe' 'Giwa' 'Kaura Namoda' 'Oboma Ngwa'
 'Akko' 'Kaugama' 'Lere' 'Dandume' 'Gwandu' 'Hong' 'Biu' 'Damaturu'
 'Potiskum' 'Konduga' 'Geidam' 'Gujba' 'Jakusko' 'Karasuwa' 'Nguru'
 'Yunusari' 'Yusufari' 'Borsari' 'Gulani']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Market  -  40 ---------------------------------------------------
['Jibia (CBM)' 'Illela (CBM)' 'Mai Adoua (CBM)' 'Damassack (CBM)'
 'Dawanau' 'Mai Gatari (CBM)' 'Ibadan' 'Maiduguri' 'Lagos' 'Giwa'
 'Kaura Namoda' 'Aba' 'Gombe' 'Gujungu' 'Saminaka' 'Dandume' 'Gwandu'
 'Mubi' 'Biu' 'Damaturu' 'Potiskum' 'Abba Gamaram' 'Baga Road'
 'Bullunkutu' 'Budum' 'Custom' 'Kusawam Shanu' 'Monday' 'Tashan Bama'
 'Bolori Stores' 'Damaturu (Sunday Market)' 'Geidam' 'Gujba (Buni Yadi)'
 'Jakusko' 'Bade (Gashua)' 'Nguru' 'Yunusari' 'Yusufari' 'Bursari'
 'Gulani (Tettaba)']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Category  -  8 ---------------------------------------------------
['cereals and tubers' 'pulses and nuts' 'non-food' 'oil and fats'
 'meat, fish and eggs' 'milk and dairy' 'vegetables and fruits'
 'miscellaneous food']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Commodity  -  42 ---------------------------------------------------
['Maize' 'Millet' 'Rice (imported)' 'Sorghum' 'Beans (niebe)' 'Wheat'
 'Maize (white)' 'Sorghum (white)' 'Rice (milled, local)' 'Bread'
 'Cassava meal (gari, yellow)' 'Gari (white)' 'Maize (yellow)'
 'Rice (local)' 'Sorghum (brown)' 'Yam (Abuja)' 'Fuel (diesel)'
 'Fuel (petrol-gasoline)' 'Oil (palm)' 'Cowpeas (brown)' 'Cowpeas (white)'
 'Yam' 'Groundnuts (shelled)' 'Maize flour' 'Meat (beef)' 'Meat (goat)'
 'Milk' 'Oil (vegetable)' 'Beans (red)' 'Beans (white)' 'Groundnuts'
 'Onions' 'Fish' 'Eggs' 'Bananas' 'Oranges' 'Spinach' 'Watermelons'
 'Cowpeas' 'Tomatoes' 'Salt' 'Sugar']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Unit  -  22 ---------------------------------------------------
['KG' '100 KG' '50 KG' 'Unit' 'L' '100 L' '1.3 KG' '1.4 KG' '3.4 KG'
 '20 G' '750 ML' '1.1 KG' '0.5 KG' '1.5 KG' '30 pcs' '400 G' '300 G'
 '2.1 KG' '3.1 KG' '1.2 KG' '250 G' '100 Tubers']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Price_Flag  -  4 ---------------------------------------------------
['actual' 'actual,aggregate' 'aggregate' 'forecast']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Price_Type  -  2 ---------------------------------------------------
['Wholesale' 'Retail']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Currency  -  1 ---------------------------------------------------
['NGN']
--------------------------------------------------------------------------------------------------------------
In [15]:
raw_data.Category.value_counts()
Out[15]:
cereals and tubers       39537
pulses and nuts          13841
vegetables and fruits     8794
meat, fish and eggs       6177
oil and fats              4928
non-food                  3436
miscellaneous food        2721
milk and dairy            1547
Name: Category, dtype: int64
In [16]:
raw_data.Commodity.value_counts()
Out[16]:
Millet                         4972
Rice (imported)                4537
Rice (local)                   4193
Yam                            4002
Oil (palm)                     3368
Maize (white)                  2946
Sorghum (white)                2904
Cowpeas (white)                2699
Groundnuts (shelled)           2663
Sorghum (brown)                2660
Gari (white)                   2599
Maize (yellow)                 2560
Cowpeas (brown)                2526
Cassava meal (gari, yellow)    2310
Sorghum                        2109
Fuel (petrol-gasoline)         1721
Fuel (diesel)                  1715
Groundnuts                     1561
Oil (vegetable)                1560
Maize flour                    1557
Beans (red)                    1554
Meat (beef)                    1552
Meat (goat)                    1552
Milk                           1547
Beans (white)                  1547
Onions                         1546
Fish                           1543
Eggs                           1530
Watermelons                    1521
Oranges                        1520
Bananas                        1505
Tomatoes                       1415
Sugar                          1369
Salt                           1352
Spinach                        1287
Cowpeas                        1267
Bread                          1035
Maize                           601
Rice (milled, local)            394
Yam (Abuja)                      85
Wheat                            73
Beans (niebe)                    24
Name: Commodity, dtype: int64
In [17]:
raw_data['Unit'].value_counts()
Out[17]:
100 KG        19125
KG            18740
1.3 KG         7056
L              5135
1.4 KG         5072
1.2 KG         4335
750 ML         3120
400 G          2387
0.5 KG         2094
50 KG          1965
20 G           1547
30 pcs         1530
2.1 KG         1521
250 G          1352
300 G          1287
Unit           1035
3.1 KG          867
1.1 KG          694
3.4 KG          680
1.5 KG          676
100 Tubers      666
100 L            97
Name: Unit, dtype: int64
In [18]:
raw_data['Price_Flag'].value_counts()
Out[18]:
actual              51969
aggregate           14594
forecast            13949
actual,aggregate      469
Name: Price_Flag, dtype: int64
In [19]:
raw_data['Price_Type'].value_counts()
Out[19]:
Retail       56111
Wholesale    24870
Name: Price_Type, dtype: int64
In [20]:
# Removing Outlier column
raw_data.drop('Currency',  axis=1, inplace=True)

Data Exploration¶

In [21]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))  # You can adjust the width and height as needed

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Category'].unique()))

# Get the top 8 categories (you can adjust the number as needed)
top_categories = raw_data['Category'].value_counts().index

# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Category'].isin(top_categories)]

# Creating the bar plot with different colors for each category
sns.barplot(x='Category', y='USD_Price', hue='Category', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('Bar Plot of USD Price by Category')
plt.xlabel('Category')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Category')

# Show the plot
plt.show()
In [22]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(10, 18))  # Swapping width and height for a horizontal plot

# Get the top categories
top_categories = raw_data['Commodity'].value_counts().index

# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Commodity'].isin(top_categories)]

# Creating the horizontal bar plot with different colors for each category
sns.barplot(x='USD_Price', y='Commodity', hue='Commodity', data=filtered_data)

# Adding titles and labels
plt.title('Plot of USD Price by Commodity')
plt.xlabel('USD Price')
plt.ylabel('Commodity')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=sns.color_palette()[i % len(sns.color_palette())], markersize=10) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Commodity')

# Show the plot
plt.show()
In [23]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(14, 5))

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Price_Type'].unique()))

# Get Price Type 
top_price_type = raw_data['Price_Type'].value_counts().index

# Filter the data to include only the top Price Type
filtered_data = raw_data[raw_data['Price_Type'].isin(top_price_type)]

# Creating the bar plot with different colors for each Price Type
ax = sns.barplot(x='Price_Type', y='USD_Price', hue='Price_Type', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('USD Price by Price Type')
plt.xlabel('Price Type')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_price_type)]
plt.legend(legend_handles, top_price_type, title='Price Type')

# Adding annotations on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Show the plot
plt.show()
In [24]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(raw_data["Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()
In [25]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(raw_data["USD_Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()
In [26]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

sns.kdeplot(raw_data["USD_Price"])

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()
In [27]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of NGN Price vs Commodity')
plt.xlabel('Commodity')
plt.ylabel('NGN Price')
plt.show()
In [28]:
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['USD_Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of USD Price vs Commodity')
plt.xlabel('X1')
plt.ylabel('y')
plt.show()
In [29]:
# sns.pairplot(raw_data)
# plt.show()
In [30]:
# sns.pairplot(raw_data, kind = 'reg' , plot_kws = {'line_kws':{'color': 'red'}})
# plt.show()

Encoding categorical data¶

Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:

  1. Label Encoding:
    • Assigns a unique integer to each category.
    • Suitable for ordinal data where the order matters.
    • Sklearn provides LabelEncoder for this purpose.

  2. One-Hot Encoding:
    • Creates binary columns for each category and represents the presence of a category with a 1.
    • Suitable for nominal data where there is no inherent order.

  3. Ordinal Encoding:
    • Manually assign numerical values based on the order of categories.
    • Useful when there is an inherent order among categories.

  4. Binary Encoding:
    • Converts categories into binary code.
    • Reduces the number of columns compared to one-hot encoding.

  5. Hashing Encoding:
    • Converts categories into a fixed-size hash, useful when dealing with high cardinality.
In [31]:
pd.get_dummies(raw_data['Category'])
Out[31]:
cereals and tubers meat, fish and eggs milk and dairy miscellaneous food non-food oil and fats pulses and nuts vegetables and fruits
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 1 0 0 0 0 0 0 0
4 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ...
80976 0 0 0 0 0 0 0 1
80977 0 0 0 0 0 0 0 1
80978 0 0 0 0 0 0 0 1
80979 0 0 0 0 0 0 0 1
80980 0 0 0 0 0 0 0 1

80981 rows × 8 columns

In [32]:
pd.get_dummies(raw_data['Price_Flag'])
Out[32]:
actual actual,aggregate aggregate forecast
0 1 0 0 0
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 1 0 0 0
... ... ... ... ...
80976 0 0 0 1
80977 0 0 0 1
80978 0 0 0 1
80979 0 0 0 1
80980 0 0 0 1

80981 rows × 4 columns

In [33]:
pd.get_dummies(raw_data['Price_Type'])
Out[33]:
Retail Wholesale
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
... ... ...
80976 1 0
80977 1 0
80978 1 0
80979 1 0
80980 1 0

80981 rows × 2 columns

In [34]:
pd.get_dummies(raw_data['Unit'])
Out[34]:
0.5 KG 1.1 KG 1.2 KG 1.3 KG 1.4 KG 1.5 KG 100 KG 100 L 100 Tubers 2.1 KG ... 3.1 KG 3.4 KG 30 pcs 300 G 400 G 50 KG 750 ML KG L Unit
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
80976 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
80977 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
80978 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 0 0 0
80979 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
80980 0 0 0 0 0 0 0 0 0 1 ... 0 0 0 0 0 0 0 0 0 0

80981 rows × 22 columns

In [35]:
# This data has been seperated as test data because the price has to be predicted
test_data = raw_data[raw_data['Price_Flag'] == 'forecast'].reset_index(drop=True)
test_data
Out[35]:
Date Admin1 Admin2 Market Latitude Longitude Category Commodity Unit Price_Flag Price_Type Price USD_Price
0 2006-01-15 Sokoto Gada Illela (CBM) 13.645000 5.278000 cereals and tubers Wheat KG forecast Wholesale 0.0 0.0
1 2006-02-15 Sokoto Gada Illela (CBM) 13.645000 5.278000 cereals and tubers Wheat KG forecast Wholesale 0.0 0.0
2 2006-03-15 Katsina Mai'Adua Mai Adoua (CBM) 13.180000 8.230000 cereals and tubers Wheat KG forecast Wholesale 0.0 0.0
3 2006-03-15 Sokoto Gada Illela (CBM) 13.645000 5.278000 cereals and tubers Wheat KG forecast Wholesale 0.0 0.0
4 2006-04-15 Katsina Mai'Adua Mai Adoua (CBM) 13.180000 8.230000 cereals and tubers Wheat KG forecast Wholesale 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
13944 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Onions 400 G forecast Retail 0.0 0.0
13945 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Oranges 400 G forecast Retail 0.0 0.0
13946 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Spinach 300 G forecast Retail 0.0 0.0
13947 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Tomatoes 0.5 KG forecast Retail 0.0 0.0
13948 2023-12-15 Yobe Yusufari Yusufari 13.062665 11.171371 vegetables and fruits Watermelons 2.1 KG forecast Retail 0.0 0.0

13949 rows × 13 columns

In [36]:
raw_data = raw_data[raw_data['Price_Flag'] != 'forecast'].reset_index(drop=True)
raw_data
Out[36]:
Date Admin1 Admin2 Market Latitude Longitude Category Commodity Unit Price_Flag Price_Type Price USD_Price
0 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Maize KG actual Wholesale 175.92 1.5525
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Millet KG actual Wholesale 150.18 1.3254
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Rice (imported) KG actual Wholesale 358.70 3.1656
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Sorghum KG actual Wholesale 155.61 1.3733
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 pulses and nuts Beans (niebe) KG actual Wholesale 196.87 1.7374
... ... ... ... ... ... ... ... ... ... ... ... ... ...
67027 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (brown) KG aggregate Retail 298.55 0.6480
67028 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (white) 100 KG aggregate Wholesale 29440.00 63.8992
67029 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (white) KG aggregate Retail 274.82 0.5965
67030 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Groundnuts (shelled) 100 KG aggregate Wholesale 46960.00 101.9261
67031 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Groundnuts (shelled) KG aggregate Retail 467.66 1.0151

67032 rows × 13 columns

One-Hot Encoding¶

In [37]:
encoder = OneHotEncoder(drop='first')
In [38]:
def encoding_categorical_data(df, column):
    df = pd.concat([df, pd.DataFrame(encoder.fit_transform(df[[column]]).toarray(), columns=encoder.get_feature_names_out([column])).astype(int)],  axis=1) 
    # Removing un necessary column
    df.drop(column,  axis=1, inplace=True)
    return df
In [39]:
data = raw_data.copy()
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    data = encoding_categorical_data(data, col)
    
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    test_data = encoding_categorical_data(test_data, col)
In [40]:
data.shape
Out[40]:
(67032, 80)
In [41]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67032 entries, 0 to 67031
Data columns (total 80 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Date                                   67032 non-null  datetime64[ns]
 1   Admin1                                 67032 non-null  object        
 2   Admin2                                 67032 non-null  object        
 3   Market                                 67032 non-null  object        
 4   Latitude                               67032 non-null  float64       
 5   Longitude                              67032 non-null  float64       
 6   Price                                  67032 non-null  float64       
 7   USD_Price                              67032 non-null  float64       
 8   Price_Type_Wholesale                   67032 non-null  int32         
 9   Price_Flag_actual,aggregate            67032 non-null  int32         
 10  Price_Flag_aggregate                   67032 non-null  int32         
 11  Unit_1.1 KG                            67032 non-null  int32         
 12  Unit_1.2 KG                            67032 non-null  int32         
 13  Unit_1.3 KG                            67032 non-null  int32         
 14  Unit_1.4 KG                            67032 non-null  int32         
 15  Unit_1.5 KG                            67032 non-null  int32         
 16  Unit_100 KG                            67032 non-null  int32         
 17  Unit_100 L                             67032 non-null  int32         
 18  Unit_100 Tubers                        67032 non-null  int32         
 19  Unit_2.1 KG                            67032 non-null  int32         
 20  Unit_20 G                              67032 non-null  int32         
 21  Unit_250 G                             67032 non-null  int32         
 22  Unit_3.1 KG                            67032 non-null  int32         
 23  Unit_3.4 KG                            67032 non-null  int32         
 24  Unit_30 pcs                            67032 non-null  int32         
 25  Unit_300 G                             67032 non-null  int32         
 26  Unit_400 G                             67032 non-null  int32         
 27  Unit_50 KG                             67032 non-null  int32         
 28  Unit_750 ML                            67032 non-null  int32         
 29  Unit_KG                                67032 non-null  int32         
 30  Unit_L                                 67032 non-null  int32         
 31  Unit_Unit                              67032 non-null  int32         
 32  Commodity_Beans (niebe)                67032 non-null  int32         
 33  Commodity_Beans (red)                  67032 non-null  int32         
 34  Commodity_Beans (white)                67032 non-null  int32         
 35  Commodity_Bread                        67032 non-null  int32         
 36  Commodity_Cassava meal (gari, yellow)  67032 non-null  int32         
 37  Commodity_Cowpeas                      67032 non-null  int32         
 38  Commodity_Cowpeas (brown)              67032 non-null  int32         
 39  Commodity_Cowpeas (white)              67032 non-null  int32         
 40  Commodity_Eggs                         67032 non-null  int32         
 41  Commodity_Fish                         67032 non-null  int32         
 42  Commodity_Fuel (diesel)                67032 non-null  int32         
 43  Commodity_Fuel (petrol-gasoline)       67032 non-null  int32         
 44  Commodity_Gari (white)                 67032 non-null  int32         
 45  Commodity_Groundnuts                   67032 non-null  int32         
 46  Commodity_Groundnuts (shelled)         67032 non-null  int32         
 47  Commodity_Maize                        67032 non-null  int32         
 48  Commodity_Maize (white)                67032 non-null  int32         
 49  Commodity_Maize (yellow)               67032 non-null  int32         
 50  Commodity_Maize flour                  67032 non-null  int32         
 51  Commodity_Meat (beef)                  67032 non-null  int32         
 52  Commodity_Meat (goat)                  67032 non-null  int32         
 53  Commodity_Milk                         67032 non-null  int32         
 54  Commodity_Millet                       67032 non-null  int32         
 55  Commodity_Oil (palm)                   67032 non-null  int32         
 56  Commodity_Oil (vegetable)              67032 non-null  int32         
 57  Commodity_Onions                       67032 non-null  int32         
 58  Commodity_Oranges                      67032 non-null  int32         
 59  Commodity_Rice (imported)              67032 non-null  int32         
 60  Commodity_Rice (local)                 67032 non-null  int32         
 61  Commodity_Rice (milled, local)         67032 non-null  int32         
 62  Commodity_Salt                         67032 non-null  int32         
 63  Commodity_Sorghum                      67032 non-null  int32         
 64  Commodity_Sorghum (brown)              67032 non-null  int32         
 65  Commodity_Sorghum (white)              67032 non-null  int32         
 66  Commodity_Spinach                      67032 non-null  int32         
 67  Commodity_Sugar                        67032 non-null  int32         
 68  Commodity_Tomatoes                     67032 non-null  int32         
 69  Commodity_Watermelons                  67032 non-null  int32         
 70  Commodity_Wheat                        67032 non-null  int32         
 71  Commodity_Yam                          67032 non-null  int32         
 72  Commodity_Yam (Abuja)                  67032 non-null  int32         
 73  Category_meat, fish and eggs           67032 non-null  int32         
 74  Category_milk and dairy                67032 non-null  int32         
 75  Category_miscellaneous food            67032 non-null  int32         
 76  Category_non-food                      67032 non-null  int32         
 77  Category_oil and fats                  67032 non-null  int32         
 78  Category_pulses and nuts               67032 non-null  int32         
 79  Category_vegetables and fruits         67032 non-null  int32         
dtypes: datetime64[ns](1), float64(4), int32(72), object(3)
memory usage: 22.5+ MB
In [42]:
data.describe()
Out[42]:
Latitude Longitude Price USD_Price Price_Type_Wholesale Price_Flag_actual,aggregate Price_Flag_aggregate Unit_1.1 KG Unit_1.2 KG Unit_1.3 KG ... Commodity_Wheat Commodity_Yam Commodity_Yam (Abuja) Category_meat, fish and eggs Category_milk and dairy Category_miscellaneous food Category_non-food Category_oil and fats Category_pulses and nuts Category_vegetables and fruits
count 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 ... 67032.000000 67032.000000 67032.0000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000 67032.000000
mean 11.411172 10.273114 5904.063438 35.625326 0.320847 0.006997 0.217717 0.008832 0.052811 0.086645 ... 0.000910 0.048171 0.0010 0.076575 0.019185 0.033074 0.043681 0.060717 0.168979 0.107919
std 1.689944 2.934114 11435.785644 362.031271 0.466805 0.083354 0.412697 0.093561 0.223657 0.281317 ... 0.030153 0.214129 0.0316 0.265918 0.137175 0.178831 0.204385 0.238813 0.374736 0.310280
min 5.147640 3.401897 1.000000 0.002300 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.0000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 11.280160 8.429120 200.000000 0.600400 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.0000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% 11.831100 11.166670 446.745000 1.307900 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.0000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 12.362300 13.092646 6862.500000 28.969800 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.0000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
max 13.645000 13.264360 138000.000000 28662.419400 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 1.000000 1.0000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

8 rows × 76 columns

In [43]:
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13949 entries, 0 to 13948
Data columns (total 77 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Date                                   13949 non-null  datetime64[ns]
 1   Admin1                                 13949 non-null  object        
 2   Admin2                                 13949 non-null  object        
 3   Market                                 13949 non-null  object        
 4   Latitude                               13949 non-null  float64       
 5   Longitude                              13949 non-null  float64       
 6   Price                                  13949 non-null  float64       
 7   USD_Price                              13949 non-null  float64       
 8   Price_Type_Wholesale                   13949 non-null  int32         
 9   Unit_1.1 KG                            13949 non-null  int32         
 10  Unit_1.2 KG                            13949 non-null  int32         
 11  Unit_1.3 KG                            13949 non-null  int32         
 12  Unit_1.4 KG                            13949 non-null  int32         
 13  Unit_1.5 KG                            13949 non-null  int32         
 14  Unit_100 KG                            13949 non-null  int32         
 15  Unit_100 L                             13949 non-null  int32         
 16  Unit_100 Tubers                        13949 non-null  int32         
 17  Unit_2.1 KG                            13949 non-null  int32         
 18  Unit_20 G                              13949 non-null  int32         
 19  Unit_250 G                             13949 non-null  int32         
 20  Unit_3.1 KG                            13949 non-null  int32         
 21  Unit_3.4 KG                            13949 non-null  int32         
 22  Unit_30 pcs                            13949 non-null  int32         
 23  Unit_300 G                             13949 non-null  int32         
 24  Unit_400 G                             13949 non-null  int32         
 25  Unit_50 KG                             13949 non-null  int32         
 26  Unit_750 ML                            13949 non-null  int32         
 27  Unit_KG                                13949 non-null  int32         
 28  Unit_L                                 13949 non-null  int32         
 29  Unit_Unit                              13949 non-null  int32         
 30  Commodity_Beans (red)                  13949 non-null  int32         
 31  Commodity_Beans (white)                13949 non-null  int32         
 32  Commodity_Bread                        13949 non-null  int32         
 33  Commodity_Cassava meal (gari, yellow)  13949 non-null  int32         
 34  Commodity_Cowpeas                      13949 non-null  int32         
 35  Commodity_Cowpeas (brown)              13949 non-null  int32         
 36  Commodity_Cowpeas (white)              13949 non-null  int32         
 37  Commodity_Eggs                         13949 non-null  int32         
 38  Commodity_Fish                         13949 non-null  int32         
 39  Commodity_Fuel (diesel)                13949 non-null  int32         
 40  Commodity_Fuel (petrol-gasoline)       13949 non-null  int32         
 41  Commodity_Gari (white)                 13949 non-null  int32         
 42  Commodity_Groundnuts                   13949 non-null  int32         
 43  Commodity_Groundnuts (shelled)         13949 non-null  int32         
 44  Commodity_Maize                        13949 non-null  int32         
 45  Commodity_Maize (white)                13949 non-null  int32         
 46  Commodity_Maize (yellow)               13949 non-null  int32         
 47  Commodity_Maize flour                  13949 non-null  int32         
 48  Commodity_Meat (beef)                  13949 non-null  int32         
 49  Commodity_Meat (goat)                  13949 non-null  int32         
 50  Commodity_Milk                         13949 non-null  int32         
 51  Commodity_Millet                       13949 non-null  int32         
 52  Commodity_Oil (palm)                   13949 non-null  int32         
 53  Commodity_Oil (vegetable)              13949 non-null  int32         
 54  Commodity_Onions                       13949 non-null  int32         
 55  Commodity_Oranges                      13949 non-null  int32         
 56  Commodity_Rice (imported)              13949 non-null  int32         
 57  Commodity_Rice (local)                 13949 non-null  int32         
 58  Commodity_Rice (milled, local)         13949 non-null  int32         
 59  Commodity_Salt                         13949 non-null  int32         
 60  Commodity_Sorghum                      13949 non-null  int32         
 61  Commodity_Sorghum (brown)              13949 non-null  int32         
 62  Commodity_Sorghum (white)              13949 non-null  int32         
 63  Commodity_Spinach                      13949 non-null  int32         
 64  Commodity_Sugar                        13949 non-null  int32         
 65  Commodity_Tomatoes                     13949 non-null  int32         
 66  Commodity_Watermelons                  13949 non-null  int32         
 67  Commodity_Wheat                        13949 non-null  int32         
 68  Commodity_Yam                          13949 non-null  int32         
 69  Commodity_Yam (Abuja)                  13949 non-null  int32         
 70  Category_meat, fish and eggs           13949 non-null  int32         
 71  Category_milk and dairy                13949 non-null  int32         
 72  Category_miscellaneous food            13949 non-null  int32         
 73  Category_non-food                      13949 non-null  int32         
 74  Category_oil and fats                  13949 non-null  int32         
 75  Category_pulses and nuts               13949 non-null  int32         
 76  Category_vegetables and fruits         13949 non-null  int32         
dtypes: datetime64[ns](1), float64(4), int32(69), object(3)
memory usage: 4.5+ MB
In [44]:
data
Out[44]:
Date Admin1 Admin2 Market Latitude Longitude Price USD_Price Price_Type_Wholesale Price_Flag_actual,aggregate ... Commodity_Wheat Commodity_Yam Commodity_Yam (Abuja) Category_meat, fish and eggs Category_milk and dairy Category_miscellaneous food Category_non-food Category_oil and fats Category_pulses and nuts Category_vegetables and fruits
0 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 175.92 1.5525 1 0 ... 0 0 0 0 0 0 0 0 0 0
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 150.18 1.3254 1 0 ... 0 0 0 0 0 0 0 0 0 0
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 358.70 3.1656 1 0 ... 0 0 0 0 0 0 0 0 0 0
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 155.61 1.3733 1 0 ... 0 0 0 0 0 0 0 0 0 0
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 196.87 1.7374 1 0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67027 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 298.55 0.6480 0 0 ... 0 0 0 0 0 0 0 0 1 0
67028 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 29440.00 63.8992 1 0 ... 0 0 0 0 0 0 0 0 1 0
67029 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 274.82 0.5965 0 0 ... 0 0 0 0 0 0 0 0 1 0
67030 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 46960.00 101.9261 1 0 ... 0 0 0 0 0 0 0 0 1 0
67031 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 467.66 1.0151 0 0 ... 0 0 0 0 0 0 0 0 1 0

67032 rows × 80 columns

In [45]:
data = data.drop(['Date', 'Admin1', 'Admin2', 'Market'], axis=1)
In [46]:
raw_data
Out[46]:
Date Admin1 Admin2 Market Latitude Longitude Category Commodity Unit Price_Flag Price_Type Price USD_Price
0 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Maize KG actual Wholesale 175.92 1.5525
1 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Millet KG actual Wholesale 150.18 1.3254
2 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Rice (imported) KG actual Wholesale 358.70 3.1656
3 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 cereals and tubers Sorghum KG actual Wholesale 155.61 1.3733
4 2002-01-15 Katsina Jibia Jibia (CBM) 13.08000 7.24000 pulses and nuts Beans (niebe) KG actual Wholesale 196.87 1.7374
... ... ... ... ... ... ... ... ... ... ... ... ... ...
67027 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (brown) KG aggregate Retail 298.55 0.6480
67028 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (white) 100 KG aggregate Wholesale 29440.00 63.8992
67029 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Cowpeas (white) KG aggregate Retail 274.82 0.5965
67030 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Groundnuts (shelled) 100 KG aggregate Wholesale 46960.00 101.9261
67031 2023-01-15 Zamfara Kaura Namoda Kaura Namoda 12.59519 6.58635 pulses and nuts Groundnuts (shelled) KG aggregate Retail 467.66 1.0151

67032 rows × 13 columns

In [47]:
data.duplicated()
Out[47]:
0        False
1        False
2        False
3        False
4        False
         ...  
67027    False
67028    False
67029    False
67030    False
67031    False
Length: 67032, dtype: bool

Co-relation¶

In [48]:
co_relation = data.corr()
In [49]:
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.heatmap(co_relation, xticklabels=co_relation.columns, yticklabels=co_relation.columns, annot=True)
Out[49]:
<Axes: >

Data Spliting¶

In [50]:
y_variables = data[['Price']]
x_variables = data.drop(['Price','USD_Price'], axis=1)
In [51]:
x_variables
Out[51]:
Latitude Longitude Price_Type_Wholesale Price_Flag_actual,aggregate Price_Flag_aggregate Unit_1.1 KG Unit_1.2 KG Unit_1.3 KG Unit_1.4 KG Unit_1.5 KG ... Commodity_Wheat Commodity_Yam Commodity_Yam (Abuja) Category_meat, fish and eggs Category_milk and dairy Category_miscellaneous food Category_non-food Category_oil and fats Category_pulses and nuts Category_vegetables and fruits
0 13.08000 7.24000 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 13.08000 7.24000 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 13.08000 7.24000 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 13.08000 7.24000 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 13.08000 7.24000 1 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
67027 12.59519 6.58635 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
67028 12.59519 6.58635 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
67029 12.59519 6.58635 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
67030 12.59519 6.58635 1 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
67031 12.59519 6.58635 0 0 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0

67032 rows × 74 columns

In [52]:
# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)
In [53]:
print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100)
X_train Size : 53625
Y_train Size : 53625
X_test Size : 13407
Y_test Size : 13407
Train Size : 79.99910490511995
Train Size : 20.000895094880057

Calculation of Multiple Linear Regression¶

Several circumstances that influence the dependent variable simultaneously can be controlled through multiple regression analysis. Regression analysis is a method of analyzing the relationship between independent variables and dependent variables.

Let $k$ represent the number of variables denoted by $x_1, x_2, x_3, ……, x_k$.

For this method, we assume that we have $k$ independent variables $x_1, . . . , x_k$ that we can set, then they probabilistically determine an outcome $Y$.

Furthermore, we assume that $Y$ is linearly dependent on the factors according to

$$Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + · · · + \beta_kx_k + \epsilon$$

The variable $y_i$ is dependent or predicted The slope of $y$ depends on the y-intercept, that is, when $x_i$ and $x_2$ are both zero, $y$ will be $\beta_0$. The regression coefficients $\beta_1$ and $\beta_2$ represent the change in $y$ as a result of one-unit changes in $x_{i1}$ and $x_{i2}$. $\beta_p$ refers to the slope coefficient of all independent variables $\epsilon$ term describes the random error (residual) in the model. Where $\epsilon$ is a standard error, this is just like we had for simple linear regression, except $k$ doesn’t have to be 1.

We have $n$ observations, $n$ typically being much more than $k$.

For $i$ th observation, we set the independent variables to the values $x_{i1}, x_{i2} . . . , x_{ik}$ and measure a value $y_i$ for the random variable $Y_i$.

Thus, the model can be described by the equations.

$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik} + i$$

for $i = 1, 2, . . . , n$,

Where the errors $i$ are independent standard variables, each with mean 0 and the same unknown variance $\sigma^2$.

Altogether the model for multiple linear regression has $k + 2$ unknown parameters:

$\beta_0, \beta_1, . . . , \beta_k$, and $\sigma^2$.

When $k$ was equal to 1, we found the least squares line $y = \hat{\beta}_0 +\hat{\beta}_1x$.

It was a line in the plane $R^2$.

Now, with $k ≥ 1$, we’ll have a least squares hyperplane.

$$y = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + · · · + \hat{\beta}_kx_k$$

in $R^{k+1}$.

The way to find the estimators $\hat{\beta}_0, \hat{\beta}_1, . . ., and \hat{\beta}_k$ is the same.

Take the partial derivatives of the squared error.

$$Q = \sum_{i=1}^{n} (y_i − (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik}))^2$$

When that system is solved we have fitted values

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \hat{\beta}_2x_{i2} + · · · + \hat{\beta}_kx_{ik}$$

for $i = 1, . . . , n$ that should be close to the actual values $y_i$

In [54]:
# Linear Rergression
regr = LinearRegression()
regr.fit(X_train, y_train)
Out[54]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [55]:
regr.coef_
Out[55]:
array([[ -496.37577043,    59.00419216,  7127.40673806, -1352.59510637,
         7550.79319011, -3778.135254  , -3440.99173812, -3791.11471381,
        -4587.58949543, -3372.19435216,  5538.41622035, 10824.82777782,
        29157.66962326, -1866.76080609,  -758.24411606, -1279.93091436,
        -7528.21499248, -7596.86213421, -2625.80215273, -1941.83515747,
          317.92836492,  3636.69856655, 11338.51910072, -9857.58587657,
         6735.37409185, -2575.2182841 , -1870.13105442, -1726.39911311,
        -1764.77844386, -2575.2182841 ,   606.46553692, -1870.37217076,
         3411.03963047,  2277.51264975, -2625.80215273, -2633.23666723,
        -3878.24010093, -4036.89809262, -1012.79963229, -1722.67623345,
         6714.54158886,  1337.60370884, -3583.79062706, -2379.76527505,
         1701.05276651,  4665.97254177,  4709.96026691,  -758.24411606,
         -344.69289164, -4390.94720729, -4379.87204189, -4090.90661068,
        -4218.01435362,  2395.96294133,  6096.6802472 ,  8903.42047086,
        -1279.93091435,  1782.27837774, -2246.98054402, -2905.38240885,
        -1941.83515747,  1689.62362447, -3922.10442867, -1866.76080609,
         1532.10846264,  5840.42898245, -7338.19608482,  4116.89398872,
         -758.24411606,   409.69271012, -7915.13819355, -8770.81924919,
         3448.7368535 ,  1659.70225979]])
In [56]:
len(regr.coef_)
Out[56]:
1
In [57]:
regr.intercept_
Out[57]:
array([7638.17668407])
In [58]:
print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))
Training data r-squared: 0.7633629459214317
Test data r-squared: 0.7691195969542567
In [59]:
pd.DataFrame(data = regr.coef_[0], index = X_train.columns, columns = ['Coef'])
Out[59]:
Coef
Latitude -496.375770
Longitude 59.004192
Price_Type_Wholesale 7127.406738
Price_Flag_actual,aggregate -1352.595106
Price_Flag_aggregate 7550.793190
... ...
Category_miscellaneous food 409.692710
Category_non-food -7915.138194
Category_oil and fats -8770.819249
Category_pulses and nuts 3448.736854
Category_vegetables and fruits 1659.702260

74 rows × 1 columns

Skewness¶

Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero: Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero:

  1. Positive Skewness (Right Skewed):

    • The right tail of the distribution is longer or fatter than the left tail.
    • The majority of the data points are concentrated on the left side, and the distribution extends further to the right.
  2. Negative Skewness (Left Skewed):

    • The left tail of the distribution is longer or fatter than the right tail.
    • The majority of the data points are concentrated on the right side, and the distribution extends further to the left.
  3. Zero Skewness:

    • The distribution is perfectly symmetrical.

The skewness of a variable (X) can be calculated using the formula:

$$ \text{Skewness} = \frac{1}{\sigma^3} E[(X-\mu)^3] $$

where:

  • $E$ denotes the expectation,
  • $X$ is a random variable,
  • $\mu$ is the mean of $X$, and
  • $\sigma$ is the standard deviation of $X$.

where (E) is the expectation operator, (\mu) is the mean of (X), and (\sigma) is the standard deviation of (X).

Now, regarding your question about applying skewness to (y) variables (target variables or dependent variables in a predictive modeling context), there are a couple of considerations:

  1. Assumption of Regression Models:

    • Many statistical and machine learning models assume that the residuals (the differences between the observed and predicted values) are normally distributed. Skewed distributions of the target variable might violate this assumption.
    • Transforming the target variable (applying transformations such as log transformation) can sometimes help in making the residuals more normally distributed, improving the performance of the model.
  2. Equalizing Variances:

    • In some cases, transforming the target variable can help equalize the variances across different levels of the predictor variables, which is a key assumption in the analysis of variance (ANOVA).
  3. Stabilizing Variances:

    • In financial modeling or time series analysis, transforming the target variable may be done to stabilize variances over time.

It's important to note that transforming the target variable should be done judiciously, and the choice of transformation depends on the characteristics of the data and the goals of the analysis. Common transformations include the logarithmic transformation, square root transformation, and Box-Cox transformation.

Always validate the impact of any transformation on your analysis, and consider the interpretability of the results after transformation. Additionally, be cautious about over-transforming the data, as it might lead to difficulties in interpreting the results.

In [60]:
plt.figure(figsize= (16,8))
plt.hist(data['Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by  hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()
In [61]:
plt.figure(figsize= (16,8))
plt.hist(data['USD_Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by  hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()
In [62]:
# The plot above has a skew.....Its not balanced by the middle value
# Lets find the skew..
data['USD_Price'].skew()
Out[62]:
27.614708422713022
In [63]:
data['Price'].skew()
Out[63]:
2.8713985830315463
In [64]:
y_log = np.log(data['Price'])
y_usd_log = np.log(data['USD_Price'])
In [65]:
y_log.skew()
Out[65]:
0.6090003213244045
In [66]:
y_usd_log.skew()
Out[66]:
0.6544660543802026
In [67]:
sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_log, kde=True)
plt.show()
In [68]:
plt.figure(figsize=(20,6))
sns.displot(y_usd_log,  kind="kde")
Out[68]:
<seaborn.axisgrid.FacetGrid at 0x2275db7f6a0>
<Figure size 2000x600 with 0 Axes>
In [69]:
sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_usd_log, kde=True)
plt.show()
In [70]:
plt.figure(figsize=(20,6))
sns.displot(y_usd_log,  kind="kde")
Out[70]:
<seaborn.axisgrid.FacetGrid at 0x2275db0c430>
<Figure size 2000x600 with 0 Axes>
In [71]:
print('Intercept', regr.intercept_)# intercept after log transformation of prices
Intercept [7638.17668407]
In [72]:
# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params
Out[72]:
const                             6992.523453
Latitude                          -496.375770
Longitude                           59.004192
Price_Type_Wholesale              7127.406738
Price_Flag_actual,aggregate      -1352.595106
                                     ...     
Category_miscellaneous food        662.114666
Category_non-food                -7709.652678
Category_oil and fats            -8565.333733
Category_pulses and nuts          3722.717541
Category_vegetables and fruits    1967.930534
Length: 75, dtype: float64
In [73]:
data['Price'] = y_log
In [74]:
x_variables = data.drop(['Price','USD_Price'], axis=1)
y_variables = data[['Price']]
In [75]:
# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)
In [76]:
regr = LinearRegression()
regr.fit(X_train, y_train)
Out[76]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [77]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None)
Out[77]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [78]:
print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))
Training data r-squared: 0.955473671807045
Test data r-squared: 0.9530163840929582
In [79]:
r_squared =  regr.score(X_test, y_test)
In [80]:
1/(1 - r_squared)
Out[80]:
21.28401530394177
In [81]:
pd.DataFrame(data=regr.coef_[0], index=X_train.columns, columns=['coef'])
Out[81]:
coef
Latitude -0.043979
Longitude 0.005610
Price_Type_Wholesale 0.695027
Price_Flag_actual,aggregate -0.024670
Price_Flag_aggregate 0.609545
... ...
Category_miscellaneous food 0.168125
Category_non-food -1.914440
Category_oil and fats -1.093030
Category_pulses and nuts 0.453877
Category_vegetables and fruits -0.195025

74 rows × 1 columns

In [82]:
print('Intercept', regr.intercept_)# intercept after log transformation of prices
Intercept [6.68274382]
In [83]:
# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params
Out[83]:
const                             6.117853
Latitude                         -0.043979
Longitude                         0.005610
Price_Type_Wholesale              0.695027
Price_Flag_actual,aggregate      -0.024670
                                    ...   
Category_miscellaneous food       0.388973
Category_non-food                -1.734658
Category_oil and fats            -0.913248
Category_pulses and nuts          0.693586
Category_vegetables and fruits    0.074648
Length: 75, dtype: float64
In [84]:
results.pvalues
Out[84]:
const                              0.000000e+00
Latitude                          5.864980e-250
Longitude                          1.532625e-11
Price_Type_Wholesale               0.000000e+00
Price_Flag_actual,aggregate        2.700965e-01
                                      ...      
Category_miscellaneous food       6.518373e-260
Category_non-food                  1.576027e-85
Category_oil and fats              2.739426e-25
Category_pulses and nuts           0.000000e+00
Category_vegetables and fruits     3.880508e-03
Length: 75, dtype: float64
In [85]:
round(results.pvalues,3)
Out[85]:
const                             0.000
Latitude                          0.000
Longitude                         0.000
Price_Type_Wholesale              0.000
Price_Flag_actual,aggregate       0.270
                                  ...  
Category_miscellaneous food       0.000
Category_non-food                 0.000
Category_oil and fats             0.000
Category_pulses and nuts          0.000
Category_vegetables and fruits    0.004
Length: 75, dtype: float64
In [86]:
X = X_train.assign(const=1)
vif = pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)
C:\Users\bhara\anaconda3\envs\base-env\lib\site-packages\statsmodels\stats\outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)
In [87]:
vif
Out[87]:
Latitude                           1.425167
Longitude                          1.767381
Price_Type_Wholesale              12.763579
Price_Flag_actual,aggregate        1.020677
Price_Flag_aggregate               1.630952
                                    ...    
Category_non-food                       inf
Category_oil and fats                   inf
Category_pulses and nuts                inf
Category_vegetables and fruits          inf
const                                   inf
Length: 75, dtype: float64
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: