Multiple Linear Regression¶

Predicting Food Prices in Nigreia¶

The Economics of Eating: Predicting Food Price Trends in Nigeria
This dataset contains Food Prices data for Nigeria, sourced from the World Food Programme Price Database. The World Food Programme Price Database covers foods such as maize, rice, beans, fish, and sugar for 98 countries and some 3000 markets. It is updated weekly but contains to a large extent monthly data. The data goes back as far as 1992 for a few countries, although many countries started reporting from 2003 or thereafter.

Exploratory Data Analysis: Unveiling Patterns and Insights¶

1. Introduction
    Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.

2. Objectives of Exploratory Data Analysis
    1. Understand Data Characteristics:
        Gain insights into the distribution, central tendency, and variability of the data.
        Identify the presence of missing values, outliers, and anomalies.

    2. Explore Relationships:
        Examine correlations and dependencies between different variables.
        Uncover potential patterns and trends within the dataset.

    3. Visualize Data Distributions:
        Utilize graphical representations to visualize the distribution of data.
        Choose appropriate plots such as histograms, box plots, and scatter plots.

    4. Identify Patterns and Anomalies:
        Uncover hidden patterns that may not be apparent in raw data.
        Detect outliers and anomalies that could impact analysis outcomes.


3. Techniques and Tools
    1. Descriptive Statistics:
        Calculate measures such as mean, median, and standard deviation.
        Utilize summary statistics to provide an overview of the dataset.
        Data Visualization:

        Employ graphical representations like histograms, box plots, and scatter plots.
        Create visualizations to illustrate trends, patterns, and relationships.
        Correlation Analysis:

        Use correlation matrices to quantify the relationships between variables.
        Identify strong positive/negative correlations and potential multicollinearity.
        Outlier Detection:

        Apply statistical methods or visual inspection to identify outliers.
        Assess the impact of outliers on the analysis and consider appropriate handling.

4. Steps in Exploratory Data Analysis
    1. Data Collection:
        Gather the raw dataset from reliable sources.

    2. Data Cleaning:
        Handle missing values, duplicate entries, and inconsistencies.
        Ensure data is in a suitable format for analysis.

    3. Descriptive Statistics:
        Compute basic statistics to describe the central tendency and dispersion.

    4. Visualization:
        Generate visualizations to explore data distributions and relationships.

    5. Correlation Analysis:
        Investigate correlations between variables.

    6. Outlier Detection:
        Identify and analyze outliers to understand their impact.

5. Case Study: Applying EDA to Real-World Data
    Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.

6. Conclusion
    Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.

7. References
    Include references to any tools, libraries, or methodologies used in the EDA process.

Import Libraries¶

In [1]:

import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Documentation
import handcalcs.render

# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px


from sympy import Sum, symbols, Indexed, lambdify, diff
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

In [2]:

# Path
data_path = './Data/'

Import Dataset¶

In [3]:

raw_data = pd.read_csv(data_path+"wfp_food_prices_nga.csv",  low_memory=False).reset_index(drop=True)
raw_data.shape

Out[3]:

(80982, 14)

In [4]:

raw_data

Out[4]:

	date	admin1	admin2	market	latitude	longitude	category	commodity	unit	priceflag	pricetype	currency	price	usdprice
0	#date	#adm1+name	#adm2+name	#loc+market+name	#geo+lat	#geo+lon	#item+type	#item+name	#item+unit	#item+price+flag	#item+price+type	#currency	#value	#value+usd
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Maize	KG	actual	Wholesale	NGN	175.92	1.5525
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Millet	KG	actual	Wholesale	NGN	150.18	1.3254
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Rice (imported)	KG	actual	Wholesale	NGN	358.7	3.1656
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Sorghum	KG	actual	Wholesale	NGN	155.61	1.3733
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80977	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Onions	400 G	forecast	Retail	NGN	0.0	0.0
80978	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Oranges	400 G	forecast	Retail	NGN	0.0	0.0
80979	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Spinach	300 G	forecast	Retail	NGN	0.0	0.0
80980	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Tomatoes	0.5 KG	forecast	Retail	NGN	0.0	0.0
80981	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Watermelons	2.1 KG	forecast	Retail	NGN	0.0	0.0

80982 rows × 14 columns

In [5]:

column_info = raw_data.iloc[0]
column_info

Out[5]:

date                    #date
admin1             #adm1+name
admin2             #adm2+name
market       #loc+market+name
latitude             #geo+lat
longitude            #geo+lon
category           #item+type
commodity          #item+name
unit               #item+unit
priceflag    #item+price+flag
pricetype    #item+price+type
currency            #currency
price                  #value
usdprice           #value+usd
Name: 0, dtype: object

In [6]:

raw_data = raw_data.drop(0).reset_index(drop=True)
raw_data

Out[6]:

	date	admin1	admin2	market	latitude	longitude	category	commodity	unit	priceflag	pricetype	currency	price	usdprice
0	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Maize	KG	actual	Wholesale	NGN	175.92	1.5525
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Millet	KG	actual	Wholesale	NGN	150.18	1.3254
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Rice (imported)	KG	actual	Wholesale	NGN	358.7	3.1656
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	cereals and tubers	Sorghum	KG	actual	Wholesale	NGN	155.61	1.3733
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08	7.24	pulses and nuts	Beans (niebe)	KG	actual	Wholesale	NGN	196.87	1.7374
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80976	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Onions	400 G	forecast	Retail	NGN	0.0	0.0
80977	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Oranges	400 G	forecast	Retail	NGN	0.0	0.0
80978	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Spinach	300 G	forecast	Retail	NGN	0.0	0.0
80979	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Tomatoes	0.5 KG	forecast	Retail	NGN	0.0	0.0
80980	2023-12-15	Yobe	Yusufari	Yusufari	13.062664985656738	11.171370506286621	vegetables and fruits	Watermelons	2.1 KG	forecast	Retail	NGN	0.0	0.0

80981 rows × 14 columns

In [7]:

# Rename the columns
raw_data.rename(columns={ 'date': 'Date', 'admin1': 'Admin1', 'admin2': 'Admin2', 'market': 'Market', 'latitude': 'Latitude', 'longitude': 'Longitude', 'category': 'Category', 
                     'commodity': 'Commodity', 'unit': 'Unit', 'priceflag': 'Price_Flag', 'pricetype': 'Price_Type', 'currency': 'Currency', 'price': 'Price', 'usdprice':'USD_Price' }, inplace=True)
raw_data.columns

Out[7]:

Index(['Date', 'Admin1', 'Admin2', 'Market', 'Latitude', 'Longitude',
       'Category', 'Commodity', 'Unit', 'Price_Flag', 'Price_Type', 'Currency',
       'Price', 'USD_Price'],
      dtype='object')

In [8]:

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80981 entries, 0 to 80980
Data columns (total 14 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Date        80981 non-null  object
 1   Admin1      80981 non-null  object
 2   Admin2      80981 non-null  object
 3   Market      80981 non-null  object
 4   Latitude    80981 non-null  object
 5   Longitude   80981 non-null  object
 6   Category    80981 non-null  object
 7   Commodity   80981 non-null  object
 8   Unit        80981 non-null  object
 9   Price_Flag  80981 non-null  object
 10  Price_Type  80981 non-null  object
 11  Currency    80981 non-null  object
 12  Price       80981 non-null  object
 13  USD_Price   80981 non-null  object
dtypes: object(14)
memory usage: 8.6+ MB

In [9]:

raw_data['Date'] = raw_data['Date'].astype('datetime64[ns]')
raw_data['Latitude'] = raw_data['Latitude'].astype(float)
raw_data['Longitude'] = raw_data['Longitude'].astype(float)
raw_data['Price'] = raw_data['Price'].astype(float)
raw_data['USD_Price'] = raw_data['USD_Price'].astype(float)

In [10]:

raw_data

Out[10]:

	Date	Admin1	Admin2	Market	Latitude	Longitude	Category	Commodity	Unit	Price_Flag	Price_Type	Currency	Price	USD_Price
0	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.080000	7.240000	cereals and tubers	Maize	KG	actual	Wholesale	NGN	175.92	1.5525
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.080000	7.240000	cereals and tubers	Millet	KG	actual	Wholesale	NGN	150.18	1.3254
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.080000	7.240000	cereals and tubers	Rice (imported)	KG	actual	Wholesale	NGN	358.70	3.1656
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.080000	7.240000	cereals and tubers	Sorghum	KG	actual	Wholesale	NGN	155.61	1.3733
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.080000	7.240000	pulses and nuts	Beans (niebe)	KG	actual	Wholesale	NGN	196.87	1.7374
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80976	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Onions	400 G	forecast	Retail	NGN	0.00	0.0000
80977	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Oranges	400 G	forecast	Retail	NGN	0.00	0.0000
80978	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Spinach	300 G	forecast	Retail	NGN	0.00	0.0000
80979	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Tomatoes	0.5 KG	forecast	Retail	NGN	0.00	0.0000
80980	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Watermelons	2.1 KG	forecast	Retail	NGN	0.00	0.0000

80981 rows × 14 columns

In [11]:

raw_data.describe()

Out[11]:

	Latitude	Longitude	Price	USD_Price
count	80981.000000	80981.000000	80981.000000	80981.000000
mean	11.396802	10.278874	4887.086852	29.488854
std	1.699392	2.926734	10640.517897	329.652934
min	5.147640	3.401897	0.000000	0.000000
25%	11.280160	8.429120	100.000000	0.279300
50%	11.831100	11.166670	314.280000	0.968800
75%	12.362300	13.092646	1254.000000	3.909100
max	13.645000	13.264360	138000.000000	28662.419400

In [12]:

raw_data.nunique()

Out[12]:

Date            264
Admin1           14
Admin2           31
Market           40
Latitude         40
Longitude        40
Category          8
Commodity        42
Unit             22
Price_Flag        4
Price_Type        2
Currency          1
Price         20163
USD_Price     29982
dtype: int64

Finding for missing values¶

In [13]:

raw_data.isnull().sum()

Out[13]:

Date          0
Admin1        0
Admin2        0
Market        0
Latitude      0
Longitude     0
Category      0
Commodity     0
Unit          0
Price_Flag    0
Price_Type    0
Currency      0
Price         0
USD_Price     0
dtype: int64

In [14]:

for column in raw_data.columns:
    if column not in ['Date', 'Latitude', 'Longitude', 'Price', 'USD_Price']:
        print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
        print(raw_data[column].unique())
        print("--------------------------------------------------------------------------------------------------------------")
        

------------------------------------------------- Admin1  -  14 ---------------------------------------------------
['Katsina' 'Sokoto' 'Borno' 'Kano' 'Jigawa' 'Oyo' 'Lagos' 'Kaduna'
 'Zamfara' 'Abia' 'Gombe' 'Kebbi' 'Adamawa' 'Yobe']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Admin2  -  31 ---------------------------------------------------
['Jibia' 'Gada' "Mai'Adua" 'Mobbar' 'Dawakin Tofa' 'Maigatari'
 'Ibadan North' 'Maiduguri' 'Kosofe' 'Giwa' 'Kaura Namoda' 'Oboma Ngwa'
 'Akko' 'Kaugama' 'Lere' 'Dandume' 'Gwandu' 'Hong' 'Biu' 'Damaturu'
 'Potiskum' 'Konduga' 'Geidam' 'Gujba' 'Jakusko' 'Karasuwa' 'Nguru'
 'Yunusari' 'Yusufari' 'Borsari' 'Gulani']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Market  -  40 ---------------------------------------------------
['Jibia (CBM)' 'Illela (CBM)' 'Mai Adoua (CBM)' 'Damassack (CBM)'
 'Dawanau' 'Mai Gatari (CBM)' 'Ibadan' 'Maiduguri' 'Lagos' 'Giwa'
 'Kaura Namoda' 'Aba' 'Gombe' 'Gujungu' 'Saminaka' 'Dandume' 'Gwandu'
 'Mubi' 'Biu' 'Damaturu' 'Potiskum' 'Abba Gamaram' 'Baga Road'
 'Bullunkutu' 'Budum' 'Custom' 'Kusawam Shanu' 'Monday' 'Tashan Bama'
 'Bolori Stores' 'Damaturu (Sunday Market)' 'Geidam' 'Gujba (Buni Yadi)'
 'Jakusko' 'Bade (Gashua)' 'Nguru' 'Yunusari' 'Yusufari' 'Bursari'
 'Gulani (Tettaba)']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Category  -  8 ---------------------------------------------------
['cereals and tubers' 'pulses and nuts' 'non-food' 'oil and fats'
 'meat, fish and eggs' 'milk and dairy' 'vegetables and fruits'
 'miscellaneous food']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Commodity  -  42 ---------------------------------------------------
['Maize' 'Millet' 'Rice (imported)' 'Sorghum' 'Beans (niebe)' 'Wheat'
 'Maize (white)' 'Sorghum (white)' 'Rice (milled, local)' 'Bread'
 'Cassava meal (gari, yellow)' 'Gari (white)' 'Maize (yellow)'
 'Rice (local)' 'Sorghum (brown)' 'Yam (Abuja)' 'Fuel (diesel)'
 'Fuel (petrol-gasoline)' 'Oil (palm)' 'Cowpeas (brown)' 'Cowpeas (white)'
 'Yam' 'Groundnuts (shelled)' 'Maize flour' 'Meat (beef)' 'Meat (goat)'
 'Milk' 'Oil (vegetable)' 'Beans (red)' 'Beans (white)' 'Groundnuts'
 'Onions' 'Fish' 'Eggs' 'Bananas' 'Oranges' 'Spinach' 'Watermelons'
 'Cowpeas' 'Tomatoes' 'Salt' 'Sugar']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Unit  -  22 ---------------------------------------------------
['KG' '100 KG' '50 KG' 'Unit' 'L' '100 L' '1.3 KG' '1.4 KG' '3.4 KG'
 '20 G' '750 ML' '1.1 KG' '0.5 KG' '1.5 KG' '30 pcs' '400 G' '300 G'
 '2.1 KG' '3.1 KG' '1.2 KG' '250 G' '100 Tubers']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Price_Flag  -  4 ---------------------------------------------------
['actual' 'actual,aggregate' 'aggregate' 'forecast']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Price_Type  -  2 ---------------------------------------------------
['Wholesale' 'Retail']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Currency  -  1 ---------------------------------------------------
['NGN']
--------------------------------------------------------------------------------------------------------------

In [15]:

raw_data.Category.value_counts()

Out[15]:

cereals and tubers       39537
pulses and nuts          13841
vegetables and fruits     8794
meat, fish and eggs       6177
oil and fats              4928
non-food                  3436
miscellaneous food        2721
milk and dairy            1547
Name: Category, dtype: int64

In [16]:

raw_data.Commodity.value_counts()

Out[16]:

Millet                         4972
Rice (imported)                4537
Rice (local)                   4193
Yam                            4002
Oil (palm)                     3368
Maize (white)                  2946
Sorghum (white)                2904
Cowpeas (white)                2699
Groundnuts (shelled)           2663
Sorghum (brown)                2660
Gari (white)                   2599
Maize (yellow)                 2560
Cowpeas (brown)                2526
Cassava meal (gari, yellow)    2310
Sorghum                        2109
Fuel (petrol-gasoline)         1721
Fuel (diesel)                  1715
Groundnuts                     1561
Oil (vegetable)                1560
Maize flour                    1557
Beans (red)                    1554
Meat (beef)                    1552
Meat (goat)                    1552
Milk                           1547
Beans (white)                  1547
Onions                         1546
Fish                           1543
Eggs                           1530
Watermelons                    1521
Oranges                        1520
Bananas                        1505
Tomatoes                       1415
Sugar                          1369
Salt                           1352
Spinach                        1287
Cowpeas                        1267
Bread                          1035
Maize                           601
Rice (milled, local)            394
Yam (Abuja)                      85
Wheat                            73
Beans (niebe)                    24
Name: Commodity, dtype: int64

In [17]:

raw_data['Unit'].value_counts()

Out[17]:

100 KG        19125
KG            18740
1.3 KG         7056
L              5135
1.4 KG         5072
1.2 KG         4335
750 ML         3120
400 G          2387
0.5 KG         2094
50 KG          1965
20 G           1547
30 pcs         1530
2.1 KG         1521
250 G          1352
300 G          1287
Unit           1035
3.1 KG          867
1.1 KG          694
3.4 KG          680
1.5 KG          676
100 Tubers      666
100 L            97
Name: Unit, dtype: int64

In [18]:

raw_data['Price_Flag'].value_counts()

Out[18]:

actual              51969
aggregate           14594
forecast            13949
actual,aggregate      469
Name: Price_Flag, dtype: int64

In [19]:

raw_data['Price_Type'].value_counts()

Out[19]:

Retail       56111
Wholesale    24870
Name: Price_Type, dtype: int64

In [20]:

# Removing Outlier column
raw_data.drop('Currency',  axis=1, inplace=True)

Data Exploration¶

In [21]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))  # You can adjust the width and height as needed

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Category'].unique()))

# Get the top 8 categories (you can adjust the number as needed)
top_categories = raw_data['Category'].value_counts().index

# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Category'].isin(top_categories)]

# Creating the bar plot with different colors for each category
sns.barplot(x='Category', y='USD_Price', hue='Category', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('Bar Plot of USD Price by Category')
plt.xlabel('Category')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Category')

# Show the plot
plt.show()

In [22]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(10, 18))  # Swapping width and height for a horizontal plot

# Get the top categories
top_categories = raw_data['Commodity'].value_counts().index

# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Commodity'].isin(top_categories)]

# Creating the horizontal bar plot with different colors for each category
sns.barplot(x='USD_Price', y='Commodity', hue='Commodity', data=filtered_data)

# Adding titles and labels
plt.title('Plot of USD Price by Commodity')
plt.xlabel('USD Price')
plt.ylabel('Commodity')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=sns.color_palette()[i % len(sns.color_palette())], markersize=10) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Commodity')

# Show the plot
plt.show()

In [23]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(14, 5))

# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Price_Type'].unique()))

# Get Price Type 
top_price_type = raw_data['Price_Type'].value_counts().index

# Filter the data to include only the top Price Type
filtered_data = raw_data[raw_data['Price_Type'].isin(top_price_type)]

# Creating the bar plot with different colors for each Price Type
ax = sns.barplot(x='Price_Type', y='USD_Price', hue='Price_Type', data=filtered_data, palette=palette)

# Adding titles and labels
plt.title('USD Price by Price Type')
plt.xlabel('Price Type')
plt.ylabel('USD Price')

# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_price_type)]
plt.legend(legend_handles, top_price_type, title='Price Type')

# Adding annotations on top of the bars
for p in ax.patches:
    ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
                ha='center', va='center', xytext=(0, 10), textcoords='offset points')

# Show the plot
plt.show()

In [24]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(raw_data["Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [25]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

# Creating a histogram with 20 bins
sns.histplot(raw_data["USD_Price"], kde=True, bins=10,
    stat="density", kde_kws=dict(cut=3),
    alpha=.4, edgecolor=(1, 1, 1, .4),)

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [26]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))

sns.kdeplot(raw_data["USD_Price"])

# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [27]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of NGN Price vs Commodity')
plt.xlabel('Commodity')
plt.ylabel('NGN Price')
plt.show()

In [28]:

# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['USD_Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of USD Price vs Commodity')
plt.xlabel('X1')
plt.ylabel('y')
plt.show()

In [29]:

# sns.pairplot(raw_data)
# plt.show()

In [30]:

# sns.pairplot(raw_data, kind = 'reg' , plot_kws = {'line_kws':{'color': 'red'}})
# plt.show()

Encoding categorical data¶

Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:

Label Encoding:
- Assigns a unique integer to each category.
- Suitable for ordinal data where the order matters.
- Sklearn provides LabelEncoder for this purpose.
One-Hot Encoding:
- Creates binary columns for each category and represents the presence of a category with a 1.
- Suitable for nominal data where there is no inherent order.
Ordinal Encoding:
- Manually assign numerical values based on the order of categories.
- Useful when there is an inherent order among categories.
Binary Encoding:
- Converts categories into binary code.
- Reduces the number of columns compared to one-hot encoding.
Hashing Encoding:
- Converts categories into a fixed-size hash, useful when dealing with high cardinality.

In [31]:

pd.get_dummies(raw_data['Category'])

Out[31]:

	cereals and tubers	meat, fish and eggs	milk and dairy	miscellaneous food	non-food	oil and fats	pulses and nuts	vegetables and fruits
0	1	0	0	0	0	0	0	0
1	1	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0
3	1	0	0	0	0	0	0	0
4	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...
80976	0	0	0	0	0	0	0	1
80977	0	0	0	0	0	0	0	1
80978	0	0	0	0	0	0	0	1
80979	0	0	0	0	0	0	0	1
80980	0	0	0	0	0	0	0	1

80981 rows × 8 columns

In [32]:

pd.get_dummies(raw_data['Price_Flag'])

Out[32]:

	actual	actual,aggregate	aggregate	forecast
0	1	0	0	0
1	1	0	0	0
2	1	0	0	0
3	1	0	0	0
4	1	0	0	0
...	...	...	...	...
80976	0	0	0	1
80977	0	0	0	1
80978	0	0	0	1
80979	0	0	0	1
80980	0	0	0	1

80981 rows × 4 columns

In [33]:

pd.get_dummies(raw_data['Price_Type'])

Out[33]:

	Retail	Wholesale
0	0	1
1	0	1
2	0	1
3	0	1
4	0	1
...	...	...
80976	1	0
80977	1	0
80978	1	0
80979	1	0
80980	1	0

80981 rows × 2 columns

In [34]:

pd.get_dummies(raw_data['Unit'])

Out[34]:

	0.5 KG	1.1 KG	1.2 KG	1.3 KG	1.4 KG	1.5 KG	100 KG	100 L	100 Tubers	2.1 KG	...	3.1 KG	3.4 KG	30 pcs	300 G	400 G	50 KG	750 ML	KG	L	Unit
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
80976	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
80977	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	1	0	0	0	0	0
80978	0	0	0	0	0	0	0	0	0	0	...	0	0	0	1	0	0	0	0	0	0
80979	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
80980	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0

80981 rows × 22 columns

In [35]:

# This data has been seperated as test data because the price has to be predicted
test_data = raw_data[raw_data['Price_Flag'] == 'forecast'].reset_index(drop=True)
test_data

Out[35]:

	Date	Admin1	Admin2	Market	Latitude	Longitude	Category	Commodity	Unit	Price_Flag	Price_Type	Price	USD_Price
0	2006-01-15	Sokoto	Gada	Illela (CBM)	13.645000	5.278000	cereals and tubers	Wheat	KG	forecast	Wholesale	0.0	0.0
1	2006-02-15	Sokoto	Gada	Illela (CBM)	13.645000	5.278000	cereals and tubers	Wheat	KG	forecast	Wholesale	0.0	0.0
2	2006-03-15	Katsina	Mai'Adua	Mai Adoua (CBM)	13.180000	8.230000	cereals and tubers	Wheat	KG	forecast	Wholesale	0.0	0.0
3	2006-03-15	Sokoto	Gada	Illela (CBM)	13.645000	5.278000	cereals and tubers	Wheat	KG	forecast	Wholesale	0.0	0.0
4	2006-04-15	Katsina	Mai'Adua	Mai Adoua (CBM)	13.180000	8.230000	cereals and tubers	Wheat	KG	forecast	Wholesale	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
13944	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Onions	400 G	forecast	Retail	0.0	0.0
13945	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Oranges	400 G	forecast	Retail	0.0	0.0
13946	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Spinach	300 G	forecast	Retail	0.0	0.0
13947	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Tomatoes	0.5 KG	forecast	Retail	0.0	0.0
13948	2023-12-15	Yobe	Yusufari	Yusufari	13.062665	11.171371	vegetables and fruits	Watermelons	2.1 KG	forecast	Retail	0.0	0.0

13949 rows × 13 columns

In [36]:

raw_data = raw_data[raw_data['Price_Flag'] != 'forecast'].reset_index(drop=True)
raw_data

Out[36]:

	Date	Admin1	Admin2	Market	Latitude	Longitude	Category	Commodity	Unit	Price_Flag	Price_Type	Price	USD_Price
0	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Maize	KG	actual	Wholesale	175.92	1.5525
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Millet	KG	actual	Wholesale	150.18	1.3254
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Rice (imported)	KG	actual	Wholesale	358.70	3.1656
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Sorghum	KG	actual	Wholesale	155.61	1.3733
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	pulses and nuts	Beans (niebe)	KG	actual	Wholesale	196.87	1.7374
...	...	...	...	...	...	...	...	...	...	...	...	...	...
67027	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (brown)	KG	aggregate	Retail	298.55	0.6480
67028	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (white)	100 KG	aggregate	Wholesale	29440.00	63.8992
67029	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (white)	KG	aggregate	Retail	274.82	0.5965
67030	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Groundnuts (shelled)	100 KG	aggregate	Wholesale	46960.00	101.9261
67031	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Groundnuts (shelled)	KG	aggregate	Retail	467.66	1.0151

67032 rows × 13 columns

One-Hot Encoding¶

In [37]:

encoder = OneHotEncoder(drop='first')

In [38]:

def encoding_categorical_data(df, column):
    df = pd.concat([df, pd.DataFrame(encoder.fit_transform(df[[column]]).toarray(), columns=encoder.get_feature_names_out([column])).astype(int)],  axis=1) 
    # Removing un necessary column
    df.drop(column,  axis=1, inplace=True)
    return df

In [39]:

data = raw_data.copy()
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    data = encoding_categorical_data(data, col)
    
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
    test_data = encoding_categorical_data(test_data, col)

In [40]:

data.shape

Out[40]:

(67032, 80)

In [41]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67032 entries, 0 to 67031
Data columns (total 80 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Date                                   67032 non-null  datetime64[ns]
 1   Admin1                                 67032 non-null  object        
 2   Admin2                                 67032 non-null  object        
 3   Market                                 67032 non-null  object        
 4   Latitude                               67032 non-null  float64       
 5   Longitude                              67032 non-null  float64       
 6   Price                                  67032 non-null  float64       
 7   USD_Price                              67032 non-null  float64       
 8   Price_Type_Wholesale                   67032 non-null  int32         
 9   Price_Flag_actual,aggregate            67032 non-null  int32         
 10  Price_Flag_aggregate                   67032 non-null  int32         
 11  Unit_1.1 KG                            67032 non-null  int32         
 12  Unit_1.2 KG                            67032 non-null  int32         
 13  Unit_1.3 KG                            67032 non-null  int32         
 14  Unit_1.4 KG                            67032 non-null  int32         
 15  Unit_1.5 KG                            67032 non-null  int32         
 16  Unit_100 KG                            67032 non-null  int32         
 17  Unit_100 L                             67032 non-null  int32         
 18  Unit_100 Tubers                        67032 non-null  int32         
 19  Unit_2.1 KG                            67032 non-null  int32         
 20  Unit_20 G                              67032 non-null  int32         
 21  Unit_250 G                             67032 non-null  int32         
 22  Unit_3.1 KG                            67032 non-null  int32         
 23  Unit_3.4 KG                            67032 non-null  int32         
 24  Unit_30 pcs                            67032 non-null  int32         
 25  Unit_300 G                             67032 non-null  int32         
 26  Unit_400 G                             67032 non-null  int32         
 27  Unit_50 KG                             67032 non-null  int32         
 28  Unit_750 ML                            67032 non-null  int32         
 29  Unit_KG                                67032 non-null  int32         
 30  Unit_L                                 67032 non-null  int32         
 31  Unit_Unit                              67032 non-null  int32         
 32  Commodity_Beans (niebe)                67032 non-null  int32         
 33  Commodity_Beans (red)                  67032 non-null  int32         
 34  Commodity_Beans (white)                67032 non-null  int32         
 35  Commodity_Bread                        67032 non-null  int32         
 36  Commodity_Cassava meal (gari, yellow)  67032 non-null  int32         
 37  Commodity_Cowpeas                      67032 non-null  int32         
 38  Commodity_Cowpeas (brown)              67032 non-null  int32         
 39  Commodity_Cowpeas (white)              67032 non-null  int32         
 40  Commodity_Eggs                         67032 non-null  int32         
 41  Commodity_Fish                         67032 non-null  int32         
 42  Commodity_Fuel (diesel)                67032 non-null  int32         
 43  Commodity_Fuel (petrol-gasoline)       67032 non-null  int32         
 44  Commodity_Gari (white)                 67032 non-null  int32         
 45  Commodity_Groundnuts                   67032 non-null  int32         
 46  Commodity_Groundnuts (shelled)         67032 non-null  int32         
 47  Commodity_Maize                        67032 non-null  int32         
 48  Commodity_Maize (white)                67032 non-null  int32         
 49  Commodity_Maize (yellow)               67032 non-null  int32         
 50  Commodity_Maize flour                  67032 non-null  int32         
 51  Commodity_Meat (beef)                  67032 non-null  int32         
 52  Commodity_Meat (goat)                  67032 non-null  int32         
 53  Commodity_Milk                         67032 non-null  int32         
 54  Commodity_Millet                       67032 non-null  int32         
 55  Commodity_Oil (palm)                   67032 non-null  int32         
 56  Commodity_Oil (vegetable)              67032 non-null  int32         
 57  Commodity_Onions                       67032 non-null  int32         
 58  Commodity_Oranges                      67032 non-null  int32         
 59  Commodity_Rice (imported)              67032 non-null  int32         
 60  Commodity_Rice (local)                 67032 non-null  int32         
 61  Commodity_Rice (milled, local)         67032 non-null  int32         
 62  Commodity_Salt                         67032 non-null  int32         
 63  Commodity_Sorghum                      67032 non-null  int32         
 64  Commodity_Sorghum (brown)              67032 non-null  int32         
 65  Commodity_Sorghum (white)              67032 non-null  int32         
 66  Commodity_Spinach                      67032 non-null  int32         
 67  Commodity_Sugar                        67032 non-null  int32         
 68  Commodity_Tomatoes                     67032 non-null  int32         
 69  Commodity_Watermelons                  67032 non-null  int32         
 70  Commodity_Wheat                        67032 non-null  int32         
 71  Commodity_Yam                          67032 non-null  int32         
 72  Commodity_Yam (Abuja)                  67032 non-null  int32         
 73  Category_meat, fish and eggs           67032 non-null  int32         
 74  Category_milk and dairy                67032 non-null  int32         
 75  Category_miscellaneous food            67032 non-null  int32         
 76  Category_non-food                      67032 non-null  int32         
 77  Category_oil and fats                  67032 non-null  int32         
 78  Category_pulses and nuts               67032 non-null  int32         
 79  Category_vegetables and fruits         67032 non-null  int32         
dtypes: datetime64[ns](1), float64(4), int32(72), object(3)
memory usage: 22.5+ MB

In [42]:

data.describe()

Out[42]:

	Latitude	Longitude	Price	USD_Price	Price_Type_Wholesale	Price_Flag_actual,aggregate	Price_Flag_aggregate	Unit_1.1 KG	Unit_1.2 KG	Unit_1.3 KG	...	Commodity_Wheat	Commodity_Yam	Commodity_Yam (Abuja)	Category_meat, fish and eggs	Category_milk and dairy	Category_miscellaneous food	Category_non-food	Category_oil and fats	Category_pulses and nuts	Category_vegetables and fruits
count	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	...	67032.000000	67032.000000	67032.0000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000	67032.000000
mean	11.411172	10.273114	5904.063438	35.625326	0.320847	0.006997	0.217717	0.008832	0.052811	0.086645	...	0.000910	0.048171	0.0010	0.076575	0.019185	0.033074	0.043681	0.060717	0.168979	0.107919
std	1.689944	2.934114	11435.785644	362.031271	0.466805	0.083354	0.412697	0.093561	0.223657	0.281317	...	0.030153	0.214129	0.0316	0.265918	0.137175	0.178831	0.204385	0.238813	0.374736	0.310280
min	5.147640	3.401897	1.000000	0.002300	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	11.280160	8.429120	200.000000	0.600400	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	11.831100	11.166670	446.745000	1.307900	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	12.362300	13.092646	6862.500000	28.969800	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	...	0.000000	0.000000	0.0000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
max	13.645000	13.264360	138000.000000	28662.419400	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	...	1.000000	1.000000	1.0000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000

8 rows × 76 columns

In [43]:

test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13949 entries, 0 to 13948
Data columns (total 77 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Date                                   13949 non-null  datetime64[ns]
 1   Admin1                                 13949 non-null  object        
 2   Admin2                                 13949 non-null  object        
 3   Market                                 13949 non-null  object        
 4   Latitude                               13949 non-null  float64       
 5   Longitude                              13949 non-null  float64       
 6   Price                                  13949 non-null  float64       
 7   USD_Price                              13949 non-null  float64       
 8   Price_Type_Wholesale                   13949 non-null  int32         
 9   Unit_1.1 KG                            13949 non-null  int32         
 10  Unit_1.2 KG                            13949 non-null  int32         
 11  Unit_1.3 KG                            13949 non-null  int32         
 12  Unit_1.4 KG                            13949 non-null  int32         
 13  Unit_1.5 KG                            13949 non-null  int32         
 14  Unit_100 KG                            13949 non-null  int32         
 15  Unit_100 L                             13949 non-null  int32         
 16  Unit_100 Tubers                        13949 non-null  int32         
 17  Unit_2.1 KG                            13949 non-null  int32         
 18  Unit_20 G                              13949 non-null  int32         
 19  Unit_250 G                             13949 non-null  int32         
 20  Unit_3.1 KG                            13949 non-null  int32         
 21  Unit_3.4 KG                            13949 non-null  int32         
 22  Unit_30 pcs                            13949 non-null  int32         
 23  Unit_300 G                             13949 non-null  int32         
 24  Unit_400 G                             13949 non-null  int32         
 25  Unit_50 KG                             13949 non-null  int32         
 26  Unit_750 ML                            13949 non-null  int32         
 27  Unit_KG                                13949 non-null  int32         
 28  Unit_L                                 13949 non-null  int32         
 29  Unit_Unit                              13949 non-null  int32         
 30  Commodity_Beans (red)                  13949 non-null  int32         
 31  Commodity_Beans (white)                13949 non-null  int32         
 32  Commodity_Bread                        13949 non-null  int32         
 33  Commodity_Cassava meal (gari, yellow)  13949 non-null  int32         
 34  Commodity_Cowpeas                      13949 non-null  int32         
 35  Commodity_Cowpeas (brown)              13949 non-null  int32         
 36  Commodity_Cowpeas (white)              13949 non-null  int32         
 37  Commodity_Eggs                         13949 non-null  int32         
 38  Commodity_Fish                         13949 non-null  int32         
 39  Commodity_Fuel (diesel)                13949 non-null  int32         
 40  Commodity_Fuel (petrol-gasoline)       13949 non-null  int32         
 41  Commodity_Gari (white)                 13949 non-null  int32         
 42  Commodity_Groundnuts                   13949 non-null  int32         
 43  Commodity_Groundnuts (shelled)         13949 non-null  int32         
 44  Commodity_Maize                        13949 non-null  int32         
 45  Commodity_Maize (white)                13949 non-null  int32         
 46  Commodity_Maize (yellow)               13949 non-null  int32         
 47  Commodity_Maize flour                  13949 non-null  int32         
 48  Commodity_Meat (beef)                  13949 non-null  int32         
 49  Commodity_Meat (goat)                  13949 non-null  int32         
 50  Commodity_Milk                         13949 non-null  int32         
 51  Commodity_Millet                       13949 non-null  int32         
 52  Commodity_Oil (palm)                   13949 non-null  int32         
 53  Commodity_Oil (vegetable)              13949 non-null  int32         
 54  Commodity_Onions                       13949 non-null  int32         
 55  Commodity_Oranges                      13949 non-null  int32         
 56  Commodity_Rice (imported)              13949 non-null  int32         
 57  Commodity_Rice (local)                 13949 non-null  int32         
 58  Commodity_Rice (milled, local)         13949 non-null  int32         
 59  Commodity_Salt                         13949 non-null  int32         
 60  Commodity_Sorghum                      13949 non-null  int32         
 61  Commodity_Sorghum (brown)              13949 non-null  int32         
 62  Commodity_Sorghum (white)              13949 non-null  int32         
 63  Commodity_Spinach                      13949 non-null  int32         
 64  Commodity_Sugar                        13949 non-null  int32         
 65  Commodity_Tomatoes                     13949 non-null  int32         
 66  Commodity_Watermelons                  13949 non-null  int32         
 67  Commodity_Wheat                        13949 non-null  int32         
 68  Commodity_Yam                          13949 non-null  int32         
 69  Commodity_Yam (Abuja)                  13949 non-null  int32         
 70  Category_meat, fish and eggs           13949 non-null  int32         
 71  Category_milk and dairy                13949 non-null  int32         
 72  Category_miscellaneous food            13949 non-null  int32         
 73  Category_non-food                      13949 non-null  int32         
 74  Category_oil and fats                  13949 non-null  int32         
 75  Category_pulses and nuts               13949 non-null  int32         
 76  Category_vegetables and fruits         13949 non-null  int32         
dtypes: datetime64[ns](1), float64(4), int32(69), object(3)
memory usage: 4.5+ MB

In [44]:

data

Out[44]:

	Date	Admin1	Admin2	Market	Latitude	Longitude	Price	USD_Price	Price_Type_Wholesale	Price_Flag_actual,aggregate	...	Commodity_Wheat	Commodity_Yam	Commodity_Yam (Abuja)	Category_meat, fish and eggs	Category_milk and dairy	Category_miscellaneous food	Category_non-food	Category_oil and fats	Category_pulses and nuts	Category_vegetables and fruits
0	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	175.92	1.5525	1	0	...	0	0	0	0	0	0	0	0	0	0
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	150.18	1.3254	1	0	...	0	0	0	0	0	0	0	0	0	0
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	358.70	3.1656	1	0	...	0	0	0	0	0	0	0	0	0	0
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	155.61	1.3733	1	0	...	0	0	0	0	0	0	0	0	0	0
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	196.87	1.7374	1	0	...	0	0	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
67027	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	298.55	0.6480	0	0	...	0	0	0	0	0	0	0	0	1	0
67028	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	29440.00	63.8992	1	0	...	0	0	0	0	0	0	0	0	1	0
67029	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	274.82	0.5965	0	0	...	0	0	0	0	0	0	0	0	1	0
67030	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	46960.00	101.9261	1	0	...	0	0	0	0	0	0	0	0	1	0
67031	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	467.66	1.0151	0	0	...	0	0	0	0	0	0	0	0	1	0

67032 rows × 80 columns

In [45]:

data = data.drop(['Date', 'Admin1', 'Admin2', 'Market'], axis=1)

In [46]:

raw_data

Out[46]:

	Date	Admin1	Admin2	Market	Latitude	Longitude	Category	Commodity	Unit	Price_Flag	Price_Type	Price	USD_Price
0	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Maize	KG	actual	Wholesale	175.92	1.5525
1	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Millet	KG	actual	Wholesale	150.18	1.3254
2	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Rice (imported)	KG	actual	Wholesale	358.70	3.1656
3	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	cereals and tubers	Sorghum	KG	actual	Wholesale	155.61	1.3733
4	2002-01-15	Katsina	Jibia	Jibia (CBM)	13.08000	7.24000	pulses and nuts	Beans (niebe)	KG	actual	Wholesale	196.87	1.7374
...	...	...	...	...	...	...	...	...	...	...	...	...	...
67027	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (brown)	KG	aggregate	Retail	298.55	0.6480
67028	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (white)	100 KG	aggregate	Wholesale	29440.00	63.8992
67029	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Cowpeas (white)	KG	aggregate	Retail	274.82	0.5965
67030	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Groundnuts (shelled)	100 KG	aggregate	Wholesale	46960.00	101.9261
67031	2023-01-15	Zamfara	Kaura Namoda	Kaura Namoda	12.59519	6.58635	pulses and nuts	Groundnuts (shelled)	KG	aggregate	Retail	467.66	1.0151

67032 rows × 13 columns

In [47]:

data.duplicated()

Out[47]:

0        False
1        False
2        False
3        False
4        False
         ...  
67027    False
67028    False
67029    False
67030    False
67031    False
Length: 67032, dtype: bool

Co-relation¶

In [48]:

co_relation = data.corr()

In [49]:

plt.figure(figsize=(18, 8))  # Swapping width and height for a horizontal plot
sns.heatmap(co_relation, xticklabels=co_relation.columns, yticklabels=co_relation.columns, annot=True)

Out[49]:

<Axes: >

Data Spliting¶

In [50]:

y_variables = data[['Price']]
x_variables = data.drop(['Price','USD_Price'], axis=1)

In [51]:

x_variables

Out[51]:

	Latitude	Longitude	Price_Type_Wholesale	Price_Flag_actual,aggregate	Price_Flag_aggregate	Unit_1.1 KG	Unit_1.2 KG	Unit_1.3 KG	Unit_1.4 KG	Unit_1.5 KG	...	Commodity_Wheat	Commodity_Yam	Commodity_Yam (Abuja)	Category_meat, fish and eggs	Category_milk and dairy	Category_miscellaneous food	Category_non-food	Category_oil and fats	Category_pulses and nuts	Category_vegetables and fruits
0	13.08000	7.24000	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	13.08000	7.24000	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	13.08000	7.24000	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	13.08000	7.24000	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	13.08000	7.24000	1	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
67027	12.59519	6.58635	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
67028	12.59519	6.58635	1	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
67029	12.59519	6.58635	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
67030	12.59519	6.58635	1	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0
67031	12.59519	6.58635	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	1	0

67032 rows × 74 columns

In [52]:

# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)

In [53]:

print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100)

X_train Size : 53625
Y_train Size : 53625
X_test Size : 13407
Y_test Size : 13407
Train Size : 79.99910490511995
Train Size : 20.000895094880057

Calculation of Multiple Linear Regression¶

Several circumstances that influence the dependent variable simultaneously can be controlled through multiple regression analysis. Regression analysis is a method of analyzing the relationship between independent variables and dependent variables.

Let $k$ represent the number of variables denoted by $x_1, x_2, x_3, ……, x_k$.

For this method, we assume that we have $k$ independent variables $x_1, . . . , x_k$ that we can set, then they probabilistically determine an outcome $Y$.

Furthermore, we assume that $Y$ is linearly dependent on the factors according to

$$Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + · · · + \beta_kx_k + \epsilon$$

The variable $y_i$ is dependent or predicted The slope of $y$ depends on the y-intercept, that is, when $x_i$ and $x_2$ are both zero, $y$ will be $\beta_0$. The regression coefficients $\beta_1$ and $\beta_2$ represent the change in $y$ as a result of one-unit changes in $x_{i1}$ and $x_{i2}$. $\beta_p$ refers to the slope coefficient of all independent variables $\epsilon$ term describes the random error (residual) in the model. Where $\epsilon$ is a standard error, this is just like we had for simple linear regression, except $k$ doesn’t have to be 1.

We have $n$ observations, $n$ typically being much more than $k$.

For $i$ th observation, we set the independent variables to the values $x_{i1}, x_{i2} . . . , x_{ik}$ and measure a value $y_i$ for the random variable $Y_i$.

Thus, the model can be described by the equations.

$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik} + i$$

for $i = 1, 2, . . . , n$,

Where the errors $i$ are independent standard variables, each with mean 0 and the same unknown variance $\sigma^2$.

Altogether the model for multiple linear regression has $k + 2$ unknown parameters:

$\beta_0, \beta_1, . . . , \beta_k$, and $\sigma^2$.

When $k$ was equal to 1, we found the least squares line $y = \hat{\beta}_0 +\hat{\beta}_1x$.

It was a line in the plane $R^2$.

Now, with $k ≥ 1$, we’ll have a least squares hyperplane.

$$y = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + · · · + \hat{\beta}_kx_k$$

in $R^{k+1}$.

The way to find the estimators $\hat{\beta}_0, \hat{\beta}_1, . . ., and \hat{\beta}_k$ is the same.

Take the partial derivatives of the squared error.

$$Q = \sum_{i=1}^{n} (y_i − (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik}))^2$$

When that system is solved we have fitted values

$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \hat{\beta}_2x_{i2} + · · · + \hat{\beta}_kx_{ik}$$

for $i = 1, . . . , n$ that should be close to the actual values $y_i$

In [54]:

# Linear Rergression
regr = LinearRegression()
regr.fit(X_train, y_train)

Out[54]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [55]:

regr.coef_

Out[55]:

array([[ -496.37577043,    59.00419216,  7127.40673806, -1352.59510637,
         7550.79319011, -3778.135254  , -3440.99173812, -3791.11471381,
        -4587.58949543, -3372.19435216,  5538.41622035, 10824.82777782,
        29157.66962326, -1866.76080609,  -758.24411606, -1279.93091436,
        -7528.21499248, -7596.86213421, -2625.80215273, -1941.83515747,
          317.92836492,  3636.69856655, 11338.51910072, -9857.58587657,
         6735.37409185, -2575.2182841 , -1870.13105442, -1726.39911311,
        -1764.77844386, -2575.2182841 ,   606.46553692, -1870.37217076,
         3411.03963047,  2277.51264975, -2625.80215273, -2633.23666723,
        -3878.24010093, -4036.89809262, -1012.79963229, -1722.67623345,
         6714.54158886,  1337.60370884, -3583.79062706, -2379.76527505,
         1701.05276651,  4665.97254177,  4709.96026691,  -758.24411606,
         -344.69289164, -4390.94720729, -4379.87204189, -4090.90661068,
        -4218.01435362,  2395.96294133,  6096.6802472 ,  8903.42047086,
        -1279.93091435,  1782.27837774, -2246.98054402, -2905.38240885,
        -1941.83515747,  1689.62362447, -3922.10442867, -1866.76080609,
         1532.10846264,  5840.42898245, -7338.19608482,  4116.89398872,
         -758.24411606,   409.69271012, -7915.13819355, -8770.81924919,
         3448.7368535 ,  1659.70225979]])

In [56]:

len(regr.coef_)

Out[56]:

In [57]:

regr.intercept_

Out[57]:

array([7638.17668407])

In [58]:

print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))

Training data r-squared: 0.7633629459214317
Test data r-squared: 0.7691195969542567

In [59]:

pd.DataFrame(data = regr.coef_[0], index = X_train.columns, columns = ['Coef'])

Out[59]:

	Coef
Latitude	-496.375770
Longitude	59.004192
Price_Type_Wholesale	7127.406738
Price_Flag_actual,aggregate	-1352.595106
Price_Flag_aggregate	7550.793190
...	...
Category_miscellaneous food	409.692710
Category_non-food	-7915.138194
Category_oil and fats	-8770.819249
Category_pulses and nuts	3448.736854
Category_vegetables and fruits	1659.702260

74 rows × 1 columns

Skewness¶

Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero: Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero:

Positive Skewness (Right Skewed):
- The right tail of the distribution is longer or fatter than the left tail.
- The majority of the data points are concentrated on the left side, and the distribution extends further to the right.
Negative Skewness (Left Skewed):
- The left tail of the distribution is longer or fatter than the right tail.
- The majority of the data points are concentrated on the right side, and the distribution extends further to the left.
Zero Skewness:
- The distribution is perfectly symmetrical.

The skewness of a variable (X) can be calculated using the formula:

$$ \text{Skewness} = \frac{1}{\sigma^3} E[(X-\mu)^3] $$

where:

$E$ denotes the expectation,
$X$ is a random variable,
$\mu$ is the mean of $X$, and
$\sigma$ is the standard deviation of $X$.

where (E) is the expectation operator, (\mu) is the mean of (X), and (\sigma) is the standard deviation of (X).

Now, regarding your question about applying skewness to (y) variables (target variables or dependent variables in a predictive modeling context), there are a couple of considerations:

Assumption of Regression Models:
- Many statistical and machine learning models assume that the residuals (the differences between the observed and predicted values) are normally distributed. Skewed distributions of the target variable might violate this assumption.
- Transforming the target variable (applying transformations such as log transformation) can sometimes help in making the residuals more normally distributed, improving the performance of the model.
Equalizing Variances:
- In some cases, transforming the target variable can help equalize the variances across different levels of the predictor variables, which is a key assumption in the analysis of variance (ANOVA).
Stabilizing Variances:
- In financial modeling or time series analysis, transforming the target variable may be done to stabilize variances over time.

It's important to note that transforming the target variable should be done judiciously, and the choice of transformation depends on the characteristics of the data and the goals of the analysis. Common transformations include the logarithmic transformation, square root transformation, and Box-Cox transformation.

Always validate the impact of any transformation on your analysis, and consider the interpretability of the results after transformation. Additionally, be cautious about over-transforming the data, as it might lead to difficulties in interpreting the results.

In [60]:

plt.figure(figsize= (16,8))
plt.hist(data['Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by  hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()

In [61]:

plt.figure(figsize= (16,8))
plt.hist(data['USD_Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by  hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()

In [62]:

# The plot above has a skew.....Its not balanced by the middle value
# Lets find the skew..
data['USD_Price'].skew()

Out[62]:

27.614708422713022

In [63]:

data['Price'].skew()

Out[63]:

2.8713985830315463

In [64]:

y_log = np.log(data['Price'])
y_usd_log = np.log(data['USD_Price'])

In [65]:

y_log.skew()

Out[65]:

0.6090003213244045

In [66]:

y_usd_log.skew()

Out[66]:

0.6544660543802026

In [67]:

sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_log, kde=True)
plt.show()

In [68]:

plt.figure(figsize=(20,6))
sns.displot(y_usd_log,  kind="kde")

Out[68]:

<seaborn.axisgrid.FacetGrid at 0x2275db7f6a0>

<Figure size 2000x600 with 0 Axes>

In [69]:

sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_usd_log, kde=True)
plt.show()

In [70]:

plt.figure(figsize=(20,6))
sns.displot(y_usd_log,  kind="kde")

Out[70]:

<seaborn.axisgrid.FacetGrid at 0x2275db0c430>

<Figure size 2000x600 with 0 Axes>

In [71]:

print('Intercept', regr.intercept_)# intercept after log transformation of prices

Intercept [7638.17668407]

In [72]:

# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params

Out[72]:

const                             6992.523453
Latitude                          -496.375770
Longitude                           59.004192
Price_Type_Wholesale              7127.406738
Price_Flag_actual,aggregate      -1352.595106
                                     ...     
Category_miscellaneous food        662.114666
Category_non-food                -7709.652678
Category_oil and fats            -8565.333733
Category_pulses and nuts          3722.717541
Category_vegetables and fruits    1967.930534
Length: 75, dtype: float64

In [73]:

data['Price'] = y_log

In [74]:

x_variables = data.drop(['Price','USD_Price'], axis=1)
y_variables = data[['Price']]

In [75]:

# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)

In [76]:

regr = LinearRegression()
regr.fit(X_train, y_train)

Out[76]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [77]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None)

Out[77]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [78]:

print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))

Training data r-squared: 0.955473671807045
Test data r-squared: 0.9530163840929582

In [79]:

r_squared =  regr.score(X_test, y_test)

In [80]:

1/(1 - r_squared)

Out[80]:

21.28401530394177

In [81]:

pd.DataFrame(data=regr.coef_[0], index=X_train.columns, columns=['coef'])

Out[81]:

	coef
Latitude	-0.043979
Longitude	0.005610
Price_Type_Wholesale	0.695027
Price_Flag_actual,aggregate	-0.024670
Price_Flag_aggregate	0.609545
...	...
Category_miscellaneous food	0.168125
Category_non-food	-1.914440
Category_oil and fats	-1.093030
Category_pulses and nuts	0.453877
Category_vegetables and fruits	-0.195025

74 rows × 1 columns

In [82]:

print('Intercept', regr.intercept_)# intercept after log transformation of prices

Intercept [6.68274382]

In [83]:

# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params

Out[83]:

const                             6.117853
Latitude                         -0.043979
Longitude                         0.005610
Price_Type_Wholesale              0.695027
Price_Flag_actual,aggregate      -0.024670
                                    ...   
Category_miscellaneous food       0.388973
Category_non-food                -1.734658
Category_oil and fats            -0.913248
Category_pulses and nuts          0.693586
Category_vegetables and fruits    0.074648
Length: 75, dtype: float64

In [84]:

results.pvalues

Out[84]:

const                              0.000000e+00
Latitude                          5.864980e-250
Longitude                          1.532625e-11
Price_Type_Wholesale               0.000000e+00
Price_Flag_actual,aggregate        2.700965e-01
                                      ...      
Category_miscellaneous food       6.518373e-260
Category_non-food                  1.576027e-85
Category_oil and fats              2.739426e-25
Category_pulses and nuts           0.000000e+00
Category_vegetables and fruits     3.880508e-03
Length: 75, dtype: float64

In [85]:

round(results.pvalues,3)

Out[85]:

const                             0.000
Latitude                          0.000
Longitude                         0.000
Price_Type_Wholesale              0.000
Price_Flag_actual,aggregate       0.270
                                  ...  
Category_miscellaneous food       0.000
Category_non-food                 0.000
Category_oil and fats             0.000
Category_pulses and nuts          0.000
Category_vegetables and fruits    0.004
Length: 75, dtype: float64

In [86]:

X = X_train.assign(const=1)
vif = pd.Series([variance_inflation_factor(X.values, i) 
               for i in range(X.shape[1])], 
              index=X.columns)

C:\Users\bhara\anaconda3\envs\base-env\lib\site-packages\statsmodels\stats\outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide
  vif = 1. / (1. - r_squared_i)

In [87]:

vif

Out[87]:

Latitude                           1.425167
Longitude                          1.767381
Price_Type_Wholesale              12.763579
Price_Flag_actual,aggregate        1.020677
Price_Flag_aggregate               1.630952
                                    ...    
Category_non-food                       inf
Category_oil and fats                   inf
Category_pulses and nuts                inf
Category_vegetables and fruits          inf
const                                   inf
Length: 75, dtype: float64

In [ ]: