1. Introduction
Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.
2. Objectives of Exploratory Data Analysis
1. Understand Data Characteristics:
Gain insights into the distribution, central tendency, and variability of the data.
Identify the presence of missing values, outliers, and anomalies.
2. Explore Relationships:
Examine correlations and dependencies between different variables.
Uncover potential patterns and trends within the dataset.
3. Visualize Data Distributions:
Utilize graphical representations to visualize the distribution of data.
Choose appropriate plots such as histograms, box plots, and scatter plots.
4. Identify Patterns and Anomalies:
Uncover hidden patterns that may not be apparent in raw data.
Detect outliers and anomalies that could impact analysis outcomes.
3. Techniques and Tools
1. Descriptive Statistics:
Calculate measures such as mean, median, and standard deviation.
Utilize summary statistics to provide an overview of the dataset.
Data Visualization:
Employ graphical representations like histograms, box plots, and scatter plots.
Create visualizations to illustrate trends, patterns, and relationships.
Correlation Analysis:
Use correlation matrices to quantify the relationships between variables.
Identify strong positive/negative correlations and potential multicollinearity.
Outlier Detection:
Apply statistical methods or visual inspection to identify outliers.
Assess the impact of outliers on the analysis and consider appropriate handling.
4. Steps in Exploratory Data Analysis
1. Data Collection:
Gather the raw dataset from reliable sources.
2. Data Cleaning:
Handle missing values, duplicate entries, and inconsistencies.
Ensure data is in a suitable format for analysis.
3. Descriptive Statistics:
Compute basic statistics to describe the central tendency and dispersion.
4. Visualization:
Generate visualizations to explore data distributions and relationships.
5. Correlation Analysis:
Investigate correlations between variables.
6. Outlier Detection:
Identify and analyze outliers to understand their impact.
5. Case Study: Applying EDA to Real-World Data
Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.
6. Conclusion
Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.
7. References
Include references to any tools, libraries, or methodologies used in the EDA process.
import pandas as pd
import numpy as np
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Documentation
import handcalcs.render
# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px
from sympy import Sum, symbols, Indexed, lambdify, diff
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import LabelEncoder
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# Path
data_path = './Data/'
raw_data = pd.read_csv(data_path+"wfp_food_prices_nga.csv", low_memory=False).reset_index(drop=True)
raw_data.shape
(80982, 14)
raw_data
date | admin1 | admin2 | market | latitude | longitude | category | commodity | unit | priceflag | pricetype | currency | price | usdprice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | #date | #adm1+name | #adm2+name | #loc+market+name | #geo+lat | #geo+lon | #item+type | #item+name | #item+unit | #item+price+flag | #item+price+type | #currency | #value | #value+usd |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Maize | KG | actual | Wholesale | NGN | 175.92 | 1.5525 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Millet | KG | actual | Wholesale | NGN | 150.18 | 1.3254 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Rice (imported) | KG | actual | Wholesale | NGN | 358.7 | 3.1656 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Sorghum | KG | actual | Wholesale | NGN | 155.61 | 1.3733 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80977 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Onions | 400 G | forecast | Retail | NGN | 0.0 | 0.0 |
80978 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Oranges | 400 G | forecast | Retail | NGN | 0.0 | 0.0 |
80979 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Spinach | 300 G | forecast | Retail | NGN | 0.0 | 0.0 |
80980 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Tomatoes | 0.5 KG | forecast | Retail | NGN | 0.0 | 0.0 |
80981 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Watermelons | 2.1 KG | forecast | Retail | NGN | 0.0 | 0.0 |
80982 rows × 14 columns
column_info = raw_data.iloc[0]
column_info
date #date admin1 #adm1+name admin2 #adm2+name market #loc+market+name latitude #geo+lat longitude #geo+lon category #item+type commodity #item+name unit #item+unit priceflag #item+price+flag pricetype #item+price+type currency #currency price #value usdprice #value+usd Name: 0, dtype: object
raw_data = raw_data.drop(0).reset_index(drop=True)
raw_data
date | admin1 | admin2 | market | latitude | longitude | category | commodity | unit | priceflag | pricetype | currency | price | usdprice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Maize | KG | actual | Wholesale | NGN | 175.92 | 1.5525 |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Millet | KG | actual | Wholesale | NGN | 150.18 | 1.3254 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Rice (imported) | KG | actual | Wholesale | NGN | 358.7 | 3.1656 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | cereals and tubers | Sorghum | KG | actual | Wholesale | NGN | 155.61 | 1.3733 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08 | 7.24 | pulses and nuts | Beans (niebe) | KG | actual | Wholesale | NGN | 196.87 | 1.7374 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80976 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Onions | 400 G | forecast | Retail | NGN | 0.0 | 0.0 |
80977 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Oranges | 400 G | forecast | Retail | NGN | 0.0 | 0.0 |
80978 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Spinach | 300 G | forecast | Retail | NGN | 0.0 | 0.0 |
80979 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Tomatoes | 0.5 KG | forecast | Retail | NGN | 0.0 | 0.0 |
80980 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062664985656738 | 11.171370506286621 | vegetables and fruits | Watermelons | 2.1 KG | forecast | Retail | NGN | 0.0 | 0.0 |
80981 rows × 14 columns
# Rename the columns
raw_data.rename(columns={ 'date': 'Date', 'admin1': 'Admin1', 'admin2': 'Admin2', 'market': 'Market', 'latitude': 'Latitude', 'longitude': 'Longitude', 'category': 'Category',
'commodity': 'Commodity', 'unit': 'Unit', 'priceflag': 'Price_Flag', 'pricetype': 'Price_Type', 'currency': 'Currency', 'price': 'Price', 'usdprice':'USD_Price' }, inplace=True)
raw_data.columns
Index(['Date', 'Admin1', 'Admin2', 'Market', 'Latitude', 'Longitude', 'Category', 'Commodity', 'Unit', 'Price_Flag', 'Price_Type', 'Currency', 'Price', 'USD_Price'], dtype='object')
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 80981 entries, 0 to 80980 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 80981 non-null object 1 Admin1 80981 non-null object 2 Admin2 80981 non-null object 3 Market 80981 non-null object 4 Latitude 80981 non-null object 5 Longitude 80981 non-null object 6 Category 80981 non-null object 7 Commodity 80981 non-null object 8 Unit 80981 non-null object 9 Price_Flag 80981 non-null object 10 Price_Type 80981 non-null object 11 Currency 80981 non-null object 12 Price 80981 non-null object 13 USD_Price 80981 non-null object dtypes: object(14) memory usage: 8.6+ MB
raw_data['Date'] = raw_data['Date'].astype('datetime64[ns]')
raw_data['Latitude'] = raw_data['Latitude'].astype(float)
raw_data['Longitude'] = raw_data['Longitude'].astype(float)
raw_data['Price'] = raw_data['Price'].astype(float)
raw_data['USD_Price'] = raw_data['USD_Price'].astype(float)
raw_data
Date | Admin1 | Admin2 | Market | Latitude | Longitude | Category | Commodity | Unit | Price_Flag | Price_Type | Currency | Price | USD_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.080000 | 7.240000 | cereals and tubers | Maize | KG | actual | Wholesale | NGN | 175.92 | 1.5525 |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.080000 | 7.240000 | cereals and tubers | Millet | KG | actual | Wholesale | NGN | 150.18 | 1.3254 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.080000 | 7.240000 | cereals and tubers | Rice (imported) | KG | actual | Wholesale | NGN | 358.70 | 3.1656 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.080000 | 7.240000 | cereals and tubers | Sorghum | KG | actual | Wholesale | NGN | 155.61 | 1.3733 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.080000 | 7.240000 | pulses and nuts | Beans (niebe) | KG | actual | Wholesale | NGN | 196.87 | 1.7374 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80976 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Onions | 400 G | forecast | Retail | NGN | 0.00 | 0.0000 |
80977 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Oranges | 400 G | forecast | Retail | NGN | 0.00 | 0.0000 |
80978 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Spinach | 300 G | forecast | Retail | NGN | 0.00 | 0.0000 |
80979 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Tomatoes | 0.5 KG | forecast | Retail | NGN | 0.00 | 0.0000 |
80980 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Watermelons | 2.1 KG | forecast | Retail | NGN | 0.00 | 0.0000 |
80981 rows × 14 columns
raw_data.describe()
Latitude | Longitude | Price | USD_Price | |
---|---|---|---|---|
count | 80981.000000 | 80981.000000 | 80981.000000 | 80981.000000 |
mean | 11.396802 | 10.278874 | 4887.086852 | 29.488854 |
std | 1.699392 | 2.926734 | 10640.517897 | 329.652934 |
min | 5.147640 | 3.401897 | 0.000000 | 0.000000 |
25% | 11.280160 | 8.429120 | 100.000000 | 0.279300 |
50% | 11.831100 | 11.166670 | 314.280000 | 0.968800 |
75% | 12.362300 | 13.092646 | 1254.000000 | 3.909100 |
max | 13.645000 | 13.264360 | 138000.000000 | 28662.419400 |
raw_data.nunique()
Date 264 Admin1 14 Admin2 31 Market 40 Latitude 40 Longitude 40 Category 8 Commodity 42 Unit 22 Price_Flag 4 Price_Type 2 Currency 1 Price 20163 USD_Price 29982 dtype: int64
raw_data.isnull().sum()
Date 0 Admin1 0 Admin2 0 Market 0 Latitude 0 Longitude 0 Category 0 Commodity 0 Unit 0 Price_Flag 0 Price_Type 0 Currency 0 Price 0 USD_Price 0 dtype: int64
for column in raw_data.columns:
if column not in ['Date', 'Latitude', 'Longitude', 'Price', 'USD_Price']:
print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
print(raw_data[column].unique())
print("--------------------------------------------------------------------------------------------------------------")
------------------------------------------------- Admin1 - 14 --------------------------------------------------- ['Katsina' 'Sokoto' 'Borno' 'Kano' 'Jigawa' 'Oyo' 'Lagos' 'Kaduna' 'Zamfara' 'Abia' 'Gombe' 'Kebbi' 'Adamawa' 'Yobe'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Admin2 - 31 --------------------------------------------------- ['Jibia' 'Gada' "Mai'Adua" 'Mobbar' 'Dawakin Tofa' 'Maigatari' 'Ibadan North' 'Maiduguri' 'Kosofe' 'Giwa' 'Kaura Namoda' 'Oboma Ngwa' 'Akko' 'Kaugama' 'Lere' 'Dandume' 'Gwandu' 'Hong' 'Biu' 'Damaturu' 'Potiskum' 'Konduga' 'Geidam' 'Gujba' 'Jakusko' 'Karasuwa' 'Nguru' 'Yunusari' 'Yusufari' 'Borsari' 'Gulani'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Market - 40 --------------------------------------------------- ['Jibia (CBM)' 'Illela (CBM)' 'Mai Adoua (CBM)' 'Damassack (CBM)' 'Dawanau' 'Mai Gatari (CBM)' 'Ibadan' 'Maiduguri' 'Lagos' 'Giwa' 'Kaura Namoda' 'Aba' 'Gombe' 'Gujungu' 'Saminaka' 'Dandume' 'Gwandu' 'Mubi' 'Biu' 'Damaturu' 'Potiskum' 'Abba Gamaram' 'Baga Road' 'Bullunkutu' 'Budum' 'Custom' 'Kusawam Shanu' 'Monday' 'Tashan Bama' 'Bolori Stores' 'Damaturu (Sunday Market)' 'Geidam' 'Gujba (Buni Yadi)' 'Jakusko' 'Bade (Gashua)' 'Nguru' 'Yunusari' 'Yusufari' 'Bursari' 'Gulani (Tettaba)'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Category - 8 --------------------------------------------------- ['cereals and tubers' 'pulses and nuts' 'non-food' 'oil and fats' 'meat, fish and eggs' 'milk and dairy' 'vegetables and fruits' 'miscellaneous food'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Commodity - 42 --------------------------------------------------- ['Maize' 'Millet' 'Rice (imported)' 'Sorghum' 'Beans (niebe)' 'Wheat' 'Maize (white)' 'Sorghum (white)' 'Rice (milled, local)' 'Bread' 'Cassava meal (gari, yellow)' 'Gari (white)' 'Maize (yellow)' 'Rice (local)' 'Sorghum (brown)' 'Yam (Abuja)' 'Fuel (diesel)' 'Fuel (petrol-gasoline)' 'Oil (palm)' 'Cowpeas (brown)' 'Cowpeas (white)' 'Yam' 'Groundnuts (shelled)' 'Maize flour' 'Meat (beef)' 'Meat (goat)' 'Milk' 'Oil (vegetable)' 'Beans (red)' 'Beans (white)' 'Groundnuts' 'Onions' 'Fish' 'Eggs' 'Bananas' 'Oranges' 'Spinach' 'Watermelons' 'Cowpeas' 'Tomatoes' 'Salt' 'Sugar'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Unit - 22 --------------------------------------------------- ['KG' '100 KG' '50 KG' 'Unit' 'L' '100 L' '1.3 KG' '1.4 KG' '3.4 KG' '20 G' '750 ML' '1.1 KG' '0.5 KG' '1.5 KG' '30 pcs' '400 G' '300 G' '2.1 KG' '3.1 KG' '1.2 KG' '250 G' '100 Tubers'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Price_Flag - 4 --------------------------------------------------- ['actual' 'actual,aggregate' 'aggregate' 'forecast'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Price_Type - 2 --------------------------------------------------- ['Wholesale' 'Retail'] -------------------------------------------------------------------------------------------------------------- ------------------------------------------------- Currency - 1 --------------------------------------------------- ['NGN'] --------------------------------------------------------------------------------------------------------------
raw_data.Category.value_counts()
cereals and tubers 39537 pulses and nuts 13841 vegetables and fruits 8794 meat, fish and eggs 6177 oil and fats 4928 non-food 3436 miscellaneous food 2721 milk and dairy 1547 Name: Category, dtype: int64
raw_data.Commodity.value_counts()
Millet 4972 Rice (imported) 4537 Rice (local) 4193 Yam 4002 Oil (palm) 3368 Maize (white) 2946 Sorghum (white) 2904 Cowpeas (white) 2699 Groundnuts (shelled) 2663 Sorghum (brown) 2660 Gari (white) 2599 Maize (yellow) 2560 Cowpeas (brown) 2526 Cassava meal (gari, yellow) 2310 Sorghum 2109 Fuel (petrol-gasoline) 1721 Fuel (diesel) 1715 Groundnuts 1561 Oil (vegetable) 1560 Maize flour 1557 Beans (red) 1554 Meat (beef) 1552 Meat (goat) 1552 Milk 1547 Beans (white) 1547 Onions 1546 Fish 1543 Eggs 1530 Watermelons 1521 Oranges 1520 Bananas 1505 Tomatoes 1415 Sugar 1369 Salt 1352 Spinach 1287 Cowpeas 1267 Bread 1035 Maize 601 Rice (milled, local) 394 Yam (Abuja) 85 Wheat 73 Beans (niebe) 24 Name: Commodity, dtype: int64
raw_data['Unit'].value_counts()
100 KG 19125 KG 18740 1.3 KG 7056 L 5135 1.4 KG 5072 1.2 KG 4335 750 ML 3120 400 G 2387 0.5 KG 2094 50 KG 1965 20 G 1547 30 pcs 1530 2.1 KG 1521 250 G 1352 300 G 1287 Unit 1035 3.1 KG 867 1.1 KG 694 3.4 KG 680 1.5 KG 676 100 Tubers 666 100 L 97 Name: Unit, dtype: int64
raw_data['Price_Flag'].value_counts()
actual 51969 aggregate 14594 forecast 13949 actual,aggregate 469 Name: Price_Flag, dtype: int64
raw_data['Price_Type'].value_counts()
Retail 56111 Wholesale 24870 Name: Price_Type, dtype: int64
# Removing Outlier column
raw_data.drop('Currency', axis=1, inplace=True)
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5)) # You can adjust the width and height as needed
# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Category'].unique()))
# Get the top 8 categories (you can adjust the number as needed)
top_categories = raw_data['Category'].value_counts().index
# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Category'].isin(top_categories)]
# Creating the bar plot with different colors for each category
sns.barplot(x='Category', y='USD_Price', hue='Category', data=filtered_data, palette=palette)
# Adding titles and labels
plt.title('Bar Plot of USD Price by Category')
plt.xlabel('Category')
plt.ylabel('USD Price')
# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Category')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(10, 18)) # Swapping width and height for a horizontal plot
# Get the top categories
top_categories = raw_data['Commodity'].value_counts().index
# Filter the data to include only the top categories
filtered_data = raw_data[raw_data['Commodity'].isin(top_categories)]
# Creating the horizontal bar plot with different colors for each category
sns.barplot(x='USD_Price', y='Commodity', hue='Commodity', data=filtered_data)
# Adding titles and labels
plt.title('Plot of USD Price by Commodity')
plt.xlabel('USD Price')
plt.ylabel('Commodity')
# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], marker='o', color='w', markerfacecolor=sns.color_palette()[i % len(sns.color_palette())], markersize=10) for i, _ in enumerate(top_categories)]
plt.legend(legend_handles, top_categories, title='Commodity')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(14, 5))
# Choose a color palette with fewer colors
palette = sns.color_palette("Set1", len(raw_data['Price_Type'].unique()))
# Get Price Type
top_price_type = raw_data['Price_Type'].value_counts().index
# Filter the data to include only the top Price Type
filtered_data = raw_data[raw_data['Price_Type'].isin(top_price_type)]
# Creating the bar plot with different colors for each Price Type
ax = sns.barplot(x='Price_Type', y='USD_Price', hue='Price_Type', data=filtered_data, palette=palette)
# Adding titles and labels
plt.title('USD Price by Price Type')
plt.xlabel('Price Type')
plt.ylabel('USD Price')
# Creating a custom legend with colors
legend_handles = [plt.Line2D([0], [0], color=palette[i], lw=4) for i, _ in enumerate(top_price_type)]
plt.legend(legend_handles, top_price_type, title='Price Type')
# Adding annotations on top of the bars
for p in ax.patches:
ax.annotate(f'{p.get_height():.2f}', (p.get_x() + p.get_width() / 2., p.get_height()),
ha='center', va='center', xytext=(0, 10), textcoords='offset points')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))
# Creating a histogram with 20 bins
sns.histplot(raw_data["Price"], kde=True, bins=10,
stat="density", kde_kws=dict(cut=3),
alpha=.4, edgecolor=(1, 1, 1, .4),)
# Adding titles and labels
plt.title('Price Distribution')
plt.xlabel('Price')
plt.ylabel('Frequency')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))
# Creating a histogram with 20 bins
sns.histplot(raw_data["USD_Price"], kde=True, bins=10,
stat="density", kde_kws=dict(cut=3),
alpha=.4, edgecolor=(1, 1, 1, .4),)
# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 5))
sns.kdeplot(raw_data["USD_Price"])
# Adding titles and labels
plt.title('USD Price Distribution')
plt.xlabel('USD Price')
plt.ylabel('Frequency')
# Show the plot
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8)) # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of NGN Price vs Commodity')
plt.xlabel('Commodity')
plt.ylabel('NGN Price')
plt.show()
# Adjusting figure size (set this before creating the plot)
plt.figure(figsize=(18, 8)) # Swapping width and height for a horizontal plot
sns.scatterplot(x=raw_data['USD_Price'], y=raw_data['Commodity'])
plt.title('Scatter Plot of USD Price vs Commodity')
plt.xlabel('X1')
plt.ylabel('y')
plt.show()
# sns.pairplot(raw_data)
# plt.show()
# sns.pairplot(raw_data, kind = 'reg' , plot_kws = {'line_kws':{'color': 'red'}})
# plt.show()
Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:
pd.get_dummies(raw_data['Category'])
cereals and tubers | meat, fish and eggs | milk and dairy | miscellaneous food | non-food | oil and fats | pulses and nuts | vegetables and fruits | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
80976 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
80977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
80978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
80979 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
80980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
80981 rows × 8 columns
pd.get_dummies(raw_data['Price_Flag'])
actual | actual,aggregate | aggregate | forecast | |
---|---|---|---|---|
0 | 1 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 |
... | ... | ... | ... | ... |
80976 | 0 | 0 | 0 | 1 |
80977 | 0 | 0 | 0 | 1 |
80978 | 0 | 0 | 0 | 1 |
80979 | 0 | 0 | 0 | 1 |
80980 | 0 | 0 | 0 | 1 |
80981 rows × 4 columns
pd.get_dummies(raw_data['Price_Type'])
Retail | Wholesale | |
---|---|---|
0 | 0 | 1 |
1 | 0 | 1 |
2 | 0 | 1 |
3 | 0 | 1 |
4 | 0 | 1 |
... | ... | ... |
80976 | 1 | 0 |
80977 | 1 | 0 |
80978 | 1 | 0 |
80979 | 1 | 0 |
80980 | 1 | 0 |
80981 rows × 2 columns
pd.get_dummies(raw_data['Unit'])
0.5 KG | 1.1 KG | 1.2 KG | 1.3 KG | 1.4 KG | 1.5 KG | 100 KG | 100 L | 100 Tubers | 2.1 KG | ... | 3.1 KG | 3.4 KG | 30 pcs | 300 G | 400 G | 50 KG | 750 ML | KG | L | Unit | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
80976 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
80977 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
80978 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
80979 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
80980 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
80981 rows × 22 columns
# This data has been seperated as test data because the price has to be predicted
test_data = raw_data[raw_data['Price_Flag'] == 'forecast'].reset_index(drop=True)
test_data
Date | Admin1 | Admin2 | Market | Latitude | Longitude | Category | Commodity | Unit | Price_Flag | Price_Type | Price | USD_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006-01-15 | Sokoto | Gada | Illela (CBM) | 13.645000 | 5.278000 | cereals and tubers | Wheat | KG | forecast | Wholesale | 0.0 | 0.0 |
1 | 2006-02-15 | Sokoto | Gada | Illela (CBM) | 13.645000 | 5.278000 | cereals and tubers | Wheat | KG | forecast | Wholesale | 0.0 | 0.0 |
2 | 2006-03-15 | Katsina | Mai'Adua | Mai Adoua (CBM) | 13.180000 | 8.230000 | cereals and tubers | Wheat | KG | forecast | Wholesale | 0.0 | 0.0 |
3 | 2006-03-15 | Sokoto | Gada | Illela (CBM) | 13.645000 | 5.278000 | cereals and tubers | Wheat | KG | forecast | Wholesale | 0.0 | 0.0 |
4 | 2006-04-15 | Katsina | Mai'Adua | Mai Adoua (CBM) | 13.180000 | 8.230000 | cereals and tubers | Wheat | KG | forecast | Wholesale | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
13944 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Onions | 400 G | forecast | Retail | 0.0 | 0.0 |
13945 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Oranges | 400 G | forecast | Retail | 0.0 | 0.0 |
13946 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Spinach | 300 G | forecast | Retail | 0.0 | 0.0 |
13947 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Tomatoes | 0.5 KG | forecast | Retail | 0.0 | 0.0 |
13948 | 2023-12-15 | Yobe | Yusufari | Yusufari | 13.062665 | 11.171371 | vegetables and fruits | Watermelons | 2.1 KG | forecast | Retail | 0.0 | 0.0 |
13949 rows × 13 columns
raw_data = raw_data[raw_data['Price_Flag'] != 'forecast'].reset_index(drop=True)
raw_data
Date | Admin1 | Admin2 | Market | Latitude | Longitude | Category | Commodity | Unit | Price_Flag | Price_Type | Price | USD_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Maize | KG | actual | Wholesale | 175.92 | 1.5525 |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Millet | KG | actual | Wholesale | 150.18 | 1.3254 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Rice (imported) | KG | actual | Wholesale | 358.70 | 3.1656 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Sorghum | KG | actual | Wholesale | 155.61 | 1.3733 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | pulses and nuts | Beans (niebe) | KG | actual | Wholesale | 196.87 | 1.7374 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67027 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (brown) | KG | aggregate | Retail | 298.55 | 0.6480 |
67028 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (white) | 100 KG | aggregate | Wholesale | 29440.00 | 63.8992 |
67029 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (white) | KG | aggregate | Retail | 274.82 | 0.5965 |
67030 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Groundnuts (shelled) | 100 KG | aggregate | Wholesale | 46960.00 | 101.9261 |
67031 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Groundnuts (shelled) | KG | aggregate | Retail | 467.66 | 1.0151 |
67032 rows × 13 columns
encoder = OneHotEncoder(drop='first')
def encoding_categorical_data(df, column):
df = pd.concat([df, pd.DataFrame(encoder.fit_transform(df[[column]]).toarray(), columns=encoder.get_feature_names_out([column])).astype(int)], axis=1)
# Removing un necessary column
df.drop(column, axis=1, inplace=True)
return df
data = raw_data.copy()
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
data = encoding_categorical_data(data, col)
for col in ['Price_Type', 'Price_Flag', 'Unit', 'Commodity', 'Category']:
test_data = encoding_categorical_data(test_data, col)
data.shape
(67032, 80)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 67032 entries, 0 to 67031 Data columns (total 80 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 67032 non-null datetime64[ns] 1 Admin1 67032 non-null object 2 Admin2 67032 non-null object 3 Market 67032 non-null object 4 Latitude 67032 non-null float64 5 Longitude 67032 non-null float64 6 Price 67032 non-null float64 7 USD_Price 67032 non-null float64 8 Price_Type_Wholesale 67032 non-null int32 9 Price_Flag_actual,aggregate 67032 non-null int32 10 Price_Flag_aggregate 67032 non-null int32 11 Unit_1.1 KG 67032 non-null int32 12 Unit_1.2 KG 67032 non-null int32 13 Unit_1.3 KG 67032 non-null int32 14 Unit_1.4 KG 67032 non-null int32 15 Unit_1.5 KG 67032 non-null int32 16 Unit_100 KG 67032 non-null int32 17 Unit_100 L 67032 non-null int32 18 Unit_100 Tubers 67032 non-null int32 19 Unit_2.1 KG 67032 non-null int32 20 Unit_20 G 67032 non-null int32 21 Unit_250 G 67032 non-null int32 22 Unit_3.1 KG 67032 non-null int32 23 Unit_3.4 KG 67032 non-null int32 24 Unit_30 pcs 67032 non-null int32 25 Unit_300 G 67032 non-null int32 26 Unit_400 G 67032 non-null int32 27 Unit_50 KG 67032 non-null int32 28 Unit_750 ML 67032 non-null int32 29 Unit_KG 67032 non-null int32 30 Unit_L 67032 non-null int32 31 Unit_Unit 67032 non-null int32 32 Commodity_Beans (niebe) 67032 non-null int32 33 Commodity_Beans (red) 67032 non-null int32 34 Commodity_Beans (white) 67032 non-null int32 35 Commodity_Bread 67032 non-null int32 36 Commodity_Cassava meal (gari, yellow) 67032 non-null int32 37 Commodity_Cowpeas 67032 non-null int32 38 Commodity_Cowpeas (brown) 67032 non-null int32 39 Commodity_Cowpeas (white) 67032 non-null int32 40 Commodity_Eggs 67032 non-null int32 41 Commodity_Fish 67032 non-null int32 42 Commodity_Fuel (diesel) 67032 non-null int32 43 Commodity_Fuel (petrol-gasoline) 67032 non-null int32 44 Commodity_Gari (white) 67032 non-null int32 45 Commodity_Groundnuts 67032 non-null int32 46 Commodity_Groundnuts (shelled) 67032 non-null int32 47 Commodity_Maize 67032 non-null int32 48 Commodity_Maize (white) 67032 non-null int32 49 Commodity_Maize (yellow) 67032 non-null int32 50 Commodity_Maize flour 67032 non-null int32 51 Commodity_Meat (beef) 67032 non-null int32 52 Commodity_Meat (goat) 67032 non-null int32 53 Commodity_Milk 67032 non-null int32 54 Commodity_Millet 67032 non-null int32 55 Commodity_Oil (palm) 67032 non-null int32 56 Commodity_Oil (vegetable) 67032 non-null int32 57 Commodity_Onions 67032 non-null int32 58 Commodity_Oranges 67032 non-null int32 59 Commodity_Rice (imported) 67032 non-null int32 60 Commodity_Rice (local) 67032 non-null int32 61 Commodity_Rice (milled, local) 67032 non-null int32 62 Commodity_Salt 67032 non-null int32 63 Commodity_Sorghum 67032 non-null int32 64 Commodity_Sorghum (brown) 67032 non-null int32 65 Commodity_Sorghum (white) 67032 non-null int32 66 Commodity_Spinach 67032 non-null int32 67 Commodity_Sugar 67032 non-null int32 68 Commodity_Tomatoes 67032 non-null int32 69 Commodity_Watermelons 67032 non-null int32 70 Commodity_Wheat 67032 non-null int32 71 Commodity_Yam 67032 non-null int32 72 Commodity_Yam (Abuja) 67032 non-null int32 73 Category_meat, fish and eggs 67032 non-null int32 74 Category_milk and dairy 67032 non-null int32 75 Category_miscellaneous food 67032 non-null int32 76 Category_non-food 67032 non-null int32 77 Category_oil and fats 67032 non-null int32 78 Category_pulses and nuts 67032 non-null int32 79 Category_vegetables and fruits 67032 non-null int32 dtypes: datetime64[ns](1), float64(4), int32(72), object(3) memory usage: 22.5+ MB
data.describe()
Latitude | Longitude | Price | USD_Price | Price_Type_Wholesale | Price_Flag_actual,aggregate | Price_Flag_aggregate | Unit_1.1 KG | Unit_1.2 KG | Unit_1.3 KG | ... | Commodity_Wheat | Commodity_Yam | Commodity_Yam (Abuja) | Category_meat, fish and eggs | Category_milk and dairy | Category_miscellaneous food | Category_non-food | Category_oil and fats | Category_pulses and nuts | Category_vegetables and fruits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | ... | 67032.000000 | 67032.000000 | 67032.0000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 | 67032.000000 |
mean | 11.411172 | 10.273114 | 5904.063438 | 35.625326 | 0.320847 | 0.006997 | 0.217717 | 0.008832 | 0.052811 | 0.086645 | ... | 0.000910 | 0.048171 | 0.0010 | 0.076575 | 0.019185 | 0.033074 | 0.043681 | 0.060717 | 0.168979 | 0.107919 |
std | 1.689944 | 2.934114 | 11435.785644 | 362.031271 | 0.466805 | 0.083354 | 0.412697 | 0.093561 | 0.223657 | 0.281317 | ... | 0.030153 | 0.214129 | 0.0316 | 0.265918 | 0.137175 | 0.178831 | 0.204385 | 0.238813 | 0.374736 | 0.310280 |
min | 5.147640 | 3.401897 | 1.000000 | 0.002300 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 11.280160 | 8.429120 | 200.000000 | 0.600400 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 11.831100 | 11.166670 | 446.745000 | 1.307900 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 12.362300 | 13.092646 | 6862.500000 | 28.969800 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.0000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
max | 13.645000 | 13.264360 | 138000.000000 | 28662.419400 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | ... | 1.000000 | 1.000000 | 1.0000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
8 rows × 76 columns
test_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 13949 entries, 0 to 13948 Data columns (total 77 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 13949 non-null datetime64[ns] 1 Admin1 13949 non-null object 2 Admin2 13949 non-null object 3 Market 13949 non-null object 4 Latitude 13949 non-null float64 5 Longitude 13949 non-null float64 6 Price 13949 non-null float64 7 USD_Price 13949 non-null float64 8 Price_Type_Wholesale 13949 non-null int32 9 Unit_1.1 KG 13949 non-null int32 10 Unit_1.2 KG 13949 non-null int32 11 Unit_1.3 KG 13949 non-null int32 12 Unit_1.4 KG 13949 non-null int32 13 Unit_1.5 KG 13949 non-null int32 14 Unit_100 KG 13949 non-null int32 15 Unit_100 L 13949 non-null int32 16 Unit_100 Tubers 13949 non-null int32 17 Unit_2.1 KG 13949 non-null int32 18 Unit_20 G 13949 non-null int32 19 Unit_250 G 13949 non-null int32 20 Unit_3.1 KG 13949 non-null int32 21 Unit_3.4 KG 13949 non-null int32 22 Unit_30 pcs 13949 non-null int32 23 Unit_300 G 13949 non-null int32 24 Unit_400 G 13949 non-null int32 25 Unit_50 KG 13949 non-null int32 26 Unit_750 ML 13949 non-null int32 27 Unit_KG 13949 non-null int32 28 Unit_L 13949 non-null int32 29 Unit_Unit 13949 non-null int32 30 Commodity_Beans (red) 13949 non-null int32 31 Commodity_Beans (white) 13949 non-null int32 32 Commodity_Bread 13949 non-null int32 33 Commodity_Cassava meal (gari, yellow) 13949 non-null int32 34 Commodity_Cowpeas 13949 non-null int32 35 Commodity_Cowpeas (brown) 13949 non-null int32 36 Commodity_Cowpeas (white) 13949 non-null int32 37 Commodity_Eggs 13949 non-null int32 38 Commodity_Fish 13949 non-null int32 39 Commodity_Fuel (diesel) 13949 non-null int32 40 Commodity_Fuel (petrol-gasoline) 13949 non-null int32 41 Commodity_Gari (white) 13949 non-null int32 42 Commodity_Groundnuts 13949 non-null int32 43 Commodity_Groundnuts (shelled) 13949 non-null int32 44 Commodity_Maize 13949 non-null int32 45 Commodity_Maize (white) 13949 non-null int32 46 Commodity_Maize (yellow) 13949 non-null int32 47 Commodity_Maize flour 13949 non-null int32 48 Commodity_Meat (beef) 13949 non-null int32 49 Commodity_Meat (goat) 13949 non-null int32 50 Commodity_Milk 13949 non-null int32 51 Commodity_Millet 13949 non-null int32 52 Commodity_Oil (palm) 13949 non-null int32 53 Commodity_Oil (vegetable) 13949 non-null int32 54 Commodity_Onions 13949 non-null int32 55 Commodity_Oranges 13949 non-null int32 56 Commodity_Rice (imported) 13949 non-null int32 57 Commodity_Rice (local) 13949 non-null int32 58 Commodity_Rice (milled, local) 13949 non-null int32 59 Commodity_Salt 13949 non-null int32 60 Commodity_Sorghum 13949 non-null int32 61 Commodity_Sorghum (brown) 13949 non-null int32 62 Commodity_Sorghum (white) 13949 non-null int32 63 Commodity_Spinach 13949 non-null int32 64 Commodity_Sugar 13949 non-null int32 65 Commodity_Tomatoes 13949 non-null int32 66 Commodity_Watermelons 13949 non-null int32 67 Commodity_Wheat 13949 non-null int32 68 Commodity_Yam 13949 non-null int32 69 Commodity_Yam (Abuja) 13949 non-null int32 70 Category_meat, fish and eggs 13949 non-null int32 71 Category_milk and dairy 13949 non-null int32 72 Category_miscellaneous food 13949 non-null int32 73 Category_non-food 13949 non-null int32 74 Category_oil and fats 13949 non-null int32 75 Category_pulses and nuts 13949 non-null int32 76 Category_vegetables and fruits 13949 non-null int32 dtypes: datetime64[ns](1), float64(4), int32(69), object(3) memory usage: 4.5+ MB
data
Date | Admin1 | Admin2 | Market | Latitude | Longitude | Price | USD_Price | Price_Type_Wholesale | Price_Flag_actual,aggregate | ... | Commodity_Wheat | Commodity_Yam | Commodity_Yam (Abuja) | Category_meat, fish and eggs | Category_milk and dairy | Category_miscellaneous food | Category_non-food | Category_oil and fats | Category_pulses and nuts | Category_vegetables and fruits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | 175.92 | 1.5525 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | 150.18 | 1.3254 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | 358.70 | 3.1656 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | 155.61 | 1.3733 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | 196.87 | 1.7374 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67027 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | 298.55 | 0.6480 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67028 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | 29440.00 | 63.8992 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67029 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | 274.82 | 0.5965 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67030 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | 46960.00 | 101.9261 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67031 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | 467.66 | 1.0151 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67032 rows × 80 columns
data = data.drop(['Date', 'Admin1', 'Admin2', 'Market'], axis=1)
raw_data
Date | Admin1 | Admin2 | Market | Latitude | Longitude | Category | Commodity | Unit | Price_Flag | Price_Type | Price | USD_Price | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Maize | KG | actual | Wholesale | 175.92 | 1.5525 |
1 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Millet | KG | actual | Wholesale | 150.18 | 1.3254 |
2 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Rice (imported) | KG | actual | Wholesale | 358.70 | 3.1656 |
3 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | cereals and tubers | Sorghum | KG | actual | Wholesale | 155.61 | 1.3733 |
4 | 2002-01-15 | Katsina | Jibia | Jibia (CBM) | 13.08000 | 7.24000 | pulses and nuts | Beans (niebe) | KG | actual | Wholesale | 196.87 | 1.7374 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67027 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (brown) | KG | aggregate | Retail | 298.55 | 0.6480 |
67028 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (white) | 100 KG | aggregate | Wholesale | 29440.00 | 63.8992 |
67029 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Cowpeas (white) | KG | aggregate | Retail | 274.82 | 0.5965 |
67030 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Groundnuts (shelled) | 100 KG | aggregate | Wholesale | 46960.00 | 101.9261 |
67031 | 2023-01-15 | Zamfara | Kaura Namoda | Kaura Namoda | 12.59519 | 6.58635 | pulses and nuts | Groundnuts (shelled) | KG | aggregate | Retail | 467.66 | 1.0151 |
67032 rows × 13 columns
data.duplicated()
0 False 1 False 2 False 3 False 4 False ... 67027 False 67028 False 67029 False 67030 False 67031 False Length: 67032, dtype: bool
co_relation = data.corr()
plt.figure(figsize=(18, 8)) # Swapping width and height for a horizontal plot
sns.heatmap(co_relation, xticklabels=co_relation.columns, yticklabels=co_relation.columns, annot=True)
<Axes: >
y_variables = data[['Price']]
x_variables = data.drop(['Price','USD_Price'], axis=1)
x_variables
Latitude | Longitude | Price_Type_Wholesale | Price_Flag_actual,aggregate | Price_Flag_aggregate | Unit_1.1 KG | Unit_1.2 KG | Unit_1.3 KG | Unit_1.4 KG | Unit_1.5 KG | ... | Commodity_Wheat | Commodity_Yam | Commodity_Yam (Abuja) | Category_meat, fish and eggs | Category_milk and dairy | Category_miscellaneous food | Category_non-food | Category_oil and fats | Category_pulses and nuts | Category_vegetables and fruits | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 13.08000 | 7.24000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 13.08000 | 7.24000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 13.08000 | 7.24000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 13.08000 | 7.24000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 13.08000 | 7.24000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
67027 | 12.59519 | 6.58635 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67028 | 12.59519 | 6.58635 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67029 | 12.59519 | 6.58635 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67030 | 12.59519 | 6.58635 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67031 | 12.59519 | 6.58635 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
67032 rows × 74 columns
# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)
print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100)
X_train Size : 53625 Y_train Size : 53625 X_test Size : 13407 Y_test Size : 13407 Train Size : 79.99910490511995 Train Size : 20.000895094880057
Several circumstances that influence the dependent variable simultaneously can be controlled through multiple regression analysis. Regression analysis is a method of analyzing the relationship between independent variables and dependent variables.
Let $k$ represent the number of variables denoted by $x_1, x_2, x_3, ……, x_k$.
For this method, we assume that we have $k$ independent variables $x_1, . . . , x_k$ that we can set, then they probabilistically determine an outcome $Y$.
Furthermore, we assume that $Y$ is linearly dependent on the factors according to
$$Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + · · · + \beta_kx_k + \epsilon$$The variable $y_i$ is dependent or predicted The slope of $y$ depends on the y-intercept, that is, when $x_i$ and $x_2$ are both zero, $y$ will be $\beta_0$. The regression coefficients $\beta_1$ and $\beta_2$ represent the change in $y$ as a result of one-unit changes in $x_{i1}$ and $x_{i2}$. $\beta_p$ refers to the slope coefficient of all independent variables $\epsilon$ term describes the random error (residual) in the model. Where $\epsilon$ is a standard error, this is just like we had for simple linear regression, except $k$ doesn’t have to be 1.
We have $n$ observations, $n$ typically being much more than $k$.
For $i$ th observation, we set the independent variables to the values $x_{i1}, x_{i2} . . . , x_{ik}$ and measure a value $y_i$ for the random variable $Y_i$.
Thus, the model can be described by the equations.
$$Y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik} + i$$for $i = 1, 2, . . . , n$,
Where the errors $i$ are independent standard variables, each with mean 0 and the same unknown variance $\sigma^2$.
Altogether the model for multiple linear regression has $k + 2$ unknown parameters:
$\beta_0, \beta_1, . . . , \beta_k$, and $\sigma^2$.
When $k$ was equal to 1, we found the least squares line $y = \hat{\beta}_0 +\hat{\beta}_1x$.
It was a line in the plane $R^2$.
Now, with $k ≥ 1$, we’ll have a least squares hyperplane.
$$y = \hat{\beta}_0 + \hat{\beta}_1x_1 + \hat{\beta}_2x_2 + · · · + \hat{\beta}_kx_k$$in $R^{k+1}$.
The way to find the estimators $\hat{\beta}_0, \hat{\beta}_1, . . ., and \hat{\beta}_k$ is the same.
Take the partial derivatives of the squared error.
$$Q = \sum_{i=1}^{n} (y_i − (\beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + · · · + \beta_kx_{ik}))^2$$When that system is solved we have fitted values
$$\hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1x_{i1} + \hat{\beta}_2x_{i2} + · · · + \hat{\beta}_kx_{ik}$$for $i = 1, . . . , n$ that should be close to the actual values $y_i$
# Linear Rergression
regr = LinearRegression()
regr.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
regr.coef_
array([[ -496.37577043, 59.00419216, 7127.40673806, -1352.59510637, 7550.79319011, -3778.135254 , -3440.99173812, -3791.11471381, -4587.58949543, -3372.19435216, 5538.41622035, 10824.82777782, 29157.66962326, -1866.76080609, -758.24411606, -1279.93091436, -7528.21499248, -7596.86213421, -2625.80215273, -1941.83515747, 317.92836492, 3636.69856655, 11338.51910072, -9857.58587657, 6735.37409185, -2575.2182841 , -1870.13105442, -1726.39911311, -1764.77844386, -2575.2182841 , 606.46553692, -1870.37217076, 3411.03963047, 2277.51264975, -2625.80215273, -2633.23666723, -3878.24010093, -4036.89809262, -1012.79963229, -1722.67623345, 6714.54158886, 1337.60370884, -3583.79062706, -2379.76527505, 1701.05276651, 4665.97254177, 4709.96026691, -758.24411606, -344.69289164, -4390.94720729, -4379.87204189, -4090.90661068, -4218.01435362, 2395.96294133, 6096.6802472 , 8903.42047086, -1279.93091435, 1782.27837774, -2246.98054402, -2905.38240885, -1941.83515747, 1689.62362447, -3922.10442867, -1866.76080609, 1532.10846264, 5840.42898245, -7338.19608482, 4116.89398872, -758.24411606, 409.69271012, -7915.13819355, -8770.81924919, 3448.7368535 , 1659.70225979]])
len(regr.coef_)
1
regr.intercept_
array([7638.17668407])
print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))
Training data r-squared: 0.7633629459214317 Test data r-squared: 0.7691195969542567
pd.DataFrame(data = regr.coef_[0], index = X_train.columns, columns = ['Coef'])
Coef | |
---|---|
Latitude | -496.375770 |
Longitude | 59.004192 |
Price_Type_Wholesale | 7127.406738 |
Price_Flag_actual,aggregate | -1352.595106 |
Price_Flag_aggregate | 7550.793190 |
... | ... |
Category_miscellaneous food | 409.692710 |
Category_non-food | -7915.138194 |
Category_oil and fats | -8770.819249 |
Category_pulses and nuts | 3448.736854 |
Category_vegetables and fruits | 1659.702260 |
74 rows × 1 columns
Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero: Skewness is a statistical measure that describes the asymmetry or departure from symmetry in a probability distribution. In the context of data analysis, skewness is often used to assess the distribution of a variable. The skewness of a distribution can be positive, negative, or zero:
Positive Skewness (Right Skewed):
Negative Skewness (Left Skewed):
Zero Skewness:
The skewness of a variable (X) can be calculated using the formula:
$$ \text{Skewness} = \frac{1}{\sigma^3} E[(X-\mu)^3] $$where:
where (E) is the expectation operator, (\mu) is the mean of (X), and (\sigma) is the standard deviation of (X).
Now, regarding your question about applying skewness to (y) variables (target variables or dependent variables in a predictive modeling context), there are a couple of considerations:
Assumption of Regression Models:
Equalizing Variances:
Stabilizing Variances:
It's important to note that transforming the target variable should be done judiciously, and the choice of transformation depends on the characteristics of the data and the goals of the analysis. Common transformations include the logarithmic transformation, square root transformation, and Box-Cox transformation.
Always validate the impact of any transformation on your analysis, and consider the interpretability of the results after transformation. Additionally, be cautious about over-transforming the data, as it might lead to difficulties in interpreting the results.
plt.figure(figsize= (16,8))
plt.hist(data['Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()
plt.figure(figsize= (16,8))
plt.hist(data['USD_Price'], bins =50, ec = 'black', color = 'blue', alpha = 0.7)# bins 50...setting color by hexcode
plt.xlabel('Price')
plt.ylabel('Nr of Houses')
plt.show()
# The plot above has a skew.....Its not balanced by the middle value
# Lets find the skew..
data['USD_Price'].skew()
27.614708422713022
data['Price'].skew()
2.8713985830315463
y_log = np.log(data['Price'])
y_usd_log = np.log(data['USD_Price'])
y_log.skew()
0.6090003213244045
y_usd_log.skew()
0.6544660543802026
sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_log, kde=True)
plt.show()
plt.figure(figsize=(20,6))
sns.displot(y_usd_log, kind="kde")
<seaborn.axisgrid.FacetGrid at 0x2275db7f6a0>
<Figure size 2000x600 with 0 Axes>
sns.set()
plt.figure(figsize=(20,6))
sns.histplot(y_usd_log, kde=True)
plt.show()
plt.figure(figsize=(20,6))
sns.displot(y_usd_log, kind="kde")
<seaborn.axisgrid.FacetGrid at 0x2275db0c430>
<Figure size 2000x600 with 0 Axes>
print('Intercept', regr.intercept_)# intercept after log transformation of prices
Intercept [7638.17668407]
# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params
const 6992.523453 Latitude -496.375770 Longitude 59.004192 Price_Type_Wholesale 7127.406738 Price_Flag_actual,aggregate -1352.595106 ... Category_miscellaneous food 662.114666 Category_non-food -7709.652678 Category_oil and fats -8565.333733 Category_pulses and nuts 3722.717541 Category_vegetables and fruits 1967.930534 Length: 75, dtype: float64
data['Price'] = y_log
x_variables = data.drop(['Price','USD_Price'], axis=1)
y_variables = data[['Price']]
# train, test split......
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)
regr = LinearRegression()
regr.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
print('Training data r-squared:', regr.score(X_train, y_train))
print('Test data r-squared:', regr.score(X_test, y_test))
Training data r-squared: 0.955473671807045 Test data r-squared: 0.9530163840929582
r_squared = regr.score(X_test, y_test)
1/(1 - r_squared)
21.28401530394177
pd.DataFrame(data=regr.coef_[0], index=X_train.columns, columns=['coef'])
coef | |
---|---|
Latitude | -0.043979 |
Longitude | 0.005610 |
Price_Type_Wholesale | 0.695027 |
Price_Flag_actual,aggregate | -0.024670 |
Price_Flag_aggregate | 0.609545 |
... | ... |
Category_miscellaneous food | 0.168125 |
Category_non-food | -1.914440 |
Category_oil and fats | -1.093030 |
Category_pulses and nuts | 0.453877 |
Category_vegetables and fruits | -0.195025 |
74 rows × 1 columns
print('Intercept', regr.intercept_)# intercept after log transformation of prices
Intercept [6.68274382]
# using the stats model to get the p value...
results = sm.OLS(y_train, sm.add_constant(X_train)).fit()
results.params
const 6.117853 Latitude -0.043979 Longitude 0.005610 Price_Type_Wholesale 0.695027 Price_Flag_actual,aggregate -0.024670 ... Category_miscellaneous food 0.388973 Category_non-food -1.734658 Category_oil and fats -0.913248 Category_pulses and nuts 0.693586 Category_vegetables and fruits 0.074648 Length: 75, dtype: float64
results.pvalues
const 0.000000e+00 Latitude 5.864980e-250 Longitude 1.532625e-11 Price_Type_Wholesale 0.000000e+00 Price_Flag_actual,aggregate 2.700965e-01 ... Category_miscellaneous food 6.518373e-260 Category_non-food 1.576027e-85 Category_oil and fats 2.739426e-25 Category_pulses and nuts 0.000000e+00 Category_vegetables and fruits 3.880508e-03 Length: 75, dtype: float64
round(results.pvalues,3)
const 0.000 Latitude 0.000 Longitude 0.000 Price_Type_Wholesale 0.000 Price_Flag_actual,aggregate 0.270 ... Category_miscellaneous food 0.000 Category_non-food 0.000 Category_oil and fats 0.000 Category_pulses and nuts 0.000 Category_vegetables and fruits 0.004 Length: 75, dtype: float64
X = X_train.assign(const=1)
vif = pd.Series([variance_inflation_factor(X.values, i)
for i in range(X.shape[1])],
index=X.columns)
C:\Users\bhara\anaconda3\envs\base-env\lib\site-packages\statsmodels\stats\outliers_influence.py:198: RuntimeWarning: divide by zero encountered in scalar divide vif = 1. / (1. - r_squared_i)
vif
Latitude 1.425167 Longitude 1.767381 Price_Type_Wholesale 12.763579 Price_Flag_actual,aggregate 1.020677 Price_Flag_aggregate 1.630952 ... Category_non-food inf Category_oil and fats inf Category_pulses and nuts inf Category_vegetables and fruits inf const inf Length: 75, dtype: float64