Polynomial Linear Regression¶

Fish Market¶

About Dataset¶

The fish market dataset is a collection of data related to various species of fish and their characteristics. This dataset is designed for polynomial regression analysis and contains several columns with specific information. Here's a description of each column in the dataset:

Species: This column represents the species of the fish. It is a categorical variable that categorizes each fish into one of seven species. The species may include names like "Perch," "Bream," "Roach," "Pike," "Smelt," "Parkki," and "Whitefish." This column is the target variable for the polynomial regression analysis, where we aim to predict the fish's weight based on its other attributes.
Weight: This column represents the weight of the fish. It is a numerical variable that is typically measured in grams. The weight is the dependent variable we want to predict using polynomial regression.
Length1: This column represents the first measurement of the fish's length. It is a numerical variable, typically measured in centimetres.
Length2: This column represents the second measurement of the fish's length. It is another numerical variable, typically measured in centimetres.
Length3: This column represents the third measurement of the fish's length. Similar to the previous two columns, it is a numerical variable, usually measured in centimetres.
Height: This column represents the height of the fish. It is a numerical variable, typically measured in centimetres.
Width: This column represents the width of the fish. Like the other numerical variables, it is also typically measured in centimetres.

The dataset is structured in such a way that each row corresponds to a single fish with its species and various physical measurements (lengths, height, and width). The goal of using polynomial regression on this dataset would be to build a predictive model that can estimate the weight of a fish based on its species and the provided physical measurements. Polynomial regression allows for modelling more complex relationships between the independent variables (lengths, height, and width) and the dependent variable (weight), which may be particularly useful if there are non-linear patterns in the data.

Exploratory Data Analysis: Unveiling Patterns and Insights¶

1. Introduction
    Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document
    aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.

2. Objectives of Exploratory Data Analysis
    1. Understand Data Characteristics:
        Gain insights into the distribution, central tendency, and variability of the data.
        Identify the presence of missing values, outliers, and anomalies.

    2. Explore Relationships:
        Examine correlations and dependencies between different variables.
        Uncover potential patterns and trends within the dataset.

    3. Visualize Data Distributions:
        Utilize graphical representations to visualize the distribution of data.
        Choose appropriate plots such as histograms, box plots, and scatter plots.

    4. Identify Patterns and Anomalies:
        Uncover hidden patterns that may not be apparent in raw data.
        Detect outliers and anomalies that could impact analysis outcomes.


3. Techniques and Tools
    1. Descriptive Statistics:
        Calculate measures such as mean, median, and standard deviation.
        Utilize summary statistics to provide an overview of the dataset.
        Data Visualization:

        Employ graphical representations like histograms, box plots, and scatter plots.
        Create visualizations to illustrate trends, patterns, and relationships.
        Correlation Analysis:

        Use correlation matrices to quantify the relationships between variables.
        Identify strong positive/negative correlations and potential multicollinearity.
        Outlier Detection:

        Apply statistical methods or visual inspection to identify outliers.
        Assess the impact of outliers on the analysis and consider appropriate handling.

4. Steps in Exploratory Data Analysis
    1. Data Collection:
        Gather the raw dataset from reliable sources.

    2. Data Cleaning:
        Handle missing values, duplicate entries, and inconsistencies.
        Ensure data is in a suitable format for analysis.

    3. Descriptive Statistics:
        Compute basic statistics to describe the central tendency and dispersion.

    4. Visualization:
        Generate visualizations to explore data distributions and relationships.

    5. Correlation Analysis:
        Investigate correlations between variables.

    6. Outlier Detection:
        Identify and analyze outliers to understand their impact.

5. Case Study: Applying EDA to Real-World Data
    Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.

6. Conclusion
    Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.

7. References
    Include references to any tools, libraries, or methodologies used in the EDA process.

Import Libraries¶

In [1]:

import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Documentation
import handcalcs.render

# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import colors

from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px

# Importing detect_outliers function from datasist library
from datasist.structdata import detect_outliers

from sympy import Sum, symbols, Indexed, lambdify, diff
from mpl_toolkits.mplot3d.axes3d import Axes3D

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures

In [2]:

# Path
data_path = './Data/'

In [3]:

raw_data = pd.read_csv(data_path+"Fish.csv",  low_memory=False).reset_index(drop=True)
raw_data.shape

Out[3]:

(159, 7)

In [4]:

raw_data

Out[4]:

	Species	Weight	Length1	Length2	Length3	Height	Width
0	Bream	242.0	23.2	25.4	30.0	11.5200	4.0200
1	Bream	290.0	24.0	26.3	31.2	12.4800	4.3056
2	Bream	340.0	23.9	26.5	31.1	12.3778	4.6961
3	Bream	363.0	26.3	29.0	33.5	12.7300	4.4555
4	Bream	430.0	26.5	29.0	34.0	12.4440	5.1340
...	...	...	...	...	...	...	...
154	Smelt	12.2	11.5	12.2	13.4	2.0904	1.3936
155	Smelt	13.4	11.7	12.4	13.5	2.4300	1.2690
156	Smelt	12.2	12.1	13.0	13.8	2.2770	1.2558
157	Smelt	19.7	13.2	14.3	15.2	2.8728	2.0672
158	Smelt	19.9	13.8	15.0	16.2	2.9322	1.8792

159 rows × 7 columns

In [5]:

columns = raw_data.columns
columns

Out[5]:

Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height',
       'Width'],
      dtype='object')

In [6]:

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB

In [7]:

raw_data.describe()

Out[7]:

	Weight	Length1	Length2	Length3	Height	Width
count	159.000000	159.000000	159.000000	159.000000	159.000000	159.000000
mean	398.326415	26.247170	28.415723	31.227044	8.970994	4.417486
std	357.978317	9.996441	10.716328	11.610246	4.286208	1.685804
min	0.000000	7.500000	8.400000	8.800000	1.728400	1.047600
25%	120.000000	19.050000	21.000000	23.150000	5.944800	3.385650
50%	273.000000	25.200000	27.300000	29.400000	7.786000	4.248500
75%	650.000000	32.700000	35.500000	39.650000	12.365900	5.584500
max	1650.000000	59.000000	63.400000	68.000000	18.957000	8.142000

In [8]:

raw_data.nunique()

Out[8]:

Species      7
Weight     101
Length1    116
Length2     93
Length3    124
Height     154
Width      152
dtype: int64

Finding for missing values¶

In [9]:

raw_data.isnull().sum()

Out[9]:

Species    0
Weight     0
Length1    0
Length2    0
Length3    0
Height     0
Width      0
dtype: int64

In [10]:

for column in raw_data.columns:
    if column in ['Species']:
        print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
        print(raw_data[column].unique())
        print("--------------------------------------------------------------------------------------------------------------")
        

------------------------------------------------- Species  -  7 ---------------------------------------------------
['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt']
--------------------------------------------------------------------------------------------------------------

In [11]:

raw_data.Species.value_counts()

Out[11]:

Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       11
Whitefish     6
Name: Species, dtype: int64

Data Exploration¶

In [12]:

# Setting seaborn visualization parameters
sns.set(rc={"figure.figsize" : [15,8]}, font_scale=1.2)
sns.set(rc={"axes.facecolor":"#F2F3F4","figure.facecolor":"#F2F3F4"})
palette = ["#F08080", "#FA8072", "#E9967A", "#FFA07A", "#CD5C5C", "#AF601A", "#CA6F1E"]

sns.set_palette(palette)
color_map = colors.ListedColormap(palette)

In [13]:

## Extracting numerical columns from the 'fish' DataFrame
numerical_columns = raw_data.select_dtypes(include="number").columns.to_list()
## fig. size
plt.figure(figsize=(20, 15))

## Looping through each numerical column for plotting
for idx, column in enumerate(numerical_columns):
    
    ## Creating subplots within the grid
    plt.subplot(2, 3, idx+1)
    
    ## Plotting histogram with KDE for the current numerical column
    sns.histplot(data=raw_data,
                 x=column,
                 bins=25,
                 kde=True)
    
    ## title for the subplot
    plt.title(f"{column} distribution.", weight="bold")

## overall title
plt.suptitle("Distribution for Numerical Columns".title(), weight="bold", fontsize=25, x=0.5, y=0.92, color="#CA6F1E")

## Displaying the plot
plt.show()

In [14]:

plt.figure(figsize=(20, 17))

for idx, column in enumerate(numerical_columns):
    plt.subplot(2, 3, idx+1)
    sns.boxplot(data=raw_data, x=column)
    sns.swarmplot(data=raw_data, x=column, color="k")
    plt.title(f"{column} distribution .", weight="bold")
plt.suptitle("Detect Outliers .".title(), weight="bold", fontsize=25, x=0.5, y=0.91, color="#CA6F1E")
plt.show()

In [15]:

# Detect Outliers Using datasist library¶

idx = detect_outliers(
    data=raw_data,
    n=0, ## the bench mark for the number of allowable outliers in the columns.
    features=['Weight', 'Length1', 'Length2', 'Length3']
)

In [16]:

raw_data.iloc[idx]

Out[16]:

	Species	Weight	Length1	Length2	Length3	Height	Width
142	Pike	1600.0	56.0	60.0	64.0	9.600	6.144
143	Pike	1550.0	56.0	60.0	64.0	9.600	6.144
144	Pike	1650.0	59.0	63.4	68.0	10.812	7.480

Encoding categorical data¶

Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:

Label Encoding:
- Assigns a unique integer to each category.
- Suitable for ordinal data where the order matters.
- Sklearn provides LabelEncoder for this purpose.
One-Hot Encoding:
- Creates binary columns for each category and represents the presence of a category with a 1.
- Suitable for nominal data where there is no inherent order.
Ordinal Encoding:
- Manually assign numerical values based on the order of categories.
- Useful when there is an inherent order among categories.
Binary Encoding:
- Converts categories into binary code.
- Reduces the number of columns compared to one-hot encoding.
Hashing Encoding:
- Converts categories into a fixed-size hash, useful when dealing with high cardinality.

In [17]:

raw_data.describe(exclude="number")

Out[17]:

	Species
count	159
unique	7
top	Perch
freq	56

In [18]:

raw_data["Species"].value_counts(normalize=True).to_frame()

Out[18]:

	Species
Perch	0.352201
Bream	0.220126
Roach	0.125786
Pike	0.106918
Smelt	0.088050
Parkki	0.069182
Whitefish	0.037736

In [19]:

## Create a count plot for the "Species" column
ax = sns.countplot(data=raw_data, y="Species", order=raw_data["Species"].value_counts().index)

## Add counts on top of each bar
for container in ax.containers:
    ax.bar_label(container, label_type="center", color="k")

## Add title and labels
plt.title("Distribution of Fish Species", fontsize=15, weight="bold", color="#CA6F1E")
plt.xlabel("Count")
plt.ylabel("Fish Species")
plt.grid(axis="x", linestyle="--", alpha=0.6, c="k")

## Show plot
plt.show()

Bivariate Analysis & Multivariate Analysis .¶

In [20]:

condition = raw_data.columns.str.contains("Length")
all_length = raw_data.columns[condition].tolist()
plt.figure(figsize=(15,5))

for idx, column in enumerate(all_length):
    
    plt.subplot(1, 3, idx+1)
    sns.scatterplot(data=raw_data, x="Weight", y=column)
plt.suptitle("Correlation between Weight and Length", weight="bold", color="#CA6F1E")
plt.show()

In [21]:

raw_data[["Length1", "Length2", "Length3", "Weight"]].corr()

Out[21]:

	Length1	Length2	Length3	Weight
Length1	1.000000	0.999517	0.992031	0.915712
Length2	0.999517	1.000000	0.994103	0.918618
Length3	0.992031	0.994103	1.000000	0.923044
Weight	0.915712	0.918618	0.923044	1.000000

In [22]:

sns.boxplot(data=raw_data, x="Weight", y="Species")
plt.title('Boxplot of Weight by Species .', weight="bold", color="#CA6F1E")
plt.xlabel('Weight')
plt.ylabel('Species')
plt.grid(True)
plt.show()

In [23]:

## Create barplot using seaborn between species and the mean of weight
sns.barplot(
    data=raw_data,
    x="Species",
    y="Weight",
    errorbar=('ci', False),
    estimator='mean'
)

## title and labels name
plt.title('Mean of Fish Weight by Species .', weight="bold", color="#CA6F1E")
plt.xlabel('Weight')
plt.ylabel('Species')

## Show plot
plt.show()
plt.show()

In [24]:

plt.figure(figsize=(15,5))

for idx, column in enumerate(all_length):
    
    plt.subplot(1, 3, idx+1)
    sns.scatterplot(data=raw_data, x="Weight", y=column, hue="Species")
plt.suptitle("Correlation between Weight and Length by Species .", weight="bold", color="#CA6F1E")
plt.show()

In [25]:

raw_data[["Height", "Width"]].corr()

Out[25]:

	Height	Width
Height	1.000000	0.792881
Width	0.792881	1.000000

In [26]:

sns.scatterplot(data=raw_data, x="Height", y="Width", hue="Species", palette="Set1" )
plt.title('Scatter Plot of Height vs Width.', weight="bold", color="#CA6F1E")
plt.xlabel('Height')
plt.ylabel('Width')
plt.grid(True)
plt.show()

In [27]:

sns.scatterplot(data=raw_data, x="Height", y="Width", hue="Species", palette="Set1")
plt.title('Scatter Plot of Height vs Width', weight="bold", color="#CA6F1E")
plt.xlabel('Height')
plt.ylabel('Width')
plt.grid(True)
plt.show()

In [28]:

## Selecting rows where the species is "Pike" and extracting columns "Height" and "Width"
pike_data = raw_data[raw_data["Species"] == "Pike"][["Height", "Width"]]

## Calculating the correlation coefficient between the height and width of Pike
corr_coeff = pike_data.corr().iloc[0, 1].round(2)
print(f"correlation between height and width for pike type : {corr_coeff}")

correlation between height and width for pike type : 0.97

In [29]:

## Selecting rows where the species is "Pike" and extracting columns "Height" and "Width"
pike_data = raw_data[raw_data["Species"] == "Smelt"][["Height", "Width"]]

## Calculating the correlation coefficient between the height and width of Pike
corr_coeff = pike_data.corr().iloc[0, 1].round(2)
print(f"correlation between height and width for smelt type : {corr_coeff}")

correlation between height and width for smelt type : 0.87

In [30]:

raw_data[["Length1", "Length2", "Length3", "Height"]].corr()[["Height"]]

Out[30]:

	Height
Length1	0.625378
Length2	0.640441
Length3	0.703409
Height	1.000000

In [31]:

pair_plot = sns.pairplot(
    data=raw_data,
    hue="Species",
    diag_kind="kde",  # Use kernel density estimates on diagonal plots
    height=3.5,  # Set the height of each subplot
    aspect=1.2  # Adjust the aspect ratio of the subplots
)
pair_plot.fig.suptitle("Pair Plot by Species", y=1.02, fontsize=30, weight="bold", color="#CA6F1E")
plt.show()

In [32]:

## Calculate the correlation matrix, excluding the 'Weight' column (target)
multi_corr = raw_data.drop("Weight", axis=1).corr(numeric_only=True)

## Create a mask for the upper triangle to hide it
mask = np.triu(np.ones_like(multi_corr), k=1)

## figure size
plt.figure(figsize=(10, 8))

## heatmap using seaborn
sns.heatmap(
    multi_corr,
    annot=True,       # Display the correlation values
    mask=mask,        # Apply the mask to hide the upper triangle
    square=True,       # Ensure square-shaped cells
    fmt="0.3f"
)

## title of the heatmap
plt.title("Correlation Heatmap Excluding Weight (Target) .", weight="bold", color="#CA6F1E")

## Show the plot
plt.show()

In [33]:

## Drop Columns with multucollinearlity 
raw_data.drop(columns=["Length2", "Length3"], inplace=True)

In [34]:

target = "Weight"
x_variables = raw_data.drop(target, axis=1)
y_variables = pd.DataFrame(raw_data[target])

In [35]:

### Split Data - train, test 
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)

In [36]:

print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100)

X_train Size : 127
Y_train Size : 127
X_test Size : 32
Y_test Size : 32
Train Size : 79.87421383647799
Train Size : 20.125786163522015

In [37]:

num_feat = x_variables.select_dtypes(include="number").columns.to_list()
categ_feat = y_variables.select_dtypes(exclude="number").columns.to_list()
print("After addressing multicollinearity, the remaining columns are:")
print(f"Numerical Feature in the Data ==> {num_feat}")
print(f"Categorical Feature in the Data ==> {categ_feat}")

After addressing multicollinearity, the remaining columns are:
Numerical Feature in the Data ==> ['Length1', 'Height', 'Width']
Categorical Feature in the Data ==> []

In [38]:

## Create a ColumnTransformer for preprocessing numerical and categorical features

preprocess_cols = ColumnTransformer([
    ("Numerical Cols", StandardScaler(), num_feat),
    ("Categorical Cols", OneHotEncoder(), categ_feat)
])

## Fit the ColumnTransformer on the training data..
preprocess_cols.fit(X_train)

## Transform the training and test data using the fitted ColumnTransformer
X_train_final = preprocess_cols.transform(X_train)
X_test_final = preprocess_cols.transform(X_test)

In [39]:

## Create an instance of the LinearRegression model
lin_reg = LinearRegression()

## Fit the model to the training data
## X_train_final: Features of the training set
## y_train: Target values of the training set
lin_reg.fit(X_train_final, y_train)

Out[39]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [40]:

## Predictions on the training set using the trained Linear Regression model
lin_reg_train_predict = lin_reg.predict(X_train_final)
## Predictions on the testing set using the trained Linear Regression model
lin_reg_test_predict = lin_reg.predict(X_test_final)

## Mean Squared Error (MSE) for the training data
print(f"MSE For Training Data (Linear Reg.) : {mean_squared_error(y_train, lin_reg_train_predict).round(2)}")
## Mean Squared Error (MSE) for the testing data
print(f"MSE For Testing Data (Linear Reg.) : {mean_squared_error(y_test, lin_reg_test_predict).round(2)}")
print("**" * 50)
## Mean Absolute Error (MAE) for the training data
print(f"MAE For Training Data (Linear Reg.) : {mean_absolute_error(y_train, lin_reg_train_predict).round(2)}")
## Mean Absolute Error (MAE) for the testing data
print(f"MAE For Testing Data (Linear Reg.) : {mean_absolute_error(y_test, lin_reg_test_predict).round(2)}")
print("**" * 50)
## R-squared score for the training data
print(f"R-Squere Score For Training Data (Linear Reg.) : {r2_score(y_train, lin_reg_train_predict).round(2) * 100} %")
## R-squared score for the testing data
print(f"R-Squere Score For Testing Data (Linear Reg.) : {r2_score(y_test, lin_reg_test_predict).round(2) * 100} %")

MSE For Training Data (Linear Reg.) : 16099.32
MSE For Testing Data (Linear Reg.) : 11054.92
****************************************************************************************************
MAE For Training Data (Linear Reg.) : 99.2
MAE For Testing Data (Linear Reg.) : 84.99
****************************************************************************************************
R-Squere Score For Training Data (Linear Reg.) : 89.0 %
R-Squere Score For Testing Data (Linear Reg.) : 71.0 %

In [41]:

# Gradient Descent

## Create an instance of the SGDRegressor model with specific parameters
## penalty=None: No regularization penalty
## random_state=90: Seed for reproducibility
## learning_rate='constant': Constant learning rate
## eta0=0.01: Initial learning rate
SGD = SGDRegressor(
    penalty=None, random_state=90, learning_rate='constant', eta0=0.02, max_iter=1000
)

## Fit the model to the training data
## X_train_final: Features of the training set
## y_train: Target values of the training set
SGD.fit(X_train_final, y_train.iloc[:, 0].values)

Out[41]:

SGDRegressor(eta0=0.02, learning_rate='constant', penalty=None, random_state=90)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [42]:

SGD_train_predict = SGD.predict(X_train_final)
SGD_test_predict = SGD.predict(X_test_final)

print(f"MSE For Training Data (SGD) : {mean_squared_error(y_train, SGD_train_predict).round(2)}")
print(f"MSE For Testing Data (SGD) : {mean_squared_error(y_test, SGD_test_predict).round(2)}")
print("**" * 50)
print(f"MAE For Training Data (SGD) : {mean_absolute_error(y_train, SGD_train_predict).round(2)}")
print(f"MAE For Testing Data (SGD) : {mean_absolute_error(y_test, SGD_test_predict).round(2)}")
print("**" * 50)
print(f"R-Square Score For Training Data (SGD) : {r2_score(y_train, SGD_train_predict).round(2) * 100} %")
print(f"R-Square Score For Testing Data (SGD) : {r2_score(y_test, SGD_test_predict).round(2) * 100} %")

MSE For Training Data (SGD) : 16156.4
MSE For Testing Data (SGD) : 11612.37
****************************************************************************************************
MAE For Training Data (SGD) : 99.49
MAE For Testing Data (SGD) : 86.94
****************************************************************************************************
R-Square Score For Training Data (SGD) : 89.0 %
R-Square Score For Testing Data (SGD) : 70.0 %

In [43]:

# cross validation

results = cross_val_score(estimator=SGD, X=X_train_final, y=y_train.iloc[:, 0].values, scoring="r2", cv=5)

print(f'CV Score using R-Square: {results}')
print(f'Mean of Results: {results.mean().round(2) * 100} %')
print(f'SD of results: {results.std().round(4)}') ## Standard Deviation

CV Score using R-Square: [0.86903092 0.87879141 0.81720361 0.89172989 0.87064825]
Mean of Results: 87.0 %
SD of results: 0.0254

In [44]:

# Create KDE plots
sns.kdeplot(data=y_test, linewidth=3, label='Actual Values')
sns.kdeplot(data=SGD_test_predict, linewidth=3, color="#64340A", label='Predicted Values')

# Set labels and title
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Kernel Density Estimate (KDE) Plot for Actual vs Predicted Values (SGD)", weight="bold", color="#CA6F1E")

# Display legend
plt.legend()

# Show the plot
plt.show()

Polynomial Regression¶

Polynomial Regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is modeled as an nth degree polynomial.

Mathematical Representation¶

The general equation of a polynomial regression of degree n is:

$$y = b_0 + b_1x + b_2x^2 + b_3x^3 + ... + b_nx^n + \epsilon$$

where:

$y$ is the dependent variable.
$x$ is the independent variable.
$b_0, b_1, ..., b_n$ are the regression coefficients. These are the parameters of the model that we will estimate.
$\epsilon$ is the error term.

Fitting a Polynomial Regression¶

Fitting a polynomial regression to a dataset involves estimating the coefficients $b_0, b_1, ..., b_n$. This is typically done using a method such as Ordinary Least Squares (OLS), which minimizes the sum of the squared residuals:

$$\min_{b_0, b_1, ..., b_n} \sum_{i=1}^{m} (y_i - (b_0 + b_1x_i + b_2x_i^2 + ... + b_nx_i^n))^2$$

where:

$m$ is the number of observations in the dataset.
$y_i$ is the observed value of the dependent variable for observation $i$.
$x_i$ is the value of the independent variable for observation $i$.

Overfitting and Underfitting¶

One of the challenges with Polynomial Regression is selecting the right degree of the polynomial. A model with too high a degree can lead to overfitting, where the model fits the training data too closely and performs poorly on new data. Conversely, a model with too low a degree can lead to underfitting, where the model does not fit the training data well enough.

In [45]:

## Create an instance of PolynomialFeatures
poly = PolynomialFeatures(degree=2)
## Transform the features of the training set to include polynomial combinations
X_train_poly = poly.fit_transform(X_train_final)
## Transform the features of the testing set to include polynomial combinations
X_test_poly = poly.transform(X_test_final)

In [46]:

poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)

Out[46]:

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [47]:

poly_train_predict = poly_reg.predict(X_train_poly)
poly_test_predict = poly_reg.predict(X_test_poly)

print(f"MSE For Training Data (Polynomial Regression) : {mean_squared_error(y_train, poly_train_predict).round(2)}")
print(f"MSE For Testing Data (Polynomial Regression) : {mean_squared_error(y_test, poly_test_predict).round(2)}")
print("**" * 50)
print(f"MAE For Training Data (Polynomial Regression) : {mean_absolute_error(y_train, poly_train_predict).round(2)}")
print(f"MAE For Testing Data (Polynomial Regression) : {mean_absolute_error(y_test, poly_test_predict).round(2)}")
print("**" * 50)
print(f"R-Square Score For Training Data (Polynomial Regression) : {r2_score(y_train, poly_train_predict).round(2) * 100} %")
print(f"R-Square Score For Testing Data (Polynomial Regression) : {r2_score(y_test, poly_test_predict).round(2) * 100} %")

MSE For Training Data (Polynomial Regression) : 2594.66
MSE For Testing Data (Polynomial Regression) : 1385.32
****************************************************************************************************
MAE For Training Data (Polynomial Regression) : 34.22
MAE For Testing Data (Polynomial Regression) : 25.14
****************************************************************************************************
R-Square Score For Training Data (Polynomial Regression) : 98.0 %
R-Square Score For Testing Data (Polynomial Regression) : 96.0 %

In [48]:

#  Cross Validation

results = cross_val_score(
    estimator=poly_reg, X=X_train_poly, y=y_train, scoring="r2", cv=5
)
print(f'CV Score using R-Square: {results}')
print(f'Mean of Results: {results.mean().round(2) * 100} %')
print(f'SD of results: {results.std().round(4)}') ## Standard Deviation

CV Score using R-Square: [0.97268534 0.97946351 0.95263976 0.98247351 0.96779586]
Mean of Results: 97.0 %
SD of results: 0.0105

In [49]:

# Plotting the actual vs predicted values 

# Create KDE plots
sns.kdeplot(data=y_test, linewidth=3, label='Actual Values')
sns.kdeplot(data=poly_test_predict, linewidth=3, color="#64340A", label='Predicted Values')

# Set labels and title
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Kernel Density Estimate (KDE) Plot for Actual vs Predicted Values (SGD)", weight="bold", color="#CA6F1E")

# Display legend
plt.legend()

# Show the plot
plt.show()