The fish market dataset is a collection of data related to various species of fish and their characteristics. This dataset is designed for polynomial regression analysis and contains several columns with specific information. Here's a description of each column in the dataset:
Species: This column represents the species of the fish. It is a categorical variable that categorizes each fish into one of seven species. The species may include names like "Perch," "Bream," "Roach," "Pike," "Smelt," "Parkki," and "Whitefish." This column is the target variable for the polynomial regression analysis, where we aim to predict the fish's weight based on its other attributes.
Weight: This column represents the weight of the fish. It is a numerical variable that is typically measured in grams. The weight is the dependent variable we want to predict using polynomial regression.
Length1: This column represents the first measurement of the fish's length. It is a numerical variable, typically measured in centimetres.
Length2: This column represents the second measurement of the fish's length. It is another numerical variable, typically measured in centimetres.
Length3: This column represents the third measurement of the fish's length. Similar to the previous two columns, it is a numerical variable, usually measured in centimetres.
Height: This column represents the height of the fish. It is a numerical variable, typically measured in centimetres.
Width: This column represents the width of the fish. Like the other numerical variables, it is also typically measured in centimetres.
The dataset is structured in such a way that each row corresponds to a single fish with its species and various physical measurements (lengths, height, and width). The goal of using polynomial regression on this dataset would be to build a predictive model that can estimate the weight of a fish based on its species and the provided physical measurements. Polynomial regression allows for modelling more complex relationships between the independent variables (lengths, height, and width) and the dependent variable (weight), which may be particularly useful if there are non-linear patterns in the data.
1. Introduction
Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document
aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.
2. Objectives of Exploratory Data Analysis
1. Understand Data Characteristics:
Gain insights into the distribution, central tendency, and variability of the data.
Identify the presence of missing values, outliers, and anomalies.
2. Explore Relationships:
Examine correlations and dependencies between different variables.
Uncover potential patterns and trends within the dataset.
3. Visualize Data Distributions:
Utilize graphical representations to visualize the distribution of data.
Choose appropriate plots such as histograms, box plots, and scatter plots.
4. Identify Patterns and Anomalies:
Uncover hidden patterns that may not be apparent in raw data.
Detect outliers and anomalies that could impact analysis outcomes.
3. Techniques and Tools
1. Descriptive Statistics:
Calculate measures such as mean, median, and standard deviation.
Utilize summary statistics to provide an overview of the dataset.
Data Visualization:
Employ graphical representations like histograms, box plots, and scatter plots.
Create visualizations to illustrate trends, patterns, and relationships.
Correlation Analysis:
Use correlation matrices to quantify the relationships between variables.
Identify strong positive/negative correlations and potential multicollinearity.
Outlier Detection:
Apply statistical methods or visual inspection to identify outliers.
Assess the impact of outliers on the analysis and consider appropriate handling.
4. Steps in Exploratory Data Analysis
1. Data Collection:
Gather the raw dataset from reliable sources.
2. Data Cleaning:
Handle missing values, duplicate entries, and inconsistencies.
Ensure data is in a suitable format for analysis.
3. Descriptive Statistics:
Compute basic statistics to describe the central tendency and dispersion.
4. Visualization:
Generate visualizations to explore data distributions and relationships.
5. Correlation Analysis:
Investigate correlations between variables.
6. Outlier Detection:
Identify and analyze outliers to understand their impact.
5. Case Study: Applying EDA to Real-World Data
Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.
6. Conclusion
Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.
7. References
Include references to any tools, libraries, or methodologies used in the EDA process.
import pandas as pd
import numpy as np
# Plotting
import matplotlib.pyplot as plt
import seaborn as sns
# Documentation
import handcalcs.render
# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import colors
from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px
# Importing detect_outliers function from datasist library
from datasist.structdata import detect_outliers
from sympy import Sum, symbols, Indexed, lambdify, diff
from mpl_toolkits.mplot3d.axes3d import Axes3D
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression, SGDRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures
# Path
data_path = './Data/'
raw_data = pd.read_csv(data_path+"Fish.csv", low_memory=False).reset_index(drop=True)
raw_data.shape
(159, 7)
raw_data
Species | Weight | Length1 | Length2 | Length3 | Height | Width | |
---|---|---|---|---|---|---|---|
0 | Bream | 242.0 | 23.2 | 25.4 | 30.0 | 11.5200 | 4.0200 |
1 | Bream | 290.0 | 24.0 | 26.3 | 31.2 | 12.4800 | 4.3056 |
2 | Bream | 340.0 | 23.9 | 26.5 | 31.1 | 12.3778 | 4.6961 |
3 | Bream | 363.0 | 26.3 | 29.0 | 33.5 | 12.7300 | 4.4555 |
4 | Bream | 430.0 | 26.5 | 29.0 | 34.0 | 12.4440 | 5.1340 |
... | ... | ... | ... | ... | ... | ... | ... |
154 | Smelt | 12.2 | 11.5 | 12.2 | 13.4 | 2.0904 | 1.3936 |
155 | Smelt | 13.4 | 11.7 | 12.4 | 13.5 | 2.4300 | 1.2690 |
156 | Smelt | 12.2 | 12.1 | 13.0 | 13.8 | 2.2770 | 1.2558 |
157 | Smelt | 19.7 | 13.2 | 14.3 | 15.2 | 2.8728 | 2.0672 |
158 | Smelt | 19.9 | 13.8 | 15.0 | 16.2 | 2.9322 | 1.8792 |
159 rows × 7 columns
columns = raw_data.columns
columns
Index(['Species', 'Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width'], dtype='object')
raw_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 159 entries, 0 to 158 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Species 159 non-null object 1 Weight 159 non-null float64 2 Length1 159 non-null float64 3 Length2 159 non-null float64 4 Length3 159 non-null float64 5 Height 159 non-null float64 6 Width 159 non-null float64 dtypes: float64(6), object(1) memory usage: 8.8+ KB
raw_data.describe()
Weight | Length1 | Length2 | Length3 | Height | Width | |
---|---|---|---|---|---|---|
count | 159.000000 | 159.000000 | 159.000000 | 159.000000 | 159.000000 | 159.000000 |
mean | 398.326415 | 26.247170 | 28.415723 | 31.227044 | 8.970994 | 4.417486 |
std | 357.978317 | 9.996441 | 10.716328 | 11.610246 | 4.286208 | 1.685804 |
min | 0.000000 | 7.500000 | 8.400000 | 8.800000 | 1.728400 | 1.047600 |
25% | 120.000000 | 19.050000 | 21.000000 | 23.150000 | 5.944800 | 3.385650 |
50% | 273.000000 | 25.200000 | 27.300000 | 29.400000 | 7.786000 | 4.248500 |
75% | 650.000000 | 32.700000 | 35.500000 | 39.650000 | 12.365900 | 5.584500 |
max | 1650.000000 | 59.000000 | 63.400000 | 68.000000 | 18.957000 | 8.142000 |
raw_data.nunique()
Species 7 Weight 101 Length1 116 Length2 93 Length3 124 Height 154 Width 152 dtype: int64
raw_data.isnull().sum()
Species 0 Weight 0 Length1 0 Length2 0 Length3 0 Height 0 Width 0 dtype: int64
for column in raw_data.columns:
if column in ['Species']:
print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
print(raw_data[column].unique())
print("--------------------------------------------------------------------------------------------------------------")
------------------------------------------------- Species - 7 --------------------------------------------------- ['Bream' 'Roach' 'Whitefish' 'Parkki' 'Perch' 'Pike' 'Smelt'] --------------------------------------------------------------------------------------------------------------
raw_data.Species.value_counts()
Perch 56 Bream 35 Roach 20 Pike 17 Smelt 14 Parkki 11 Whitefish 6 Name: Species, dtype: int64
# Setting seaborn visualization parameters
sns.set(rc={"figure.figsize" : [15,8]}, font_scale=1.2)
sns.set(rc={"axes.facecolor":"#F2F3F4","figure.facecolor":"#F2F3F4"})
palette = ["#F08080", "#FA8072", "#E9967A", "#FFA07A", "#CD5C5C", "#AF601A", "#CA6F1E"]
sns.set_palette(palette)
color_map = colors.ListedColormap(palette)
## Extracting numerical columns from the 'fish' DataFrame
numerical_columns = raw_data.select_dtypes(include="number").columns.to_list()
## fig. size
plt.figure(figsize=(20, 15))
## Looping through each numerical column for plotting
for idx, column in enumerate(numerical_columns):
## Creating subplots within the grid
plt.subplot(2, 3, idx+1)
## Plotting histogram with KDE for the current numerical column
sns.histplot(data=raw_data,
x=column,
bins=25,
kde=True)
## title for the subplot
plt.title(f"{column} distribution.", weight="bold")
## overall title
plt.suptitle("Distribution for Numerical Columns".title(), weight="bold", fontsize=25, x=0.5, y=0.92, color="#CA6F1E")
## Displaying the plot
plt.show()
plt.figure(figsize=(20, 17))
for idx, column in enumerate(numerical_columns):
plt.subplot(2, 3, idx+1)
sns.boxplot(data=raw_data, x=column)
sns.swarmplot(data=raw_data, x=column, color="k")
plt.title(f"{column} distribution .", weight="bold")
plt.suptitle("Detect Outliers .".title(), weight="bold", fontsize=25, x=0.5, y=0.91, color="#CA6F1E")
plt.show()
# Detect Outliers Using datasist library¶
idx = detect_outliers(
data=raw_data,
n=0, ## the bench mark for the number of allowable outliers in the columns.
features=['Weight', 'Length1', 'Length2', 'Length3']
)
raw_data.iloc[idx]
Species | Weight | Length1 | Length2 | Length3 | Height | Width | |
---|---|---|---|---|---|---|---|
142 | Pike | 1600.0 | 56.0 | 60.0 | 64.0 | 9.600 | 6.144 |
143 | Pike | 1550.0 | 56.0 | 60.0 | 64.0 | 9.600 | 6.144 |
144 | Pike | 1650.0 | 59.0 | 63.4 | 68.0 | 10.812 | 7.480 |
Encoding categorical data is a crucial step in preparing data for machine learning models, as many algorithms require numerical input. Categorical data represents variables that can take on a limited, and usually fixed, number of values. There are several common techniques for encoding categorical data:
raw_data.describe(exclude="number")
Species | |
---|---|
count | 159 |
unique | 7 |
top | Perch |
freq | 56 |
raw_data["Species"].value_counts(normalize=True).to_frame()
Species | |
---|---|
Perch | 0.352201 |
Bream | 0.220126 |
Roach | 0.125786 |
Pike | 0.106918 |
Smelt | 0.088050 |
Parkki | 0.069182 |
Whitefish | 0.037736 |
## Create a count plot for the "Species" column
ax = sns.countplot(data=raw_data, y="Species", order=raw_data["Species"].value_counts().index)
## Add counts on top of each bar
for container in ax.containers:
ax.bar_label(container, label_type="center", color="k")
## Add title and labels
plt.title("Distribution of Fish Species", fontsize=15, weight="bold", color="#CA6F1E")
plt.xlabel("Count")
plt.ylabel("Fish Species")
plt.grid(axis="x", linestyle="--", alpha=0.6, c="k")
## Show plot
plt.show()
condition = raw_data.columns.str.contains("Length")
all_length = raw_data.columns[condition].tolist()
plt.figure(figsize=(15,5))
for idx, column in enumerate(all_length):
plt.subplot(1, 3, idx+1)
sns.scatterplot(data=raw_data, x="Weight", y=column)
plt.suptitle("Correlation between Weight and Length", weight="bold", color="#CA6F1E")
plt.show()
raw_data[["Length1", "Length2", "Length3", "Weight"]].corr()
Length1 | Length2 | Length3 | Weight | |
---|---|---|---|---|
Length1 | 1.000000 | 0.999517 | 0.992031 | 0.915712 |
Length2 | 0.999517 | 1.000000 | 0.994103 | 0.918618 |
Length3 | 0.992031 | 0.994103 | 1.000000 | 0.923044 |
Weight | 0.915712 | 0.918618 | 0.923044 | 1.000000 |
sns.boxplot(data=raw_data, x="Weight", y="Species")
plt.title('Boxplot of Weight by Species .', weight="bold", color="#CA6F1E")
plt.xlabel('Weight')
plt.ylabel('Species')
plt.grid(True)
plt.show()
## Create barplot using seaborn between species and the mean of weight
sns.barplot(
data=raw_data,
x="Species",
y="Weight",
errorbar=('ci', False),
estimator='mean'
)
## title and labels name
plt.title('Mean of Fish Weight by Species .', weight="bold", color="#CA6F1E")
plt.xlabel('Weight')
plt.ylabel('Species')
## Show plot
plt.show()
plt.show()
plt.figure(figsize=(15,5))
for idx, column in enumerate(all_length):
plt.subplot(1, 3, idx+1)
sns.scatterplot(data=raw_data, x="Weight", y=column, hue="Species")
plt.suptitle("Correlation between Weight and Length by Species .", weight="bold", color="#CA6F1E")
plt.show()
raw_data[["Height", "Width"]].corr()
Height | Width | |
---|---|---|
Height | 1.000000 | 0.792881 |
Width | 0.792881 | 1.000000 |
sns.scatterplot(data=raw_data, x="Height", y="Width", hue="Species", palette="Set1" )
plt.title('Scatter Plot of Height vs Width.', weight="bold", color="#CA6F1E")
plt.xlabel('Height')
plt.ylabel('Width')
plt.grid(True)
plt.show()
sns.scatterplot(data=raw_data, x="Height", y="Width", hue="Species", palette="Set1")
plt.title('Scatter Plot of Height vs Width', weight="bold", color="#CA6F1E")
plt.xlabel('Height')
plt.ylabel('Width')
plt.grid(True)
plt.show()
## Selecting rows where the species is "Pike" and extracting columns "Height" and "Width"
pike_data = raw_data[raw_data["Species"] == "Pike"][["Height", "Width"]]
## Calculating the correlation coefficient between the height and width of Pike
corr_coeff = pike_data.corr().iloc[0, 1].round(2)
print(f"correlation between height and width for pike type : {corr_coeff}")
correlation between height and width for pike type : 0.97
## Selecting rows where the species is "Pike" and extracting columns "Height" and "Width"
pike_data = raw_data[raw_data["Species"] == "Smelt"][["Height", "Width"]]
## Calculating the correlation coefficient between the height and width of Pike
corr_coeff = pike_data.corr().iloc[0, 1].round(2)
print(f"correlation between height and width for smelt type : {corr_coeff}")
correlation between height and width for smelt type : 0.87
raw_data[["Length1", "Length2", "Length3", "Height"]].corr()[["Height"]]
Height | |
---|---|
Length1 | 0.625378 |
Length2 | 0.640441 |
Length3 | 0.703409 |
Height | 1.000000 |
pair_plot = sns.pairplot(
data=raw_data,
hue="Species",
diag_kind="kde", # Use kernel density estimates on diagonal plots
height=3.5, # Set the height of each subplot
aspect=1.2 # Adjust the aspect ratio of the subplots
)
pair_plot.fig.suptitle("Pair Plot by Species", y=1.02, fontsize=30, weight="bold", color="#CA6F1E")
plt.show()
## Calculate the correlation matrix, excluding the 'Weight' column (target)
multi_corr = raw_data.drop("Weight", axis=1).corr(numeric_only=True)
## Create a mask for the upper triangle to hide it
mask = np.triu(np.ones_like(multi_corr), k=1)
## figure size
plt.figure(figsize=(10, 8))
## heatmap using seaborn
sns.heatmap(
multi_corr,
annot=True, # Display the correlation values
mask=mask, # Apply the mask to hide the upper triangle
square=True, # Ensure square-shaped cells
fmt="0.3f"
)
## title of the heatmap
plt.title("Correlation Heatmap Excluding Weight (Target) .", weight="bold", color="#CA6F1E")
## Show the plot
plt.show()
## Drop Columns with multucollinearlity
raw_data.drop(columns=["Length2", "Length3"], inplace=True)
target = "Weight"
x_variables = raw_data.drop(target, axis=1)
y_variables = pd.DataFrame(raw_data[target])
### Split Data - train, test
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)
print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100)
X_train Size : 127 Y_train Size : 127 X_test Size : 32 Y_test Size : 32 Train Size : 79.87421383647799 Train Size : 20.125786163522015
num_feat = x_variables.select_dtypes(include="number").columns.to_list()
categ_feat = y_variables.select_dtypes(exclude="number").columns.to_list()
print("After addressing multicollinearity, the remaining columns are:")
print(f"Numerical Feature in the Data ==> {num_feat}")
print(f"Categorical Feature in the Data ==> {categ_feat}")
After addressing multicollinearity, the remaining columns are: Numerical Feature in the Data ==> ['Length1', 'Height', 'Width'] Categorical Feature in the Data ==> []
## Create a ColumnTransformer for preprocessing numerical and categorical features
preprocess_cols = ColumnTransformer([
("Numerical Cols", StandardScaler(), num_feat),
("Categorical Cols", OneHotEncoder(), categ_feat)
])
## Fit the ColumnTransformer on the training data..
preprocess_cols.fit(X_train)
## Transform the training and test data using the fitted ColumnTransformer
X_train_final = preprocess_cols.transform(X_train)
X_test_final = preprocess_cols.transform(X_test)
## Create an instance of the LinearRegression model
lin_reg = LinearRegression()
## Fit the model to the training data
## X_train_final: Features of the training set
## y_train: Target values of the training set
lin_reg.fit(X_train_final, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
## Predictions on the training set using the trained Linear Regression model
lin_reg_train_predict = lin_reg.predict(X_train_final)
## Predictions on the testing set using the trained Linear Regression model
lin_reg_test_predict = lin_reg.predict(X_test_final)
## Mean Squared Error (MSE) for the training data
print(f"MSE For Training Data (Linear Reg.) : {mean_squared_error(y_train, lin_reg_train_predict).round(2)}")
## Mean Squared Error (MSE) for the testing data
print(f"MSE For Testing Data (Linear Reg.) : {mean_squared_error(y_test, lin_reg_test_predict).round(2)}")
print("**" * 50)
## Mean Absolute Error (MAE) for the training data
print(f"MAE For Training Data (Linear Reg.) : {mean_absolute_error(y_train, lin_reg_train_predict).round(2)}")
## Mean Absolute Error (MAE) for the testing data
print(f"MAE For Testing Data (Linear Reg.) : {mean_absolute_error(y_test, lin_reg_test_predict).round(2)}")
print("**" * 50)
## R-squared score for the training data
print(f"R-Squere Score For Training Data (Linear Reg.) : {r2_score(y_train, lin_reg_train_predict).round(2) * 100} %")
## R-squared score for the testing data
print(f"R-Squere Score For Testing Data (Linear Reg.) : {r2_score(y_test, lin_reg_test_predict).round(2) * 100} %")
MSE For Training Data (Linear Reg.) : 16099.32 MSE For Testing Data (Linear Reg.) : 11054.92 **************************************************************************************************** MAE For Training Data (Linear Reg.) : 99.2 MAE For Testing Data (Linear Reg.) : 84.99 **************************************************************************************************** R-Squere Score For Training Data (Linear Reg.) : 89.0 % R-Squere Score For Testing Data (Linear Reg.) : 71.0 %
# Gradient Descent
## Create an instance of the SGDRegressor model with specific parameters
## penalty=None: No regularization penalty
## random_state=90: Seed for reproducibility
## learning_rate='constant': Constant learning rate
## eta0=0.01: Initial learning rate
SGD = SGDRegressor(
penalty=None, random_state=90, learning_rate='constant', eta0=0.02, max_iter=1000
)
## Fit the model to the training data
## X_train_final: Features of the training set
## y_train: Target values of the training set
SGD.fit(X_train_final, y_train.iloc[:, 0].values)
SGDRegressor(eta0=0.02, learning_rate='constant', penalty=None, random_state=90)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SGDRegressor(eta0=0.02, learning_rate='constant', penalty=None, random_state=90)
SGD_train_predict = SGD.predict(X_train_final)
SGD_test_predict = SGD.predict(X_test_final)
print(f"MSE For Training Data (SGD) : {mean_squared_error(y_train, SGD_train_predict).round(2)}")
print(f"MSE For Testing Data (SGD) : {mean_squared_error(y_test, SGD_test_predict).round(2)}")
print("**" * 50)
print(f"MAE For Training Data (SGD) : {mean_absolute_error(y_train, SGD_train_predict).round(2)}")
print(f"MAE For Testing Data (SGD) : {mean_absolute_error(y_test, SGD_test_predict).round(2)}")
print("**" * 50)
print(f"R-Square Score For Training Data (SGD) : {r2_score(y_train, SGD_train_predict).round(2) * 100} %")
print(f"R-Square Score For Testing Data (SGD) : {r2_score(y_test, SGD_test_predict).round(2) * 100} %")
MSE For Training Data (SGD) : 16156.4 MSE For Testing Data (SGD) : 11612.37 **************************************************************************************************** MAE For Training Data (SGD) : 99.49 MAE For Testing Data (SGD) : 86.94 **************************************************************************************************** R-Square Score For Training Data (SGD) : 89.0 % R-Square Score For Testing Data (SGD) : 70.0 %
# cross validation
results = cross_val_score(estimator=SGD, X=X_train_final, y=y_train.iloc[:, 0].values, scoring="r2", cv=5)
print(f'CV Score using R-Square: {results}')
print(f'Mean of Results: {results.mean().round(2) * 100} %')
print(f'SD of results: {results.std().round(4)}') ## Standard Deviation
CV Score using R-Square: [0.86903092 0.87879141 0.81720361 0.89172989 0.87064825] Mean of Results: 87.0 % SD of results: 0.0254
# Create KDE plots
sns.kdeplot(data=y_test, linewidth=3, label='Actual Values')
sns.kdeplot(data=SGD_test_predict, linewidth=3, color="#64340A", label='Predicted Values')
# Set labels and title
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Kernel Density Estimate (KDE) Plot for Actual vs Predicted Values (SGD)", weight="bold", color="#CA6F1E")
# Display legend
plt.legend()
# Show the plot
plt.show()
Polynomial Regression is a form of regression analysis in which the relationship between the independent variable x
and the dependent variable y
is modeled as an n
th degree polynomial.
The general equation of a polynomial regression of degree n
is:
where:
Fitting a polynomial regression to a dataset involves estimating the coefficients $b_0, b_1, ..., b_n$. This is typically done using a method such as Ordinary Least Squares (OLS), which minimizes the sum of the squared residuals:
$$\min_{b_0, b_1, ..., b_n} \sum_{i=1}^{m} (y_i - (b_0 + b_1x_i + b_2x_i^2 + ... + b_nx_i^n))^2$$where:
One of the challenges with Polynomial Regression is selecting the right degree of the polynomial. A model with too high a degree can lead to overfitting, where the model fits the training data too closely and performs poorly on new data. Conversely, a model with too low a degree can lead to underfitting, where the model does not fit the training data well enough.
## Create an instance of PolynomialFeatures
poly = PolynomialFeatures(degree=2)
## Transform the features of the training set to include polynomial combinations
X_train_poly = poly.fit_transform(X_train_final)
## Transform the features of the testing set to include polynomial combinations
X_test_poly = poly.transform(X_test_final)
poly_reg = LinearRegression()
poly_reg.fit(X_train_poly, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
poly_train_predict = poly_reg.predict(X_train_poly)
poly_test_predict = poly_reg.predict(X_test_poly)
print(f"MSE For Training Data (Polynomial Regression) : {mean_squared_error(y_train, poly_train_predict).round(2)}")
print(f"MSE For Testing Data (Polynomial Regression) : {mean_squared_error(y_test, poly_test_predict).round(2)}")
print("**" * 50)
print(f"MAE For Training Data (Polynomial Regression) : {mean_absolute_error(y_train, poly_train_predict).round(2)}")
print(f"MAE For Testing Data (Polynomial Regression) : {mean_absolute_error(y_test, poly_test_predict).round(2)}")
print("**" * 50)
print(f"R-Square Score For Training Data (Polynomial Regression) : {r2_score(y_train, poly_train_predict).round(2) * 100} %")
print(f"R-Square Score For Testing Data (Polynomial Regression) : {r2_score(y_test, poly_test_predict).round(2) * 100} %")
MSE For Training Data (Polynomial Regression) : 2594.66 MSE For Testing Data (Polynomial Regression) : 1385.32 **************************************************************************************************** MAE For Training Data (Polynomial Regression) : 34.22 MAE For Testing Data (Polynomial Regression) : 25.14 **************************************************************************************************** R-Square Score For Training Data (Polynomial Regression) : 98.0 % R-Square Score For Testing Data (Polynomial Regression) : 96.0 %
# Cross Validation
results = cross_val_score(
estimator=poly_reg, X=X_train_poly, y=y_train, scoring="r2", cv=5
)
print(f'CV Score using R-Square: {results}')
print(f'Mean of Results: {results.mean().round(2) * 100} %')
print(f'SD of results: {results.std().round(4)}') ## Standard Deviation
CV Score using R-Square: [0.97268534 0.97946351 0.95263976 0.98247351 0.96779586] Mean of Results: 97.0 % SD of results: 0.0105
# Plotting the actual vs predicted values
# Create KDE plots
sns.kdeplot(data=y_test, linewidth=3, label='Actual Values')
sns.kdeplot(data=poly_test_predict, linewidth=3, color="#64340A", label='Predicted Values')
# Set labels and title
plt.xlabel("Values")
plt.ylabel("Density")
plt.title("Kernel Density Estimate (KDE) Plot for Actual vs Predicted Values (SGD)", weight="bold", color="#CA6F1E")
# Display legend
plt.legend()
# Show the plot
plt.show()