Logistic Regression¶

Loan Prediction¶

About Dataset¶

Predict Loan Eligibility for Dream Housing Finance company Dream Housing Finance company deals in all kinds of home loans. They have presence across all urban, semi urban and rural areas. Customer first applies for home loan and after that company validates the customer eligibility for loan.

Company wants to automate the loan eligibility process (real time) based on customer detail provided while filling online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have provided a dataset to identify the customers segments that are eligible for loan amount so that they can specifically target these customers.

Exploratory Data Analysis: Unveiling Patterns and Insights¶

1. Introduction
    Exploratory Data Analysis (EDA) is a crucial phase in the data analysis pipeline, serving as the foundation for making informed decisions and deriving meaningful insights from raw data. This document
    aims to provide a comprehensive understanding of the EDA process, its importance, and the key techniques involved.

2. Objectives of Exploratory Data Analysis
    1. Understand Data Characteristics:
        Gain insights into the distribution, central tendency, and variability of the data.
        Identify the presence of missing values, outliers, and anomalies.

    2. Explore Relationships:
        Examine correlations and dependencies between different variables.
        Uncover potential patterns and trends within the dataset.

    3. Visualize Data Distributions:
        Utilize graphical representations to visualize the distribution of data.
        Choose appropriate plots such as histograms, box plots, and scatter plots.

    4. Identify Patterns and Anomalies:
        Uncover hidden patterns that may not be apparent in raw data.
        Detect outliers and anomalies that could impact analysis outcomes.


3. Techniques and Tools
    1. Descriptive Statistics:
        Calculate measures such as mean, median, and standard deviation.
        Utilize summary statistics to provide an overview of the dataset.
        Data Visualization:

        Employ graphical representations like histograms, box plots, and scatter plots.
        Create visualizations to illustrate trends, patterns, and relationships.
        Correlation Analysis:

        Use correlation matrices to quantify the relationships between variables.
        Identify strong positive/negative correlations and potential multicollinearity.
        Outlier Detection:

        Apply statistical methods or visual inspection to identify outliers.
        Assess the impact of outliers on the analysis and consider appropriate handling.

4. Steps in Exploratory Data Analysis
    1. Data Collection:
        Gather the raw dataset from reliable sources.

    2. Data Cleaning:
        Handle missing values, duplicate entries, and inconsistencies.
        Ensure data is in a suitable format for analysis.

    3. Descriptive Statistics:
        Compute basic statistics to describe the central tendency and dispersion.

    4. Visualization:
        Generate visualizations to explore data distributions and relationships.

    5. Correlation Analysis:
        Investigate correlations between variables.

    6. Outlier Detection:
        Identify and analyze outliers to understand their impact.

5. Case Study: Applying EDA to Real-World Data
    Provide a practical example where EDA is applied to a specific dataset, showcasing the step-by-step process and the insights gained.

6. Conclusion
    Summarize the key findings from the EDA process and emphasize its importance in guiding subsequent data analysis and decision-making.

7. References
    Include references to any tools, libraries, or methodologies used in the EDA process.

Import Libraries¶

In [1]:

import pandas as pd
import numpy as np

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Documentation
import handcalcs.render

# Plot
%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import colors

from matplotlib import cm # color map
import seaborn as sns
import plotly.express as px

# Importing detect_outliers function from datasist library
from datasist.structdata import detect_outliers

from sympy import Sum, symbols, Indexed, lambdify, diff
from mpl_toolkits.mplot3d.axes3d import Axes3D

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold


from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, PolynomialFeatures

In [2]:

# Path
data_path = './Data/'

Load the dataset¶

In [3]:

train_data = pd.read_csv(data_path+"train.csv",  low_memory=False).reset_index(drop=True)
test_data = pd.read_csv(data_path+"test.csv",  low_memory=False).reset_index(drop=True)

print("Shape of the train data: ", train_data.shape)
print("Shape of the test data: ", test_data.shape)

Shape of the train data:  (614, 13)
Shape of the test data:  (367, 12)

In [4]:

train_data

Out[4]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	NaN	360.0	1.0	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.0	360.0	1.0	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.0	360.0	1.0	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.0	360.0	1.0	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.0	360.0	1.0	Urban	Y
...	...	...	...	...	...	...	...	...	...	...	...	...	...
609	LP002978	Female	No	0	Graduate	No	2900	0.0	71.0	360.0	1.0	Rural	Y
610	LP002979	Male	Yes	3+	Graduate	No	4106	0.0	40.0	180.0	1.0	Rural	Y
611	LP002983	Male	Yes	1	Graduate	No	8072	240.0	253.0	360.0	1.0	Urban	Y
612	LP002984	Male	Yes	2	Graduate	No	7583	0.0	187.0	360.0	1.0	Urban	Y
613	LP002990	Female	No	0	Graduate	Yes	4583	0.0	133.0	360.0	0.0	Semiurban	N

614 rows × 13 columns

In [5]:

test_data

Out[5]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area
0	LP001015	Male	Yes	0	Graduate	No	5720	0	110.0	360.0	1.0	Urban
1	LP001022	Male	Yes	1	Graduate	No	3076	1500	126.0	360.0	1.0	Urban
2	LP001031	Male	Yes	2	Graduate	No	5000	1800	208.0	360.0	1.0	Urban
3	LP001035	Male	Yes	2	Graduate	No	2340	2546	100.0	360.0	NaN	Urban
4	LP001051	Male	No	0	Not Graduate	No	3276	0	78.0	360.0	1.0	Urban
...	...	...	...	...	...	...	...	...	...	...	...	...
362	LP002971	Male	Yes	3+	Not Graduate	Yes	4009	1777	113.0	360.0	1.0	Urban
363	LP002975	Male	Yes	0	Graduate	No	4158	709	115.0	360.0	1.0	Urban
364	LP002980	Male	No	0	Graduate	No	3250	1993	126.0	360.0	NaN	Semiurban
365	LP002986	Male	Yes	0	Graduate	No	5000	2393	158.0	360.0	1.0	Rural
366	LP002989	Male	No	0	Graduate	Yes	9200	0	98.0	180.0	1.0	Rural

367 rows × 12 columns

In [6]:

columns = train_data.columns
columns

Out[6]:

Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
       'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
      dtype='object')

In [7]:

train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

In [8]:

train_data.describe()

Out[8]:

	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History
count	614.000000	614.000000	592.000000	600.00000	564.000000
mean	5403.459283	1621.245798	146.412162	342.00000	0.842199
std	6109.041673	2926.248369	85.587325	65.12041	0.364878
min	150.000000	0.000000	9.000000	12.00000	0.000000
25%	2877.500000	0.000000	100.000000	360.00000	1.000000
50%	3812.500000	1188.500000	128.000000	360.00000	1.000000
75%	5795.000000	2297.250000	168.000000	360.00000	1.000000
max	81000.000000	41667.000000	700.000000	480.00000	1.000000

In [9]:

train_data.nunique()

Out[9]:

Loan_ID              614
Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      505
CoapplicantIncome    287
LoanAmount           203
Loan_Amount_Term      10
Credit_History         2
Property_Area          3
Loan_Status            2
dtype: int64

In [10]:

raw_data = pd.concat([train_data,test_data],ignore_index=True)
raw_data.shape

Out[10]:

(981, 13)

Finding for missing values¶

In [11]:

raw_data.isnull().sum()

Out[11]:

Loan_ID                0
Gender                24
Married                3
Dependents            25
Education              0
Self_Employed         55
ApplicantIncome        0
CoapplicantIncome      0
LoanAmount            27
Loan_Amount_Term      20
Credit_History        79
Property_Area          0
Loan_Status          367
dtype: int64

In [12]:

raw_data["LoanAmount"].fillna(raw_data["LoanAmount"].mean(),inplace=True)

In [13]:

raw_data["CoapplicantIncome"].fillna(raw_data["CoapplicantIncome"].mean(),inplace=True)

In [14]:

raw_data["ApplicantIncome"].fillna(raw_data["ApplicantIncome"].mean(),inplace=True)

In [15]:

raw_data["Loan_Amount_Term"].fillna(raw_data["Loan_Amount_Term"].mean(),inplace=True)

In [16]:

raw_data["Credit_History"].fillna(raw_data["Credit_History"].mean(),inplace=True)

In [17]:

# Add Credit history Manually in cat cols because it has data in the form of 0 and 1
cat_cols = ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
       'Self_Employed', 'Property_Area', 'Loan_Status','Credit_History']

In [18]:

# Imputing object data with mode
for i in cat_cols:
    raw_data[i].fillna(raw_data[i].mode()[0],inplace=True)

In [19]:

raw_data.isnull().sum().sort_values(ascending=False)

Out[19]:

Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

In [20]:

raw_data

Out[20]:

	Loan_ID	Gender	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Property_Area	Loan_Status
0	LP001002	Male	No	0	Graduate	No	5849	0.0	142.51153	360.0	1.00000	Urban	Y
1	LP001003	Male	Yes	1	Graduate	No	4583	1508.0	128.00000	360.0	1.00000	Rural	N
2	LP001005	Male	Yes	0	Graduate	Yes	3000	0.0	66.00000	360.0	1.00000	Urban	Y
3	LP001006	Male	Yes	0	Not Graduate	No	2583	2358.0	120.00000	360.0	1.00000	Urban	Y
4	LP001008	Male	No	0	Graduate	No	6000	0.0	141.00000	360.0	1.00000	Urban	Y
...	...	...	...	...	...	...	...	...	...	...	...	...	...
976	LP002971	Male	Yes	3+	Not Graduate	Yes	4009	1777.0	113.00000	360.0	1.00000	Urban	Y
977	LP002975	Male	Yes	0	Graduate	No	4158	709.0	115.00000	360.0	1.00000	Urban	Y
978	LP002980	Male	No	0	Graduate	No	3250	1993.0	126.00000	360.0	0.83592	Semiurban	Y
979	LP002986	Male	Yes	0	Graduate	No	5000	2393.0	158.00000	360.0	1.00000	Rural	Y
980	LP002989	Male	No	0	Graduate	Yes	9200	0.0	98.00000	180.0	1.00000	Rural	Y

981 rows × 13 columns

In [21]:

raw_data.nunique()

Out[21]:

Loan_ID              981
Gender                 2
Married                2
Dependents             4
Education              2
Self_Employed          2
ApplicantIncome      752
CoapplicantIncome    437
LoanAmount           233
Loan_Amount_Term      13
Credit_History         3
Property_Area          3
Loan_Status            2
dtype: int64

In [22]:

for column in raw_data.columns:
    if column in ['Gender', 'Married', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']:
        print("-------------------------------------------------",column," - ",len(raw_data[column].unique()),"---------------------------------------------------")
        print(raw_data[column].unique())
        print("--------------------------------------------------------------------------------------------------------------")
        

------------------------------------------------- Gender  -  2 ---------------------------------------------------
['Male' 'Female']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Married  -  2 ---------------------------------------------------
['No' 'Yes']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Education  -  2 ---------------------------------------------------
['Graduate' 'Not Graduate']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Self_Employed  -  2 ---------------------------------------------------
['No' 'Yes']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Property_Area  -  3 ---------------------------------------------------
['Urban' 'Rural' 'Semiurban']
--------------------------------------------------------------------------------------------------------------
------------------------------------------------- Loan_Status  -  2 ---------------------------------------------------
['Y' 'N']
--------------------------------------------------------------------------------------------------------------

Data Exploration¶

In [23]:

# Setting seaborn visualization parameters
sns.set(rc={"figure.figsize" : [12 ,4]}, font_scale=1.2)
sns.set(rc={"axes.facecolor":"#F2F3F4","figure.facecolor":"#F2F3F4"})
palette = ["#F08080", "#FA8072", "#E9967A", "#FFA07A", "#CD5C5C", "#AF601A", "#CA6F1E"]

sns.set_palette(palette)
color_map = colors.ListedColormap(palette)

In [24]:

# Univariate Analysis For Numerical Features:
plt.figure(figsize=(12,7))
t = 1
for i in raw_data.select_dtypes(include=np.number).columns:
    plt.subplot(3,2,t)
    sns.histplot(raw_data[i], kde=True)
    plt.title(f"Skewness:{raw_data[i].skew().round(2)}, Kurtosis:{raw_data[i].kurt().round(2)}")
    t+=1
plt.tight_layout()
plt.show()

In [25]:

# Univariate Analysis For Numerical Features:
plt.figure(figsize=(5,10))
t = 1
for i in raw_data.select_dtypes(include=np.number).columns:
    plt.subplot(3,2,t)
    sns.boxplot(raw_data[i])
    t+=1
plt.tight_layout()
plt.show()

In [26]:

# Pie Plot For Loan Status(Target Variable)
raw_data["Loan_Status"].value_counts().plot(kind="pie",autopct="%.4f");

In [27]:

# Group by Gender and Loan_Status, and count the occurrences
loan_status_counts = raw_data.groupby(['Gender', 'Loan_Status']).size().unstack()

# Plotting with Seaborn
sns.set(style="whitegrid")
sns.countplot(x='Gender', hue='Loan_Status', data=raw_data)
plt.title('Loan Status by Gender')
plt.xlabel('Gender')
plt.ylabel('Number of Applicants')
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

In [28]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Married', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Married')
plt.xlabel('Married')
plt.ylabel('Number of Applicants')
plt.show()

In [29]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Dependents', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Dependents')
plt.xlabel('Dependents')
plt.ylabel('Number of Applicants')
plt.show()

In [30]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Education', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Education')
plt.xlabel('Education')
plt.ylabel('Number of Applicants')
plt.show()

In [31]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Self_Employed', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Self Employed')
plt.xlabel('Self Employed')
plt.ylabel('Number of Applicants')
plt.show()

In [32]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Property_Area', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Property Area')
plt.xlabel('Property Area')
plt.ylabel('Number of Applicants')
plt.show()

In [33]:

sns.set(style="whitegrid")
ax = sns.countplot(x='Credit_History', hue='Loan_Status', data=raw_data)

# Set legend labels
legend_labels = {'Y': 'Approved', 'N': 'Not Approved'}

# Show legend
plt.legend(title='Loan Status', bbox_to_anchor=(1.05, 1), loc='upper left')

plt.title('Loan Status by Credit History')
plt.xlabel('Credit History')
plt.ylabel('Number of Applicants')
plt.show()

In [34]:

# Manipulate the Dependent Variable 3+ to 3:
raw_data["Dependents"] = np.where(raw_data["Dependents"]=="3+",3, raw_data["Dependents"])

In [35]:

#Converting dependents into int because the date in Contineous Form
raw_data["Dependents"] = raw_data["Dependents"].astype(int)

In [36]:

# Mapping the object columns to convert it into (0 and 1) form and then change the data type to int
raw_data['Married']=raw_data['Married'].map({'Yes':1, 'No':0})
raw_data['Education']=raw_data['Education'].map({'Graduate':1, 'Not Graduate':0})
raw_data['Self_Employed']=raw_data['Self_Employed'].map({'Yes':1, 'No':0})
raw_data['Loan_Status']=raw_data['Loan_Status'].map({'Y':1, 'N':0})

In [37]:

raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 981 entries, 0 to 980
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            981 non-null    object 
 1   Gender             981 non-null    object 
 2   Married            981 non-null    int64  
 3   Dependents         981 non-null    int32  
 4   Education          981 non-null    int64  
 5   Self_Employed      981 non-null    int64  
 6   ApplicantIncome    981 non-null    int64  
 7   CoapplicantIncome  981 non-null    float64
 8   LoanAmount         981 non-null    float64
 9   Loan_Amount_Term   981 non-null    float64
 10  Credit_History     981 non-null    float64
 11  Property_Area      981 non-null    object 
 12  Loan_Status        981 non-null    int64  
dtypes: float64(4), int32(1), int64(5), object(3)
memory usage: 95.9+ KB

In [38]:

new_train_data = raw_data.loc[0:train_data.shape[0]-1, ]
new_test_data = raw_data.loc[train_data.shape[0]:, ]

In [39]:

new_train_data = new_train_data.drop('Loan_ID', axis=1)

In [40]:

#Using get dummies for the remaining object columns for which mapping or encoder cant be used 
new_train_data=pd.get_dummies(new_train_data)
new_train_data.head()

Out[40]:

	Married	Dependents	Education	Self_Employed	ApplicantIncome	CoapplicantIncome	LoanAmount	Loan_Amount_Term	Credit_History	Loan_Status	Gender_Male	Property_Area_Rural	Property_Area_Urban
0	0	0	1	0	5849	0.0	142.51153	360.0	1.0	1	1	0	1
1	1	1	1	0	4583	1508.0	128.00000	360.0	1.0	0	1	1	0
2	1	0	1	1	3000	0.0	66.00000	360.0	1.0	1	1	0	1
3	1	0	0	0	2583	2358.0	120.00000	360.0	1.0	1	1	0	1
4	0	0	1	0	6000	0.0	141.00000	360.0	1.0	1	1	0	1

In [41]:

#Checking for correlation again now that we have treated the object columns
plt.figure(figsize=(15,5))
sns.heatmap(new_train_data.corr(),annot=True)
plt.show()

In [42]:

# Spliting the dataset into features and target
x_variables = new_train_data.drop('Loan_Status',axis=1)
y_variables = new_train_data["Loan_Status"]

In [43]:

### Split Data - train, test 
X_train, X_test, y_train, y_test = train_test_split(x_variables, y_variables, test_size=0.2, random_state = 10)

In [44]:

print("X_train Size :",len(X_train))
print("Y_train Size :",len(y_train))
print("X_test Size :",len(X_test))
print("Y_test Size :",len(y_test))
print("Train Size :", (len(X_train)/len(x_variables))*100)
print("Train Size :", (len(X_test)/len(x_variables))*100) 

X_train Size : 491
Y_train Size : 491
X_test Size : 123
Y_test Size : 123
Train Size : 79.96742671009773
Train Size : 20.03257328990228

Logistic Regression Equation¶

Logistic Regression is a type of regression analysis used for predicting the probability of an outcome. It is particularly useful for binary classification problems. The logistic regression model is based on the logistic function, also known as the sigmoid function.

The logistic function is defined as:

$$ S(z) = \frac{1}{1 + e^{-z}} $$

Where:

( S(z) ) is the sigmoid function output (probability),
( e ) is the base of the natural logarithm,
( z ) is the linear combination of input features.

In the context of logistic regression, ( z ) is defined as:

$$ z = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_n \cdot x_n $$

Where:

( \beta_0 ) is the intercept term,
( \beta_1, \beta_2, \ldots, \beta_n ) are the coefficients associated with the input features ( x_1, x_2, \ldots, x_n ).

Putting it all together, the logistic regression model can be expressed as:

$$ P(Y=1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 + \ldots + \beta_n \cdot x_n)}} $$

Where:

( P(Y=1) ) is the probability of the outcome variable ( Y ) being 1.

This equation represents the probability that the dependent variable ( Y ) is 1 given the values of input features (

In [45]:

# Using logistic regression supervised ML classification model
logistic = LogisticRegression(max_iter=1000)
logistic.fit(X_train, y_train)

Out[45]:

LogisticRegression(max_iter=1000)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [46]:

y_pred = logistic.predict(X_test)
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

0.7967479674796748
[[12 24]
 [ 1 86]]
              precision    recall  f1-score   support

           0       0.92      0.33      0.49        36
           1       0.78      0.99      0.87        87

    accuracy                           0.80       123
   macro avg       0.85      0.66      0.68       123
weighted avg       0.82      0.80      0.76       123

In [47]:

kf=StratifiedKFold(n_splits=5, random_state=1, shuffle= True)
i=1
for train_index,test_index in kf.split(x_variables,y_variables):
    global model
    print('{} of kfold {}'.format(i,kf.n_splits))
    xtr,xvl=x_variables.iloc[train_index],x_variables.iloc[test_index]
    ytr,yvl=y_variables.iloc[train_index],y_variables.iloc[test_index]    
    model=LogisticRegression(random_state=1, max_iter=1000 )
    model.fit(xtr,ytr)
    pred_test=model.predict(xvl)
    score=accuracy_score(yvl,pred_test)
    print('Accuracy score: ', score)
    i+=1
    pred_test =model.predict(X_test)
    pred= model.predict_proba(xvl)[:,1]

1 of kfold 5
Accuracy score:  0.8048780487804879
2 of kfold 5
Accuracy score:  0.8373983739837398
3 of kfold 5
Accuracy score:  0.7967479674796748
4 of kfold 5
Accuracy score:  0.8048780487804879
5 of kfold 5
Accuracy score:  0.7868852459016393

In [48]:

print(roc_curve(y_test,y_pred))

(array([0.        , 0.66666667, 1.        ]), array([0.        , 0.98850575, 1.        ]), array([inf,  1.,  0.]))

In [49]:

fpr, tpr, threshold = roc_curve(y_test, y_pred)


fpr,tpr,_=metrics.roc_curve(yvl, pred)
auc = metrics.roc_auc_score(yvl,pred)
plt.figure(figsize=(10,3))
plt.plot(fpr,tpr,label='validation, auc='+str(auc))
plt.xlabel('False Positive rate')
plt.ylabel('True Positive Rate')   
plt.legend(loc=4)
plt.show()

In [50]:

print(roc_auc_score(y_test,y_pred))

0.6609195402298851

In [ ]: