{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "provenance": [] }, "kernelspec": { "name": "python3", "display_name": "Python 3" }, "language_info": { "name": "python" } }, "cells": [ { "cell_type": "markdown", "source": [ "## * What is Data Science?\n", " - Intersection of CS/IT + Maths/Stats + Domain/Business Knowledge\n", " - togethre its data science\n", "\n", "##* What is Analytics?\n", " - Descriptive (ANOVA)\n", " - Predictive (LR)\n", " - Prescriptive (LPP)\n", "\n", "## * What is AI?\n", " - include => learning + reasoning + self correction + simulation\n", " - E.g. driver less cars\n", " * What can it do? \n", " - Answer info\n", " - watch health\n", " - deliver groceries\n", " - breakthrough genomics\n", " * what it cannot do?\n", " - human level general intelligence\n", "\n", "## * What is ML?\n", " - in 1959 Arthur Samuel defined it as \"Field of study that gives computers the ability to learn without being explicitly programmed”\n", " - it can be called as \"the effort to automate intellectual tasks normally performed by humans\"\n", " - in classical programming we input Rules+Data to get Answers while in ML programming we input Data+Answers to get Rules \n", "\n", "## * Where is ML used? \n", " - Medicine\n", " - National Security\n", " - from a list of customers which will respond?\n", " - which customers are likely to commit fraud?\n", "\n", "## * Role of ML\n", " - to aid to achieve AI\n", " - driven by maths concepts\n", " - analyse patterns in captured data to build predictive model on existing phenomena in business\n", " - broadly there are 3 types of ML algorithms\n", " 1. Supervised \n", " 2. Unsupervised \n", " 3. Re-inforcement " ], "metadata": { "id": "Bb_rhylYOXWJ" } }, { "cell_type": "markdown", "source": [ "## Types of ML\n", "1. Supervised\n", "2. Unsupervised\n", "\n", "## Supervised\n", "* when the response variable is numerical, then the predictive modeling is called regression\n", "* when the response variable is categorical, the model is called classification \n", "\n", " ### * Example of Regression:\n", " * Sales are influenced by variables like adv expenses, costs, dealers cost etc.\n", " \n", " Sales = function(Adv. Exp, Manpower, Cost, Dealers...)\n", "\n", " ### * Example of Classification: \n", " * Customer may buy/not buy, patient may die/not die\n", "\n", " Prob(Cust Purchase) = function(Age, Income, Residence...)\n", "\n", " Prob(Cust defaults) = function(Expense, Taxes, Charges...)\n", "\n", " ### * Algorithms of Supervised Learning:\n", "\n", " 1. Linear Regression\n", " 2. Regression Trees\n", " 3. Non-Linear Regression\n", " 4. Bayesian Linear Regression\n", " 5. Polynomial Regression\n", "\n", "## UnSupervised\n", "* there is no outcome or y \n", "\n", " ### * Algorithms of Unsupervised Learning:\n", " 1. Clustering\n", " 2. PCA\n", " 3. Association Rules\n", "\n", "## Lifecycle of any ML project\n", "1. Define the scope\n", "2. Collect/ Extract the Data\n", "3. Train an appropriate ML Model and Evaluate it\n", "4. Deploy the Model\n", "5. Collect feedback and backtrack with new inputs to make changes to Model and re-Deploy\n", "\n", "## Types of predicted values\n", " * ### categorical\n", " * we use classiication confusion matrix\n", " * ### numerical \n", " * we use algorithms\n", "\n", "\n" ], "metadata": { "id": "DPSAjMAsPVeK" } }, { "cell_type": "code", "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "from sklearn.decomposition import PCA\n", "from sklearn.preprocessing import StandardScaler\n", "from google.colab import drive\n", "import matplotlib.pyplot as plt\n", "drive.mount('/content/gdrive')" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Bf57lIcNUgtR", "outputId": "3943e175-dfcc-4e33-83bb-d3e0f4ddaec6" }, "execution_count": 2, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount(\"/content/gdrive\", force_remount=True).\n" ] } ] }, { "cell_type": "code", "source": [ "mowers_path = '/content/gdrive/MyDrive/Datasets/RidingMowers.csv'\n", "mowers = pd.read_csv(mowers_path)\n", "with plt.style.context('seaborn-whitegrid'):\n", " fig_size = plt.rcParams[\"figure.figsize\"]\n", " fig_size[0] = 12 # X scaling of fig \n", " fig_size[1] = 8 # Y scaling of fig\n", " sns.scatterplot(data = mowers, x = 'Income', y = 'Lot_Size', hue = 'Response')\n", " plt.show()" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 493 }, "id": "lDx2MwjQUvXf", "outputId": "d95fd2b3-486f-4579-912f-cf7363c8897f" }, "execution_count": 8, "outputs": [ { "output_type": "display_data", "data": { "text/plain": [ "
" ], "image/png": "\n" }, "metadata": {} } ] }, { "cell_type": "code", "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import confusion_matrix, r2_score, f1_score, accuracy_score, classification_report\n", "# create dummy\n", "# train test split\n", "# go for knn\n", "# test and conclude\n", "dum_mow =pd.get_dummies(mowers,drop_first=True)\n", "\n", "X = dum_mow.drop('Response_Not Bought',axis=1)\n", "y= dum_mow['Response_Not Bought']\n", "X_train,X_test,y_train,y_test= train_test_split(X,y,stratify=y,random_state=2022,train_size=0.7)\n", "\n", "knn= KNeighborsClassifier(n_neighbors=3)\n", "knn.fit(X_train,y_train)\n", "\n", "y_pred= knn.predict(X_test)\n", "\n", "print(confusion_matrix(y_test,y_pred))\n", "print(accuracy_score(y_test,y_pred))\n", "print(classification_report(y_test,y_pred))\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "JBjB_wYde32n", "outputId": "5b5cb727-19d1-4ab6-f226-be248ef5025b" }, "execution_count": 20, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "[[22 0]\n", " [ 1 32]]\n", "0.9818181818181818\n", " precision recall f1-score support\n", "\n", " 0 0.96 1.00 0.98 22\n", " 1 1.00 0.97 0.98 33\n", "\n", " accuracy 0.98 55\n", " macro avg 0.98 0.98 0.98 55\n", "weighted avg 0.98 0.98 0.98 55\n", "\n" ] } ] }, { "cell_type": "code", "source": [ "#loop\n", "acc=[]\n", "ks = [x for x in range(1,16) if x%2!=0]\n", "for i in ks:\n", " knn =KNeighborsClassifier(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", " y_pred = knn.predict(X_test)\n", " acc.append(accuracy_score(y_test,y_pred))\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neigbors =\",best_k)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "4FNWE2DulSqY", "outputId": "21e5e96f-3837-4b0d-bc6f-c146f794f8ae" }, "execution_count": 25, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neigbors = 3\n" ] } ] }, { "cell_type": "markdown", "source": [ "# Functions:\n", " - roc_auc : helps to draw ROC curve\n", " - roc_auc_score : calculates area under the curve " ], "metadata": { "id": "y9eSYgYb-UXc" } }, { "cell_type": "code", "source": [ "from sklearn.metrics import roc_auc_score\n", "\n", "acc=[]\n", "ks = [x for x in range(1,16) if x%2!=0]\n", "for i in ks:\n", " knn =KNeighborsClassifier(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", " y_pred_prob = knn.predict_proba(X_test)[:,1]\n", " acc.append(roc_auc_score(y_test,y_pred_prob))\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neigbors =\",best_k)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Aayuzq_zp5gs", "outputId": "8298921a-bf06-40b4-f790-4ede8e05b001" }, "execution_count": 32, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neigbors = 3\n" ] } ] }, { "cell_type": "code", "source": [ "from sklearn.metrics import log_loss\n", "acc=[]\n", "ks = [x for x in range(1,16) if x%2!=0]\n", "for i in ks:\n", " knn =KNeighborsClassifier(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", " y_pred_prob = knn.predict_proba(X_test)[:,1]\n", " acc.append(-log_loss(y_test,y_pred_prob))\n", "\n", "i_max = np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"best n_neigbors =\",best_k)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "thAdHB8lH86L", "outputId": "23a64ba0-8516-476e-ae10-8f9d71b40c08" }, "execution_count": 42, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "best n_neigbors = 15\n" ] } ] }, { "cell_type": "code", "source": [ "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.metrics import mean_squared_error\n", "boston_path = '/content/gdrive/MyDrive/Datasets/Boston.csv'\n", "boston = pd.read_csv(boston_path)\n", "X= boston.drop('medv',axis=1)\n", "y= boston['medv']\n", "\n", "X_train, X_test,y_train,y_test= train_test_split(X,y,random_state=2022,train_size=0.7)\n", "acc=[]\n", "ks =np.arange(1,16)\n", "for i in ks:\n", " knn= KNeighborsRegressor(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", "\n", " y_pred =knn.predict(X_test)\n", " acc.append(-mean_squared_error(y_test,y_pred))\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best score =\",acc[i_max])\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "GeQy_XI7OCIx", "outputId": "7bc43384-78d4-4456-dd3f-95ed03999900" }, "execution_count": 53, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 9\n", "Best score = -39.11516325536063\n" ] } ] }, { "cell_type": "code", "source": [ "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.metrics import mean_squared_error\n", "boston_path = '/content/gdrive/MyDrive/Datasets/Boston.csv'\n", "boston = pd.read_csv(boston_path)\n", "X= boston.drop('medv',axis=1)\n", "y= boston['medv']\n", "\n", "X_train, X_test,y_train,y_test= train_test_split(X,y,random_state=2022,train_size=0.7)\n", "acc=[]\n", "ks =np.arange(1,16)\n", "for i in ks:\n", " knn= KNeighborsRegressor(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", "\n", " y_pred =knn.predict(X_test)\n", " acc.append(r2_score(y_test,y_pred))\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best score =\",acc[i_max])\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "uZNe4b3-T0LX", "outputId": "f688dad7-f011-4c95-9597-6b032c1b6e6e" }, "execution_count": 54, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 9\n", "Best score = 0.5433824716755711\n" ] } ] }, { "cell_type": "code", "source": [ "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_trn_scl =scaler.transform(X_train)\n", "X_tst_scl =scaler.transform(X_test)\n", "acc=[]\n", "ks =np.arange(1,16)\n", "for i in ks:\n", " knn= KNeighborsRegressor(n_neighbors=i)\n", " knn.fit(X_trn_scl,y_train)\n", " y_pred =knn.predict(X_tst_scl)\n", " acc.append(r2_score(y_test,y_pred))\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best score =\",acc[i_max])\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sWnO14h2XXo2", "outputId": "d2ac5e4f-41c1-4bb5-e1e8-d58537f4ba8a" }, "execution_count": 58, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 4\n", "Best score = 0.8398330163891028\n" ] } ] }, { "cell_type": "code", "source": [ "# without scaling\n", "concrete_path = '/content/gdrive/MyDrive/Datasets/Concrete_Data.csv'\n", "concrete = pd.read_csv(concrete_path)\n", "X= concrete.drop('Strength',axis=1)\n", "y= concrete['Strength']\n", "\n", "X_train, X_test,y_train,y_test= train_test_split(X,y,random_state=2022,train_size=0.7)\n", "acc=[]\n", "ks =np.arange(1,16)\n", "for i in ks:\n", " knn= KNeighborsRegressor(n_neighbors=i)\n", " knn.fit(X_train,y_train)\n", " y_pred =knn.predict(X_test)\n", " acc.append(r2_score(y_test,y_pred))\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best score =\",acc[i_max])\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Mqr4nzqUaMvy", "outputId": "8532003e-dc27-4395-81af-26b199434e94" }, "execution_count": 62, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 2\n", "Best score = 0.712527052078325\n" ] } ] }, { "cell_type": "code", "source": [ "# with scaling\n", "concrete_path = '/content/gdrive/MyDrive/Datasets/Concrete_Data.csv'\n", "concrete = pd.read_csv(concrete_path)\n", "X= concrete.drop('Strength',axis=1)\n", "y= concrete['Strength']\n", "\n", "X_train, X_test,y_train,y_test= train_test_split(X,y,random_state=2022,train_size=0.7)\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "X_trn_scl =scaler.transform(X_train)\n", "X_tst_scl =scaler.transform(X_test)\n", "acc=[]\n", "ks =np.arange(1,16)\n", "for i in ks:\n", " knn= KNeighborsRegressor(n_neighbors=i)\n", " knn.fit(X_trn_scl,y_train)\n", " y_pred =knn.predict(X_tst_scl)\n", " acc.append(r2_score(y_test,y_pred))\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best score =\",acc[i_max])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "IFLdbPAvaZuR", "outputId": "d6e88dcd-835d-4fa2-e8a9-c62aa957cf3d" }, "execution_count": 63, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 4\n", "Best score = 0.7273783938643456\n" ] } ] }, { "cell_type": "code", "source": [ "from sklearn.model_selection import StratifiedKFold\n", "from sklearn.model_selection import cross_val_score\n", "\n", "mowers_path = '/content/gdrive/MyDrive/Datasets/RidingMowers.csv'\n", "mowers = pd.read_csv(mowers_path)\n", "\n", "dum_mow =pd.get_dummies(mowers,drop_first=True)\n", "X= dum_mow.drop('Response_Not Bought',axis=1)\n", "y =dum_mow['Response_Not Bought']\n", "\n", "knn = KNeighborsClassifier(n_neighbors=3)\n", "kfold = StratifiedKFold(n_splits=5, shuffle=True,random_state=2022)\n", "\n", "#accuracy\n", "cross_val_score(knn,X,y,cv =kfold)\n", "\n", "#roc auc\n", "result = cross_val_score(knn,X,y,cv=kfold,scoring='roc_auc')\n", "print(result.mean())\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "xQPMLOXTkf4-", "outputId": "f12bde50-d240-4cc3-e2d1-f09f4d7253ec" }, "execution_count": 68, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "0.9353391053391054\n" ] } ] }, { "cell_type": "code", "source": [ "acc=[]\n", "kfold = StratifiedKFold(n_splits=5, shuffle=True,random_state=2022)\n", "ks=[x for x in range(1,16) if x%2!=0]\n", "for i in ks:\n", " knn = KNeighborsClassifier(n_neighbors=i)\n", " result = cross_val_score(knn,X,y,cv=kfold,scoring='roc_auc')\n", " acc.append(result.mean())\n", "\n", "i_max =np.argmax(acc)\n", "best_k =ks[i_max]\n", "print(\"Best n_neighbors =\",best_k)\n", "print(\"Best Cross Validation Score =\",acc[i_max])" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "vKOiUKwJn7nj", "outputId": "eddbd08d-bb21-4fc2-aba7-4fb84bf267f9" }, "execution_count": 79, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "Best n_neighbors = 7\n", "Best Cross Validation Score = 0.9387373737373738\n" ] } ] }, { "cell_type": "code", "source": [ "# Grid Search Cross Validation is even better for complicated algorithms without using the loop method\n", "from sklearn.model_selection import GridSearchCV\n", "ks= np.arange(1,16,2)\n", "parameters = {'n_neighbors':ks}\n", "knn=KNeighborsClassifier()\n", "gcv =GridSearchCV(knn,param_grid=parameters,scoring ='roc_auc',cv=kfold)\n", "gcv.fit(X,y)\n", "print(gcv.best_params_)\n", "print(gcv.best_score_)\n" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Ieah7zTCo9Si", "outputId": "6ff38ece-e82a-4afd-c76b-9a29cd42ccf7" }, "execution_count": 85, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'n_neighbors': 7}\n", "0.9387373737373738\n" ] } ] }, { "cell_type": "code", "source": [ "# kfold\n", "# KNeighborsRegressor\n", "# r2_score\n", "# boston \n", "\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.model_selection import KFold\n", "\n", "\n", "boston_path = '/content/gdrive/MyDrive/Datasets/Boston.csv'\n", "boston = pd.read_csv(boston_path)\n", "X= boston.drop('medv',axis=1)\n", "y= boston['medv']\n", "\n", "\n", "kfold = KFold(n_splits=5, shuffle=True,random_state=2022)\n", "ks= np.arange(1,16)\n", "parameters = {'n_neighbors':ks}\n", "knn=KNeighborsRegressor()\n", "\n", "gcv =GridSearchCV(knn,param_grid=parameters,scoring ='r2',cv=kfold)\n", "gcv.fit(X,y)\n", "\n", "\n", "print(gcv.best_params_)\n", "print(gcv.best_score_)" ], "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "M5WElObkqZgJ", "outputId": "6942fa65-03fc-421c-d5a8-2017401948c8" }, "execution_count": 88, "outputs": [ { "output_type": "stream", "name": "stdout", "text": [ "{'n_neighbors': 4}\n", "0.5460972244133464\n" ] } ] } ] }