{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Andrey Varkentin. Prediction of heart diseases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Feature and data explanation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this project we are going to analize factors, that contribute to heart diseases. The data was taken from UCI Machine Learning Repository from [here](http://archive.ics.uci.edu/ml/datasets/Heart+Disease)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "14 features have been used:\n", " \n", " -- 1. (age) \n", " \n", " -- 2. (sex) (1 = male; 0 = female) \n", " \n", " -- 3. (cp) chest pain type\n", "\n", " -- Value 1: typical angina\n", " -- Value 2: atypical angina\n", " -- Value 3: non-anginal pain\n", " -- Value 4: asymptomatic \n", " \n", " -- 4. (trestbps) resting blood pressure (in mm Hg on admission to the hospital)\n", " \n", " -- 5. (chol) serum cholestoral in mg/dl \n", " \n", " -- 6. (fbs) (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)\n", "\n", " -- 7. (restecg) resting electrocardiographic results\n", " \n", " -- Value 0: normal\n", " -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST \n", " elevation or depression of > 0.05 mV)\n", " -- Value 2: showing probable or definite left ventricular hypertrophy\n", " by Estes' criteria\n", " \n", " -- 8. (thalach) maximum heart rate achieved \n", " \n", " -- 9. (exang) exercise induced angina (1 = yes; 0 = no) \n", " \n", " -- 10. (oldpeak) = ST depression induced by exercise relative to rest\n", " \n", " -- 11. (slope) the slope of the peak exercise ST segment\n", " \n", " -- Value 1: upsloping\n", " -- Value 2: flat\n", " -- Value 3: downsloping \n", " \n", " -- 12. (ca) number of major vessels (0-3) colored by flourosopy \n", " \n", " -- 13. (thal) 3 = normal; 6 = fixed defect; 7 = reversable defect \n", " \n", " -- 14. (num) diagnosis of heart disease (angiographic disease status)\n", " \n", " -- five diseases (0-4), where 0 - the most frequent\n", " \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's import necessary modules, read data and divide features into numerical and categorical" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib\n", "from matplotlib import pyplot as plt\n", "import matplotlib.patches as mpatches\n", "matplotlib.style.use('ggplot')\n", "%matplotlib inline\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "from sklearn.preprocessing import StandardScaler\n", "import matplotlib.cm as cm\n", "import warnings\n", "warnings.filterwarnings('ignore')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data = pd.read_csv('C:/Users/Andrey/Downloads/processed.cleveland.data.txt', sep=\",\", header=None)\n", "data = pd.read_csv('/home/andrey/Загрузки/processed.cleveland.data.txt', sep=\",\", header=None)\n", "data.columns = ['age', 'sex', 'pain', 'restbp','chol', 'fbs', 'restecg', \n", " 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "numer = ['age', 'restbp', 'chol', 'thalach', 'oldpeak']\n", "categ = data.drop(columns=numer).columns.tolist()\n", "categ" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Primary data analysis " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's dive into distribution of features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in categ:\n", " print(i)\n", " print(data[i].value_counts())\n", " print(10 * \"-\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oh, here we have some missing values. Let's clean the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['ca'] = data['ca'].apply(lambda x: np.nan if str(x) == \"?\" else x)\n", "data['thal'] = data['thal'].apply(lambda x: np.nan if str(x) == \"?\" else x)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['ca'].fillna(0, inplace = True)\n", "data['thal'].fillna(method='bfill', inplace = True)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data['ca'].value_counts()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "for i in categ:\n", " data[i] = data[i].astype(\"float64\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Primary visual data analysis " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data[\"num\"].value_counts().plot(kind=\"bar\")\n", "\n", "plt.title(\"Target distribution\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Classes are imbalanced, we will account for it in our models. Moreover, there are few observations for different heart diseases, and it will be easier for algorithm to define whether a person is ill with disease 0, and harder to distinguish diseases" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(data[numer+[\"num\"]], \n", " hue=\"num\", diag_kind=\"kde\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Almost all features look similar to normal distribution. Anomalies can be noted in age and oldpeak of people having diseases 0" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=3, ncols=2, figsize=(16, 16))\n", "\n", "for idx, feat in enumerate(numer+['num']):\n", " sns.boxplot(x='num', y=feat, data=data, ax=axes[idx // 2, idx % 2])\n", " axes[idx // 2, idx % 2].legend()\n", " axes[idx // 2, idx % 2].set_xlabel('num')\n", " axes[idx // 2, idx % 2].set_ylabel(feat)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can't say, that there are many differences between observed groups of peolpe, though there are lots of overshooting for people with disease 0, except for age variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.heatmap(data[numer].corr(), square=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most correlated variables are restbp and age, which is due to age-dependent hypertony, but still it is within normal rate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(16, 16))\n", "\n", "for idx, feat in enumerate(categ):\n", " sns.countplot(x=feat, hue='num', data=data, ax=axes[idx // 2, idx % 2]);\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Conclusions from the graph:\n", "\n", "- Males are more often ill in our dataset, \n", "\n", "- Asymotimatic pain is equally distributed among classes, other types of pain primarily are reported during disease 0\n", "\n", "- Blood sugar is rarely increased\n", "\n", "- There is rare ST-T wave abnormality\n", "\n", "- Exercise during angina has been more often done for people with diseases 2-4\n", "\n", "- The slope of the peak exercise ST segment is usually flat for all diseases except disease 0\n", "\n", "- Defect is more often reversable for all diseases except disease 0\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "y = data['num']\n", "X = data.drop(columns=['num'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "X_scaled = scaler.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tsne = TSNE(random_state=1)\n", "tsne_representation = tsne.fit_transform(X_scaled)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(12,8))\n", "plt.scatter(tsne_representation[:, 0], tsne_representation[:, 1], \n", " c = y, cmap='viridis_r', alpha=.8);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "People with disease 0 are seemed to be much easier to separate as a dictinct cluster, and it is difficult to find clusters among other diseases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Insights and found dependencies " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the previous EDA we've found that the disease 0 has much more peculiarities, that other diseases from our data. That is why for our case, we should remember about balancing our classes and be rather suspicious considering the results we'll get about diseases 1-4, thus we're going to sove multiclass classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Data preprocessing \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Choose categorical variables for using in models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "categ2 = ['sex', 'pain', 'fbs', 'restecg', 'exang', 'slope', 'ca', 'thal']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's apply OneHotEncoding for categorical features and scale numeric features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_extraction import DictVectorizer as DV\n", "\n", "encoder = DV(sparse = False)\n", "cat_hot_x = encoder.fit_transform(X[categ2].T.to_dict().values())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.cross_validation import train_test_split\n", "\n", "(X_train_cat_oh,\n", " X_test_cat_oh) = train_test_split(cat_hot_x, \n", " test_size=0.3, \n", " random_state=0, stratify = y)\n", "\n", "(X_train_num, \n", " X_test_num, \n", " y_train, y_test) = train_test_split(X[numer], y, \n", " test_size=0.3, \n", " random_state=0, stratify = y)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler_train = StandardScaler()\n", "scaler_train.fit(X_train_num, y_train)\n", "\n", "X_train_num_scaled=scaler_train.transform(X_train_num)\n", "X_test_num_scaled=scaler_train.transform(X_test_num)\n", "\n", "x_train=np.hstack((X_train_num_scaled,X_train_cat_oh))\n", "x_test=np.hstack((X_test_num_scaled,X_test_cat_oh))\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 6. Model selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are solving multiclass classification problem. The classifiers supported by scikit-learn, which we are going to check here:\n", " \n", " - Inherently multiclass:\n", " -- sklearn.tree.DecisionTreeClassifier\n", " -- sklearn.neighbors.KNeighborsClassifier\n", " -- sklearn.ensemble.RandomForestClassifier\n", " -- sklearn.linear_model.RidgeClassifierCV\n", " - Multiclass as One-Vs-One:\n", " -- sklearn.svm.SVC\n", " - Multiclass as One-Vs-All:\n", " -- sklearn.linear_model.LogisticRegression (setting multi_class=”ovr”)\n", " \n", " \n", "Moreover, let's include XGBoost\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 7. Metrics selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll also use KFold cross-validation scheme with shuffle. Moreover, we will divide our data into train and test set. \n", "\n", "After that we are going to apply default model parameters and then tune them with GridSearch.\n", "\n", "As for metrics: in our case it is better to state that people with less dangerous disease 0 are ill with more rare diseases, so that they will go for an extra observation, rather than we won't find more dangerous diseases. So for this case we need less false negatives (we say a person is relatively healthy, while he is not). So we place emphasise on recall. To account for this we can place more weights to recall variable in f1 metric. In Python we can do this by fbeta_score function. We will also include accuracy and ordinary f1 score to see how it differs from our target metric." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.grid_search import GridSearchCV\n", "from sklearn.linear_model import LogisticRegression, RidgeClassifier, SGDClassifier\n", "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.naive_bayes import GaussianNB\n", "from sklearn.svm import NuSVC, SVC\n", "import xgboost as xgb\n", "from sklearn.model_selection import KFold\n", "\n", "classifier = [RandomForestClassifier(max_depth=4, criterion='entropy'),\n", " KNeighborsClassifier(n_neighbors=10),\n", " xgb.XGBClassifier(learning_rate=0.1, max_depth=5, n_estimators=40, min_child_weight=3),\n", " RidgeClassifier(random_state=1),\n", " SVC(random_state=1),\n", " LogisticRegression(random_state = 0, multi_class='ovr'),\n", " DecisionTreeClassifier(random_state=1)]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_val_score\n", "from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, fbeta_score\n", "metr = ['accuracy_score', 'roc_auc_score', 'f1_score']" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "CV = KFold(n_splits=3, shuffle=True, random_state=1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scoress = []\n", "model_name = []\n", "for model in classifier:\n", " model.fit(x_train, y_train)\n", " y_pred = model.predict(x_test)\n", " model_name.append(model.__class__.__name__) \n", " scoress.append(({ \n", " 'accuracy': accuracy_score(y_test, y_pred),\n", " 'f1_score_wei': f1_score(y_test, y_pred,average='weighted'), \n", " 'f_beta': fbeta_score(y_test, y_pred, beta = 1.5, average='weighted')}))\n", "\n", "results = pd.DataFrame(data=scoress, columns=['Algorithm', 'accuracy', 'f1_score_wei', 'f_beta'])\n", "results['Algorithm']=model_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now according to f_beta the best model is DecisionTree. The same is true for f1_score, though the most accurate is Ridge model\n", "\n", "Let's tune everything" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 8. Cross-validation and adjustment of model hyperparameters " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "param_grid_rand = {'n_jobs' : [1,5,10,15,20],\n", " 'max_depth' : [5,7,10,15],\n", " 'criterion' : ['gini', 'entropy']}\n", "param_grid_knn = {'n_neighbors':[2,5,10,15],\n", " 'weights':['uniform', 'distance'],\n", " 'p': [1, 2]}\n", "param_grid_log = {'penalty' : ['l1', 'l2'],\n", " 'C' : [0.01, 0.05, 0.1, 0.5, 1, 5, 10]}\n", "param_grid_xgb = {'max_depth' :[2,5,7,10],\n", " 'learning_rate' : [0.01, 0,1, 1],\n", " 'gamma' : [0.01, 0.1],\n", " 'n_estimators' : [20, 30, 40, 50] }\n", "\n", "param_grid_ridge = {'alpha' : [0.001, 0.01, 0,1, 1]}\n", "param_grid_svc = {'kernel': ['linear', 'poly', 'rbf'],\n", " 'degree': [2, 3] ,\n", " 'decision_function_shape': ['ovo', 'ovr']}\n", "\n", "param_grid_tree = {'max_depth' : [3,4,5,7, 10],\n", " 'min_samples_split' :[2,3,5],\n", " 'max_features' :['auto', None]}\n", "param_grid = [param_grid_rand, param_grid_knn, param_grid_xgb,param_grid_ridge, \n", " param_grid_svc, param_grid_log, \n", " param_grid_tree]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "scores2 = []\n", "model_name = []\n", "for i in enumerate(classifier):\n", " grid_cv = GridSearchCV(classifier[i[0]], param_grid[i[0]], cv = 3)\n", " grid = grid_cv.fit(x_train, y_train.values)\n", " print(classifier[i[0]], grid_cv.best_params_)\n", " pred = grid_cv.best_estimator_.predict(x_test)#[:,1]\n", " model_name.append(classifier[i[0]].__class__.__name__) \n", " scores2.append(({ \n", " 'accuracy': accuracy_score(y_test, pred),\n", " 'f1_score_wei': f1_score(y_test, pred, average='weighted'),\n", " 'f_beta': fbeta_score(y_test, pred, beta = 1.5, average='weighted')}))\n", "\n", "results2 = pd.DataFrame(data=scores2, columns=['Algorithm', 'f1_score_wei', 'accuracy', 'f_beta'])\n", "results2['Algorithm']=model_name" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plot all metrics for tuned models" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "f, ax = plt.subplots(figsize=(7, 7))\n", "sns.set(style=\"whitegrid\")\n", "\n", "sns.set_color_codes(\"muted\")\n", "sns.barplot(x=\"accuracy\",y = 'Algorithm', data=results2,\n", " label=\"accuracy\", color=\"b\", alpha= 0.5)\n", "\n", "sns.barplot(x=\"f_beta\",y = 'Algorithm', data=results2,\n", " label=\"f_beta\", color=\"g\", alpha= 0.5)\n", "\n", "sns.barplot(x=\"f1_score_wei\", y = 'Algorithm', data=results2,\n", " label=\"f1_score_wei\", color=\"r\", alpha = 0.5)\n", "\n", "\n", "ax.legend(ncol=2, loc=\"lower right\", frameon=True)\n", "ax.set(xlim=(0, 1), ylabel=\"\",\n", " xlabel=\"score\")\n", "sns.despine(left=True, bottom=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we can see the best perfomed model is SVC with parameters {'decision_function_shape': 'ovo', 'degree': 2, 'kernel': 'linear'}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 9. Prediction for test or hold-out samples " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is the prediction of the best model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator=SVC(decision_function_shape = 'ovo', degree = 2, kernel='linear')\n", "estimator.fit(x_train, y_train)\n", "pred_new = estimator.predict(x_test)\n", "fbeta_score(y_test, pred_new, beta = 1.5, average='weighted')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pred_new" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 10. Plotting training and validation curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take pur best model and look at learning curves" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "\n", "RANDOM_SEED = 0\n", "train_sizes, train_scores, test_scores = learning_curve(estimator=SVC(decision_function_shape = 'ovo', \n", " degree = 2, kernel='linear'),\n", " X=x_train, \n", " y=y_train,\n", " train_sizes=[0.25, 0.5, 0.75, 1.0],\n", " cv=CV,\n", " shuffle=True,\n", " random_state=RANDOM_SEED)\n", "\n", "train_scores_mean = np.mean(train_scores, axis=1)\n", "train_scores_std = np.std(train_scores, axis=1)\n", "test_scores_mean = np.mean(test_scores, axis=1)\n", "test_scores_std = np.std(test_scores, axis=1)\n", "\n", "fig = plt.figure(figsize=(20, 5))\n", "\n", "plt.xlabel('train_sizes')\n", "plt.ylabel('roc_auc')\n", "# plt.ylim(0.5, 1.01)\n", "\n", "plt.plot(train_sizes,\n", " train_scores_mean,\n", " label=\"Training score\",\n", " color=\"b\", marker='o')\n", "\n", "plt.plot(train_sizes,\n", " test_scores_mean, \n", " label=\"CV score\",\n", " color=\"g\", marker='s')\n", "\n", "plt.fill_between(train_sizes, \n", " train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, \n", " alpha=0.2, color=\"b\")\n", "\n", "plt.fill_between(train_sizes,\n", " test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std,\n", " alpha=0.2, color=\"g\")\n", "\n", "\n", "plt.legend(loc=\"best\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The good news: curves are coming closer to each other. The bad news: nothing happens to CV score. Maybe this model requires specific CV scheme" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import StratifiedKFold\n", "\n", "\n", "CV2 = StratifiedKFold(n_splits=3, random_state=0, shuffle=True)\n", "RANDOM_SEED = 0\n", "train_sizes, train_scores, test_scores = learning_curve(estimator=SVC(decision_function_shape = 'ovo', \n", " degree = 2, kernel='linear'),\n", " X=x_train, \n", " y=y_train,\n", " train_sizes=[0.25, 0.5, 0.75, 1.0],\n", " cv=CV2,\n", " shuffle=True,\n", " random_state=RANDOM_SEED)\n", "\n", "train_scores_mean = np.mean(train_scores, axis=1)\n", "train_scores_std = np.std(train_scores, axis=1)\n", "test_scores_mean = np.mean(test_scores, axis=1)\n", "test_scores_std = np.std(test_scores, axis=1)\n", "\n", "fig = plt.figure(figsize=(20, 5))\n", "\n", "plt.xlabel('train_sizes')\n", "plt.ylabel('roc_auc')\n", "# plt.ylim(0.5, 1.01)\n", "\n", "plt.plot(train_sizes,\n", " train_scores_mean,\n", " label=\"Training score\",\n", " color=\"b\", marker='o')\n", "\n", "plt.plot(train_sizes,\n", " test_scores_mean, \n", " label=\"CV score\",\n", " color=\"g\", marker='s')\n", "\n", "plt.fill_between(train_sizes, \n", " train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, \n", " alpha=0.2, color=\"b\")\n", "\n", "plt.fill_between(train_sizes,\n", " test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std,\n", " alpha=0.2, color=\"g\")\n", "\n", "\n", "plt.legend(loc=\"best\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Yes, with stratified KFold the situation is better" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 11. Creation of new features and description of this process " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can we improve model by creating new features? Here data contain more categorical variables. In such situation let's generate new features by cross interaction of existing ones. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "\n", "poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)\n", "data_poly_train = poly.fit_transform(x_train)\n", "data_poly_test = poly.transform(x_test)\n", "\n", "scaler = StandardScaler()\n", "data_poly_scaled_train = scaler.fit_transform(data_poly_train)\n", "data_poly_scaled_test = scaler.transform(data_poly_test)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "x_train_new=np.hstack((x_train, data_poly_scaled_train))\n", "x_test_new=np.hstack((x_test, data_poly_scaled_test))\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_train.shape, data_poly_scaled_train.shape, x_train_new.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "estimator=SVC(decision_function_shape = 'ovo', degree = 2, kernel='linear')\n", "estimator.fit(x_train_new, y_train)\n", "pred_new = estimator.predict(x_test_new)\n", "fbeta_score(y_test, pred_new, beta = 1.5, average='weighted')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It got much worse, so just forget about this :)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 12. Conclusions " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have tried 7 different algorithm for multiclass classification task. As we can see to distinguish between diseases is more difficult than to cathc whether the person have any disease. In our analysis the best performed model turned out to be SVC. \n", "\n", "As for ways of improvement, we can see from learning curves that there is need for more data. Moreover, more features considering heart can be included in the dataset, that could help to build more sophisticated algorithms.\n", "\n", "Case for application of such approach are obvious: it is automatification of patients observation and helping doctors in diagnosis clearification.\n", "\n", "Hope you've found this project interesting, as I did. (^_^)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }