{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "**Research plan**\n", "\n", "[Part 0. Mobile Price Classification](#mpc)
\n", "[Part 1. Feature and data explanation](#part1)
\n", "[Part 2. Primary data analysis](#EDA)
\n", "[Part 3. Primary visual data analysis](#part3)
\n", "[Part 4. Insights and found dependencies](#part4)
\n", "[Part 5. Metrics selection](#part5)
\n", "[Part 6. Model selection](#part6)
\n", "[Part 7. Data preprocessing](#part7)
\n", "[Part 8. Cross-validation and adjustment of model hyperparameters](#part8)
\n", "[Part 9. Creation of new features and description of this process](#part9)
\n", "[Part 10. Plotting training and validation curves](#part10)
\n", "[Part 11. Prediction for test or hold-out samples](#part11)
\n", "[Part 12. Conclusions](#part12)
\n", "[Bonus Part. Clustering](#bonus)
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Mobile Price Classification \n", "
Автор: Трефилов Андрей" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Oldi zdes'?\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 1. Feature and data explanation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bob has started his own mobile company. He wants to give tough fight to big companies like Apple,Samsung etc.\n", "\n", "He does not know how to estimate price of mobiles his company creates. In this competitive mobile phone market you cannot simply assume things. To solve this problem he collects sales data of mobile phones of various companies.\n", "\n", "Bob wants to find out some relation between features of a mobile phone(eg:- RAM,Internal Memory etc) and its selling price. \n", "\n", "In this project we do have to predict price range indicating how high the price is." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Download dataset from [Kaggle page](https://www.kaggle.com/iabhishekofficial/mobile-price-classification)\n", "
\n", "Dataset contain train (with target variable) and test (without target variable) samples.\n", "
\n", "For the train sample, we will solve the multiclass classification problem with 4 class, and for the test sample we will solve the clustering problem." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The dataset has the following features (copied from Kaggle):\n", "Every object - it is a unique mobile phone.\n", "- **battery_power** - Total energy a battery can store in one time measured in mAh (quantitative);\n", "- **blue** - Has bluetooth or not (binary);\n", "- **clock_speed** - speed at which microprocessor executes instructions (quantitative);\n", "- **dual_sim** - Has dual sim support or not (binary);\n", "- **fc** - Front Camera mega pixels (categorical);\n", "- **four_g** - Has 4G or not (binary);\n", "- **int_memory** - Internal Memory in Gigabytes (quantitative);\n", "- **m_dep** - Mobile Depth in cm (categorical); \n", "- **mobile_wt** - Weight of mobile phone (quantitative);\n", "- **n_cores** - Number of cores of processor (categorical);\n", "- **pc** - Primary Camera mega pixels (categorical);\n", "- **px_height** - Pixel Resolution Heigh (quantitative);\n", "- **px_width** - Pixel Resolution Width (quantitative);\n", "- **ram** - Random Access Memory in Megabytes (quantitative);\n", "- **sc_h** - Screen Height of mobile in cm (categorical);\n", "- **sc_w** - Screen Width of mobile in cm (categorical);\n", "- **talk_time** - longest time that a single battery charge will last when you are (quantitative);\n", "- **three_g** - Has 3G or not (binary);\n", "- **touch_screen** - Has touch screen or not (binary);\n", "- **wifi** - Has wifi or not (binary);\n", "
\n", "\n", "- **price_range** - This is the `target variable` with value of 0(low cost), 1(medium cost), 2(high cost) and 3(very high cost). Contain only the in train sample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2. Primary data analysis " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Importing libraries:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "\n", "from pylab import rcParams\n", "rcParams['figure.figsize'] = 10, 8\n", "#%config InlineBackend.figure_format = 'svg'\n", "import warnings\n", "warnings.simplefilter('ignore')\n", "from sklearn.decomposition import PCA\n", "from matplotlib import pyplot as plt\n", "from sklearn.manifold import TSNE\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.pipeline import Pipeline\n", "from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier\n", "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.svm import SVC\n", "from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor\n", "from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_predict, StratifiedKFold, validation_curve\n", "from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score,\\\n", " f1_score, make_scorer, classification_report, confusion_matrix\n", "from sklearn import svm, datasets\n", "from sklearn.metrics import roc_curve, auc\n", "pd.set_option('display.max_rows', 20)\n", "pd.set_option('display.max_columns', 21)\n", "from sklearn import metrics\n", "from sklearn.cluster import KMeans, AgglomerativeClustering, AffinityPropagation, SpectralClustering\n", "from tqdm import tqdm_notebook\n", "from sklearn.metrics.cluster import adjusted_rand_score\n", "from scipy.cluster import hierarchy\n", "from scipy.spatial.distance import pdist\n", "from sklearn.model_selection import learning_curve\n", "from sklearn.model_selection import ShuffleSplit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let`s look at data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train = pd.read_csv('../data/mobile/train.csv')\n", "data_test = pd.read_csv('../data/mobile/test.csv')\n", "data_test.drop(columns='id', inplace=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_test.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our samples we have quantitative features, categorical and binary features\n", "\n", "
\n", "And our samples haven't missing items in the data:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train.info()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_test.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the distribution of target feature:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train.groupby('price_range')[['price_range']].count().rename(columns={'price_range': 'count'}).T" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, it is a toy dataset..)We see that the target variable is uniform distributed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 3. Primary visual data analysis " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's draw plot of correlation matrix (before this, drop a boolean variables):" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corr_matrix = data_train.drop(['blue', 'dual_sim', 'four_g', 'three_g', 'touch_screen', 'wifi'], axis=1).corr()\n", "fig, ax = plt.subplots(figsize=(16,12))\n", "sns.heatmap(corr_matrix,annot=True,fmt='.1f',linewidths=0.5);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we see that there is a correlation between the `target` variable and four features: `battery_power`, `px_height`, `px_width` and `ram`.\n", "\n", "\n", "And some variables are correlated with each other: `pc` and `fc` (photo modules), `sc_w` and `sc_h` (screen width and heght), `px_width` and `px_height` (pixel resolution heigh and width)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Draw plot of distribution of target variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train['price_range'].value_counts().plot(kind='bar',figsize=(14,6))\n", "plt.title('Distribution of target variable');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we again see that the target variable is uniform distributed" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Look at the distribution of quantitative features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features = list(data_train.drop(['price_range', 'blue', 'dual_sim',\\\n", " 'four_g', 'fc', 'm_dep', 'n_cores',\\\n", " 'pc', 'sc_h', 'sc_w', 'three_g', 'wifi', 'touch_screen'], axis=1).columns)\n", "data_train[features].hist(figsize=(20,12));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at the interaction of different features among themselves with `sns.pairplot`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sns.pairplot(data_train[features + ['price_range']], hue='price_range');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the `ram` feature of a good separates our objects by different price categories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Construct the `sns.boxplot`, describe the distribution statistics of quantitative traits:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(nrows=2, ncols=4, figsize=(20, 12))\n", "\n", "for idx, feat in enumerate(features):\n", " sns.boxplot(x='price_range', y=feat, data=data_train, ax=axes[int(idx / 4), idx % 4])\n", " axes[int(idx / 4), idx % 4].set_xlabel('price_range')\n", " axes[int(idx / 4), idx % 4].set_ylabel(feat);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that it is better to difference our price categories the following features: `battery_power`, `px_height`, `px_width` и `ram`. As well as the plot of the correlation matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's plot the distribution for `sc_w` - categorical feature:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(16,10))\n", "sns.countplot(x='sc_w', hue='price_range', data=data_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Wee see that count of our object decreases with increasing width" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "plot the distribution for `sc_w` - categorical feature:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(16,10))\n", "sns.countplot(x='sc_h', hue='price_range', data=data_train);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's look at the connection of binary features of `blue`, `dual_sim`, `four_g` and `three_g` with our target `price_range`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_, axes = plt.subplots(1, 4, sharey=True, figsize=(16,6))\n", "\n", "sns.countplot(x='blue', hue='price_range', data=data_train, ax=axes[0]);\n", "sns.countplot(x='dual_sim', hue='price_range', data=data_train, ax=axes[1]);\n", "sns.countplot(x='four_g', hue='price_range', data=data_train, ax=axes[2]);\n", "sns.countplot(x='three_g', hue='price_range', data=data_train, ax=axes[3]);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All about the same, but count objects with 3G more than without." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's build a t-SNE representation:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = data_train.drop('price_range', axis=1)\n", "y = data_train.price_range" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tsne = TSNE(random_state=17)\n", "tsne_representation = tsne.fit_transform(X)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(16,10))\n", "cmap = sns.cubehelix_palette(dark=.1, light=.8, as_cmap=True)\n", "sns.scatterplot(tsne_representation[:, 0], tsne_representation[:, 1],\\\n", " s=100, hue=data_train['price_range'], palette=\"Accent\");\n", "plt.title('t-SNE projection');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the object is well distinguished." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's look at another representation of the `scaled data` colored by binary features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "scaler = StandardScaler()\n", "X_scaled = scaler.fit_transform(X)\n", "tsne2 = TSNE(random_state=17)\n", "tsne_representation2 = tsne2.fit_transform(X_scaled)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_, axes = plt.subplots(2, 2, sharey=True, figsize=(16,10))\n", "\n", "axes[0][0].scatter(tsne_representation2[:, 0], tsne_representation2[:, 1], \n", " c=data_train['three_g'].map({0: 'blue', 1: 'orange'}));\n", "axes[0][1].scatter(tsne_representation2[:, 0], tsne_representation2[:, 1], \n", " c=data_train['four_g'].map({0: 'blue', 1: 'orange'}));\n", "axes[1][0].scatter(tsne_representation2[:, 0], tsne_representation2[:, 1], \n", " c=data_train['blue'].map({0: 'blue', 1: 'orange'}));\n", "axes[1][1].scatter(tsne_representation2[:, 0], tsne_representation2[:, 1], \n", " c=data_train['dual_sim'].map({0: 'blue', 1: 'orange'}));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, we see that the binary features are a bunch)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 4. Insights and found dependencies " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Combining the observation from the previous paragraphs, the following is to be denoted:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 5. Metrics selection " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have a problem of multi-class classification. It is necessary to predict the class itself, not the probability of belonging to the class, so we use the metrics from the classification problem, namely `accuracy`, `precision`, `recall `, `f1`. The basic metric we will have is `accuracy` but we will use `classification_report` to estimate other metrics.\n", "\n", "We can use `accuracy`, because we have uniform distribution of target variable.\n", "\n", "$$\\mathcal accuracy = \\dfrac{1}{l}\\sum_{i=1}^l [a(x_{i})=y_{i}]$$\n", "\n", "We will also consider the `confusion matrix`, columns `i` - true class label, line `j` - assessment of class membership from our algorithm, where $q_{ij}$: \n", "\n", "$$\\mathcal q_{ij} = \\sum_{m=1}^l [a(x_{m})=i][y_{m}=j]$$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 6. Model selection " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, we have a problem of multi-class classification, and as we already know our task linearly separable.\n", "That's why we can use `LogisticRegression`. Well, we have four classes, and to solve this problem is well suited `OneVsOneClassifier` - a model that trains K(K-1) models for each pair of classes.\n", "\n", "With a problem of multi-class classification the following models also work well by default:\n", "\n", "- KNeighborsClassifier\n", "- RandomForestClassifier\n", "- SVC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 7. Data preprocessing " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We divide our sample into a matrix of features and a vector of answers:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = data_train.drop('price_range', axis=1)\n", "y = data_train.price_range" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's make a split into a train sample and hold-out sample:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train_part, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, stratify=y, random_state=17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some models should not be scaled, but for others it is necessary:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scaler = StandardScaler()\n", "X_scaled = scaler.fit_transform(X)\n", "X_train_part_scaled, X_valid_scaled, y_train, y_valid = train_test_split(X_scaled, y,\\\n", " test_size=0.3, stratify=y, random_state=17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 8. Cross-validation and adjustment of model hyperparameters " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `LogisticRegression` with scaled features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "lr = LogisticRegression(random_state=17)\n", "lr.fit(X_train_part_scaled, y_train);\n", "print(accuracy_score(y_valid, lr.predict(X_valid_scaled)))\n", "print(classification_report(y_valid, lr.predict(X_valid_scaled)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `LogisticRegression`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, lr.predict(X_valid_scaled), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For `GridSearchCV` we choose the parameters of `LogisticRegression`: C - Inverse of regularization strength, smaller values specify stronger regularization. solver - Algorithm to use in the optimization problem. class_weight - Weights associated with classes in the form {class_label: weight}. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = {'C': np.logspace(-5, 5, 11),\n", " 'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n", " 'class_weight' : ['balanced', None]}\n", "\n", "\n", "lr_grid = GridSearchCV(lr, params, n_jobs=-1, cv=5, scoring='accuracy', verbose=1)\n", "lr_grid.fit(X_train_part_scaled, y_train);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(accuracy_score(y_valid, lr_grid.predict(X_valid_scaled)))\n", "print(classification_report(y_valid, lr_grid.predict(X_valid_scaled)))\n", "lr_grid.best_params_, lr_grid.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nice, after `GridSearchCV` we see that score increase." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `LogisticRegression` after `GridSearchCV`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, lr_grid.predict(X_valid_scaled), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `KNeighborsClassifier` with unscaled features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kneigh = KNeighborsClassifier()\n", "kneigh.fit(X_train_part, y_train) \n", "\n", "print(accuracy_score(y_valid, kneigh.predict(X_valid)))\n", "print(classification_report(y_valid, kneigh.predict(X_valid)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `KNeighborsClassifier`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, kneigh.predict(X_valid), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `OneVsOneClassifier` with scaled features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf = OneVsOneClassifier(LogisticRegression(random_state=17))\n", "clf.fit(X_train_part_scaled, y_train);\n", "\n", "print(accuracy_score(y_valid, clf.predict(X_valid_scaled)))\n", "print(classification_report(y_valid, clf.predict(X_valid_scaled)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Doing `GridSearchCV` for `OneVsOneClassifier` with `LogisticRegression`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params = {'estimator__C': np.logspace(-5, 5, 11),\n", " 'estimator__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n", " 'estimator__class_weight' : ['balanced', None]}\n", "\n", "\n", "clf_grid = GridSearchCV(clf, params, n_jobs=-1, cv=5, scoring='accuracy', verbose=1)\n", "clf_grid.fit(X_train_part_scaled, y_train);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(accuracy_score(y_valid, clf_grid.predict(X_valid_scaled)))\n", "print(classification_report(y_valid, clf_grid.predict(X_valid_scaled)))\n", "clf_grid.best_params_, clf_grid.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `OneVsOneClassifier` after `GridSearchCV`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, clf_grid.predict(X_valid_scaled), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this task `OneVsOneClassifier` very good classifier!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `RandomForestClassifier` with unscaled features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_clf = RandomForestClassifier(random_state=17)\n", "rf_clf.fit(X_train_part, y_train) \n", "print(accuracy_score(y_valid, rf_clf.predict(X_valid)))\n", "print(classification_report(y_valid, rf_clf.predict(X_valid)))\n", "#print(confusion_matrix(y_valid, rf_clf.predict(X_valid)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Let's see `feature_importances_ ` for `RandomForestClassifier`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.DataFrame({'feat': X_train_part.columns,\n", " 'coef': np.abs(rf_clf.feature_importances_).flatten().tolist()}).\\\n", " sort_values(by='coef', ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### No wonder the correlation matrix told us that already." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `RandomForestClassifier`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, rf_clf.predict(X_valid), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## `SVC` with unscaled features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svc = SVC(kernel='linear', probability=True, random_state=17)\n", "svc.fit(X_train_part, y_train);\n", "\n", "print(accuracy_score(y_valid, svc.predict(X_valid)))\n", "print(classification_report(y_valid, svc.predict(X_valid)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svc = SVC(kernel='linear', probability=True, random_state=17)\n", "svc.fit(X_train_part, y_train);\n", "\n", "print(accuracy_score(y_valid, svc.predict(X_valid)))\n", "print(classification_report(y_valid, svc.predict(X_valid)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Doing `GridSearchCV` for `SVC`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "params_svc = {'C': np.logspace(-1, 1, 3),\n", " 'decision_function_shape': ['ovr', 'ovo'],\n", " 'class_weight' : ['balanced', None]}\n", "\n", "\n", "svc_grid = GridSearchCV(svc, params_svc, n_jobs=-1, cv=3, scoring='accuracy', verbose=1)\n", "svc_grid.fit(X_train_part, y_train);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(accuracy_score(y_valid, svc_grid.predict(X_valid)))\n", "print(classification_report(y_valid, svc_grid.predict(X_valid)))\n", "svc_grid.best_params_, svc_grid.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Confusion matrix for `SVC` after `GridSearchCV`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y_valid, svc_grid.predict(X_valid), margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = tab.index\n", "tab" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We have 2 models with amazing score - `OneVsOneClassifier` with `LogisticRegression` (scaled features), and `SVC` (unscaled features), with `accuracy = 0.9766` and `accuracy = 0.98` after `GridSearchCV` respectively!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 9. Creation of new features and description of this process " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `inch` (abbreviation: in or ″) is a unit of length in the (British) imperial and United States customary systems of measurement. It is equal to ​1⁄36 yard or ​1⁄12 of a foot. Derived from the Roman uncia (\"twelfth\"), the word inch is also sometimes used to translate similar units in other measurement systems, usually understood as deriving from the width of the human thumb. Standards for the exact length of an inch have varied in the past, but since the adoption of the international yard during the 1950s and 1960s it has been based on the metric system and defined as exactly 2.54 cm. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pixels per inch (`ppi`) or pixels per centimeter (ppcm) are measurements of the pixel density (resolution) of an electronic image device, such as a computer monitor or television display, or image digitizing device such as a camera or image scanner. Horizontal and vertical density are usually the same, as most devices have square pixels, but differ on devices that have non-square pixels. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, i think `ppi` it is a good feature, because, than the larger the value, the sharper the image.\n", "\n", "Let's check this." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train2 = data_train.copy()\n", "data_train2['inch'] = (np.sqrt(data_train2['sc_h']**2 + data_train2['sc_w']**2)/2.54).astype('int')\n", "data_train2['ppi'] = np.sqrt(data_train2['px_width']**2 + data_train2['px_height']**2)/data_train2['inch']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also make a feature that is based on the current modern phones:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_train2['top'] = ((data_train2['touch_screen'] ==1)|\\\n", " (data_train2['ppi'] >=500)&\\\n", " (data_train2['inch'] >=5)&\\\n", " (data_train2['four_g'] ==1)|\\\n", " (data_train2['blue'] ==1)|\\\n", " (data_train2['int_memory'] >=36)|\\\n", " (data_train2['ram'] >=2600)).astype('int64')\n", "data_train2['top'].value_counts() " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, let's check these features on our models:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For `SVC` unscaled matrix features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train2, y2 = data_train2.drop(['price_range','inch'], axis=1), data_train2['price_range']\n", "X_train_part2, X_valid2, y_train2, y_valid2 = train_test_split\\\n", " (X_train2, y2, test_size=.3, stratify=y2, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svc2 = SVC(kernel='linear', probability=True, random_state=17)\n", "svc2.fit(X_train_part2, y_train2);\n", "\n", "print(accuracy_score(y_valid2, svc2.predict(X_valid2)))\n", "print(classification_report(y_valid2, svc2.predict(X_valid2)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "params_svc2 = {'C': np.logspace(-1, 1, 3),\n", " 'decision_function_shape': ['ovr', 'ovo'],\n", " 'class_weight' : ['balanced', None]}\n", "\n", "\n", "svc_grid2 = GridSearchCV(svc2, params_svc2, n_jobs=-1, cv=3, scoring='accuracy', verbose=1)\n", "svc_grid2.fit(X_train_part2, y_train2);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(accuracy_score(y_valid2, svc_grid2.predict(X_valid2)))\n", "print(classification_report(y_valid2, svc_grid2.predict(X_valid2)))\n", "svc_grid2.best_params_, svc_grid2.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For `OneVsOneClassifier` with `LogisticRegression` unscaled matrix features:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X2 = data_train2.drop(['price_range','inch'], axis=1)\n", "scaler2 = StandardScaler()\n", "X_scaled2, y2 = scaler2.fit_transform(X2), data_train2['price_range']\n", "X_train_part_scaled2, X_valid_scaled2, y_train2, y_valid2 = train_test_split\\\n", " (X_scaled2, y2, test_size=.3, stratify=y2, random_state=17)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf2 = OneVsOneClassifier(LogisticRegression(random_state=17))\n", "clf2.fit(X_train_part_scaled2, y_train2);\n", "\n", "print(accuracy_score(y_valid2, clf2.predict(X_valid_scaled2)))\n", "print(classification_report(y_valid2, clf2.predict(X_valid_scaled2)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "params2 = {'estimator__C': np.logspace(-5, 5, 11),\n", " 'estimator__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],\n", " 'estimator__class_weight' : ['balanced', None]}\n", "\n", "\n", "clf_grid2 = GridSearchCV(clf2, params2, n_jobs=-1, cv=5, scoring='accuracy', verbose=1)\n", "clf_grid2.fit(X_train_part_scaled2, y_train2);" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(accuracy_score(y_valid2, clf_grid2.predict(X_valid_scaled2)))\n", "print(classification_report(y_valid2, clf_grid2.predict(X_valid_scaled2)))\n", "clf_grid2.best_params_, clf_grid2.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, with new features we observe the following situation: `OneVsOneClassifier` with `LogisticRegression` by comparison with a default train sample after `GridSearchCV` increase score and now - `accuracy = 0.98`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`SVC` with new features and without using `GridSearchCV` increase score : there was `accuracy = 0.9766` it became like this `accuracy = 0.9783`. But `GridSearchCV` is not increased the result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 10. Plotting training and validation curves " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,\n", " n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):\n", "\n", " plt.figure()\n", " plt.title(title)\n", " if ylim is not None:\n", " plt.ylim(*ylim)\n", " plt.xlabel(\"Training examples\")\n", " plt.ylabel(\"Score\")\n", " train_sizes, train_scores, test_scores = learning_curve(\n", " estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)\n", " train_scores_mean = np.mean(train_scores, axis=1)\n", " train_scores_std = np.std(train_scores, axis=1)\n", " test_scores_mean = np.mean(test_scores, axis=1)\n", " test_scores_std = np.std(test_scores, axis=1)\n", " plt.grid()\n", "\n", " plt.fill_between(train_sizes, train_scores_mean - train_scores_std,\n", " train_scores_mean + train_scores_std, alpha=0.1,\n", " color=\"r\")\n", " plt.fill_between(train_sizes, test_scores_mean - test_scores_std,\n", " test_scores_mean + test_scores_std, alpha=0.1, color=\"g\")\n", " plt.plot(train_sizes, train_scores_mean, 'o-', color=\"r\",\n", " label=\"Training score\")\n", " plt.plot(train_sizes, test_scores_mean, 'o-', color=\"g\",\n", " label=\"Cross-validation score\")\n", "\n", " plt.legend(loc=\"best\")\n", " return plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting training and validation curves for grid model with new features `SVC`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "svc3 = SVC(C=0.1, kernel='linear', probability=True, class_weight='balanced', random_state=17)\n", "title = \"Learning Curves (SVM, Linear kernel, C=0.1)\"\n", "plot_learning_curve(svc3, title, X_train_part2, y_train2, (0.7, 1.01), cv=20, n_jobs=4)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting training and validation curves for grid model with new features `OneVsOneClassifier` with `LogisticRegression`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clf3 = OneVsOneClassifier(LogisticRegression(C=100,\\\n", " class_weight='balanced', solver='newton-cg', random_state=17))\n", "title = \"Learning Curves (OneVsOneClassifier, LogisticRegression base model, C=100)\"\n", "plot_learning_curve(clf3, title, X_train_part_scaled2, y_train2, (0.7, 1.01), cv=20, n_jobs=4)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that the curves practically converge, this indicates a high quality of the forecast and if we continue to move to the right (add data to the model), we can still improve the quality of the validation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 11. Prediction for test or hold-out samples " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Was discussed in Part 8 and Part 9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 12. Conclusions " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We had a problem with multi-class classification, and we saw that the following methods do a better job: `OneVsOneClassifier` with `LogisticRegression` and `SVC`.\n", "We got very good score.\n", "\n", "Now Bob knows how to evaluate phones of his own production!\n", "\n", "Further ways to improve the solution:\n", "\n", "- To collect additional characteristics about the components of the phone (Manufacturer, type, brand);\n", "- Collect data about other phones;\n", "- Make more new features;\n", "- Combine multiple predictive models;\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Bonus Part. Clustering " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Сonsider the train sample:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reduce the dimension while preserving the variance:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pca = PCA(n_components=0.9, random_state=17).fit(X2)\n", "X_pca = pca.transform(X2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Projections our data for the first two dimension:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(16,12))\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y2, s=100, cmap=plt.cm.get_cmap('nipy_spectral', 4));\n", "plt.colorbar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "t-SNE representation our data for the first two dimension:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "tsne3 = TSNE(random_state=17)\n", "\n", "X_tsne = tsne3.fit_transform(X2)\n", "\n", "plt.figure(figsize=(16,10))\n", "plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y2, \n", " edgecolor='none', alpha=0.7, s=200,\n", " cmap=plt.cm.get_cmap('viridis', 4))\n", "plt.colorbar()\n", "plt.title('t-SNE projection')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "K-MEANS Clustering:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "kmeans = KMeans(n_clusters=4,random_state=17, n_jobs=1)\n", "kmeans.fit(X_pca)\n", "kmeans_labels = kmeans.labels_+1\n", "plt.figure(figsize=(16,12))\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1],\\\n", " c=kmeans_labels, s=20,\\\n", " cmap=plt.cm.get_cmap('nipy_spectral', 4));\n", "plt.colorbar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confusion matrix are very bad:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tab = pd.crosstab(y2, kmeans_labels, margins=True)\n", "tab.index = ['low cost', 'medium cost', 'high cost', 'very high cost', 'all']\n", "tab.columns = ['cluster' + str(i + 1) for i in range(4)] + ['all']\n", "tab" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pd.Series(tab.iloc[:-1,:-1].max(axis=1).values / \n", " tab.iloc[:-1,-1].values, index=tab.index[:-1])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inertia = []\n", "for k in tqdm_notebook(range(1, 12)):\n", " kmeans = KMeans(n_clusters=k, random_state=17).fit(X2)\n", " inertia.append(np.sqrt(kmeans.inertia_))\n", "plt.plot(range(1, 12), inertia, marker='s');\n", "plt.xlabel('$k$')\n", "plt.ylabel('$J(C_k)$');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Agglomerative Clustering:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ag = AgglomerativeClustering(n_clusters=4, \n", " linkage='ward').fit(X_pca)\n", "ag_labels = ag.labels_+1\n", "plt.figure(figsize=(16,12))\n", "plt.scatter(X_pca[:, 0], X_pca[:, 1],\\\n", " c=ag_labels, s=20,\\\n", " cmap=plt.cm.get_cmap('nipy_spectral', 4));#cmap='viridis');\n", "plt.colorbar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Score ARI for K-MEANS and Agglomerative Clustering:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adjusted_rand_score(y2, ag.labels_)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "adjusted_rand_score(y2, kmeans.labels_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dendrogram:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "distance_mat = pdist(X2) # pdist calculates the upper triangle of the distance matrix\n", "\n", "Z = hierarchy.linkage(distance_mat, 'single') # linkage is agglomerative clustering algorithm\n", "plt.figure(figsize=(10, 5))\n", "dn = hierarchy.dendrogram(Z, color_threshold=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A summary of the score on the train sample:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "algorithms = []\n", "algorithms.append(KMeans(n_clusters=4, random_state=17))\n", "algorithms.append(AffinityPropagation())\n", "algorithms.append(SpectralClustering(n_clusters=4, random_state=17,\n", " affinity='nearest_neighbors'))\n", "algorithms.append(AgglomerativeClustering(n_clusters=4))\n", "\n", "data = []\n", "for algo in algorithms:\n", " algo.fit(X_pca)\n", " data.append(({\n", " 'ARI': metrics.adjusted_rand_score(y2, algo.labels_),\n", " 'AMI': metrics.adjusted_mutual_info_score(y2, algo.labels_),\n", " 'Homogenity': metrics.homogeneity_score(y2, algo.labels_),\n", " 'Completeness': metrics.completeness_score(y2, algo.labels_),\n", " 'V-measure': metrics.v_measure_score(y2, algo.labels_),\n", " 'Silhouette': metrics.silhouette_score(X_pca, algo.labels_)}))\n", "\n", "results = pd.DataFrame(data=data, columns=['ARI', 'AMI', 'Homogenity',\n", " 'Completeness', 'V-measure', \n", " 'Silhouette'],\n", " index=['K-means', 'Affinity', \n", " 'Spectral', 'Agglomerative'])\n", "\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Сonsider the test sample:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X3 = data_test\n", "pca3 = PCA(n_components=0.9, random_state=17).fit(X3)\n", "X_pca3 = pca3.transform(X3)\n", "kmeans = KMeans(n_clusters=4,random_state=17, n_jobs=1)\n", "kmeans.fit(X_pca3)\n", "kmeans_labels = kmeans.labels_+1\n", "plt.figure(figsize=(16,12))\n", "plt.scatter(X_pca3[:, 0], X_pca3[:, 1],\\\n", " c=kmeans_labels, s=20,\\\n", " cmap=plt.cm.get_cmap('nipy_spectral', 4));#cmap='viridis');\n", "plt.colorbar();" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ag = AgglomerativeClustering(n_clusters=4, \n", " linkage='ward').fit(X_pca3)\n", "ag_labels = ag.labels_+1\n", "plt.figure(figsize=(16,12))\n", "plt.scatter(X_pca3[:, 0], X_pca3[:, 1],\\\n", " c=ag_labels, s=20,\\\n", " cmap=plt.cm.get_cmap('nipy_spectral', 4));#cmap='viridis');\n", "plt.colorbar();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### We can only evaluate with silhouette:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metrics.silhouette_score(X_pca3, ag_labels)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "metrics.silhouette_score(X_pca3, kmeans_labels)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.7" } }, "nbformat": 4, "nbformat_minor": 2 }