{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", " \n", "## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n", "Author: [Yury Kashnitskiy](https://yorko.github.io) (@yorko). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. This material is a translated version of the Capstone project (by the same author) from specialization \"Machine learning and data analysis\" by Yandex and MIPT. No solutions shared." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#
Week 4. Classification algorithms comparison\n", "\n", "Finally, we are going to train classification models, compare several algorithms via cross-validation, and figure out which session's parameters (*session_length* и *window_size*) it is better to use. Also, for the chosen algorithm, we will plot learning curves (which show the dependecy of model performance on the amount of training data) and validation curves (which show the dependency of model performance on one of it's hyperparameters).\n", "\n", "**Week 4 roadmap:**\n", "- Part 1. Different algorithms comparison on sessions of 10 websites\n", "- Part 2. Hyperparameter tuning – session_length and window_size\n", "- Part 3. Particular user identification and learning curves\n", "\n", "**You might find following links useful:**\n", " - [Hyperparameter Optimization in Machine Learning Models](https://www.datacamp.com/community/tutorials/parameter-optimization-machine-learning-models)\n", " - [Optimizing your model with cross-validation](http://blog.kaggle.com/2015/06/29/scikit-learn-video-7-optimizing-your-model-with-cross-validation/)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Your task**\n", "1. Fill in code in provided notebook\n", "2. Choose the answers in the [form](https://docs.google.com/forms/d/10kYgawyf9kId7VDOnBhQ6PH64rU2HR0CmTy5CDwy2Vo)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# pip install watermark\n", "%load_ext watermark" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPython 3.7.0\n", "IPython 7.1.1\n", "\n", "numpy 1.15.4\n", "scipy 1.1.0\n", "pandas 0.23.4\n", "matplotlib 3.0.2\n", "statsmodels 0.9.0\n", "sklearn 0.20.0\n", "\n", "compiler : GCC 7.3.0\n", "system : Linux\n", "release : 4.17.14-041714-generic\n", "machine : x86_64\n", "processor : x86_64\n", "CPU cores : 12\n", "interpreter: 64bit\n", "Git hash : d2fa7c7dfca896055c40b5fea2f513a384ff1fda\n" ] } ], "source": [ "%watermark -v -m -p numpy,scipy,pandas,matplotlib,statsmodels,sklearn -g" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from __future__ import division, print_function\n", "# disable Anaconda warnings\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "from time import time\n", "import itertools\n", "import os\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn as sns\n", "sns.set()\n", "from matplotlib import pyplot as plt\n", "import pickle\n", "from scipy.sparse import csr_matrix\n", "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV\n", "from sklearn.metrics import accuracy_score, f1_score" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Change the path to data\n", "PATH_TO_DATA = '../../data/capstone_user_identification/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 1. Different algorithms comparison on sessions of 10 websites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Load *X_sparse_10users* and *y_10users* objects serialized earlier, which correspond to 10 users data.**" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "#You might wanna check the `encoding` param if you face any error while opening the .pkl files..\n", "with open(os.path.join(PATH_TO_DATA, \n", " 'X_sparse_10users.pkl'), 'rb') as X_sparse_10users_pkl:\n", " X_sparse_10users = pickle.load(X_sparse_10users_pkl)\n", "with open(os.path.join(PATH_TO_DATA, \n", " 'y_10users.pkl'), 'rb') as y_10users_pkl:\n", " y_10users = pickle.load(y_10users_pkl)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**There are more than 14 thousand sessions and almost 5 thousand unique websites.**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(14061, 4913)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_sparse_10users.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Split the data into two parts. We are going to use the first part for cross-validation, second part will be used to evaluate performance of the model that we will end up with after cross-validation.**" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "X_train, X_valid, y_train, y_valid = train_test_split(X_sparse_10users, y_10users, \n", " test_size=0.3, \n", " random_state=17, stratify=y_10users)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Define cross-validation: 3-fold, with shuffle, random_state=17 – for reproducibility.**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Utility function to plot validation curves after running `GridSearchCV` (or `RandomizedCV`).**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def plot_validation_curves(param_values, grid_cv_results_):\n", " \n", " train_mu, train_std = grid_cv_results_['mean_train_score'], grid_cv_results_['std_train_score']\n", " valid_mu, valid_std = grid_cv_results_['mean_test_score'], grid_cv_results_['std_test_score']\n", " train_line = plt.plot(param_values, train_mu, '-', label='train', color='green')\n", " valid_line = plt.plot(param_values, valid_mu, '-', label='test', color='red')\n", " plt.fill_between(param_values, train_mu - train_std, train_mu + train_std, edgecolor='none',\n", " facecolor=train_line[0].get_color(), alpha=0.2)\n", " plt.fill_between(param_values, valid_mu - valid_std, valid_mu + valid_std, edgecolor='none',\n", " facecolor=valid_line[0].get_color(), alpha=0.2)\n", " plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**1. Train `KNeighborsClassifier` with 100 nearest neighbours (leave other parameters default values, only set `n_jobs = -1` for parallelization) and compare model's mean accuracy during 3-fold cross-validation (for reproducibility use `skf` object) on `(X_train, y_train)` and model's accuracy on `(X_valid, y_valid)`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "knn = KNeighborsClassifier ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 1. Evaluate KNeighborsClassifier's mean accuracy during cross-validation and model's accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**2. Train a random forest (`RandomForestClassifier`) consisting of 100 trees (for reproducibility use `random_state`=17). Compare model's OOB-score on and its accuracy on `(X_valid, y_valid)`. Use `n_jobs = -1` for parallelization.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "forest = RandomForestClassifier ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 2. Evaluate `RandomForestClassifier` Out-of-Bag aka OOB score and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**3. Train logistic regression with default C value and `random_state`=17. Compare model's mean accuracy during 3-fold cross-validation (don't forget to use `skf` object) on `(X_train, y_train)` and model's accuracy on `(X_valid, y_valid)`. Use `n_jobs = -1` for parallelization.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression, LogisticRegressionCV" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit = LogisticRegression ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Read the documentation for [LogisticRegressionCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html). Logistic regression is well studied and there are algorithms for fast parameter `C` search (faster than using `GridSearchCV`).**\n", "\n", "**Using `LogisticRegressionCV` find optimal C value. Fisrt try wider range: 10 values from 1e-4 up to 1e2 using `logspace` from `NumPy`. Specify `multi_class`='multinomial' and `random_state`=17 for `LogisticRegressionCV`. For cross-validation use `skf` object created earlier. Use `n_jobs`=-1 for parallelization.**\n", "\n", "**Plot validation curves for parameter `C`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "logit_c_values1 = np.logspace(-4, 2, 10)\n", "\n", "logit_grid_searcher1 = LogisticRegressionCV ''' YOUR CODE IS HERE ''' \n", "logit_grid_searcher1.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Mean accuracy during cross-validation for each of 10 `C` values.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit_mean_cv_scores1 = ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print the best accuracy on cross-validation and corresponding value of `C`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Plot Accuracy vs. `C` dependency graph on cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.plot(logit_c_values1, logit_mean_cv_scores1);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Now, do the same again but search `C` values in range `np.linspace`(0.1, 7, 20). Plot the validation curves and find the best accuracy on cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "logit_c_values2 = np.linspace(0.1, 7, 20)\n", "\n", "logit_grid_searcher2 = LogisticRegressionCV ''' YOUR CODE IS HERE '''\n", "logit_grid_searcher2.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Mean accuracy during cross-validation for each of 10 `C` values.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print the best accuracy on cross-validation and corresponding value of C.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Plot Accuracy vs. `C` dependency graph on cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plt.plot(logit_c_values2, logit_mean_cv_scores2);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print logistic regressoin's accuracy with the best `C` value on `(X_valid, y_valid)`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "logit_cv_acc = accuracy_score ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 3. Evaluate model's mean accuracy for `logit_grid_searcher2` on cross-validation using the best `C` and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**4. Train SVM (`LinearSVC`) with `C`=1 and `random_state`=17. Compare model's mean accuracy during cross-validation (don't forget to use `skf` object) and model's ccuracy on `(X_valid, y_valid)`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.svm import LinearSVC" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "svm = LinearSVC ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Using `GridSearchCV` find optimal C value for SVM. Fisrt try wider range: 10 values from 1e-4 up to 1e2 using `linspace` from NumPy. Plot the validation curves.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "svm_params1 = {'C': np.linspace(1e-4, 1e4, 10)}\n", "\n", "svm_grid_searcher1 = GridSearchCV ''' YOUR CODE IS HERE '''\n", "svm_grid_searcher1.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print the best accuracy on cross-validation and corresponding value of `C`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Plot Accuracy vs. C dependency graph on cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plot_validation_curves(svm_params1['C'], svm_grid_searcher1.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**But we remember that using deafault regularization parameter (`C`=1) on cross-validation we get a higher accuracy. That's the case (not rare) of a possibility to make a mistake and searching parameters in a wrong range (the reason is that we took a uniform grid on a large scale and missed optimal interval of `C` values). It is more meaningful to search `C` near 1, in addition, model trains faster than with higher values of `C`.**\n", "\n", "**Using `GridSearchCV` find optimal `C` value for SVM in range(1e-3, 1), 30 values, use `linspace` from NumPy. Plt the validation curves.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "svm_params2 = {'C': np.linspace(1e-3, 1, 30)}\n", "\n", "svm_grid_searcher2 = GridSearchCV ''' YOUR CODE IS HERE '''\n", "svm_grid_searcher2.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print the best accuracy on cross-validation and corresponding value of `C`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUT CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Plot Accuracy vs. C dependency graph on cross-validation.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plot_validation_curves(svm_params2['C'], svm_grid_searcher2.cv_results_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Print `LinearSVC`'s accuracy with the best `C` value on (X_valid, y_valid).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "svm_cv_acc = accuracy_score ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 4. Evaluate model's mean accuracy for `svm_grid_searcher2` on cross-validation using the best `C` and its accuracy on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 2. Parameter tuning – session_length and window_size" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Let's take `LinearSVC` since it performed best on cross-validation in part 1 and check its performance on 8 datasets of 10 users (with different combiantions of `session_length` and `window_size`). Since there are much more computations, we will not search regularization parameter `C` each time.**\n", "\n", "**Write the `model_assessment` function with the specification provided below. Pay your attention to all details, e.g. `train_test_split` should be stratified. Don't forget `random_state` anywhere.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def model_assessment(estimator, path_to_X_pickle, path_to_y_pickle, cv, random_state=17, test_size=0.3):\n", " '''\n", " Estimates CV-accuracy for (1 - test_size) share of (X_sparse, y) \n", " loaded from path_to_X_pickle and path_to_y_pickle and holdout accuracy for (test_size) share of (X_sparse, y).\n", " The split is made with stratified train_test_split with params random_state and test_size.\n", " \n", " :param estimator – Scikit-learn estimator (classifier or regressor)\n", " :param path_to_X_pickle – path to pickled sparse X (instances and their features)\n", " :param path_to_y_pickle – path to pickled y (responses)\n", " :param cv – cross-validation as in cross_val_score (use StratifiedKFold here)\n", " :param random_state – for train_test_split\n", " :param test_size – for train_test_split\n", " \n", " :returns mean CV-accuracy for (X_train, y_train) and accuracy for (X_valid, y_valid) where (X_train, y_train)\n", " and (X_valid, y_valid) are (1 - test_size) and (testsize) shares of (X_sparse, y).\n", " '''\n", " \n", " ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Double-check that the function is working.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model_assessment(svm_grid_searcher2.best_estimator_, \n", " os.path.join(PATH_TO_DATA, 'X_sparse_10users.pkl'),\n", " os.path.join(PATH_TO_DATA, 'y_10users.pkl'), skf, random_state=17, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Apply `model_assessment` function for the best algorithm from the previous part (namely, `svm_grid_searcher2.best_estimator_`) and 9 datasets with different combinations of `session_length` and `window_size` of 10 users. Print `session_length` and `window_size` parameters in the loop as well as an output of the `model_assessment` function.\n", "It's handy if the `model_assessment` function returns execution time as a third output argument. It took 20 sec to execute this code snippet on my laptop. But with 150 users dataset, each iteration takes a couple of minutes.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, for the convinience it worth to create copies of pickle-files `X_sparse_10users.pkl`, `X_sparse_150users.pkl`, `y_10users.pkl` and `y_150users.pkl` adding `s10_w10` to their names, which means session length of 10 and window width of 10." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Won't work on non-Linux based Machines (Basically it's creating copies of the files)\n", "!cp $PATH_TO_DATA/X_sparse_10users.pkl $PATH_TO_DATA/X_sparse_10users_s10_w10.pkl \n", "!cp $PATH_TO_DATA/X_sparse_150users.pkl $PATH_TO_DATA/X_sparse_150users_s10_w10.pkl \n", "!cp $PATH_TO_DATA/y_10users.pkl $PATH_TO_DATA/y_10users_s10_w10.pkl \n", "!cp $PATH_TO_DATA/y_150users.pkl $PATH_TO_DATA/y_150users_s10_w10.pkl " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "estimator = svm_grid_searcher2.best_estimator_\n", "\n", "for window_size, session_length in itertools.product([10, 7, 5], [15, 10, 7, 5]):\n", " if window_size <= session_length:\n", " path_to_X_pkl = ''' YOUR CODE IS HERE '''\n", " path_to_y_pkl = ''' YOUR CODE IS HERE '''\n", " print ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 5. Evaluate `LinearSVC`'s accuracy with optimal `C` on `X_sparse_10users_s15_w5` dataset. Write model's mean accuracy on cross-validation and its accuracy on validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Comment on the results. Compare mean accuracy on cross-validation and on validation dataset using the following combinations of parameters(`session_length, window_size`): (5,5), (7,7) и (10,10). On average laptop it could take up to an hour. After all, it's data science :)**.\n", "\n", "**Make a conclusion about how accuracy depends on session length and window width.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "estimator = svm_grid_searcher2.best_estimator_\n", "\n", "for window_size, session_length in [(5,5), (7,7), (10,10)]:\n", " path_to_X_pkl = ''' YOUR CODE IS HERE '''\n", " path_to_y_pkl = ''' YOUR CODE IS HERE '''\n", " print ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 6. Evaluate `LinearSVC`'s accuracy with optimal `C` value on `X_sparse_150users`. Write model's accuracy on cross-validation and on the validation dataset. Round your answers up to the third digit after a decimal point. Write your answers with a space between them.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Part 3. Particular user identification and learning curves" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Since it may dissapoint that accuracy at multiclass classification problem on 150 users dataset is low, let's joy the fact that some particular user could be identified quite well.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Load `X_sparse_150users` and `y_150users` objects serialized earlier which correspond to 150 users dataset with parameters (*session_length, window_size*) = (10,10). Split them into two parts: 70% train data and 30% validation data.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "with open(os.path.join(PATH_TO_DATA, 'X_sparse_150users.pkl'), 'rb') as X_sparse_150users_pkl:\n", " X_sparse_150users = pickle.load(X_sparse_150users_pkl)\n", "with open(os.path.join(PATH_TO_DATA, 'y_150users.pkl'), 'rb') as y_150users_pkl:\n", " y_150users = pickle.load(y_150users_pkl)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X_train_150, X_valid_150, y_train_150, y_valid_150 = train_test_split(X_sparse_150users, \n", " y_150users, test_size=0.3, \n", " random_state=17, stratify=y_150users)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Train `LogisticRegressionCV` with single `C` value (take the best `C` value on cross-validation in part 1, use an accurate value, not an approximate one). Now we are going to solve 150 tasks One-vs-All, hence set `multi_class`='ovr'. As usual, set `n_jobs=-1` and `random_state`=17 where it is possible (this training might take up to 20 min).**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "logit_cv_150users = LogisticRegressionCV ''' YOUR CODE IS HERE '''\n", "logit_cv_150users.fit(X_train_150, y_train_150)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Compare mean accuracy on cross-validation for each user identification problem separately.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "cv_scores_by_user = {}\n", "for user_id in logit_cv_150users.scores_:\n", " print('User {}, CV score: {}'.format ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Accuracy could seem impressive, but, perhaps, we forget about class disbalance and high accuracy could just be obtained with a constant prediction. Evaluate the difference between accuracy on cross-validation (we've just evaluated using `LogisticRegressionCV`) and the fraction of labels which differ from user_id (that's the accuracy we get if classificator always says that it is not the $i$-th user in classification task $i$-vs-All) for each user in `y_train_150`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class_distr = np.bincount(y_train_150.astype('int'))\n", "\n", "for user_id in np.unique(y_train_150):\n", " ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "num_better_than_default = (np.array(list(acc_diff_vs_constant.values())) > 0).sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question 7. Evaluate the fraction of users where `LogisticRegressionCV` performs better than just a constant prediction. Round your answer up to the third digit after the decimal point.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "''' YOUR CODE IS HERE '''" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Next step is to plot learning curves for a particular user, let's say for 128-th. Make a new binary vector using `y_150users`, its values are 1 or 0 depending on whether user_id=128 or not.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "y_binary_128 = ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.model_selection import learning_curve\n", "\n", "def plot_learning_curve(val_train, val_test, train_sizes, \n", " xlabel='Training Set Size', ylabel='score'):\n", " def plot_with_err(x, data, **kwargs):\n", " mu, std = data.mean(1), data.std(1)\n", " lines = plt.plot(x, mu, '-', **kwargs)\n", " plt.fill_between(x, mu - std, mu + std, edgecolor='none',\n", " facecolor=lines[0].get_color(), alpha=0.2)\n", " plot_with_err(train_sizes, val_train, label='train')\n", " plot_with_err(train_sizes, val_test, label='valid')\n", " plt.xlabel(xlabel); plt.ylabel(ylabel)\n", " plt.legend(loc='lower right');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Evaluate accuracy on cross-validation at \"user128-vs-All\" task depending on train size. It would be useful to check the documentation for `learning_curve`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%time\n", "train_sizes = np.linspace(0.25, 1, 20)\n", "estimator = svm_grid_searcher2.best_estimator_\n", "n_train, val_train, val_test = learning_curve ''' YOUR CODE IS HERE '''" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "plot_learning_curve(val_train, val_test, n_train, \n", " xlabel='train_size', ylabel='accuracy')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Make a conclusion whether new data helps to improve the model's accuracy with the same problem definition.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next week, we will recall linear models trained with stochastic gradient descend, and enjoy how faster they work." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }