{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Network Tour of Data Science\n", "### Xavier Bresson, Winter 2016/17\n", "## Exercise 3 : Baseline Classification Techniques" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load libraries\n", "import numpy as np # Math\n", "import scipy.io # Import data\n", "import time\n", "import sklearn.neighbors, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes # Baseline classification techniques\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load 400 text documents representing 5 classes\n", "# X_train matrix contains the training data\n", "# y_train vector contains the training labels\n", "# X_test matrix contains the test data\n", "# y_test vector contains the test labels\n", "[X_train, y_train, X_test, y_test] = np.load('datasets/20news_5classes_400docs.npy')\n", "print('X_train size=',X_train.shape)\n", "print('X_test size=',X_test.shape)\n", "print('y_train size=',y_train.shape)\n", "print('y_test size=',y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1a: Run the following baseline classification techniques:\n", "* k-NN classifier: You may use *sklearn.neighbors.KNeighborsClassifier()*\n", "* Linear SVM classifier: You may use *sklearn.svm.LinearSVC()*\n", "* Logistic Regression classifier: You may use *sklearn.linear_model.LogisticRegression()*\n", "* Random Forest classifier: You may use *sklearn.ensemble.RandomForestClassifier()*\n", "* Ridge classifier: You may use *sklearn.linear_model.RidgeClassifier()*\n", "* Naive Bayes classifier with Bernoulli: You may use *sklearn.naive_bayes.BernoulliNB()*\n", "* Naive Bayes classifier with Multinomial: You may use *sklearn.naive_bayes.MultinomialNB()*\n", "\n", "### Question 1b: \n", "* Print accuracy for train dataset and test dataset: You may use function *sklearn.metrics.accuracy_score()*\n", "* Print the computational time to train each model: You may use commands *t_start = time.process_time()*, and *exec_time = time.process_time() - t_start*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Your code here\n", "clf, train_accuracy, test_accuracy, exec_time = [], [], [], []\n", "clf.append(sklearn.neighbors.KNeighborsClassifier()) # k-NN classifier\n", "clf.append(sklearn.svm.LinearSVC()) # linear SVM classifier\n", "clf.append(sklearn.linear_model.LogisticRegression()) # logistic classifier\n", "clf.append(sklearn.ensemble.RandomForestClassifier())\n", "clf.append(sklearn.linear_model.RidgeClassifier())\n", "clf.append(sklearn.naive_bayes.BernoulliNB())\n", "clf.append(sklearn.naive_bayes.MultinomialNB())\n", "\n", "for c in clf:\n", " t_start = time.process_time()\n", " c.fit(X_train, y_train)\n", " train_pred = c.predict(X_train)\n", " test_pred = c.predict(X_test)\n", " train_accuracy.append('{:5.2f}'.format(100*sklearn.metrics.accuracy_score(y_train, train_pred)))\n", " test_accuracy.append('{:5.2f}'.format(100*sklearn.metrics.accuracy_score(y_test, test_pred)))\n", " exec_time.append('{:5.2f}'.format(time.process_time() - t_start))\n", "print('Train accuracy: {}'.format(' '.join(train_accuracy)))\n", "print('Test accuracy: {}'.format(' '.join(test_accuracy)))\n", "print('Execution time: {}'.format(' '.join(exec_time)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the best result. What is the best technique?<br> \n", "Do you think the other classification techniques are not as efficient?<br> \n", "Should you believe all blackbox data analysis techniques?\n", "\n", "Let us consider one classification technique like logistic regression:<br>\n", " *model = sklearn.linear_model.LogisticRegression(C=C_value)*<br>\n", "and its hyperparamater C, which is the trade-off between the data term and the regularization term.\n", "\n", "### Question 2: Estimate the hyperparameter C of the logistic regression classifier by cross-validation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2a: First, split the training set into 5 folds\n", "\n", "Hint: You may use the function *np.array_split()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "num_folds = 5 \n", "X_train = X_train.toarray() # for np.array_split\n", "\n", "X_train_folds = np.array_split(X_train, num_folds)\n", "y_train_folds = np.array_split(y_train, num_folds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values of the hyperparameter C:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "C_choices = [1e-2, 5*1e-2, 1e-1, 5*1e-1, 1e0, 5*1e0, 1e1, 5*1e1, 1e2, 5*1e2, 1e3, 5*1e3]\n", "num_Cs = len(C_choices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2b: Compute the accuracy for all folds and all hyperparameter values (and store it for example in a tab like *accuracy_tab*)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "accuracy_tab = np.zeros([num_folds,num_Cs])\n", "\n", "for C_idx, C_value in enumerate(C_choices):\n", "\n", " for fold_idx in range(num_folds):\n", " \n", " # Extract train dataset for the current fold\n", " fold_x_train = np.concatenate([X_train_folds[i] for i in range(num_folds) if i!=fold_idx]) \n", " fold_y_train = np.concatenate([y_train_folds[i] for i in range(num_folds) if i!=fold_idx]) \n", "\n", " # validation dataset for the current fold\n", " fold_x_val = X_train_folds[fold_idx]\n", " fold_y_val = y_train_folds[fold_idx]\n", " \n", " # Run Logistic Regression model for the current fold\n", " model = sklearn.linear_model.LogisticRegression(C=C_value)\n", " model.fit(fold_x_train, fold_y_train)\n", " test_pred = model.predict(fold_x_val)\n", " accuracy = sklearn.metrics.accuracy_score(test_pred, fold_y_val)\n", " \n", " # Store accuracy value\n", " accuracy_tab[fold_idx,C_idx] = accuracy\n", "\n", "print(accuracy_tab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2c: Plot the following:\n", "* The accuracy values for all folds and all hyperparameter values\n", "* The mean and standard deviation accuracies over the folds for all hyperparameter values\n", "\n", "Hint: You may use the function *plt.scatter(), np.mean(), np.std(), plt.errorbar(), plt.show()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# plot the raw observations\n", "for C_idx, C_value in enumerate(C_choices):\n", " accuracies_C_idx = accuracy_tab[:,C_idx]\n", " plt.scatter([np.log(C_value)]* len(accuracies_C_idx), accuracies_C_idx)\n", " \n", "# plot the trend line with error bars that correspond to standard deviation\n", "accuracies_mean = np.mean(accuracy_tab,axis=0)\n", "accuracies_std = np.std(accuracy_tab,axis=0)\n", "plt.errorbar(np.log(C_choices), accuracies_mean, yerr=accuracies_std)\n", "\n", "# Add text\n", "plt.title('Cross-validation on C')\n", "plt.xlabel('log C')\n", "plt.ylabel('Cross-validation accuracy')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2d: Based on the cross-validation results above, choose the best value for C for the classifier, and apply it on the test set. What is the accuracy?\n", "\n", "Did we do better than the best technique in Question 1? or not?\n", "\n", "Hint: You may use the function *np.argmax()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "idx_best_C = np.argmax(accuracies_mean)\n", "best_C = C_choices[idx_best_C]\n", "model = sklearn.linear_model.LogisticRegression(C=best_C)\n", "model.fit(X_train, y_train)\n", "test_pred = model.predict(X_test)\n", "accuracy_testset = sklearn.metrics.accuracy_score(test_pred, y_test)\n", "print('best accuracy=',accuracy_testset)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }