{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Network Tour of Data Science\n", "###       Xavier Bresson, Winter 2016/17\n", "## Exercise 3 : Baseline Classification Techniques" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Load libraries\n", "import numpy as np # Math\n", "import scipy.io # Import data\n", "import time\n", "import sklearn.neighbors, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes # Baseline classification techniques\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Load 400 text documents representing 5 classes\n", "# X_train matrix contains the training data\n", "# y_train vector contains the training labels\n", "# X_test matrix contains the test data\n", "# y_test vector contains the test labels\n", "[X_train, y_train, X_test, y_test] = np.load('datasets/20news_5classes_400docs.npy')\n", "print('X_train size=',X_train.shape)\n", "print('X_test size=',X_test.shape)\n", "print('y_train size=',y_train.shape)\n", "print('y_test size=',y_test.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 1a: Run the following baseline classification techniques:\n", "* k-NN classifier: You may use *sklearn.neighbors.KNeighborsClassifier()*\n", "* Linear SVM classifier: You may use *sklearn.svm.LinearSVC()*\n", "* Logistic Regression classifier: You may use *sklearn.linear_model.LogisticRegression()*\n", "* Random Forest classifier: You may use *sklearn.ensemble.RandomForestClassifier()*\n", "* Ridge classifier: You may use *sklearn.linear_model.RidgeClassifier()*\n", "* Naive Bayes classifier with Bernoulli: You may use *sklearn.naive_bayes.BernoulliNB()*\n", "* Naive Bayes classifier with Multinomial: You may use *sklearn.naive_bayes.MultinomialNB()*\n", "\n", "### Question 1b: \n", "* Print accuracy for train dataset and test dataset: You may use function *sklearn.metrics.accuracy_score()*\n", "* Print the computational time to train each model: You may use commands *t_start = time.process_time()*, and *exec_time = time.process_time() - t_start*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train_accuracy = YOUR CODE HERE\n", "test_accuracy = YOUR CODE HERE\n", "exec_time =YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Observe the best result. What is the best technique?
\n", "Do you think the other classification techniques are not as efficient?
\n", "Should you believe all blackbox data analysis techniques?\n", "\n", "Let us consider one classification technique like logistic regression:
\n", " *model = sklearn.linear_model.LogisticRegression(C=C_value)*
\n", "and its hyperparamater C, which is the trade-off between the data term and the regularization term.\n", "\n", "### Question 2: Estimate the hyperparameter C of the logistic regression classifier by cross-validation " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2a: First, split the training set into 5 folds\n", "\n", "Hint: You may use the function *np.array_split()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "num_folds = 5 \n", "X_train = X_train.toarray() # for np.array_split\n", "\n", "X_train_folds = np.array_split(YOUR CODE HERE)\n", "y_train_folds = YOUR CODE HERE" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Values of the hyperparameter C:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "C_choices = [1e-2, 5*1e-2, 1e-1, 5*1e-1, 1e0, 5*1e0, 1e1, 5*1e1, 1e2, 5*1e2, 1e3, 5*1e3]\n", "num_Cs = len(C_choices)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2b: Compute the accuracy for all folds and all hyperparameter values (and store it for example in a tab like *accuracy_tab*)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "accuracy_tab = np.zeros([num_folds,num_Cs])\n", "\n", "for C_idx, C_value in enumerate(C_choices):\n", "\n", " for fold_idx in range(num_folds):\n", " \n", " # Extract train dataset for the current fold\n", " fold_x_train = np.concatenate([X_train_folds[i] for i in range(num_folds) if i!=fold_idx]) \n", " fold_y_train = YOUR CODE HERE\n", "\n", " # validation dataset for the current fold\n", " fold_x_val = X_train_folds[fold_idx]\n", " fold_y_val = YOUR CODE HERE\n", " \n", " # Run Logistic Regression model for the current fold\n", " accuracy = YOUR CODE HERE\n", " \n", " # Store accuracy value\n", " accuracy_tab[fold_idx,C_idx] = accuracy\n", "\n", "print(accuracy_tab)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2c: Plot the following:\n", "* The accuracy values for all folds and all hyperparameter values\n", "* The mean and standard deviation accuracies over the folds for all hyperparameter values\n", "\n", "Hint: You may use the function *plt.scatter(), np.mean(), np.std(), plt.errorbar(), plt.show()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# plot the raw observations\n", "for C_idx, C_value in enumerate(C_choices):\n", " accuracies_C_idx = accuracy_tab[:,C_idx]\n", " plt.scatter(YOUR CODE HERE)\n", " \n", "# plot the trend line with error bars that correspond to standard deviation\n", "accuracies_mean = YOUR CODE HERE\n", "accuracies_std = YOUR CODE HERE\n", "plt.errorbar(np.log(C_choices), accuracies_mean, yerr=accuracies_std)\n", "\n", "# Add text\n", "plt.title('Cross-validation on C')\n", "plt.xlabel('log C')\n", "plt.ylabel('Cross-validation accuracy')\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question 2d: Based on the cross-validation results above, choose the best value for C and apply it on the test set. What is the accuracy for the best C value?\n", "\n", "Did we do better than the best technique in Question 1? or not?\n", "\n", "Hint: You may use the function *np.argmax()*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "idx_best_C = YOUR CODE HERE\n", "accuracy_testset = YOUR CODE HERE\n", "print('best accuracy=',accuracy_testset)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 0 }