{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# A Network Tour of Data Science\n",
    "###       Xavier Bresson, Winter 2016/17\n",
    "## Exercise 3 : Baseline Classification Techniques"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# Load libraries\n",
    "import numpy as np # Math\n",
    "import scipy.io # Import data\n",
    "import time\n",
    "import sklearn.neighbors, sklearn.linear_model, sklearn.ensemble, sklearn.naive_bayes # Baseline classification techniques\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Load 400 text documents representing 5 classes\n",
    "# X_train matrix contains the training data\n",
    "# y_train vector contains the training labels\n",
    "# X_test matrix contains the test data\n",
    "# y_test vector contains the test labels\n",
    "[X_train, y_train, X_test, y_test] = np.load('datasets/20news_5classes_400docs.npy')\n",
    "print('X_train size=',X_train.shape)\n",
    "print('X_test size=',X_test.shape)\n",
    "print('y_train size=',y_train.shape)\n",
    "print('y_test size=',y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 1a: Run the following baseline classification techniques:\n",
    "* k-NN classifier: You may use *sklearn.neighbors.KNeighborsClassifier()*\n",
    "* Linear SVM classifier: You may use *sklearn.svm.LinearSVC()*\n",
    "* Logistic Regression classifier: You may use *sklearn.linear_model.LogisticRegression()*\n",
    "* Random Forest classifier: You may use *sklearn.ensemble.RandomForestClassifier()*\n",
    "* Ridge classifier: You may use *sklearn.linear_model.RidgeClassifier()*\n",
    "* Naive Bayes classifier with Bernoulli: You may use *sklearn.naive_bayes.BernoulliNB()*\n",
    "* Naive Bayes classifier with Multinomial: You may use *sklearn.naive_bayes.MultinomialNB()*\n",
    "\n",
    "### Question 1b: \n",
    "* Print accuracy for train dataset and test dataset: You may use function *sklearn.metrics.accuracy_score()*\n",
    "* Print the computational time to train each model: You may use commands *t_start = time.process_time()*, and *exec_time = time.process_time() - t_start*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Your code here\n",
    "clf, train_accuracy, test_accuracy, exec_time = [], [], [], []\n",
    "clf.append(sklearn.neighbors.KNeighborsClassifier()) # k-NN classifier\n",
    "clf.append(sklearn.svm.LinearSVC()) # linear SVM classifier\n",
    "clf.append(sklearn.linear_model.LogisticRegression()) # logistic classifier\n",
    "clf.append(sklearn.ensemble.RandomForestClassifier())\n",
    "clf.append(sklearn.linear_model.RidgeClassifier())\n",
    "clf.append(sklearn.naive_bayes.BernoulliNB())\n",
    "clf.append(sklearn.naive_bayes.MultinomialNB())\n",
    "\n",
    "for c in clf:\n",
    "    t_start = time.process_time()\n",
    "    c.fit(X_train, y_train)\n",
    "    train_pred = c.predict(X_train)\n",
    "    test_pred = c.predict(X_test)\n",
    "    train_accuracy.append('{:5.2f}'.format(100*sklearn.metrics.accuracy_score(y_train, train_pred)))\n",
    "    test_accuracy.append('{:5.2f}'.format(100*sklearn.metrics.accuracy_score(y_test, test_pred)))\n",
    "    exec_time.append('{:5.2f}'.format(time.process_time() - t_start))\n",
    "print('Train accuracy:      {}'.format(' '.join(train_accuracy)))\n",
    "print('Test accuracy:       {}'.format(' '.join(test_accuracy)))\n",
    "print('Execution time:      {}'.format(' '.join(exec_time)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Observe the best result. What is the best technique?<br> \n",
    "Do you think the other classification techniques are not as efficient?<br> \n",
    "Should you believe all blackbox data analysis techniques?\n",
    "\n",
    "Let us consider one classification technique like logistic regression:<br>\n",
    "    *model = sklearn.linear_model.LogisticRegression(C=C_value)*<br>\n",
    "and its hyperparamater C, which is the trade-off between the data term and the regularization term.\n",
    "\n",
    "### Question 2: Estimate the hyperparameter C of the logistic regression classifier by cross-validation "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2a: First, split the training set into 5 folds\n",
    "\n",
    "Hint: You may use the function *np.array_split()*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "num_folds = 5 \n",
    "X_train = X_train.toarray() # for np.array_split\n",
    "\n",
    "X_train_folds = np.array_split(X_train, num_folds)\n",
    "y_train_folds = np.array_split(y_train, num_folds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Values of the hyperparameter C:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "C_choices = [1e-2, 5*1e-2, 1e-1, 5*1e-1, 1e0, 5*1e0, 1e1, 5*1e1, 1e2, 5*1e2, 1e3, 5*1e3]\n",
    "num_Cs = len(C_choices)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2b: Compute the accuracy for all folds and all hyperparameter values (and store it for example in a tab like *accuracy_tab*)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "accuracy_tab = np.zeros([num_folds,num_Cs])\n",
    "\n",
    "for C_idx, C_value in enumerate(C_choices):\n",
    "\n",
    "    for fold_idx in range(num_folds):\n",
    "        \n",
    "        # Extract train dataset for the current fold\n",
    "        fold_x_train = np.concatenate([X_train_folds[i] for i in range(num_folds) if i!=fold_idx])       \n",
    "        fold_y_train = np.concatenate([y_train_folds[i] for i in range(num_folds) if i!=fold_idx])   \n",
    "\n",
    "        # validation dataset for the current fold\n",
    "        fold_x_val  = X_train_folds[fold_idx]\n",
    "        fold_y_val  = y_train_folds[fold_idx]\n",
    "        \n",
    "        # Run Logistic Regression model for the current fold\n",
    "        model = sklearn.linear_model.LogisticRegression(C=C_value)\n",
    "        model.fit(fold_x_train, fold_y_train)\n",
    "        test_pred = model.predict(fold_x_val)\n",
    "        accuracy = sklearn.metrics.accuracy_score(test_pred, fold_y_val)\n",
    "        \n",
    "        # Store accuracy value\n",
    "        accuracy_tab[fold_idx,C_idx] = accuracy\n",
    "\n",
    "print(accuracy_tab)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2c: Plot the following:\n",
    "* The accuracy values for all folds and all hyperparameter values\n",
    "* The mean and standard deviation accuracies over the folds for all hyperparameter values\n",
    "\n",
    "Hint: You may use the function *plt.scatter(), np.mean(), np.std(), plt.errorbar(), plt.show()*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# plot the raw observations\n",
    "for C_idx, C_value in enumerate(C_choices):\n",
    "    accuracies_C_idx = accuracy_tab[:,C_idx]\n",
    "    plt.scatter([np.log(C_value)]* len(accuracies_C_idx), accuracies_C_idx)\n",
    "    \n",
    "# plot the trend line with error bars that correspond to standard deviation\n",
    "accuracies_mean = np.mean(accuracy_tab,axis=0)\n",
    "accuracies_std = np.std(accuracy_tab,axis=0)\n",
    "plt.errorbar(np.log(C_choices), accuracies_mean, yerr=accuracies_std)\n",
    "\n",
    "# Add text\n",
    "plt.title('Cross-validation on C')\n",
    "plt.xlabel('log C')\n",
    "plt.ylabel('Cross-validation accuracy')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question 2d: Based on the cross-validation results above, choose the best value for C for the classifier, and apply it on the test set. What is the accuracy?\n",
    "\n",
    "Did we do better than the best technique in Question 1? or not?\n",
    "\n",
    "Hint: You may use the function *np.argmax()*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "idx_best_C = np.argmax(accuracies_mean)\n",
    "best_C = C_choices[idx_best_C]\n",
    "model = sklearn.linear_model.LogisticRegression(C=best_C)\n",
    "model.fit(X_train, y_train)\n",
    "test_pred = model.predict(X_test)\n",
    "accuracy_testset = sklearn.metrics.accuracy_score(test_pred, y_test)\n",
    "print('best accuracy=',accuracy_testset)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}