{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    },
    "tags": [
     "s1",
     "content",
     "l1"
    ]
   },
   "source": [
    "# Support Vector Machines (SVMs)\n",
    "\n",
    "## Introduction\n",
    "\n",
    "Support Vector Machines are classifiers that can classify datasets by a introducing an optimal hyperplane between the multi-dimensional data points. An hyperplane is a multi-dimensional structure that extends a two-dimensional plane. If the datasets consists of two dimensional dataset, then an estimate line is fit that provides the best classification on the  dataset. By \"best classification\", it is to be noted that a plane that not necessarily provides perfect classification of all points in the training dataset but fits a criterion such that the line is farthest from all points. You can see from the figure below that a hyperplane classifies the dataset as shown.\n",
    "\n",
    "<img src=\"../images/SVM.png\", style=\"width: 700px;\"> \n",
    "\n",
    "We shall use a plot_learning_curve function from sklearn:\n",
    "ref: http://scikit-learn.org/stable/auto_examples/model_selection/plot_learning_curve.html .We will generate some simple toy data using [sklearn](http://scikit-learn.org)'s `make_classification` function.\n",
    "\n",
    "## Exercise\n",
    "\n",
    "* In the titanic dataset that has been cleaned, train a SVM classifier on the 'features' list provided below.\n",
    "* Perform a Train-Test split\n",
    "* Perform 10 fold cross-validation\n",
    "* Find out mean accuracy of the trained model and put the obtained accuracy in accuracy_train variable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": true,
    "slideshow": {
     "slide_type": "subslide"
    },
    "tags": [
     "s1",
     "ce",
     "l1"
    ]
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn import model_selection\n",
    "from sklearn.metrics import classification_report\n",
    "from sklearn.metrics import confusion_matrix\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.svm import SVC\n",
    "from sklearn.cross_validation import KFold\n",
    "\n",
    "train_data = pd.read_csv(\"https://raw.githubusercontent.com/colaberry/data/master/Titanic/train_data.csv\")\n",
    "test_data = pd.read_csv(\"https://raw.githubusercontent.com/colaberry/data/master/Titanic/test_data.csv\")\n",
    "features = ['Pclass', 'Survived','Age_Imputed', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n",
    "\n",
    "#Keeping relevant data for processing \n",
    "train_data = train_data[features]\n",
    "\n",
    "#Converting dataset into array for Cross validation\n",
    "array = train_data.values\n",
    "\n",
    "#Seperating target variable and indepentdent variables\n",
    "X=np.delete(array, 1, axis=1)\n",
    "Y=array[:,1]\n",
    "\n",
    "#Setting the test size and train size\n",
    "test_size = 0.20\n",
    "seed = 7\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "s1",
     "l1",
     "hint"
    ]
   },
   "source": [
    "Use model_selection.train_test_split(....) function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "tags": [
     "s1",
     "l1",
     "ans"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.71514084507\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)\n",
    "\n",
    "scoring = 'accuracy'\n",
    "models=SVC()\n",
    "kfold = model_selection.KFold(n_splits=10, random_state=seed)\n",
    "\n",
    "cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)\n",
    "results=(cv_results)\n",
    "accuracy_train = cv_results.mean()\n",
    "print(accuracy_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    },
    "tags": [
     "s1",
     "hid",
     "l1"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    if( accuracy_train >0.65 and accuracy_train <0.75):\n",
    "        ref_assert_var = True\n",
    "    else:\n",
    "        ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "assert ref_tmp_var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l2",
     "content",
     "s2"
    ]
   },
   "source": [
    "We should now predict the model on our test data.\n",
    "\n",
    "## Exercise\n",
    "\n",
    "* Predict on the test data and find the accuracy of the model. Put the accuracy of the model in a variable called accuracy_test."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": true,
    "tags": [
     "l2",
     "ce",
     "s2"
    ]
   },
   "outputs": [],
   "source": [
    "# Make predictions on test dataset\n",
    "svm = SVC()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l2",
     "s2",
     "hint"
    ]
   },
   "source": [
    "use svm.fit(..) to fit the model and then accuracy_score(..) to find the accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "tags": [
     "l2",
     "s2",
     "ans"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.68156424581\n"
     ]
    }
   ],
   "source": [
    "svm.fit(X_train, Y_train)\n",
    "predictions = svm.predict(X_test)\n",
    "accuracy_test= accuracy_score(Y_test, predictions)\n",
    "print(accuracy_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "tags": [
     "l2",
     "hid",
     "s2"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    if( accuracy_test >0.5 and accuracy_test <0.75):\n",
    "        ref_assert_var = True\n",
    "    else:\n",
    "        ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "assert ref_tmp_var"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "executed_sections": [],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}