{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross validation et SGD\n", "\n", "On continue dans la suite du lab 03 sur le dataset du titanic\n", "\n", "Dans cette partie, vous allez maintenant \n", "\n", "* Convertir les variables catégoriques en numériques avec LabelEncoder\n", "\n", "\n", "\n", "## Illustration de l'overfitting\n", "\n", "Dans la suitre on utilise AUC comme métrique.\n", "\n", "\n", "* Scinder le dataset original en train et test subsets (80 / 20). \n", " * utilisez pour cela la fonction [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) \n", " ```from sklearn.model_selection import train_test_split```\n", "\n", " * Il vous faudra surement *shuffle* le dataset original avec la fonction \n", " ```df.sample(frac = 1)```\n", " \n", "\n", "Entrainez un classifier [SGD](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) \n", "et pour chqaue modèle \n", "\n", "* Cas 1: pas de regularisation\n", " * supprimez la regularisation ```penalty = 'none'```\n", " * faites varier le learning rate et observez comment varie le score AUC sur le subset de test.\n", " * pour faire manuellement varier le learning_rate il faut \n", " * set learning_rate = 'constant'\n", " * set eta0 = 0.01, 0.1, ...\n", " * Comparez avec le score sur le subset de train\n", " * observez-vous de l'overfitting ?\n", " \n", " \n", "* Cas 2: Regularisation L2\n", " * penalty = 'l2'\n", " * faites varier le parametre ```alpha``` et observer la variation du score sur le test et le training set\n", " * Les resultats sont ils differents si vous re-schufflez le dataset original\n", " * pour obtenir les memes resultats d'une fois \n", "\n", "* Cas 3: K-fold cross validation\n", " * set learning_rate = 'optimal'\n", " * selectionnez le meilleur modele en fonction de la regularization en utilisant le k-fold cross validation\n", " * cf les fonctions suivantes\n", " * class sklearn.model_selection.KFold(n_splits=’warn’, shuffle=False, random_state=None)\n", " * sklearn.model_selection.cross_val_score(estimator, X, y=None, groups=None, scoring=None, cv=’warn’, n_jobs=None, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, error_score=’raise-deprecating’)\n", "\n", "\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }