{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "

Методы машинного обучения

\n", "

Семинар: деревья решений

" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "plt.style.use('ggplot')\n", "plt.rcParams['figure.figsize'] = (12,8)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "Для тех у кого проблемы с graphviz (В основном касается Windows/)\n", "\n", "1) Установите graphviz через Anaconda Navigator - Environments\n", "Not installed, выберите graphviz\n", "\n", "2) найдите куда скачался graphviz\n", "C:\\Users\\\\Anaconda3\\Library\\bin\\graphviz\n", "\n", "3) скопируйте путь и добавьте его в переменную PATH\n", "- Зайдите в Панель управления -> Система и Безопасность -> Система-> Продвинутые настройки системы - > Переменные окружения.\n", "control panel -> system and security -> system -> advanced system settings - > environment variables.\n", "\n", "- Выделите переменную PATH и нажмите \"редактировать\"\n", "- Добавьте новый путь к graphviz (который вы скопировали в буфер обмена)\n", "4) перезагрузите компьютер" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Titanic Dataset" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "from sklearn.tree import export_graphviz\n", "import subprocess" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Load titanic [dataset](https://cloud.mail.ru/public/N1Tn/25zEKkqge)" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "VARIABLE DESCRIPTIONS:\n", "survival Survival\n", " (0 = No; 1 = Yes)\n", "pclass Passenger Class\n", " (1 = 1st; 2 = 2nd; 3 = 3rd)\n", "name Name\n", "sex Sex\n", "age Age\n", "sibsp Number of Siblings/Spouses Aboard\n", "parch Number of Parents/Children Aboard\n", "ticket Ticket Number\n", "fare Passenger Fare\n", "cabin Cabin\n", "embarked Port of Embarkation\n", " (C = Cherbourg; Q = Queenstown; S = Southampton)\n", "\n", "SPECIAL NOTES:\n", "Pclass is a proxy for socio-economic status (SES)\n", " 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower\n", "\n", "Age is in Years; Fractional if Age less than One (1)\n", " If the Age is Estimated, it is in the form xx.5" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('./data/titanic.csv')\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Проанализируем данные\n", " * Типы признаков\n", " * Пропущенные значения?\n", " * Пропорции классов" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "PassengerId 0.000000\n", "Survived 0.000000\n", "Pclass 0.000000\n", "Name 0.000000\n", "Sex 0.000000\n", "Age 0.198653\n", "SibSp 0.000000\n", "Parch 0.000000\n", "Ticket 0.000000\n", "Fare 0.000000\n", "Cabin 0.771044\n", "Embarked 0.002245\n", "dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().mean()" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 549\n", "1 342\n", "Name: Survived, dtype: int64" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Survived.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Предобработка данных\n", " * выкидываем ненужные признаки\n", " * готовимся к работе с пропусками\n", " * подготовка категориальных признаков\n", " * разбиваем на обучение и контроль" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 S\n", "dtype: object" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Embarked.mode()" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [], "source": [ "drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin']\n", "\n", "df.loc[:, 'Age'] = df.Age.fillna(df.Age.mean())\n", "df.loc[:, 'Embarked'] = df.Embarked.fillna(df.Embarked.mode()[0])\n", "df = pd.get_dummies(df, columns=['Embarked'])\n", "\n", "df.loc[:, 'Sex'] = df.Sex.replace({'male': 1, 'female': 0})\n", "\n", "df_result = df.drop(drop_cols, axis=1)" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SurvivedPclassSexAgeSibSpParchFareEmbarked_CEmbarked_QEmbarked_S
003122.0107.2500001
111038.01071.2833100
213026.0007.9250001
311035.01053.1000001
403135.0008.0500001
\n", "
" ], "text/plain": [ " Survived Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q \\\n", "0 0 3 1 22.0 1 0 7.2500 0 0 \n", "1 1 1 0 38.0 1 0 71.2833 1 0 \n", "2 1 3 0 26.0 0 0 7.9250 0 0 \n", "3 1 1 0 35.0 1 0 53.1000 0 0 \n", "4 0 3 1 35.0 0 0 8.0500 0 0 \n", "\n", " Embarked_S \n", "0 1 \n", "1 0 \n", "2 1 \n", "3 1 \n", "4 1 " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_result.head()" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "X = df_result.iloc[:, 1:].values\n", "y = df_result.iloc[:, 0].values" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [], "source": [ "X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " * Обучим модель и визуализируем ее\n", " * Посмотрим на важность признаков" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [], "source": [ "def plot_tree(tree, feature_names=None, class_names=['0', '1']):\n", " with open('tree.dot', 'w') as fout:\n", " export_graphviz(tree, out_file=fout, filled=True, feature_names=feature_names, class_names=class_names)\n", " command = [\"dot\", \"-Tpng\", \"tree.dot\", \"-o\", \"tree.png\"]\n", " subprocess.check_call(command)\n", " plt.imshow(plt.imread('tree.png'))\n", " plt.axis(\"off\")\n" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [], "source": [ "model = DecisionTreeClassifier(max_depth=4)" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=4,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", " splitter='best')" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [], "source": [ "y_hat = model.predict_proba(X_valid)" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.38541667, 0.61458333],\n", " [0.86486486, 0.13513514],\n", " [0.60465116, 0.39534884],\n", " [0.60465116, 0.39534884],\n", " [0.86486486, 0.13513514],\n", " [0.93902439, 0.06097561],\n", " [0.11111111, 0.88888889],\n", " [0.11111111, 0.88888889],\n", " [0.38541667, 0.61458333],\n", " [0.38541667, 0.61458333]])" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_hat[:10]" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import roc_auc_score, roc_curve" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "fpr, tpr, thresh = roc_curve(y_valid, y_hat[:, 1])" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0,0.5,'TPR')" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(fpr, tpr)\n", "plt.xlabel('FPR')\n", "plt.ylabel('TPR')" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8749662618083671" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roc_auc_score(y_valid, y_hat[:, 1])" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Pclass 0.195365\n", "Sex 0.577291\n", "Age 0.095079\n", "SibSp 0.072319\n", "Parch 0.005876\n", "Fare 0.054071\n", "Embarked_C 0.000000\n", "Embarked_Q 0.000000\n", "Embarked_S 0.000000\n", "dtype: float64" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(index=df_result.columns[1:], data=model.feature_importances_)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plot_tree(model, feature_names=df_result.columns[1:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Speed Dating Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Предобработка данных" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('./data/speed-dating-experiment/Speed Dating Data.csv', encoding='latin1')" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8378, 195)" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [], "source": [ "df = df.iloc[:, :97]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Рассмотрим нужные признаки по очереди" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### iid\n", "unique subject number, group(wave id gender)\n", "\n", "Кажется это идентификатор" ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "551" ] }, "execution_count": 82, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iid.nunique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### id \n", "\n", "Subject number within wave\n", "\n", "Кажется это нам не нужно" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['id'], axis=1)\n", "df = df.drop(['idg'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### gender\n", "\n", "* Female=0\n", "* Male=1" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1 277\n", "0 274\n", "Name: gender, dtype: int64" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates(subset=['iid']).gender.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### condtn:\n", "* 1=limited choice\n", "* 2=extensive choice\n", "\n", "???" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 386\n", "1 165\n", "Name: condtn, dtype: int64" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates(subset=['iid']).condtn.value_counts()" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['condtn'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### wave\n", "\n", "Пока оставим в таблице, но в качестве признака рассматривать не будем" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,\n", " 18, 19, 20, 21])" ] }, "execution_count": 87, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.wave.unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### round:\n", "\n", "number of people that met in wave\n", "\n", "Можно взять в качестве признака.." ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['round'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### position:\n", "station number where met partner \n", "\n", "#### positin1\n", "station number where started \n", "\n", "Выкидываем" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['position', 'positin1'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### order: \t\t\n", "the number of date that night when met partner\n" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['order'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### partner: \t\n", "partner’s id number the night of event\n", "\n", "Это можно удались\n", "\n", "#### pid: \t\t\n", "partner’s iid number\n", "А вот это важно\n" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['partner'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### match\t\t\n", "* 1=yes, \n", "* 0=no\n", "\n", "Наш таргет" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### int_corr:\t\n", "correlation between participant’s and partner’s ratings of interests in \t\t\n", "\n", "#### samerace: \t\n", "participant and the partner were the same race. 1= yes, 0=no\n", "\n", "Придумали за нас признаки)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### age_o:\t\t\n", "age of partner\n", "#### race_o:\t\t\n", "race of partner\n", "#### pf_o_att: \t\n", "partner’s stated preference at Time 1 (attr1_1) for all 6 attributes\n", "#### dec_o: \t\t\n", "decision of partner the night of event\n", "#### attr_o: \t\t\n", "rating by partner the night of the event, for all 6 attributes\n", "\n", "Убираем" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['age_o', 'race_o', 'pf_o_att', \n", " 'pf_o_sin', 'pf_o_int',\n", " 'pf_o_fun', 'pf_o_amb', 'pf_o_sha',\n", " 'dec_o', 'attr_o', 'sinc_o', 'intel_o', 'fun_o',\n", " 'amb_o', 'shar_o', 'like_o', 'prob_o','met_o'], \n", " axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### age\n", "оставляем" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.drop_duplicates(subset=['iid']).age.hist(bins=20)" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').age.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "df = df.dropna(subset=['age'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### field:\t\t\n", "field of study \n", "\n", "#### field_cd: \t\n", "field coded \n" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==========\n", "Field Code 1.0\n", "['Law' 'law' 'LAW' 'Law and Social Work'\n", " 'Law and English Literature (J.D./Ph.D.)' 'Intellectual Property Law'\n", " 'Law/Business']\n", "==========\n", "Field Code 2.0\n", "['Economics' 'Mathematics' 'Statistics' 'math' 'Mathematics, PhD' 'Stats'\n", " 'math of finance' 'Math']\n", "==========\n", "Field Code 3.0\n", "['Psychology' 'Speech Language Pathology' 'Speech Languahe Pathology'\n", " 'Educational Psychology' 'Organizational Psychology' 'psychology'\n", " 'Communications' 'Sociology' 'psychology and english' 'theory'\n", " 'Health policy' 'Clinical Psychology' 'Sociology and Education'\n", " 'sociology' 'Anthropology/Education' 'speech pathology'\n", " 'Speech Pathology' 'Anthropology' 'School Psychology' 'anthropology'\n", " 'Counseling Psychology' 'African-American Studies/History']\n", "==========\n", "Field Code 4.0\n", "['Medicine' 'Art History/medicine'\n", " 'Sociomedical Sciences- School of Public Health' 'Epidemiology'\n", " 'GS Postbacc PreMed' 'medicine']\n", "==========\n", "Field Code 5.0\n", "['Operations Research' 'Mechanical Engineering' 'Engineering'\n", " 'Electrical Engineering' 'Operations Research (SEAS)'\n", " 'Education Administration' 'Computer Science' 'Biomedical Engineering'\n", " 'electrical engineering' 'engineering' 'Medical Informatics'\n", " 'medical informatics' 'Electrical Engg.' 'Environmental Engineering'\n", " 'Instructional Tech & Media' 'MA in Quantitative Methods'\n", " 'Urban Planning' 'Financial Engineering' 'biomedical engineering'\n", " 'biomedical informatics' 'ELECTRICAL ENGINEERING'\n", " 'Biomedical engineering' 'Industrial Engineering'\n", " 'Industrial Engineering/Operations Research'\n", " 'Masters of Industrial Engineering' 'Biomedical Informatics']\n", "==========\n", "Field Code 6.0\n", "['MFA Creative Writing' 'Classics' 'Journalism' 'English'\n", " 'Comparative Literature' 'English and Comp Lit'\n", " 'Communications in Education' 'Creative Writing'\n", " 'Creative Writing - Nonfiction' 'Writing: Literary Nonfiction'\n", " 'Creative Writing (Nonfiction)' 'NonFiction Writing' 'SOA -- writing'\n", " 'journalism' 'Nonfiction writing']\n", "==========\n", "Field Code 7.0\n", "['German Literature' 'Religion' 'philosophy' 'History of Religion'\n", " 'Modern Chinese Literature' 'Philosophy' 'Religion, GSAS' 'History'\n", " 'History (GSAS - PhD)' 'American Studies' 'Philosophy (Ph.D.)'\n", " 'Philosophy and Physics' 'Art History' 'art history']\n", "==========\n", "Field Code 8.0\n", "['Finance' 'Business' 'money' 'Applied Maths/Econs' 'Economics' 'Finanace'\n", " 'Finance&Economics' 'Mathematical Finance' 'MBA'\n", " 'Business & International Affairs' 'Marketing' 'Business (MBA)'\n", " 'financial math' 'Business- MBA' 'Economics, English'\n", " 'Economics, Sociology' 'Economics and Political Science' 'business'\n", " 'Business, marketing' 'Business/ Finance/ Real Estate'\n", " 'International Affairs/Finance' 'international finance and business'\n", " 'International Business' 'International Finance, Economic Policy'\n", " 'Business/Law' 'Business and International Affairs (MBA/MIA Dual Degree)'\n", " 'QMSS' 'Public Administration' 'Master in Public Administration'\n", " 'Business School' 'MBA / Master of International Affairs (SIPA)'\n", " 'Finance/Economics' 'Business Administration' 'MBA Finance'\n", " 'BUSINESS CONSULTING' 'business school' 'Business, Media'\n", " 'Fundraising Management' 'Business (Finance & Marketing)' 'Consulting'\n", " 'MBA - Private Equity / Real Estate' 'General management/finance']\n", "==========\n", "Field Code 9.0\n", "['TC (Health Ed)' 'Elementary/Childhood Education (MA)'\n", " 'International Educational Development' 'Art Education'\n", " 'elementary education' 'MA Science Education' 'Social Studies Education'\n", " 'MA Teaching Social Studies' 'Education Policy'\n", " 'Education- Literacy Specialist' 'bilingual education' 'Education'\n", " 'math education' 'TESOL' 'Elementary Education'\n", " 'Cognitive Studies in Education' 'education'\n", " 'Curriculum and Teaching/Giftedness' 'Instructional Media and Technology'\n", " 'English Education' 'art education' 'Early Childhood Education'\n", " 'Ed.D. in higher education policy at TC' 'EDUCATION' 'music education'\n", " 'Music Education' 'Higher Ed. - M.A.' 'Neuroscience and Education'\n", " 'Elementary Education - Preservice'\n", " 'Education Leadership - Public School Administration'\n", " 'Bilingual Education' 'teaching of English']\n", "==========\n", "Field Code 10.0\n", "['chemistry' 'microbiology' 'Chemistry'\n", " 'Climate-Earth and Environ. Science' 'marine geophysics'\n", " 'Nutrition/Genetics' 'Neuroscience' 'physics (astrophysics)' 'Physics'\n", " 'Biochemistry' 'biology' 'Cell Biology' 'Microbiology' 'climate change'\n", " 'MA Biotechnology' 'Ecology' 'Computational Biochemsistry' 'Neurobiology'\n", " 'biomedicine' 'Biology' 'Conservation biology' 'biotechnology'\n", " 'Earth and Environmental Science' 'nutrition' 'Genetics' 'Nutritiron'\n", " 'Molecular Biology' 'Genetics & Development' 'genetics'\n", " 'medicine and biochemistry' 'Epidemiology' 'Nutrition'\n", " 'Applied Physiology & Nutrition' 'Biomedical Engineering' 'physics'\n", " 'Biotechnology' 'Neurosciences/Stem cells' 'Biology PhD'\n", " 'biochemistry/genetics' 'epidemiology'\n", " 'Biochemistry & Molecular Biophysics']\n", "==========\n", "Field Code 11.0\n", "['social work' 'Social Work' 'Masters of Social Work' 'Social work'\n", " 'International Affairs' 'Social Work/SIPA']\n", "==========\n", "Field Code 12.0\n", "['Undergrad - GS']\n", "==========\n", "Field Code 13.0\n", "['Masters in Public Administration' 'Masters of Social Work&Education'\n", " 'political science' 'International Relations'\n", " 'international affairs - economic development' 'Political Science'\n", " 'American Studies (Masters)' 'International Affairs'\n", " 'international affairs/international finance' 'International Development'\n", " 'International Affairs and Public Health' 'International affairs'\n", " 'International Affairs/Business' 'Master of International Affairs'\n", " 'International Politics' 'SIPA / MIA'\n", " 'International Security Policy - SIPA' 'Intrernational Affairs'\n", " 'International Affairs - Economic Policy' 'SIPA - Energy' 'Public Policy'\n", " 'Human Rights: Middle East' 'Human Rights' 'SIPA-International Affairs'\n", " 'Public Administration']\n", "==========\n", "Field Code 14.0\n", "['Film' 'MFA -Film' 'film']\n", "==========\n", "Field Code 15.0\n", "['Arts Administration' 'Museum Anthropology'\n", " 'Theatre Management & Producing' 'MFA Writing' 'MFA Poetry' 'Theater'\n", " 'MFA Acting Program' 'Acting' 'Public Health']\n", "==========\n", "Field Code 16.0\n", "['Polish' 'Japanese Literature' 'french']\n", "==========\n", "Field Code 17.0\n", "['Architecture']\n", "==========\n", "Field Code 18.0\n", "['working' 'GSAS' 'Climate Dynamics']\n" ] } ], "source": [ "for i, group in df.groupby('field_cd'):\n", " print('=' * 10)\n", " print('Field Code {}'.format(i))\n", " print(group.field.unique())" ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.field_cd.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'field_cd'] = df.loc[:, 'field_cd'].fillna(19)" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['field'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Надо же как-то закодировать field_cd!" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [], "source": [ "df = \\\n", "pd.get_dummies(df, prefix='field_code', prefix_sep='=', \n", " columns=['field_cd'])" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iidgenderwavepidmatchint_corrsameraceageundergramn_sat...field_code=10.0field_code=11.0field_code=12.0field_code=13.0field_code=14.0field_code=15.0field_code=16.0field_code=17.0field_code=18.0field_code=19.0
010111.000.14021.0NaNNaN...0000000000
110112.000.54021.0NaNNaN...0000000000
210113.010.16121.0NaNNaN...0000000000
310114.010.61021.0NaNNaN...0000000000
410115.010.21021.0NaNNaN...0000000000
\n", "

5 rows × 88 columns

\n", "
" ], "text/plain": [ " iid gender wave pid match int_corr samerace age undergra mn_sat \\\n", "0 1 0 1 11.0 0 0.14 0 21.0 NaN NaN \n", "1 1 0 1 12.0 0 0.54 0 21.0 NaN NaN \n", "2 1 0 1 13.0 1 0.16 1 21.0 NaN NaN \n", "3 1 0 1 14.0 1 0.61 0 21.0 NaN NaN \n", "4 1 0 1 15.0 1 0.21 0 21.0 NaN NaN \n", "\n", " ... field_code=10.0 field_code=11.0 field_code=12.0 \\\n", "0 ... 0 0 0 \n", "1 ... 0 0 0 \n", "2 ... 0 0 0 \n", "3 ... 0 0 0 \n", "4 ... 0 0 0 \n", "\n", " field_code=13.0 field_code=14.0 field_code=15.0 field_code=16.0 \\\n", "0 0 0 0 0 \n", "1 0 0 0 0 \n", "2 0 0 0 0 \n", "3 0 0 0 0 \n", "4 0 0 0 0 \n", "\n", " field_code=17.0 field_code=18.0 field_code=19.0 \n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", "[5 rows x 88 columns]" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### undergrd: \t\n", "school attended for undergraduate degree\n", "\n", "Пока выкинем" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "UC Berkeley 107\n", "Harvard 104\n", "Columbia 95\n", "Yale 86\n", "NYU 78\n", "Name: undergra, dtype: int64" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.undergra.value_counts().head()" ] }, { "cell_type": "code", "execution_count": 104, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['undergra'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### mn_sat: \t\n", "Median SAT score for the undergraduate institution where attended. \t\t\t" ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1,400.00 403\n", "1,430.00 262\n", "1,290.00 190\n", "1,450.00 163\n", "1,340.00 146\n", "Name: mn_sat, dtype: int64" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.mn_sat.value_counts().head()" ] }, { "cell_type": "code", "execution_count": 106, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'mn_sat'] = df.loc[:, 'mn_sat'].str.replace(',', '').astype(np.float)" ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.drop_duplicates('iid').mn_sat.hist()" ] }, { "cell_type": "code", "execution_count": 108, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "342" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').mn_sat.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [], "source": [ "# Что будем делать?\n", "df = df.drop(['mn_sat'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### tuition: \t\n", "Tuition listed for each response to undergrad in Barron’s 25th Edition college profile book." ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26,908.00 241\n", "26,019.00 174\n", "15,162.00 138\n", "25,380.00 112\n", "26,062.00 108\n", "Name: tuition, dtype: int64" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tuition.value_counts().head()" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'tuition'] = df.loc[:, 'tuition'].str.replace(',', '').astype(np.float)" ] }, { "cell_type": "code", "execution_count": 112, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 112, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.drop_duplicates('iid').tuition.hist()" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "310" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').tuition.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 114, "metadata": {}, "outputs": [], "source": [ "# Что будем делать?\n", "df = df.drop(['tuition'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### race:\n", "* Black/African American=1\n", "* European/Caucasian-American=2\n", "* Latino/Hispanic American=3\n", "* Asian/Pacific Islander/Asian-American=4\n", "* Native American=5\n", "* Other=6\n" ] }, { "cell_type": "code", "execution_count": 115, "metadata": {}, "outputs": [], "source": [ "# Ну тут вы уже сами знаете как быть\n", "df = pd.get_dummies(df, prefix='race', prefix_sep='=',\n", " columns=['race'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### imprace:\n", "How important is it to you (on a scale of 1-10) that a person you date be of the same racial/ethnic background?\n", "\n", "#### imprelig:\n", " How important is it to you (on a scale of 1-10) that a person you date be of the same religious background?\n" ] }, { "cell_type": "code", "execution_count": 116, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 116, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').imprace.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1" ] }, "execution_count": 117, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').imprelig.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [], "source": [ "# Что делать?\n", "\n", "df = df.dropna(subset=['imprelig', 'imprace'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### from:\n", "Where are you from originally (before coming to Columbia)? \n", "\n", "#### zipcode:\n", "What was the zip code of the area where you grew up? \n", "\n", "Выкидываем" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['from', 'zipcode'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### income" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'income'] = df.loc[:, 'income'].str.replace(',', '').astype(np.float)" ] }, { "cell_type": "code", "execution_count": 121, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 121, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "df.drop_duplicates('iid').loc[:, 'income'].hist()" ] }, { "cell_type": "code", "execution_count": 122, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "261" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').loc[:, 'income'].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 123, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['income'], axis=1)\n", "# df.loc[:, 'income'] = df.loc[:, 'income'].fillna(-999)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### goal:\n", " What is your primary goal in participating in this event? \n", "\tSeemed like a fun night out=1\n", "\tTo meet new people=2\n", "\tTo get a date=3\n", "\tLooking for a serious relationship=4\n", "\tTo say I did it=5\n", "\tOther=6\n", "\n", "#### date:\n", " In general, how frequently do you go on dates? \n", "\tSeveral times a week=1\n", "\tTwice a week=2\n", "\tOnce a week=3\n", "\tTwice a month=4\n", "\tOnce a month=5\n", "\tSeveral times a year=6\n", "\tAlmost never=7\n", "\n", "#### go out:\n", " How often do you go out (not necessarily on dates)?\n", "\tSeveral times a week=1\n", "\tTwice a week=2\n", "\tOnce a week=3\n", "\tTwice a month=4\n", "\tOnce a month=5\n", "\tSeveral times a year=6\n", "\tAlmost never=7\n", "\n", "Как бы вы предложили закодировать эти переменные?" ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').goal.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "df = pd.get_dummies(df, prefix='goal', \n", " prefix_sep='=', columns=['goal'])" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "df = df.dropna(subset=['date'])" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').go_out.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### career:\n", "What is your intended career?\n", "\n", "#### career_c: \n", "career coded \n", "\n", "Работаем аналогично field и field_cd" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "==========\n", "Career Code 1.0\n", "['lawyer/policy work' 'lawyer' 'Law' 'Corporate Lawyer' 'Lawyer'\n", " 'Corporate attorney' 'law' 'Intellectual Property Attorney' 'LAWYER'\n", " 'attorney' 'Lawyer or professional surfer' 'lawyer/gov.position'\n", " 'Law or finance' 'IP Law' 'Academic (Law)' 'Private Equity' 'attorney?'\n", " 'Corporate law' 'tax lawyer' 'Business/Law' 'Assistant District Attorney']\n", "==========\n", "Career Code 2.0\n", "['Academia, Research, Banking, Life' 'academics or journalism' 'Professor'\n", " 'Academic' 'academia' 'teacher' 'industrial scientist'\n", " 'teaching and then...' 'Professor of Media Studies'\n", " 'Education Administration' 'Academic or Research staff'\n", " 'University Professor' 'Research Scientist'\n", " 'research in industry or academia' 'Teacher/Professor'\n", " 'no idea, maybe a professor' 'a research position' 'professor' 'teaching'\n", " 'engineering professional' 'research' 'Neuroscientist/Professor'\n", " 'Education' 'Professor and Government Official'\n", " 'physicist, probably academia' 'college art teacher' 'academic'\n", " 'Research scientist, professor' 'academics' 'academic research'\n", " 'academician' 'professional student' 'education' 'Historian'\n", " 'college professor' 'scientific research' 'Academic Physician'\n", " 'Researcher' 'Professor or Consultant' 'History Professor'\n", " 'Educational Policy' 'elementary school teacher' 'Research/Teaching'\n", " 'researcher in sociology' 'scientist' 'Naturalist'\n", " 'professor, poet/critic' 'researcher/academia' 'Art educator and Artist'\n", " 'Teacher' 'Scientist' 'Scientist/educator'\n", " 'scientific research for now but who knows' 'College Professor'\n", " 'Professor or Lawyer' 'research position in pharmaceutical industry'\n", " 'Academia' 'research/academia' 'Secondary Education Teacher'\n", " 'High School Social Studies Teacher' 'Education Policy Analyst'\n", " 'Literacy Organization head/ Director of Development for non-profit'\n", " 'English Teacher' 'Program development / policy work'\n", " 'professor of education' 'Educator' 'teaching/education'\n", " 'professor in college' 'Academia; Research; Teaching'\n", " 'curriculum developer' 'academic or consulting' 'Academia or UN'\n", " 'I am a teacher.' 'Professor or journalist'\n", " 'to get Ph.D and be a professor'\n", " 'Early Childhood Ed. - College/univ. faculity'\n", " 'medical examiner or researcher' 'University President'\n", " 'EDUCATION ADMINISTRATION' 'music educator, performer'\n", " 'Elementary Education Teaching' 'research - teaching' 'Research'\n", " 'Elementary school teacher' 'Bilingual Elementary School Teacher'\n", " 'Professor, or Engineer' 'Professor; Human Rights Director'\n", " 'Clinic Trial' 'English teacher' 'writer/teacher' 'Professor...?'\n", " 'acadeic' 'researcher' 'biology industry' 'Epidemiologist'\n", " 'epidemiologist' 'teacher and performer' 'TEACHING' 'Academic/ Finance'\n", " 'Science' 'Academic Work, Consultant']\n", "==========\n", "Career Code 3.0\n", "['psychologist' 'Social Worker.... Clinician' 'Psychologist'\n", " 'school psychologist' 'School Psychologist' 'Clinical Psychology'\n", " 'Clinical Psychologist' 'clinical psychologist, researcher, professor'\n", " 'School Counseling' 'Sex Therapist']\n", "==========\n", "Career Code 4.0\n", "['Biostatistics' 'Medicine' 'pharmaceuticals' 'Cardiologist' 'Pediatrics'\n", " 'medicine' 'pharmaceuticals and biotechnology' 'Physician Scientist'\n", " 'health policy' 'Epidemiologist' 'nutrition and dental' 'Physician'\n", " 'dietician' 'doctor and entrepreneur' 'Healthcare' 'Nutritionist'\n", " 'Private practice Dietician' 'physician, informaticist' 'physician'\n", " 'Medical Sciences' 'physician/healthcare' 'Doctor']\n", "==========\n", "Career Code 5.0\n", "['Informatics' 'Engineer' 'Ph.D. Electrical Engineering'\n", " 'Operations Research' 'Engineering' 'Mechanical Engineering'\n", " 'Civil Engineer' 'Urban Planner' 'Planning' 'ASIC Engineer'\n", " 'software engr, network engr' 'Research Engineer']\n", "==========\n", "Career Code 6.0\n", "['Journalist' \"Clidren's TV\" 'Music production' 'comedienne' 'novelist'\n", " 'Journalism' 'film' 'Writer' 'Porn Star' 'boxing champ'\n", " 'Paper Back Writer'\n", " 'Poet, Writer, Singer, Policy Maker with the UN and/or Indian Govt.'\n", " 'Entertainment/Sports' 'WRITING' 'manage a museum or art gallery'\n", " 'Entertainment/Media' 'Film/Television' 'Writing'\n", " 'Museum Work (Curation?)' 'Music Industry' 'Artist' 'Art Management'\n", " 'film directing' 'Screenwriter' 'Filmmaker' 'Writer/teacher'\n", " 'Writing or Editorial' 'writer/editor'\n", " 'producer at a non-profit regional theatre' 'writer' 'playing music'\n", " 'writer/producer' 'film and radio' 'Film' 'Writer/Editor' 'Actress'\n", " 'Acting']\n", "==========\n", "Career Code 7.0\n", "['research/financial industry' 'Financial Services' 'ceo' 'CEO' 'Banking'\n", " 'Capital Markets' 'Organizational Change Consultant' 'banker / academia'\n", " 'banker' 'Entrepreneur' 'consulting' 'Private Equity Investing'\n", " 'Investment Banking' 'Engineer or iBanker or consultant' 'Trading'\n", " 'Economic research' 'Microfinancing Program Manager' 'Marketing'\n", " 'Business - Investment Management' 'Finance' 'business'\n", " 'Marketing, Advertising' 'Asset Management' 'investment banking' 'MBA'\n", " 'Business' 'finance' 'Marketing and Media' 'Brand Management'\n", " 'Management Consulting' 'management consulting'\n", " 'financial service or fashion' 'International Business' 'Private Equity'\n", " 'Investment Management' 'Development work' 'marketing / brand management'\n", " 'Biotech/business' 'Country Analysis/Research/Credit Analysis'\n", " 'Consulting' 'corporate finance'\n", " 'CEO in For Profit Biomedical Organization' 'banking'\n", " 'Conservation training and education' 'president' 'Management Consultant'\n", " 'Trader' 'Wall Street Economist' 'enterpreneur' 'Industry CTO/CEO'\n", " 'finance or engineering' 'Venture Capital/Consulting/Government'\n", " \"Int'l Business\" 'Pharmaceuticals/Consulting' 'Investment banking'\n", " 'International Development banker'\n", " 'Corporate Finance, Asset Management/ Hedge Funds'\n", " 'Real Estate Consulting' 'Director of Training and Development'\n", " 'Marketing or Strategy and Business Development' 'Business Consulting'\n", " 'CONSULTING' 'investment management' 'Finance Related'\n", " 'Media Marketing/Entrepreneurship' 'Director of Admissions'\n", " 'Consultin \\\\ Management'\n", " 'Financial Mathematics-Investment Bank or Hedge Fund-Derivatives Quant Analyst'\n", " 'Work in an investment bank' 'M&A Advisory' 'millionaire'\n", " 'Fundraising for Non-Profits' 'Money Management' 'General Management'\n", " 'Public School Principal' 'Media Management' 'Public Finance'\n", " 'Business Management' 'private equity' 'Health care finance'\n", " 'Entrepreneurship' 'Fixed Income Sales & Trading'\n", " 'Consulting, later Arts or Non-Profit' 'Finance/Economics'\n", " 'Investment Banker' 'consultant'\n", " 'Business Management and Information Technology' 'self-made millionare'\n", " 'To go into Finance' 'Private Equity - Leveraged Buy-Outs' 'Management'\n", " 'General management/consulting']\n", "==========\n", "Career Code 8.0\n", "['Real Estate' 'Real Estate/ Private Equity']\n", "==========\n", "Career Code 9.0\n", "['Congresswoman, and comedian'\n", " 'To create early childhood intervention programs'\n", " 'health/nutrition oriented social worker' 'Social Worker'\n", " 'Social work with children' 'Speech Language Pathologist'\n", " 'Social Work Administration' 'social worker' 'Social Services/ Policy'\n", " 'Clinical Social Worker' 'international development work' 'Nonprofit'\n", " 'Child Rights' 'Development work on field in the middle of nowhere'\n", " 'International Development' 'UN Civil Servant'\n", " 'Humanitarian Affairs/Human Rights'\n", " 'International affairs related career' 'public service'\n", " 'Security Policy - Homeland Defense'\n", " 'reorganizing society. no, I am not being flip.' 'Intl Development'\n", " \"Diplomat / Int'l civil servant\" 'Diplomat/Business'\n", " 'Economic Policy Advisor on Latin America' 'Energy Management' 'Diplomat'\n", " 'Work at the UN' 'Foreign Service'\n", " 'Exec. Director of social service non-profit']\n", "==========\n", "Career Code 10.0\n", "['Undecided' \"I don't know\" 'What a question!' 'if only i knew'\n", " \"don't know\" 'Not Sure' 'undecided' 'TBA' 'Am not sure' 'Who knows' '?'\n", " 'not sure yet :)' 'Make money' 'still wondering' 'Not sure yet' 'unknown'\n", " 'unsure' '??' 'dont know yet']\n", "==========\n", "Career Code 11.0\n", "['Social Worker' 'Counseling Adolescents' 'Social work' 'Social Work'\n", " 'Social Work Policy' 'Clinical Social Worker']\n", "==========\n", "Career Code 12.0\n", "['speech pathologist' 'Speech Pathologist']\n", "==========\n", "Career Code 13.0\n", "['GOVERNOR' 'Political Development in Africa' 'Lobbyist' 'politics'\n", " 'School Leadership/Politics']\n", "==========\n", "Career Code 14.0\n", "['Pro Beach Volleyball']\n", "==========\n", "Career Code 15.0\n", "['Hero' 'Energy' 'Trade Specialist' 'professional career'\n", " \"assistant master of the universe (otherwise it's too much work)\"]\n", "==========\n", "Career Code 16.0\n", "['journalism' 'Writer/journalist']\n", "==========\n", "Career Code 17.0\n", "['Architecture and design']\n" ] } ], "source": [ "for i, group in df.groupby('career_c'):\n", " print('=' * 10)\n", " print('Career Code {}'.format(i))\n", " print(group.career.unique())" ] }, { "cell_type": "code", "execution_count": 129, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "59" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.career_c.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'career_c'] = df.loc[:, 'career_c'].fillna(18)" ] }, { "cell_type": "code", "execution_count": 131, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['career'], axis=1)" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [], "source": [ "# Теперь это надо закодировать\n", "df = pd.get_dummies(df, prefix='career', prefix_sep='=',\n", " columns=['career_c'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How interested are you in the following activities, on a scale of 1-10?\n", " \n", " sports: Playing sports/ athletics\n", " tvsports: Watching sports\n", " excersice: Body building/exercising\n", " dining: Dining out\n", " museums: Museums/galleries\n", " art: Art\n", " hiking: Hiking/camping\n", " gaming: Gaming\n", " clubbing: Dancing/clubbing\n", " reading: Reading\n", " tv: Watching TV\n", " theater: Theater\n", " movies: Movies\n", " concerts: Going to concerts\n", " music: Music\n", " shopping: Shopping\n", " yoga: Yoga/meditation\n", "\n", "По большому счету с этими признаками можно придумать много чего.. Например у нас уже есть признак, который считает корреляцию между интересами пар. Пока мы все их выкинем" ] }, { "cell_type": "code", "execution_count": 133, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sports 0\n", "tvsports 0\n", "exercise 0\n", "dining 0\n", "museums 0\n", "art 0\n", "hiking 0\n", "gaming 0\n", "clubbing 0\n", "reading 0\n", "tv 0\n", "theater 0\n", "movies 0\n", "concerts 0\n", "music 0\n", "shopping 0\n", "yoga 0\n", "dtype: int64" ] }, "execution_count": 133, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[:, ['sports','tvsports','exercise','dining','museums','art','hiking','gaming',\n", " 'clubbing','reading','tv','theater','movies','concerts','music','shopping','yoga']\n", " ].isnull().sum()" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['sports','tvsports','exercise','dining','museums','art','hiking','gaming',\n", " 'clubbing','reading','tv','theater','movies','concerts','music','shopping','yoga'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### exphappy:\n", "Overall, on a scale of 1-10, how happy do you expect to be with the people you meet \n", "during the speed-dating event?\n", "\n", "#### expnum: \n", "Out of the 20 people you will meet, how many do you expect will be interested in dating you? \n" ] }, { "cell_type": "code", "execution_count": 135, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 135, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').exphappy.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "416" ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.drop_duplicates('iid').expnum.isnull().sum()" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['expnum'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attr1\n", "\n", "We want to know what you look for in the opposite sex. \n", "Waves 6-9: Please rate the importance of the following attributes in a potential date on a scale of 1-10 (1=not at all important, 10=extremely important):\n", "Waves 1-5, 10-21: You have 100 points to distribute among the following attributes -- give more points to those attributes that are more important in a potential date, and fewer points to those attributes that are less important in a potential date. Total points must equal 100.\n", "\n", "attr1_1 \n", "Attractive\n", "\n", "sinc1_1\n", "Sincere\n", "\n", "intel1_1\n", "Intelligent\n", "\n", "fun1_1\n", "Fun\n", "\n", "amb1_1\n", "Ambitious\n", "\n", "shar1_1\n", "Has shared interests/hobbies\n" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "feat = ['iid', 'wave', 'attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']" ] }, { "cell_type": "code", "execution_count": 139, "metadata": {}, "outputs": [], "source": [ "temp = df.drop_duplicates(subset=['iid', 'wave']).loc[:, feat]" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "temp.loc[:, 'totalsum'] = temp.iloc[:, 2:].sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "idx = ((temp.wave < 6) | (temp.wave > 9)) & (temp.totalsum < 99)" ] }, { "cell_type": "code", "execution_count": 142, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iidwaveattr1_1sinc1_1intel1_1fun1_1amb1_1shar1_1totalsum
91867320.015.020.020.05.010.090.0
1530105430.015.020.020.00.05.090.0
72214891920.010.020.020.020.00.090.0
75865172115.020.020.020.05.010.090.0
77845262110.010.030.020.010.015.095.0
\n", "
" ], "text/plain": [ " iid wave attr1_1 sinc1_1 intel1_1 fun1_1 amb1_1 shar1_1 totalsum\n", "918 67 3 20.0 15.0 20.0 20.0 5.0 10.0 90.0\n", "1530 105 4 30.0 15.0 20.0 20.0 0.0 5.0 90.0\n", "7221 489 19 20.0 10.0 20.0 20.0 20.0 0.0 90.0\n", "7586 517 21 15.0 20.0 20.0 20.0 5.0 10.0 90.0\n", "7784 526 21 10.0 10.0 30.0 20.0 10.0 15.0 95.0" ] }, "execution_count": 142, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp.loc[idx, ]" ] }, { "cell_type": "code", "execution_count": 143, "metadata": {}, "outputs": [], "source": [ "idx = ((temp.wave >= 6) & (temp.wave <= 9))" ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "# temp.loc[idx, ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ну понятно, надо чутка подредактировать исходные признаки и в бой" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'temp_totalsum'] = df.loc[:, ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']].sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 146, "metadata": {}, "outputs": [], "source": [ "df.loc[:, ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']] = \\\n", "(df.loc[:, ['attr1_1', 'sinc1_1', 'intel1_1', 'fun1_1', 'amb1_1', 'shar1_1']].T/df.loc[:, 'temp_totalsum'].T).T * 100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Проведите аналогичную работу для признаков `attr2`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Attr2" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [], "source": [ "feat = ['iid', 'wave', 'attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "temp = df.drop_duplicates(subset=['iid', 'wave']).loc[:, feat]" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "temp.loc[:, 'totalsum'] = temp.iloc[:, 2:].sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [], "source": [ "idx = ((temp.wave < 6) | (temp.wave > 9)) & (temp.totalsum < 90) & (temp.totalsum != 0)" ] }, { "cell_type": "code", "execution_count": 151, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
iidwaveattr2_1sinc2_1intel2_1fun2_1amb2_1shar2_1totalsum
48163201220.010.010.010.020.010.080.0
\n", "
" ], "text/plain": [ " iid wave attr2_1 sinc2_1 intel2_1 fun2_1 amb2_1 shar2_1 totalsum\n", "4816 320 12 20.0 10.0 10.0 10.0 20.0 10.0 80.0" ] }, "execution_count": 151, "metadata": {}, "output_type": "execute_result" } ], "source": [ "temp.loc[idx, ]" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [], "source": [ "idx = ((temp.wave >= 6) & (temp.wave <= 9))" ] }, { "cell_type": "code", "execution_count": 153, "metadata": {}, "outputs": [], "source": [ "# temp.loc[idx, ]" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [], "source": [ "df.loc[:, 'temp_totalsum'] = df.loc[:, ['attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']].sum(axis=1)" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "df.loc[:, ['attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']] = \\\n", "(df.loc[:, ['attr2_1', 'sinc2_1', 'intel2_1', 'fun2_1', 'amb2_1', 'shar2_1']].T/df.loc[:, 'temp_totalsum'].T).T * 100" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['temp_totalsum'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Признаки `attr4` и `attr5` пока выбросим" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "scrolled": true }, "outputs": [], "source": [ "for i in [4, 5]:\n", " feat = ['attr{}_1'.format(i), 'sinc{}_1'.format(i), \n", " 'intel{}_1'.format(i), 'fun{}_1'.format(i), \n", " 'amb{}_1'.format(i), 'shar{}_1'.format(i)]\n", " \n", " if i != 4:\n", " feat.remove('shar{}_1'.format(i))\n", " \n", " df = df.drop(feat, axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Теперь создалим таблицу с мужчинами, таблицу с женщинами с сджоиним их" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [], "source": [ "df = df.drop(['wave'], axis=1)" ] }, { "cell_type": "code", "execution_count": 159, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8249, 77)" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "df_male = df.query('gender == 1').drop_duplicates(subset=['iid', 'pid'])\\\n", " .drop(['gender'], axis=1)\\\n", " .dropna()\n", "df_female = df.query('gender == 0').drop_duplicates(subset=['iid'])\\\n", " .drop(['gender', 'match', 'int_corr', 'samerace'], axis=1)\\\n", " .dropna()\n", " \n", "df_female.columns = df_female.columns + '_f'" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [], "source": [ "df_pair = df_male.join(df_female.set_index('iid_f'), \n", " on='pid', how='inner')" ] }, { "cell_type": "code", "execution_count": 162, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3999, 148)" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pair.shape" ] }, { "cell_type": "code", "execution_count": 163, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 3345\n", "1 654\n", "Name: match, dtype: int64" ] }, "execution_count": 163, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_pair.match.value_counts()" ] }, { "cell_type": "code", "execution_count": 164, "metadata": {}, "outputs": [], "source": [ "label_col = 'match'" ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "df_pair = df_pair.drop(['iid', 'pid'], axis=1)" ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [], "source": [ "X = df_pair.loc[:, df_pair.columns != label_col].values\n", "y = df_pair.loc[:, df_pair.columns == label_col].values.flatten()" ] }, { "cell_type": "code", "execution_count": 167, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, 1, 1, 0, 0, 0, 1, 0])" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y[:10]" ] }, { "cell_type": "code", "execution_count": 168, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[ 0.14, 0. , 27. , ..., 0. , 0. , 1. ],\n", " [ 0.54, 0. , 22. , ..., 0. , 0. , 1. ],\n", " [ 0.16, 1. , 22. , ..., 0. , 0. , 1. ],\n", " ...,\n", " [ 0.5 , 0. , 27. , ..., 0. , 0. , 1. ],\n", " [ 0.28, 0. , 28. , ..., 0. , 0. , 1. ],\n", " [-0.36, 0. , 24. , ..., 0. , 0. , 1. ]])" ] }, "execution_count": 168, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X[:10]" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3999, 145)" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape" ] }, { "cell_type": "code", "execution_count": 170, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,\n", " random_state=123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Задание\n", "\n", "Сфокусируемся на одном гиперпараметре деревьев решений - максимальной глубине.\n", "\n", "Подберите наилучшую глубину `d` дерева с помошью \n", "* Усредненной оценки качества roc-auc на кросс-валидации при различных `d`" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "from sklearn.tree import DecisionTreeClassifier\n", "\n", "try:\n", " from sklearn.model_selection import cross_val_score\n", "except ImportError:\n", " from sklearn.cross_validation import cross_val_score\n", "\n", "try:\n", " from sklearn.model_selection import validation_curve\n", "except ImportError:\n", " from sklearn.learning_curve import validation_curve" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=2,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", " splitter='best')" ] }, "execution_count": 173, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = DecisionTreeClassifier(max_depth=2, class_weight='balanced')\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "y_hat = model.predict(X_train)" ] }, { "cell_type": "code", "execution_count": 175, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0, 0, 1, ..., 0, 1, 1])" ] }, "execution_count": 175, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_hat" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [], "source": [ "y_hat_proba = model.predict_proba(X_train)" ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.55207836, 0.44792164],\n", " [0.55207836, 0.44792164],\n", " [0.39504568, 0.60495432],\n", " [0.39504568, 0.60495432],\n", " [0.55207836, 0.44792164],\n", " [0.39504568, 0.60495432],\n", " [0.39504568, 0.60495432],\n", " [0.39504568, 0.60495432],\n", " [0.39504568, 0.60495432],\n", " [0.55207836, 0.44792164]])" ] }, "execution_count": 177, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_hat_proba[:10]" ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [], "source": [ "scores = []\n", "cv = \n", "for d in range(1, 20):\n", " model = DecisionTreeClassifier(max_depth=d, class_weight='balanced')\n", " scores.append(\n", " cross_val_score(model, X_train, y_train, scoring='roc_auc', \n", " cv=5, n_jobs=-1).mean()\n", " )" ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 185, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(range(1,20), scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Задание\n", "\n", "Отсортируйте признаки по важности. Вектор с важностью признаков можно получить с помощью `model.feature_importances_`" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight='balanced', criterion='gini', max_depth=6,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=None,\n", " splitter='best')" ] }, "execution_count": 186, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model = DecisionTreeClassifier(max_depth=6, class_weight='balanced')\n", "model.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "imp = pd.Series(index=df_pair.columns[df_pair.columns != label_col],\n", " data=model.feature_importances_)" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "date 0.110340\n", "sinc2_1 0.084710\n", "shar1_1_f 0.079804\n", "amb3_1_f 0.074389\n", "sinc3_1 0.044790\n", "dtype: float64" ] }, "execution_count": 188, "metadata": {}, "output_type": "execute_result" } ], "source": [ "imp.sort_values(ascending=False).head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Задание \n", "* Получите предсказания на тестовой выборке\n", "* Постройте ROC кривые для обучающей и тестовой выборок" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5681619160208415" ] }, "execution_count": 190, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "y_hat = model.predict_proba(X_test)\n", "roc_auc_score(y_test, y_hat[:, 1])" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [], "source": [ "# Получилась сильная просадка.. Переобучились таки" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Задание\n", "Перейдем к модели случайного леса. Зафиксируем некоторую глубину дерева (можно звять оптимальное с предыдущих заданий).\n", "\n", "Сравните качество работы \n", "* Простого дерева решений\n", "* Бэггинга над 10,20,...,100 деревьями решений\n", "* Случайного леса с 10,20,...,100 деревьями решений\n", "\n", "Получите график, у которого по оси X откладывается количество деревьев, а по оси Y - качество классификации." ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import BaggingClassifier\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 195, "metadata": {}, "outputs": [], "source": [ "bag_valid_score = []\n", "rf_valid_score = []\n", "n_est_range = range(10, 110, 10)\n", "\n", "for n_est in n_est_range:\n", " \n", " model_bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=6, class_weight='balanced'), \n", " n_estimators=n_est)\n", " model_rf = RandomForestClassifier(max_depth=6, class_weight='balanced_subsample', n_estimators=n_est)\n", " \n", " bag_valid_score.append(\n", " cross_val_score(model_bag, X_train, y_train, scoring='roc_auc', \n", " cv=5, n_jobs=-1).mean()\n", " )\n", " rf_valid_score.append(\n", " cross_val_score(model_rf, X_train, y_train, scoring='roc_auc', \n", " cv=5, n_jobs=-1).mean()\n", " )\n", " " ] }, { "cell_type": "code", "execution_count": 198, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 198, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.plot(n_est_range, bag_valid_score, label='bagging cv')\n", "plt.plot(n_est_range, rf_valid_score, label='random forest cv')\n", "plt.legend()" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',\n", " criterion='gini', max_depth=6, max_features='auto',\n", " max_leaf_nodes=None, min_impurity_decrease=0.0,\n", " min_impurity_split=None, min_samples_leaf=1,\n", " min_samples_split=2, min_weight_fraction_leaf=0.0,\n", " n_estimators=100, n_jobs=None, oob_score=False,\n", " random_state=None, verbose=0, warm_start=False)" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_rf = RandomForestClassifier(max_depth=6, class_weight='balanced_subsample', n_estimators=100)\n", "model_rf.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 202, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6817614606593985" ] }, "execution_count": 202, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_hat = model_rf.predict_proba(X_test)\n", "roc_auc_score(y_test, y_hat[:, 1])\n", "# Есть просадка, но не такая значительная" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Подбор всех гиперпараметров\n", "\n", "Обычно подбирают гиперпараметры целыми группами. Есть несколько способов это делать\n", "* Полный перебор (Grid Search) - явно задаются все возможные значения параметров. Далее перебираются все возможные комбинации этих параметров\n", "* Случайный перебор (Random Search) - для некоотрых параметров задается распределение через функцию распределения. Задается количество случайных комбинаций, которых требуется перебрать.\n", "* \"Умный\" перебор ([hyperopt](http://hyperopt.github.io/hyperopt/)) - после каждого шага, следующия комбинация выбирается специальным образом, чтобы с одной стороны проверить неисследованные области, а с другой минимизировать функцию потерь. Не всегда работат так хорошо, как звучит.\n", "\n", "Мы же попробует случайный поиск. Почему случайный поиск лучше перебора:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Задание\n", "* С помощью GridSearchCV или RandomSearchCV подберите наиболее оптимальные параметры для случайного леса.\n", "* Для этих параметров сравните средние результаты по кросс-валидации и качество на контрольной выборке\n", "\n" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "from sklearn.model_selection import RandomizedSearchCV" ] }, { "cell_type": "code", "execution_count": 204, "metadata": {}, "outputs": [], "source": [ "# Your Code Here" ] }, { "cell_type": "code", "execution_count": 205, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import randint as randint\n", "from scipy.stats import uniform\n", "\n", "try:\n", " from sklearn.model_selection import GridSearchCV\n", " from sklearn.model_selection import RandomizedSearchCV\n", " from sklearn.model_selection import StratifiedKFold\n", "except ImportError:\n", " from sklearn.cross_validation import GridSearchCV\n", " from sklearn.cross_validation import RandomizedSearchCV\n", " from sklearn.cross_validation import StratifiedKFold\n", "\n", "\n", "RND_SEED = 123" ] }, { "cell_type": "code", "execution_count": 207, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "RandomizedSearchCV(cv=StratifiedKFold(n_splits=5, random_state=123, shuffle=True),\n", " error_score='raise-deprecating',\n", " estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=1, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=123,\n", " splitter='best'),\n", " fit_params=None, iid='warn', n_iter=200, n_jobs=-1,\n", " param_distributions={'criterion': ['gini', 'entropy'], 'max_depth': , 'min_samples_leaf': , 'class_weight': [None, 'balanced']},\n", " pre_dispatch='2*n_jobs', random_state=123, refit=True,\n", " return_train_score='warn', scoring='roc_auc', verbose=0)" ] }, "execution_count": 207, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Определим пространство поиска\n", "\n", "param_grid = {\n", " 'criterion': ['gini', 'entropy'],\n", " 'max_depth': randint(2, 10),\n", " 'min_samples_leaf': randint(1, 100),\n", " 'class_weight': [None, 'balanced']}\n", "\n", "# Некоторые параметры мы задали не простым перечислением значений, а \n", "# с помощью распределений.\n", "\n", "# Будем делать 200 запусков поиска\n", "cv = StratifiedKFold(n_splits=5, random_state=123, shuffle=True)\n", "\n", "model = DecisionTreeClassifier(random_state=123)\n", "random_search = RandomizedSearchCV(model, param_distributions=param_grid, n_iter=200, n_jobs=-1,\n", " cv=cv, scoring='roc_auc', random_state=123)\n", "# А дальше, просто .fit()\n", "random_search.fit(X, y)" ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=8,\n", " max_features=None, max_leaf_nodes=None,\n", " min_impurity_decrease=0.0, min_impurity_split=None,\n", " min_samples_leaf=95, min_samples_split=2,\n", " min_weight_fraction_leaf=0.0, presort=False, random_state=123,\n", " splitter='best')" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.best_estimator_" ] }, { "cell_type": "code", "execution_count": 209, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6135310935440562" ] }, "execution_count": 209, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.best_score_" ] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'class_weight': None,\n", " 'criterion': 'gini',\n", " 'max_depth': 8,\n", " 'min_samples_leaf': 95}" ] }, "execution_count": 210, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.best_params_" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.0" }, "toc": { "base_numbering": 1, "nav_menu": { "height": "142px", "width": "252px" }, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }