{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Data Science/Machine Learning Code Walkthrough

\n", "
\n", "\n", "\"Wharton\n", "
\n", "\n", "

Fall 2018, OIDD314/662

\n", "

Alex P. Miller, Kartik Hosanagar

\n", "\n", "

{alexmill,kartikh}@wharton.upenn.edu

\n", "

@alexpmil, @KHosanagar

\n", "\n", "

https://github.com/alexmill/machine-learning-wharton

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Main goals:\n", "- Understand basics of working with raw data in ML\n", "- Understand what \"machine learning\" looks like in practice\n", "- Get a sense of where fancy methods help and where they don't\n", "- Give you a jumping off point if you want to learn more\n", "\n", "(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# Import basic functions\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from copy import deepcopy\n", "\n", "pd.set_option('display.max_columns', 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset: Online Dating Profiles\n", "\n", "This is a useful, publicly available dataset for demonstrating some common data science techniques ([data source](https://github.com/rudeboybert/JSE_OkCupid)). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.\n", "\n", "# Part 1: Basic Data Processing and Prediction\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agebody_typedietdrinksdrugseducationessay0essay1essay2essay3essay4essay5essay6essay7essay8essay9ethnicityheightincomejoblast_onlinelocationoffspringorientationpetsreligionsexsignsmokesspeaksstatus
5351862thinNaNsociallynevergraduated from space campi'm a very lighthearted person. i prefer laugh...i'm making a point of not dwelling on what i m...making friends out of strangers. i seem to be ...my smile.i prefer books that are about true events and ...1. my family<br />\\n2. food<br />\\n3. my compu...just what to think about.NaNit's not very private if you plan on admitting...if you have read my profile and think we may b...NaN64.0-1retired2012-05-25-22-35hayward, californiahas kidsstraighthas dogschristianityfleo but it doesn&rsquo;t mattersometimesenglishsingle
2479027curvymostly vegetariansociallyNaNgraduated from college/universityi love to travel, check out coffee shops, sals...meeting people! making the most out of life......having an open mind, hosting parties, going wi...my smile :-)books- eat pray love, kite runner, homage to c...music<br />\\nphone<br />\\ninternet<br />\\nhot ...what i want to be when i grow up, the world we...maybe perusing the streets of the city with my...i like it when it's foggy and raining outsideyou define success as happiness.<br />\\n<br />...other67.0-1banking / financial / real estate2012-06-25-22-29san francisco, californiadoesn&rsquo;t have kidsstraightlikes dogs and has catsagnosticism and somewhat serious about itfpiscesnoenglish (fluently), spanish (poorly), tagalog ...single
2169128averageNaNsociallynevergraduated from college/universityNaNworking as marketing consultant helping compan...NaNmy eyesNaNgood food, friends, family, books, travel and ...everythinggoing to dinner with friendsNaNyou think we have a lot in common and want to ...white66.060000sales / marketing / biz dev2012-06-29-13-31san francisco, californiaNaNstraightlikes dogsNaNfaries and it&rsquo;s fun to think aboutnoenglishsingle
1831033athleticNaNsociallynevergraduated from college/universityNaNNaNNaNNaNNaNNaNNaNNaNNaNyou are cool - no pressure - like to sing - li...white71.0-1NaN2011-10-12-09-49san francisco, californiaNaNstraightNaNothermscorpionoenglishsingle
3662124athleticmostly vegetariansociallynevergraduated from masters programhey! i am sanket. about me? here it goes:<br /...i am trying to make it worth by having lots of...well i am good at understanding people. i am g...you will notice that i am kind of cute (atleas...dont read books a lot. i like thrillers though...my family.<br />\\nsome of my friends.<br />\\nm...dont waste time in thinking too much. i am kin...definitely outside my apartment somewhere. i h...i am going to be a millionaire!you like what you see... you can be sure\\nwhat...NaN70.0-1science / tech / engineering2012-06-30-14-41san carlos, californiadoesn&rsquo;t want kidsstraightNaNhinduism but not too serious about itmsagittarius but it doesn&rsquo;t matternoenglish (fluently), hindi (fluently)single
\n", "
" ], "text/plain": [ " age body_type diet drinks drugs \\\n", "53518 62 thin NaN socially never \n", "24790 27 curvy mostly vegetarian socially NaN \n", "21691 28 average NaN socially never \n", "18310 33 athletic NaN socially never \n", "36621 24 athletic mostly vegetarian socially never \n", "\n", " education \\\n", "53518 graduated from space camp \n", "24790 graduated from college/university \n", "21691 graduated from college/university \n", "18310 graduated from college/university \n", "36621 graduated from masters program \n", "\n", " essay0 \\\n", "53518 i'm a very lighthearted person. i prefer laugh... \n", "24790 i love to travel, check out coffee shops, sals... \n", "21691 NaN \n", "18310 NaN \n", "36621 hey! i am sanket. about me? here it goes:
\\n2. food
\\n3. my compu... \n", "24790 music
\\nphone
\\ninternet
\\nhot ... \n", "21691 good food, friends, family, books, travel and ... \n", "18310 NaN \n", "36621 my family.
\\nsome of my friends.
\\nm... \n", "\n", " essay6 \\\n", "53518 just what to think about. \n", "24790 what i want to be when i grow up, the world we... \n", "21691 everything \n", "18310 NaN \n", "36621 dont waste time in thinking too much. i am kin... \n", "\n", " essay7 \\\n", "53518 NaN \n", "24790 maybe perusing the streets of the city with my... \n", "21691 going to dinner with friends \n", "18310 NaN \n", "36621 definitely outside my apartment somewhere. i h... \n", "\n", " essay8 \\\n", "53518 it's not very private if you plan on admitting... \n", "24790 i like it when it's foggy and raining outside \n", "21691 NaN \n", "18310 NaN \n", "36621 i am going to be a millionaire! \n", "\n", " essay9 ethnicity height \\\n", "53518 if you have read my profile and think we may b... NaN 64.0 \n", "24790 you define success as happiness.
\\n
... other 67.0 \n", "21691 you think we have a lot in common and want to ... white 66.0 \n", "18310 you are cool - no pressure - like to sing - li... white 71.0 \n", "36621 you like what you see... you can be sure\\nwhat... NaN 70.0 \n", "\n", " income job last_online \\\n", "53518 -1 retired 2012-05-25-22-35 \n", "24790 -1 banking / financial / real estate 2012-06-25-22-29 \n", "21691 60000 sales / marketing / biz dev 2012-06-29-13-31 \n", "18310 -1 NaN 2011-10-12-09-49 \n", "36621 -1 science / tech / engineering 2012-06-30-14-41 \n", "\n", " location offspring orientation \\\n", "53518 hayward, california has kids straight \n", "24790 san francisco, california doesn’t have kids straight \n", "21691 san francisco, california NaN straight \n", "18310 san francisco, california NaN straight \n", "36621 san carlos, california doesn’t want kids straight \n", "\n", " pets religion sex \\\n", "53518 has dogs christianity f \n", "24790 likes dogs and has cats agnosticism and somewhat serious about it f \n", "21691 likes dogs NaN f \n", "18310 NaN other m \n", "36621 NaN hinduism but not too serious about it m \n", "\n", " sign smokes \\\n", "53518 leo but it doesn’t matter sometimes \n", "24790 pisces no \n", "21691 aries and it’s fun to think about no \n", "18310 scorpio no \n", "36621 sagittarius but it doesn’t matter no \n", "\n", " speaks status \n", "53518 english single \n", "24790 english (fluently), spanish (poorly), tagalog ... single \n", "21691 english single \n", "18310 english single \n", "36621 english (fluently), hindi (fluently) single " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Load in raw profiles\n", "dating_data = pd.read_csv(\"./dating_data/profiles_sample.csv\", index_col=0)\n", "dating_data.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(9627, 31)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dating_data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question: Can we predict a person's age from their profile characteristics?\n", "\n", "In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
body_typedietdrinksdrugseducationlocationjoborientationsexsmokesspeaks
53518thinNaNsociallynevergraduated from space camphayward, californiaretiredstraightfsometimesenglish
24790curvymostly vegetariansociallyNaNgraduated from college/universitysan francisco, californiabanking / financial / real estatestraightfnoenglish (fluently), spanish (poorly), tagalog ...
21691averageNaNsociallynevergraduated from college/universitysan francisco, californiasales / marketing / biz devstraightfnoenglish
18310athleticNaNsociallynevergraduated from college/universitysan francisco, californiaNaNstraightmnoenglish
36621athleticmostly vegetariansociallynevergraduated from masters programsan carlos, californiascience / tech / engineeringstraightmnoenglish (fluently), hindi (fluently)
\n", "
" ], "text/plain": [ " body_type diet drinks drugs \\\n", "53518 thin NaN socially never \n", "24790 curvy mostly vegetarian socially NaN \n", "21691 average NaN socially never \n", "18310 athletic NaN socially never \n", "36621 athletic mostly vegetarian socially never \n", "\n", " education location \\\n", "53518 graduated from space camp hayward, california \n", "24790 graduated from college/university san francisco, california \n", "21691 graduated from college/university san francisco, california \n", "18310 graduated from college/university san francisco, california \n", "36621 graduated from masters program san carlos, california \n", "\n", " job orientation sex smokes \\\n", "53518 retired straight f sometimes \n", "24790 banking / financial / real estate straight f no \n", "21691 sales / marketing / biz dev straight f no \n", "18310 NaN straight m no \n", "36621 science / tech / engineering straight m no \n", "\n", " speaks \n", "53518 english \n", "24790 english (fluently), spanish (poorly), tagalog ... \n", "21691 english \n", "18310 english \n", "36621 english (fluently), hindi (fluently) " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Let's use just these features to try to predict a person's age\n", "# (I'm excluding variables like \"kids\", which might be dead giveaways.)\n", "prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']\n", "dating_data[prof_cols].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### But wait...\n", "**Question:** How do we get a computer to \"understand\" a person's dating profile?\n", "\n", "**Answer:** Math! (matrices, linear algebra)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['socially', 'not at all', 'rarely', 'often', nan, 'very often',\n", " 'desperately'], dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Most columns are \"categorical\"\n", "# e.g., for whether or not someone drinks alcohol, they\n", "# can choose from among the following categories:\n", "dating_data.drinks.unique()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
desperatelynot at alloftenrarelysociallyvery often
53518000010
24790000010
21691000010
18310000010
36621000010
50834010000
7486000010
22184000010
11898000100
26979000010
4513000100
30374000010
48060000100
50583000010
38203000010
54866000100
43656000010
55574000010
48378010000
19199000010
\n", "
" ], "text/plain": [ " desperately not at all often rarely socially very often\n", "53518 0 0 0 0 1 0\n", "24790 0 0 0 0 1 0\n", "21691 0 0 0 0 1 0\n", "18310 0 0 0 0 1 0\n", "36621 0 0 0 0 1 0\n", "50834 0 1 0 0 0 0\n", "7486 0 0 0 0 1 0\n", "22184 0 0 0 0 1 0\n", "11898 0 0 0 1 0 0\n", "26979 0 0 0 0 1 0\n", "4513 0 0 0 1 0 0\n", "30374 0 0 0 0 1 0\n", "48060 0 0 0 1 0 0\n", "50583 0 0 0 0 1 0\n", "38203 0 0 0 0 1 0\n", "54866 0 0 0 1 0 0\n", "43656 0 0 0 0 1 0\n", "55574 0 0 0 0 1 0\n", "48378 0 1 0 0 0 0\n", "19199 0 0 0 0 1 0" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# To convert this data into a matrix, we will take each \n", "# category and convert it into a binary column:\n", "dating_data.drinks.str.get_dummies().head(n=20)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['has dogs', 'likes dogs and has cats', 'likes dogs', nan,\n", " 'has cats', 'likes dogs and likes cats', 'has dogs and has cats',\n", " 'has dogs and dislikes cats', 'dislikes cats',\n", " 'has dogs and likes cats', 'dislikes dogs and likes cats',\n", " 'likes dogs and dislikes cats', 'likes cats',\n", " 'dislikes dogs and dislikes cats', 'dislikes dogs and has cats',\n", " 'dislikes dogs'], dtype=object)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note: data is often very messy\n", "# Lots of work in data science is just cleaning/processing data\n", "\n", "# Example:\n", "dating_data.pets.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
BODY_TYPE_alittleextraBODY_TYPE_athleticBODY_TYPE_averageBODY_TYPE_curvyBODY_TYPE_fitBODY_TYPE_fullfiguredBODY_TYPE_jackedBODY_TYPE_overweightBODY_TYPE_rathernotsayBODY_TYPE_skinnyBODY_TYPE_thinBODY_TYPE_usedupDIET_anythingDIET_mostlyanythingDIET_mostlyhalalDIET_mostlykosherDIET_mostlyotherDIET_mostlyveganDIET_mostlyvegetarianDIET_otherDIET_strictlyanythingDIET_strictlyhalalDIET_strictlykosherDIET_strictlyotherDIET_strictlyvegan...JOB_sciencetechengineeringJOB_studentJOB_transportationJOB_unemployedORIENTATION_bisexualORIENTATION_gayORIENTATION_straightSEX_fSEX_mSMOKES_noSMOKES_sometimesSMOKES_tryingtoquitSMOKES_whendrinkingSMOKES_yesSPEAKS_c++SPEAKS_chineseSPEAKS_englishSPEAKS_germanSPEAKS_hindiSPEAKS_italianSPEAKS_japaneseSPEAKS_latinSPEAKS_spanishSPEAKS_tagalogSPEAKS_turkish
535180000000000100000000000000...0000001100100000100000000
247900001000000000000001000000...0000001101000000100000110
216910010000000000000000000000...0000001101000000100000000
183100100000000000000000000000...0000001011000000100000000
366210100000000000000001000000...1000001011000000101000000
508340001000000000000000000000...0000001101000000100000000
74860000100000001000000000000...0100001010000000100000000
221840000100000000000000000000...0000001011000000100000000
118980000000000000000000000000...0000001011000000100001100
269790010000000000000000000000...0000001010001001110010000
\n", "

10 rows × 215 columns

\n", "
" ], "text/plain": [ " BODY_TYPE_alittleextra BODY_TYPE_athletic BODY_TYPE_average \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 1 \n", "18310 0 1 0 \n", "36621 0 1 0 \n", "50834 0 0 0 \n", "7486 0 0 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 1 \n", "\n", " BODY_TYPE_curvy BODY_TYPE_fit BODY_TYPE_fullfigured \\\n", "53518 0 0 0 \n", "24790 1 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 0 0 0 \n", "50834 1 0 0 \n", "7486 0 1 0 \n", "22184 0 1 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " BODY_TYPE_jacked BODY_TYPE_overweight BODY_TYPE_rathernotsay \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 0 0 0 \n", "50834 0 0 0 \n", "7486 0 0 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " BODY_TYPE_skinny BODY_TYPE_thin BODY_TYPE_usedup DIET_anything \\\n", "53518 0 1 0 0 \n", "24790 0 0 0 0 \n", "21691 0 0 0 0 \n", "18310 0 0 0 0 \n", "36621 0 0 0 0 \n", "50834 0 0 0 0 \n", "7486 0 0 0 1 \n", "22184 0 0 0 0 \n", "11898 0 0 0 0 \n", "26979 0 0 0 0 \n", "\n", " DIET_mostlyanything DIET_mostlyhalal DIET_mostlykosher \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 0 0 0 \n", "50834 0 0 0 \n", "7486 0 0 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " DIET_mostlyother DIET_mostlyvegan DIET_mostlyvegetarian DIET_other \\\n", "53518 0 0 0 0 \n", "24790 0 0 1 0 \n", "21691 0 0 0 0 \n", "18310 0 0 0 0 \n", "36621 0 0 1 0 \n", "50834 0 0 0 0 \n", "7486 0 0 0 0 \n", "22184 0 0 0 0 \n", "11898 0 0 0 0 \n", "26979 0 0 0 0 \n", "\n", " DIET_strictlyanything DIET_strictlyhalal DIET_strictlykosher \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 0 0 0 \n", "50834 0 0 0 \n", "7486 0 0 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " DIET_strictlyother DIET_strictlyvegan ... \\\n", "53518 0 0 ... \n", "24790 0 0 ... \n", "21691 0 0 ... \n", "18310 0 0 ... \n", "36621 0 0 ... \n", "50834 0 0 ... \n", "7486 0 0 ... \n", "22184 0 0 ... \n", "11898 0 0 ... \n", "26979 0 0 ... \n", "\n", " JOB_sciencetechengineering JOB_student JOB_transportation \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 1 0 0 \n", "50834 0 0 0 \n", "7486 0 1 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " JOB_unemployed ORIENTATION_bisexual ORIENTATION_gay \\\n", "53518 0 0 0 \n", "24790 0 0 0 \n", "21691 0 0 0 \n", "18310 0 0 0 \n", "36621 0 0 0 \n", "50834 0 0 0 \n", "7486 0 0 0 \n", "22184 0 0 0 \n", "11898 0 0 0 \n", "26979 0 0 0 \n", "\n", " ORIENTATION_straight SEX_f SEX_m SMOKES_no SMOKES_sometimes \\\n", "53518 1 1 0 0 1 \n", "24790 1 1 0 1 0 \n", "21691 1 1 0 1 0 \n", "18310 1 0 1 1 0 \n", "36621 1 0 1 1 0 \n", "50834 1 1 0 1 0 \n", "7486 1 0 1 0 0 \n", "22184 1 0 1 1 0 \n", "11898 1 0 1 1 0 \n", "26979 1 0 1 0 0 \n", "\n", " SMOKES_tryingtoquit SMOKES_whendrinking SMOKES_yes SPEAKS_c++ \\\n", "53518 0 0 0 0 \n", "24790 0 0 0 0 \n", "21691 0 0 0 0 \n", "18310 0 0 0 0 \n", "36621 0 0 0 0 \n", "50834 0 0 0 0 \n", "7486 0 0 0 0 \n", "22184 0 0 0 0 \n", "11898 0 0 0 0 \n", "26979 0 1 0 0 \n", "\n", " SPEAKS_chinese SPEAKS_english SPEAKS_german SPEAKS_hindi \\\n", "53518 0 1 0 0 \n", "24790 0 1 0 0 \n", "21691 0 1 0 0 \n", "18310 0 1 0 0 \n", "36621 0 1 0 1 \n", "50834 0 1 0 0 \n", "7486 0 1 0 0 \n", "22184 0 1 0 0 \n", "11898 0 1 0 0 \n", "26979 1 1 1 0 \n", "\n", " SPEAKS_italian SPEAKS_japanese SPEAKS_latin SPEAKS_spanish \\\n", "53518 0 0 0 0 \n", "24790 0 0 0 1 \n", "21691 0 0 0 0 \n", "18310 0 0 0 0 \n", "36621 0 0 0 0 \n", "50834 0 0 0 0 \n", "7486 0 0 0 0 \n", "22184 0 0 0 0 \n", "11898 0 0 1 1 \n", "26979 0 1 0 0 \n", "\n", " SPEAKS_tagalog SPEAKS_turkish \n", "53518 0 0 \n", "24790 1 0 \n", "21691 0 0 \n", "18310 0 0 \n", "36621 0 0 \n", "50834 0 0 \n", "7486 0 0 \n", "22184 0 0 \n", "11898 0 0 \n", "26979 0 0 \n", "\n", "[10 rows x 215 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# I've done the processing work ahead of time for\n", "# the rest of the columns in the dataset\n", "\n", "# Load in pre-processed data:\n", "profile_features = pd.read_csv(\"./dating_data/profile_features.csv\", index_col=0)\n", "profile_features.head(n=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outcome variable: Age" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53518 62\n", "24790 27\n", "21691 28\n", "18310 33\n", "36621 24\n", "Name: age, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# How to define outcome variable (age)?\n", "age = dating_data.age\n", "age.head()" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYAAAAEICAYAAABWJCMKAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDIuMi4zLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvIxREBQAAGMFJREFUeJzt3X+UXGV9x/H3h4TfIEnIQkMS2KCpgp4S6RrjQS0FhQBq7KnYUNGU4oltQ9WKbQP+AFEqniq0nqO0UVICqBABSwqpGClUacuPDfIrRMoaAlkTkg3hNxYNfPvH8yzcbGZ3ZzezO2Sez+ucOXvnuc+993lmZucz97l37igiMDOz8uzS7AaYmVlzOADMzArlADAzK5QDwMysUA4AM7NCOQDMzArlACiQpH+S9LkGretgSc9KGpPv3yLpo41Yd17fv0ua16j1DWG7X5K0WdJjo73tekg6W9K3G7SuP5F0ayPWZTuXsc1ugDWWpLXAgcBW4EXgAeAyYFFEvAQQEX82hHV9NCJ+3F+diHgU2GfHWv3y9s4FXhcRp1bWf0Ij1j3EdkwFzgQOiYhNo739ekTE3zVju7Weo515O6XzHkBrem9E7AscAlwA/C1wSaM3IqlVP0AcAjz+an3zN2uYiPCthW7AWuBdfcpmAi8Bb8r3LwW+lKcnAtcDTwJbgJ+SPhhcnpf5FfAs8DdAOxDA6cCjwE8qZWPz+m4BvgzcATwFXAdMyPOOBrprtReYDfwa+E3e3j2V9X00T+8CfBZ4BNhE2rPZL8/rbce83LbNwGcGeJz2y8v35PV9Nq//XbnPL+V2XFpj2fH5MesBnsjTUyrzp+XH5hngx8A3gCsq82cB/50f83uAoyvz/gRYk5d9GPhQP+0/t3edw+j7/sAy4On8PH0RuLUy/x+BdXn+SuAduby/5+g0YHVu8xrgY5V11Xx95XkHAdfkx/Fh4OMDbce3EXi/aHYDfGvwE1ojAHL5o8Cf5+lLeSUAvgz8E7Brvr0DUK11Vd5oLgP2BvakdgD8EnhTrnNN5Y3qaPoJgDz98ptaZf4tvBIAfwp0AYeShp2uBS7v07Zv5XYdAbwAHNbP43QZKZz2zcv+L3B6f+3ss+z+wB8Ce+Xlvw/8a2X+/wBfBXYD3k56I+19DCYDjwMnkgLn3fl+W368ngZen+tOAt7YTxvOZfsAqLfvVwJL8/belJ+vagCcmvs4ljQU9hiwxwDP0UnAawEBvwc8Dxw50Osr930l8Pn8OB1KCo/j+9uOb42/eQioHOuBCTXKf0N6ozkkIn4TET+N/B84gHMj4rmI+FU/8y+PiPsj4jngc8AHew8S76APARdGxJqIeBY4C5jbZyjqCxHxq4i4h/Tp+oi+K8lt+SPgrIh4JiLWAl8DPlxPIyLi8Yi4JiKej4hngPNJb3xIOhh4C/D5iPh1RNxK+rTd61RgeUQsj4iXImIF0EkKBMh7apL2jIgNEbGqvodmSH3/w9y+5yLifmBJn/5dkfu4NSK+BuwOvH6Ax+OGiPhFJP8J/Ij0Rg/9v77eArRFxHn5cVpDCrC5Q+iv7SAHQDkmk3bB+/p70qfqH0laI2lhHetaN4T5j5A++U2sq5UDOyivr7rusaSD3r2qZ+08T+0D1BNJnzr7rmtyPY2QtJekf5b0iKSnScM94/Kb60HAloh4vrJI9fE4BDhZ0pO9N9JewqQcmH8E/BmwQdINkt5QT5uyevreRnrM+j5H1f6dKWm1pKdy+/ZjgOdP0gmSbpO0Jdc/sVK/v9fXIcBBfR6Hs9n2ubQR5gAogKS3kN7ctjvVL38CPjMiDgXeC3xK0rG9s/tZ5WB7CFMr0weTPgVuBp4jDZv0tmsM6Q2p3vWuJ71xVNe9Fdg4yHJ9bc5t6ruuX9a5/JmkT8RvjYjXAO/M5QI2ABMk7VWpX3081pH2kMZVbntHxAUAEXFjRLyb9Kn556RPxY3UQ3rM+j5HqQPSO0gnDXwQGB8R40jHcpSrbPMcSdqdNMz3VeDAXH95b/0BXl/rgIf7PA77RsSJtbZjI8MB0MIkvUbSe0hjvldExH016rxH0uskiTT+/GK+QXpjPXQYmz5V0uH5TfA84OqIeJE0zr6HpJMk7Uo68Lp7ZbmNQLuk/l6X3wP+StI0SfsAfwdcFRFbh9K43JalwPmS9pV0CPAp4Io6V7Ev6UDxk5ImAOdU1v0IaUjnXEm7SXob6Y2v1xXAeyUdL2mMpD0kHS1piqQDJb1P0t6kMfxneeW5aIjc92tz+/aSdDjp4HG1b1tJQTFW0ueB11Tm932OdiM9hz3AVkknAMf1Vh7g9XUH8LSkv5W0Z34s3pQ/rNTajo0AP7it6d8kPUP6lPUZ4ELSmRq1TCedqfIs6eDlNyPiljzvy8Bn8y76p4ew/ctJB5ofA/YAPg4QEU8BfwF8m/Rp+zmgu7Lc9/PfxyXdVWO9i/O6f0I6a+T/gL8cQruq/jJvfw1pz+i7ef31+AfSwdbNwG3AD/vM/xDwNtLB3S8BV5He0ImIdcAc0nBHD+k5+mvS/+IupL2L9aThut8jPV6NdgZpeOgx0vP0L5V5NwL/TgrrR0iPcXW4aJvnKB8D+TgpUJ8A/phtj3nUfH3lIHovMIP0XG4mvS72q7WdHeyv9aP3bA8zGyGSrgJ+HhHnDFrZbBR5D8CswSS9RdJrJe0iaTbpE/+/NrtdZn216jc5zZrpt0jj7PuThrj+PCJ+1twmmW3PQ0BmZoXyEJCZWaFe1UNAEydOjPb29mY3w8xsp7Jy5crNEdE2WL1XdQC0t7fT2dnZ7GaYme1UJD0yeC0PAZmZFWvQAMjfVLxD0j2SVkn6Qi6fJul2SQ9JukrSbrl893y/K89vr6zrrFz+oKTjR6pTZmY2uHr2AF4AjomII0jf2pstaRbwFeCiiJhO+gbg6bn+6cATEfE64KJcj/yV87nAG0nX+/5mg64QaWZmwzBoAORLvD6b7/Ze0zuAY4Crc/kS4P15eg6vXF72auDYfB2QOcCVEfFCRDxMukLgzIb0wszMhqyuYwD5Qk13k36FaQXwC+DJykW4unnlUrqTydcOyfOfIn0h5uXyGsuYmdkoqysAIuLFiJgBTCF9aj+sVrX8V/3M6698G5LmS+qU1NnT01NP88zMbBiGdBZQRDxJ+om+WaQfwOg9jXQK6QqGkD7ZT4WXfzR8P9KVDV8ur7FMdRuLIqIjIjra2gY9jdXMzIapnrOA2iSNy9N7kn40ezVwM/CBXG0e6fdVIV0Ktvf64h8A/iP/BNwy0s/37S5pGukysXc0qiNmZjY09XwRbBKwJJ+xswuwNCKul/QAcKWkLwE/Ay7J9S8BLpfURfrkPxcgIlZJWgo8QPrBiQX5muBmZtYEr+qLwXV0dMTO+E3g9oU3NG3bay84qWnbNrNXB0krI6JjsHr+JrCZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVatAAkDRV0s2SVktaJekTufxcSb+UdHe+nVhZ5ixJXZIelHR8pXx2LuuStHBkumRmZvUYW0edrcCZEXGXpH2BlZJW5HkXRcRXq5UlHQ7MBd4IHAT8WNJv59nfAN4NdAN3SloWEQ80oiNmZjY0gwZARGwANuTpZyStBiYPsMgc4MqIeAF4WFIXMDPP64qINQCSrsx1HQBmZk0wpGMAktqBNwO356IzJN0rabGk8blsMrCuslh3LuuvvO825kvqlNTZ09MzlOaZmdkQ1B0AkvYBrgE+GRFPAxcDrwVmkPYQvtZbtcbiMUD5tgURiyKiIyI62tra6m2emZkNUT3HAJC0K+nN/zsRcS1ARGyszP8WcH2+2w1MrSw+BVifp/srNzOzUVbPWUACLgFWR8SFlfJJlWp/ANyfp5cBcyXtLmkaMB24A7gTmC5pmqTdSAeKlzWmG2ZmNlT17AEcBXwYuE/S3bnsbOAUSTNIwzhrgY8BRMQqSUtJB3e3Agsi4kUASWcANwJjgMURsaqBfTEzsyGo5yygW6k9fr98gGXOB86vUb58oOXMzGz0+JvAZmaFcgCYmRXKAWBmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoer6TWDbebQvvKEp2117wUlN2a6ZDZ/3AMzMCuUAMDMrlAPAzKxQDgAzs0I5AMzMCuUAMDMrlAPAzKxQDgAzs0I5AMzMCjVoAEiaKulmSaslrZL0iVw+QdIKSQ/lv+NzuSR9XVKXpHslHVlZ17xc/yFJ80auW2ZmNph69gC2AmdGxGHALGCBpMOBhcBNETEduCnfBzgBmJ5v84GLIQUGcA7wVmAmcE5vaJiZ2egbNAAiYkNE3JWnnwFWA5OBOcCSXG0J8P48PQe4LJLbgHGSJgHHAysiYktEPAGsAGY3tDdmZla3IR0DkNQOvBm4HTgwIjZACgnggFxtMrCuslh3LuuvvO825kvqlNTZ09MzlOaZmdkQ1B0AkvYBrgE+GRFPD1S1RlkMUL5tQcSiiOiIiI62trZ6m2dmZkNUVwBI2pX05v+diLg2F2/MQzvkv5tyeTcwtbL4FGD9AOVmZtYE9ZwFJOASYHVEXFiZtQzoPZNnHnBdpfwj+WygWcBTeYjoRuA4SePzwd/jcpmZmTVBPT8IcxTwYeA+SXfnsrOBC4Clkk4HHgVOzvOWAycCXcDzwGkAEbFF0heBO3O98yJiS0N6YWZmQzZoAETErdQevwc4tkb9ABb0s67FwOKhNNDMzEaGvwlsZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEcAGZmhXIAmJkVygFgZlYoB4CZWaEGDQBJiyVtknR/pexcSb+UdHe+nViZd5akLkkPSjq+Uj47l3VJWtj4rpiZ2VDUswdwKTC7RvlFETEj35YDSDocmAu8MS/zTUljJI0BvgGcABwOnJLrmplZk4wdrEJE/ERSe53rmwNcGREvAA9L6gJm5nldEbEGQNKVue4DQ26xmZk1xI4cAzhD0r15iGh8LpsMrKvU6c5l/ZVvR9J8SZ2SOnt6enageWZmNpDhBsDFwGuBGcAG4Gu5XDXqxgDl2xdGLIqIjojoaGtrG2bzzMxsMIMOAdUSERt7pyV9C7g+3+0GplaqTgHW5+n+ys3MrAmGtQcgaVLl7h8AvWcILQPmStpd0jRgOnAHcCcwXdI0SbuRDhQvG36zzcxsRw26ByDpe8DRwERJ3cA5wNGSZpCGcdYCHwOIiFWSlpIO7m4FFkTEi3k9ZwA3AmOAxRGxquG9MTOzutVzFtApNYovGaD++cD5NcqXA8uH1DozMxsx/iawmVmhHABmZoVyAJiZFcoBYGZWKAeAmVmhHABmZoVyAJiZFcoBYGZWqGFdC8isr/aFNzRlu2svOKkp2zVrBd4DMDMrlAPAzKxQDgAzs0I5AMzMCuUAMDMrVEufBdSsM1PMzHYG3gMwMyuUA8DMrFAOADOzQjkAzMwK5QAwMyuUA8DMrFAOADOzQjkAzMwK5QAwMyuUA8DMrFAOADOzQg0aAJIWS9ok6f5K2QRJKyQ9lP+Oz+WS9HVJXZLulXRkZZl5uf5DkuaNTHfMzKxe9ewBXArM7lO2ELgpIqYDN+X7ACcA0/NtPnAxpMAAzgHeCswEzukNDTMza45BAyAifgJs6VM8B1iSp5cA76+UXxbJbcA4SZOA44EVEbElIp4AVrB9qJiZ2Sga7jGAAyNiA0D+e0Aunwysq9TrzmX9lW9H0nxJnZI6e3p6htk8MzMbTKMPAqtGWQxQvn1hxKKI6IiIjra2toY2zszMXjHcANiYh3bIfzfl8m5gaqXeFGD9AOVmZtYkww2AZUDvmTzzgOsq5R/JZwPNAp7KQ0Q3AsdJGp8P/h6Xy8zMrEkG/UlISd8DjgYmSuomnc1zAbBU0unAo8DJufpy4ESgC3geOA0gIrZI+iJwZ653XkT0PbBsZmajaNAAiIhT+pl1bI26ASzoZz2LgcVDap2ZmY0YfxPYzKxQDgAzs0I5AMzMCuUAMDMrlAPAzKxQDgAzs0I5AMzMCuUAMDMrlAPAzKxQDgAzs0I5AMzMCjXotYDMXs3aF97QtG2vveCkpm3brBG8B2BmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoRwAZmaFcgCYmRXKAWBmVigHgJlZoXYoACStlXSfpLsldeayCZJWSHoo/x2fyyXp65K6JN0r6chGdMDMzIanEXsAvx8RMyKiI99fCNwUEdOBm/J9gBOA6fk2H7i4Ads2M7NhGokhoDnAkjy9BHh/pfyySG4DxkmaNALbNzOzOuxoAATwI0krJc3PZQdGxAaA/PeAXD4ZWFdZtjuXmZlZE+zoT0IeFRHrJR0ArJD08wHqqkZZbFcpBcl8gIMPPngHm2dmZv3ZoQCIiPX57yZJPwBmAhslTYqIDXmIZ1Ou3g1MrSw+BVhfY52LgEUAHR0d2wWE2atFs36P2L9FbI0y7CEgSXtL2rd3GjgOuB9YBszL1eYB1+XpZcBH8tlAs4CneoeKzMxs9O3IHsCBwA8k9a7nuxHxQ0l3AkslnQ48Cpyc6y8HTgS6gOeB03Zg22ZmtoOGHQARsQY4okb548CxNcoDWDDc7ZmZWWP5m8BmZoVyAJiZFcoBYGZWKAeAmVmhHABmZoVyAJiZFcoBYGZWKAeAmVmhdvRicGY2ypp1DSLwdYhajfcAzMwK5QAwMyuUA8DMrFAOADOzQjkAzMwK5QAwMyuUA8DMrFD+HoCZ1c2/g9xavAdgZlYoB4CZWaEcAGZmhfIxADN71fP1j0aG9wDMzArlADAzK5QDwMysUA4AM7NC+SCwmdkAWvnLb94DMDMr1KgHgKTZkh6U1CVp4Whv38zMklENAEljgG8AJwCHA6dIOnw022BmZslo7wHMBLoiYk1E/Bq4Epgzym0wMzNG/yDwZGBd5X438NZqBUnzgfn57rOSHhyltgFMBDaP4vaaqaS+gvvbylqyr/pKv7Pq6e8h9WxjtANANcpimzsRi4BFo9OcbUnqjIiOZmx7tJXUV3B/W1lJfYXG9ne0h4C6gamV+1OA9aPcBjMzY/QD4E5guqRpknYD5gLLRrkNZmbGKA8BRcRWSWcANwJjgMURsWo02zCIpgw9NUlJfQX3t5WV1FdoYH8VEYPXMjOzluNvApuZFcoBYGZWqCIDQNJUSTdLWi1plaRP5PIJklZIeij/Hd/stjaCpD0k3SHpntzfL+TyaZJuz/29Kh+YbwmSxkj6maTr8/1W7utaSfdJultSZy5rydcygKRxkq6W9PP8P/y2VuyvpNfn57T39rSkTzayr0UGALAVODMiDgNmAQvyJSkWAjdFxHTgpny/FbwAHBMRRwAzgNmSZgFfAS7K/X0COL2JbWy0TwCrK/dbua8Avx8RMyrnh7fqaxngH4EfRsQbgCNIz3PL9TciHszP6Qzgd4HngR/QyL5GRPE34Drg3cCDwKRcNgl4sNltG4G+7gXcRfoG9mZgbC5/G3Bjs9vXoD5Oyf8YxwDXk76A2JJ9zf1ZC0zsU9aSr2XgNcDD5BNYWr2/lf4dB/xXo/ta6h7AyyS1A28GbgcOjIgNAPnvAc1rWWPlIZG7gU3ACuAXwJMRsTVX6SZdqqMV/APwN8BL+f7+tG5fIX2b/keSVuZLqUDrvpYPBXqAf8lDfN+WtDet299ec4Hv5emG9bXoAJC0D3AN8MmIeLrZ7RlJEfFipF3JKaSL8h1Wq9rotqrxJL0H2BQRK6vFNaru9H2tOCoijiRdZXeBpHc2u0EjaCxwJHBxRLwZeI4WGO4ZSD5e9T7g+41ed7EBIGlX0pv/dyLi2ly8UdKkPH8S6dNyS4mIJ4FbSMc+xknq/TJgq1yW4yjgfZLWkq42ewxpj6AV+wpARKzPfzeRxohn0rqv5W6gOyJuz/evJgVCq/YXUrDfFREb8/2G9bXIAJAk4BJgdURcWJm1DJiXp+eRjg3s9CS1SRqXp/cE3kU6cHYz8IFcrSX6GxFnRcSUiGgn7Tb/R0R8iBbsK4CkvSXt2ztNGiu+nxZ9LUfEY8A6Sa/PRccCD9Ci/c1O4ZXhH2hgX4v8JrCktwM/Be7jlXHis0nHAZYCBwOPAidHxJamNLKBJP0OsIR0+Y1dgKURcZ6kQ0mfkicAPwNOjYgXmtfSxpJ0NPDpiHhPq/Y19+sH+e5Y4LsRcb6k/WnB1zKApBnAt4HdgDXAaeTXNS3WX0l7kS6hf2hEPJXLGvbcFhkAZmZW6BCQmZk5AMzMiuUAMDMrlAPAzKxQDgAzs0I5AMzMCuUAMDMr1P8D0e+nX35w5FAAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "_ = plt.hist(age)\n", "_ = plt.title(\"Distribution of ages in dataset\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53518 False\n", "24790 True\n", "21691 True\n", "18310 False\n", "36621 True\n", "Name: age, dtype: bool" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# In most applications, you probably don't need super\n", "# fine precision, i.e., someone's exact age\n", "\n", "# Here, we wil \"discretize\" age into a categorical variable:\n", "\n", "# Binary definition; i.e., \"is 30 yrs old or younger\"\n", "age_30 = (age <= 30)\n", "age_30.head()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53518 (50, 100]\n", "24790 (20, 30]\n", "21691 (20, 30]\n", "18310 (30, 40]\n", "36621 (20, 30]\n", "Name: age, dtype: category\n", "Categories (5, object): [(0, 20] < (20, 30] < (30, 40] < (40, 50] < (50, 100]]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Categorical definition:\n", "\n", "# Define bin boundaries\n", "bins = [0,20,30,40,50,100]\n", "\n", "# Use pd.cut function to bin the data\n", "category = pd.cut(age,bins)\n", "age_bins = category.apply(lambda x: str(x))\n", "age_bins.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The magic: \"machine learning\"!" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,\n", " intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,\n", " penalty='l2', random_state=None, solver='liblinear', tol=0.0001,\n", " verbose=0, warm_start=False)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Building a basic logistic regression classifier\n", "# using profile features to predict age\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(profile_features, age_30)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictionground_truthcorrect
53518FalseFalseTrue
24790TrueTrueTrue
21691FalseTrueFalse
18310TrueFalseFalse
36621FalseTrueFalse
50834FalseFalseTrue
7486TrueFalseFalse
22184FalseFalseTrue
11898FalseFalseTrue
26979TrueTrueTrue
\n", "
" ], "text/plain": [ " prediction ground_truth correct\n", "53518 False False True\n", "24790 True True True\n", "21691 False True False\n", "18310 True False False\n", "36621 False True False\n", "50834 False False True\n", "7486 True False False\n", "22184 False False True\n", "11898 False False True\n", "26979 True True True" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logit_predictions = pd.DataFrame({\n", " \"prediction\": age_logit.predict(profile_features),\n", " \"ground_truth\": age_30\n", "})\n", "\n", "logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)\n", "logit_predictions.head(n=10)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
predictionground_truthcorrect
53518001
24790111
21691010
18310100
36621010
\n", "
" ], "text/plain": [ " prediction ground_truth correct\n", "53518 0 0 1\n", "24790 1 1 1\n", "21691 0 1 0\n", "18310 1 0 0\n", "36621 0 1 0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# We usually think of \"True\" as 1 and \"False\" as 0\n", "logit_predictions.astype(int).head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Logistic regression accuracy: 68.15%\n" ] } ], "source": [ "# Evaluate overall accuracy:\n", "logit_accuracy = logit_predictions.correct.mean()\n", "print(\"Logistic regression accuracy: {:.2f}%\".format(logit_accuracy*100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model comparison\n", "\n", "We'll try making the same prediction, using different machine learning models:\n", "\n", "- Logistic regression\n", "- Decision tree\n", "- Random forest" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "68.15" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(profile_features, age_30)\n", "round((age_logit.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "69.83" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Decision Tree\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n", "age_dt.fit(profile_features, age_30)\n", "round((age_dt.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "70.77" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Random forest\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n", "age_rf.fit(profile_features, age_30)\n", "round((age_rf.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A few takeaways:\n", "\n", "- Accuracy isn't *amazingly* better using fancy method like random forest\n", "- Fancy ML methods often only shine with truly *big data* (10k, 100k, 1m+ observations)\n", " - Not common in most organizations (outside Google, FB, Amazon, Twitter, etc.)\n", " - Lots of news is biased toward breakthroughs at these big comapnies... rarely relevant for business practitioners\n", "- The code to run different algorithms is remarkably similar\n", " - With tools like Python/SciKit-Learn, ML coding is a commodity!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated Accuracy (skip for class)\n", "\n", "If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accuracy: 0.668\n", " precision: 0.683\n", " recall: 0.713\n", " f1: 0.664\n" ] } ], "source": [ "from sklearn.model_selection import cross_validate\n", "\n", "scoring = {\n", " \"accuracy\": \"accuracy\",\n", " \"precision\": \"precision\",\n", " \"recall\": \"recall\",\n", " \"f1\": \"f1_macro\"\n", "}\n", "\n", "logit_clf = LogisticRegression()\n", "\n", "scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accuracy: 0.634\n", " precision: 0.653\n", " recall: 0.683\n", " f1: 0.630\n" ] } ], "source": [ "dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accuracy: 0.668\n", " precision: 0.672\n", " recall: 0.749\n", " f1: 0.661\n" ] } ], "source": [ "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Working with Text and Word Embeddings\n", "\n", "How can we improve performance? One idea: use text inputs from user profiles." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
essay0essay1essay2essay3essay4essay5essay6essay7essay8essay9
53518i'm a very lighthearted person. i prefer laugh...i'm making a point of not dwelling on what i m...making friends out of strangers. i seem to be ...my smile.i prefer books that are about true events and ...1. my family<br />\\n2. food<br />\\n3. my compu...just what to think about.NaNit's not very private if you plan on admitting...if you have read my profile and think we may b...
24790i love to travel, check out coffee shops, sals...meeting people! making the most out of life......having an open mind, hosting parties, going wi...my smile :-)books- eat pray love, kite runner, homage to c...music<br />\\nphone<br />\\ninternet<br />\\nhot ...what i want to be when i grow up, the world we...maybe perusing the streets of the city with my...i like it when it's foggy and raining outsideyou define success as happiness.<br />\\n<br />...
21691NaNworking as marketing consultant helping compan...NaNmy eyesNaNgood food, friends, family, books, travel and ...everythinggoing to dinner with friendsNaNyou think we have a lot in common and want to ...
18310NaNNaNNaNNaNNaNNaNNaNNaNNaNyou are cool - no pressure - like to sing - li...
36621hey! i am sanket. about me? here it goes:<br /...i am trying to make it worth by having lots of...well i am good at understanding people. i am g...you will notice that i am kind of cute (atleas...dont read books a lot. i like thrillers though...my family.<br />\\nsome of my friends.<br />\\nm...dont waste time in thinking too much. i am kin...definitely outside my apartment somewhere. i h...i am going to be a millionaire!you like what you see... you can be sure\\nwhat...
\n", "
" ], "text/plain": [ " essay0 \\\n", "53518 i'm a very lighthearted person. i prefer laugh... \n", "24790 i love to travel, check out coffee shops, sals... \n", "21691 NaN \n", "18310 NaN \n", "36621 hey! i am sanket. about me? here it goes:
\\n2. food
\\n3. my compu... \n", "24790 music
\\nphone
\\ninternet
\\nhot ... \n", "21691 good food, friends, family, books, travel and ... \n", "18310 NaN \n", "36621 my family.
\\nsome of my friends.
\\nm... \n", "\n", " essay6 \\\n", "53518 just what to think about. \n", "24790 what i want to be when i grow up, the world we... \n", "21691 everything \n", "18310 NaN \n", "36621 dont waste time in thinking too much. i am kin... \n", "\n", " essay7 \\\n", "53518 NaN \n", "24790 maybe perusing the streets of the city with my... \n", "21691 going to dinner with friends \n", "18310 NaN \n", "36621 definitely outside my apartment somewhere. i h... \n", "\n", " essay8 \\\n", "53518 it's not very private if you plan on admitting... \n", "24790 i like it when it's foggy and raining outside \n", "21691 NaN \n", "18310 NaN \n", "36621 i am going to be a millionaire! \n", "\n", " essay9 \n", "53518 if you have read my profile and think we may b... \n", "24790 you define success as happiness.
\\n
... \n", "21691 you think we have a lot in common and want to ... \n", "18310 you are cool - no pressure - like to sing - li... \n", "36621 you like what you see... you can be sure\\nwhat... " ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dating_data[[c for c in dating_data.columns if c.startswith(\"essay\")]].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using word embeddings on dating profiles\n", "\n", "### Pre-processing\n", "\n", "Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:\n", "\n", "- Take all text input from users and identify the all the unique words used\n", "- Get embeddings of all words from a pre-trained word-embedding model\n", " - GloVe, [source here](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models)\n", " - Trained on 6 billion documents from Wikipedia and Gigaword repository\n", "- Average the vector of all the words used by a given user\n", "- Save the output in its own file\n", "\n", "Result below:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
12345678910111213141516171819202122232425...276277278279280281282283284285286287288289290291292293294295296297298299300
53518-0.039890-0.011629-0.115810-0.1623400.053081-0.008670-3.6163210.156632-0.019611-0.4386790.118975-0.076947-0.013487-0.101833-0.019616-0.086686-0.121745-0.0807350.0485040.0497630.131275-0.1057130.0204470.073554-0.075424...0.0061130.1179340.0650100.001210-0.0503610.557383-0.0575720.0575710.1486030.0066650.0914290.0026360.082663-0.079894-0.0463750.000297-0.066259-0.074132-0.0671820.0509710.042647-0.043682-0.0600140.053473-0.014896
24790-0.058771-0.028866-0.074170-0.097341-0.019257-0.002513-2.756003-0.057133-0.004204-0.3625550.181805-0.0167910.0183600.0070970.009290-0.032806-0.065018-0.0312230.0447160.0578830.083528-0.0595990.001198-0.034852-0.027820...0.0277630.072245-0.034236-0.005366-0.0521450.421351-0.0095060.0254520.0967890.0064310.102477-0.035993-0.012258-0.068204-0.0942260.039112-0.092943-0.058902-0.0276920.0571580.068052-0.041389-0.0793190.0473640.040582
21691-0.0039540.022642-0.143292-0.1534590.0698950.018823-3.710687-0.0156250.008424-0.4211690.083162-0.0184550.084521-0.172999-0.0301020.028515-0.084798-0.0528070.0475340.0515490.108513-0.1344340.0210670.144914-0.098754...0.0208420.0546970.002339-0.060946-0.0648250.382083-0.0867610.0700380.1511880.0112270.169332-0.0380400.098214-0.074552-0.0846180.026273-0.082545-0.095551-0.0422690.0977900.046094-0.1088590.0550970.158306-0.020009
18310-0.132708-0.037007-0.130888-0.2826700.1406510.141868-3.4968200.030340-0.127458-0.4920440.1696950.318739-0.108982-0.024348-0.114518-0.035411-0.1875750.086961-0.122467-0.0284960.0515180.115598-0.0282410.0035360.130245...-0.0797590.1911020.082972-0.011620-0.1425330.615899-0.1778810.0420160.217753-0.1033920.1473390.1611170.0074580.007311-0.1198260.055219-0.010782-0.181100-0.1515740.243818-0.018370-0.1730400.0391940.0674090.075485
36621-0.0650780.007312-0.098709-0.1374510.0407890.009936-3.5731690.0508550.018114-0.4519260.132201-0.0670890.079714-0.086178-0.021331-0.087012-0.115125-0.0343710.0353690.0634610.126464-0.1067760.0211120.047072-0.061139...0.0086260.0995380.0796640.011097-0.0518860.577595-0.0971790.0691240.1179980.0211030.1037210.0266850.126203-0.115861-0.047948-0.015657-0.106407-0.120128-0.0563580.0520790.057502-0.055104-0.0246850.0475490.020424
\n", "

5 rows × 300 columns

\n", "
" ], "text/plain": [ " 1 2 3 4 5 6 7 \\\n", "53518 -0.039890 -0.011629 -0.115810 -0.162340 0.053081 -0.008670 -3.616321 \n", "24790 -0.058771 -0.028866 -0.074170 -0.097341 -0.019257 -0.002513 -2.756003 \n", "21691 -0.003954 0.022642 -0.143292 -0.153459 0.069895 0.018823 -3.710687 \n", "18310 -0.132708 -0.037007 -0.130888 -0.282670 0.140651 0.141868 -3.496820 \n", "36621 -0.065078 0.007312 -0.098709 -0.137451 0.040789 0.009936 -3.573169 \n", "\n", " 8 9 10 11 12 13 14 \\\n", "53518 0.156632 -0.019611 -0.438679 0.118975 -0.076947 -0.013487 -0.101833 \n", "24790 -0.057133 -0.004204 -0.362555 0.181805 -0.016791 0.018360 0.007097 \n", "21691 -0.015625 0.008424 -0.421169 0.083162 -0.018455 0.084521 -0.172999 \n", "18310 0.030340 -0.127458 -0.492044 0.169695 0.318739 -0.108982 -0.024348 \n", "36621 0.050855 0.018114 -0.451926 0.132201 -0.067089 0.079714 -0.086178 \n", "\n", " 15 16 17 18 19 20 21 \\\n", "53518 -0.019616 -0.086686 -0.121745 -0.080735 0.048504 0.049763 0.131275 \n", "24790 0.009290 -0.032806 -0.065018 -0.031223 0.044716 0.057883 0.083528 \n", "21691 -0.030102 0.028515 -0.084798 -0.052807 0.047534 0.051549 0.108513 \n", "18310 -0.114518 -0.035411 -0.187575 0.086961 -0.122467 -0.028496 0.051518 \n", "36621 -0.021331 -0.087012 -0.115125 -0.034371 0.035369 0.063461 0.126464 \n", "\n", " 22 23 24 25 ... 276 277 \\\n", "53518 -0.105713 0.020447 0.073554 -0.075424 ... 0.006113 0.117934 \n", "24790 -0.059599 0.001198 -0.034852 -0.027820 ... 0.027763 0.072245 \n", "21691 -0.134434 0.021067 0.144914 -0.098754 ... 0.020842 0.054697 \n", "18310 0.115598 -0.028241 0.003536 0.130245 ... -0.079759 0.191102 \n", "36621 -0.106776 0.021112 0.047072 -0.061139 ... 0.008626 0.099538 \n", "\n", " 278 279 280 281 282 283 284 \\\n", "53518 0.065010 0.001210 -0.050361 0.557383 -0.057572 0.057571 0.148603 \n", "24790 -0.034236 -0.005366 -0.052145 0.421351 -0.009506 0.025452 0.096789 \n", "21691 0.002339 -0.060946 -0.064825 0.382083 -0.086761 0.070038 0.151188 \n", "18310 0.082972 -0.011620 -0.142533 0.615899 -0.177881 0.042016 0.217753 \n", "36621 0.079664 0.011097 -0.051886 0.577595 -0.097179 0.069124 0.117998 \n", "\n", " 285 286 287 288 289 290 291 \\\n", "53518 0.006665 0.091429 0.002636 0.082663 -0.079894 -0.046375 0.000297 \n", "24790 0.006431 0.102477 -0.035993 -0.012258 -0.068204 -0.094226 0.039112 \n", "21691 0.011227 0.169332 -0.038040 0.098214 -0.074552 -0.084618 0.026273 \n", "18310 -0.103392 0.147339 0.161117 0.007458 0.007311 -0.119826 0.055219 \n", "36621 0.021103 0.103721 0.026685 0.126203 -0.115861 -0.047948 -0.015657 \n", "\n", " 292 293 294 295 296 297 298 \\\n", "53518 -0.066259 -0.074132 -0.067182 0.050971 0.042647 -0.043682 -0.060014 \n", "24790 -0.092943 -0.058902 -0.027692 0.057158 0.068052 -0.041389 -0.079319 \n", "21691 -0.082545 -0.095551 -0.042269 0.097790 0.046094 -0.108859 0.055097 \n", "18310 -0.010782 -0.181100 -0.151574 0.243818 -0.018370 -0.173040 0.039194 \n", "36621 -0.106407 -0.120128 -0.056358 0.052079 0.057502 -0.055104 -0.024685 \n", "\n", " 299 300 \n", "53518 0.053473 -0.014896 \n", "24790 0.047364 0.040582 \n", "21691 0.158306 -0.020009 \n", "18310 0.067409 0.075485 \n", "36621 0.047549 0.020424 \n", "\n", "[5 rows x 300 columns]" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_features = pd.read_csv(\"./dating_data/text_features.csv\", index_col=0)\n", "text_features.head()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7411446972057755" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Using embedding of text data to predict age:\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(text_features, age_30)\n", "(age_logit.predict(text_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7643087150721928" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What happens if we combine the profile characteristics and text features?\n", "\n", "combined_features = np.hstack((text_features.values, profile_features.values))\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(combined_features, age_30)\n", "(age_logit.predict(combined_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9553339565804508" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What about using fancy methods with fancy word embeddings?\n", "\n", "age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)\n", "age_rf.fit(text_features, age_30)\n", "(age_rf.predict(text_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "# BE WARY! This is \"in-sample\" fit; predictions on \"out-of-sample\"\n", "# data are actually no better than logistic regression in this case" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated accuracy scores (skip for class)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accuracy: 0.725\n", " precision: 0.730\n", " recall: 0.775\n", " f1: 0.721\n" ] } ], "source": [ "logit_clf = LogisticRegression()\n", "scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " accuracy: 0.724\n", " precision: 0.710\n", " recall: 0.824\n", " f1: 0.716\n" ] } ], "source": [ "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping up\n", "\n", "This code ([source](https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook/49199019#49199019)) lists all required packages used in this notebook, making it easy to share this code to run in your own environment." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "scikit-learn==0.19.1\n", "pandas==0.23.4\n", "numpy==1.15.0\n", "matplotlib==2.2.3\n" ] } ], "source": [ "\n", "import pkg_resources\n", "import types\n", "def get_imports():\n", " for name, val in globals().items():\n", " if isinstance(val, types.ModuleType):\n", " # Split ensures you get root package, \n", " # not just imported function\n", " name = val.__name__.split(\".\")[0]\n", "\n", " elif isinstance(val, type):\n", " name = val.__module__.split(\".\")[0]\n", "\n", " # Some packages are weird and have different\n", " # imported names vs. system/pip names. Unfortunately,\n", " # there is no systematic way to get pip names from\n", " # a package's imported name. You'll have to had\n", " # exceptions to this list manually!\n", " poorly_named_packages = {\n", " \"PIL\": \"Pillow\",\n", " \"sklearn\": \"scikit-learn\"\n", " }\n", " if name in poorly_named_packages.keys():\n", " name = poorly_named_packages[name]\n", "\n", " yield name\n", "imports = list(set(get_imports()))\n", "\n", "# The only way I found to get the version of the root package\n", "# from only the name of the package is to cross-check the names \n", "# of installed packages vs. imported packages\n", "requirements = []\n", "for m in pkg_resources.working_set:\n", " if m.project_name in imports and m.project_name!=\"pip\":\n", " requirements.append((m.project_name, m.version))\n", "\n", "for r in requirements:\n", " print(\"{}=={}\".format(*r))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }