{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Data Science/Machine Learning Code Walkthrough

\n", "
\n", "\n", "\"Wharton\n", "
\n", "\n", "

Fall 2018, OIDD314/662

\n", "

Alex P. Miller, Kartik Hosanagar

\n", "\n", "

{alexmill,kartikh}@wharton.upenn.edu

\n", "

@alexpmil, @KHosanagar

\n", "\n", "

https://github.com/alexmill/machine-learning-wharton

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Main goals:\n", "- Understand basics of working with raw data in ML\n", "- Understand what \"machine learning\" looks like in practice\n", "- Get a sense of where fancy methods help and where they don't\n", "- Give you a jumping off point if you want to learn more\n", "\n", "(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import basic functions\n", "\n", "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "from copy import deepcopy\n", "\n", "pd.set_option('display.max_columns', 50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset: Online Dating Profiles\n", "\n", "This is a useful, publicly available dataset for demonstrating some common data science techniques ([data source](https://github.com/rudeboybert/JSE_OkCupid)). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.\n", "\n", "# Part 1: Basic Data Processing and Prediction\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load in raw profiles\n", "dating_data = pd.read_csv(\"./dating_data/profiles_sample.csv\", index_col=0)\n", "dating_data.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "dating_data.shape" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Question: Can we predict a person's age from their profile characteristics?\n", "\n", "In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Let's use just these features to try to predict a person's age\n", "# (I'm excluding variables like \"kids\", which might be dead giveaways.)\n", "prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']\n", "dating_data[prof_cols].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### But wait...\n", "**Question:** How do we get a computer to \"understand\" a person's dating profile?\n", "\n", "**Answer:** Math! (matrices, linear algebra)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Most columns are \"categorical\"\n", "# e.g., for whether or not someone drinks alcohol, they\n", "# can choose from among the following categories:\n", "dating_data.drinks.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To convert this data into a matrix, we will take each \n", "# category and convert it into a binary column:\n", "dating_data.drinks.str.get_dummies().head(n=20)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Note: data is often very messy\n", "# Lots of work in data science is just cleaning/processing data\n", "\n", "# Example:\n", "dating_data.pets.unique()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# I've done the processing work ahead of time for\n", "# the rest of the columns in the dataset\n", "\n", "# Load in pre-processed data:\n", "profile_features = pd.read_csv(\"./dating_data/profile_features.csv\", index_col=0)\n", "profile_features.head(n=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Outcome variable: Age" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# How to define outcome variable (age)?\n", "age = dating_data.age\n", "age.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "_ = plt.hist(age)\n", "_ = plt.title(\"Distribution of ages in dataset\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# In most applications, you probably don't need super\n", "# fine precision, i.e., someone's exact age\n", "\n", "# Here, we wil \"discretize\" age into a categorical variable:\n", "\n", "# Binary definition; i.e., \"is 30 yrs old or younger\"\n", "age_30 = (age <= 30)\n", "age_30.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Categorical definition:\n", "\n", "# Define bin boundaries\n", "bins = [0,20,30,40,50,100]\n", "\n", "# Use pd.cut function to bin the data\n", "category = pd.cut(age,bins)\n", "age_bins = category.apply(lambda x: str(x))\n", "age_bins.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The magic: \"machine learning\"!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Building a basic logistic regression classifier\n", "# using profile features to predict age\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(profile_features, age_30)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_predictions = pd.DataFrame({\n", " \"prediction\": age_logit.predict(profile_features),\n", " \"ground_truth\": age_30\n", "})\n", "\n", "logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)\n", "logit_predictions.head(n=10)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We usually think of \"True\" as 1 and \"False\" as 0\n", "logit_predictions.astype(int).head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Evaluate overall accuracy:\n", "logit_accuracy = logit_predictions.correct.mean()\n", "print(\"Logistic regression accuracy: {:.2f}%\".format(logit_accuracy*100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model comparison\n", "\n", "We'll try making the same prediction, using different machine learning models:\n", "\n", "- Logistic regression\n", "- Decision tree\n", "- Random forest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Logistic regression\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(profile_features, age_30)\n", "round((age_logit.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Decision Tree\n", "from sklearn.tree import DecisionTreeClassifier\n", "\n", "age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n", "age_dt.fit(profile_features, age_30)\n", "round((age_dt.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Random forest\n", "from sklearn.ensemble import RandomForestClassifier\n", "\n", "age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n", "age_rf.fit(profile_features, age_30)\n", "round((age_rf.predict(profile_features)==age_30).mean()*100, 2)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A few takeaways:\n", "\n", "- Accuracy isn't *amazingly* better using fancy method like random forest\n", "- Fancy ML methods often only shine with truly *big data* (10k, 100k, 1m+ observations)\n", " - Not common in most organizations (outside Google, FB, Amazon, Twitter, etc.)\n", " - Lots of news is biased toward breakthroughs at these big comapnies... rarely relevant for business practitioners\n", "- The code to run different algorithms is remarkably similar\n", " - With tools like Python/SciKit-Learn, ML coding is a commodity!\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated Accuracy (skip for class)\n", "\n", "If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import cross_validate\n", "\n", "scoring = {\n", " \"accuracy\": \"accuracy\",\n", " \"precision\": \"precision\",\n", " \"recall\": \"recall\",\n", " \"f1\": \"f1_macro\"\n", "}\n", "\n", "logit_clf = LogisticRegression()\n", "\n", "scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 2: Working with Text and Word Embeddings\n", "\n", "How can we improve performance? One idea: use text inputs from user profiles." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dating_data[[c for c in dating_data.columns if c.startswith(\"essay\")]].head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using word embeddings on dating profiles\n", "\n", "### Pre-processing\n", "\n", "Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:\n", "\n", "- Take all text input from users and identify the all the unique words used\n", "- Get embeddings of all words from a pre-trained word-embedding model\n", " - GloVe, [source here](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models)\n", " - Trained on 6 billion documents from Wikipedia and Gigaword repository\n", "- Average the vector of all the words used by a given user\n", "- Save the output in its own file\n", "\n", "Result below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "text_features = pd.read_csv(\"./dating_data/text_features.csv\", index_col=0)\n", "text_features.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Using embedding of text data to predict age:\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(text_features, age_30)\n", "(age_logit.predict(text_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What happens if we combine the profile characteristics and text features?\n", "\n", "combined_features = np.hstack((text_features.values, profile_features.values))\n", "\n", "age_logit = LogisticRegression()\n", "age_logit.fit(combined_features, age_30)\n", "(age_logit.predict(combined_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# What about using fancy methods with fancy word embeddings?\n", "\n", "age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)\n", "age_rf.fit(text_features, age_30)\n", "(age_rf.predict(text_features)==age_30).mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# BE WARY! This is \"in-sample\" fit; predictions on \"out-of-sample\"\n", "# data are actually no better than logistic regression in this case" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Cross-validated accuracy scores (skip for class)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logit_clf = LogisticRegression()\n", "scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)\n", "\n", "scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n", "for sc in scoring.keys():\n", " print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapping up\n", "\n", "This code ([source](https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook/49199019#49199019)) lists all required packages used in this notebook, making it easy to share this code to run in your own environment." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "import pkg_resources\n", "import types\n", "def get_imports():\n", " for name, val in globals().items():\n", " if isinstance(val, types.ModuleType):\n", " # Split ensures you get root package, \n", " # not just imported function\n", " name = val.__name__.split(\".\")[0]\n", "\n", " elif isinstance(val, type):\n", " name = val.__module__.split(\".\")[0]\n", "\n", " # Some packages are weird and have different\n", " # imported names vs. system/pip names. Unfortunately,\n", " # there is no systematic way to get pip names from\n", " # a package's imported name. You'll have to had\n", " # exceptions to this list manually!\n", " poorly_named_packages = {\n", " \"PIL\": \"Pillow\",\n", " \"sklearn\": \"scikit-learn\"\n", " }\n", " if name in poorly_named_packages.keys():\n", " name = poorly_named_packages[name]\n", "\n", " yield name\n", "imports = list(set(get_imports()))\n", "\n", "# The only way I found to get the version of the root package\n", "# from only the name of the package is to cross-check the names \n", "# of installed packages vs. imported packages\n", "requirements = []\n", "for m in pkg_resources.working_set:\n", " if m.project_name in imports and m.project_name!=\"pip\":\n", " requirements.append((m.project_name, m.version))\n", "\n", "for r in requirements:\n", " print(\"{}=={}\".format(*r))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }