{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<h1 style=\"text-align: center;\"> Data Science/Machine Learning Code Walkthrough</h1>\n",
    "<br/>\n",
    "\n",
    "<img style=\"max-height: 80px; position: relative; left: -30px\" src=\"./img/wharton-logo.png\" alt=\"Wharton School logo\"/>\n",
    "<br/>\n",
    "\n",
    "<h3 style=\"text-align: center; margin: 5px;\">Fall 2018, OIDD314/662 </h3>\n",
    "<h3 style=\"text-align: center; margin: 5px;\">Alex P. Miller, Kartik Hosanagar</h3>\n",
    "\n",
    "<h4 style=\"text-align: center; margin: 5px;\">{alexmill,kartikh}@wharton.upenn.edu</h4>\n",
    "<h4 style=\"text-align: center; margin: 5px;\"><a href=\"https://twitter.com/alexpmil\">@alexpmil</a>, <a href=\"https://twitter.com/khosanagar\">@KHosanagar</a></h4>\n",
    "\n",
    "<h4 style=\"text-align: center; margin: 5px; font-weight: normal;\"><a href=\"https://github.com/alexmill/machine-learning-wharton\">https://github.com/alexmill/machine-learning-wharton</a></h4>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "Main goals:\n",
    "- Understand basics of working with raw data in ML\n",
    "- Understand what \"machine learning\" looks like in practice\n",
    "- Get a sense of where fancy methods help and where they don't\n",
    "- Give you a jumping off point if you want to learn more\n",
    "\n",
    "(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import basic functions\n",
    "\n",
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from copy import deepcopy\n",
    "\n",
    "pd.set_option('display.max_columns', 50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset: Online Dating Profiles\n",
    "\n",
    "This is a useful, publicly available dataset for demonstrating some common data science techniques ([data source](https://github.com/rudeboybert/JSE_OkCupid)). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.\n",
    "\n",
    "# Part 1: Basic Data Processing and Prediction\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load in raw profiles\n",
    "dating_data = pd.read_csv(\"./dating_data/profiles_sample.csv\", index_col=0)\n",
    "dating_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dating_data.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: Can we predict a person's age from their profile characteristics?\n",
    "\n",
    "In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Let's use just these features to try to predict a person's age\n",
    "# (I'm excluding variables like \"kids\", which might be dead giveaways.)\n",
    "prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']\n",
    "dating_data[prof_cols].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### But wait...\n",
    "**Question:** How do we get a computer to \"understand\" a person's dating profile?\n",
    "\n",
    "**Answer:** Math! (matrices, linear algebra)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Most columns are \"categorical\"\n",
    "# e.g., for whether or not someone drinks alcohol, they\n",
    "# can choose from among the following categories:\n",
    "dating_data.drinks.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# To convert this data into a matrix, we will take each \n",
    "# category and convert it into a binary column:\n",
    "dating_data.drinks.str.get_dummies().head(n=20)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: data is often very messy\n",
    "# Lots of work in data science is just cleaning/processing data\n",
    "\n",
    "# Example:\n",
    "dating_data.pets.unique()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# I've done the processing work ahead of time for\n",
    "# the rest of the columns in the dataset\n",
    "\n",
    "# Load in pre-processed data:\n",
    "profile_features = pd.read_csv(\"./dating_data/profile_features.csv\", index_col=0)\n",
    "profile_features.head(n=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Outcome variable: Age"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# How to define outcome variable (age)?\n",
    "age = dating_data.age\n",
    "age.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = plt.hist(age)\n",
    "_ = plt.title(\"Distribution of ages in dataset\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# In most applications, you probably don't need super\n",
    "# fine precision, i.e., someone's exact age\n",
    "\n",
    "# Here, we wil \"discretize\" age into a categorical variable:\n",
    "\n",
    "# Binary definition; i.e., \"is 30 yrs old or younger\"\n",
    "age_30 = (age <= 30)\n",
    "age_30.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Categorical definition:\n",
    "\n",
    "# Define bin boundaries\n",
    "bins = [0,20,30,40,50,100]\n",
    "\n",
    "# Use pd.cut function to bin the data\n",
    "category = pd.cut(age,bins)\n",
    "age_bins = category.apply(lambda x: str(x))\n",
    "age_bins.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## The magic: \"machine learning\"!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Building a basic logistic regression classifier\n",
    "# using profile features to predict age\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "age_logit = LogisticRegression()\n",
    "age_logit.fit(profile_features, age_30)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logit_predictions = pd.DataFrame({\n",
    "    \"prediction\": age_logit.predict(profile_features),\n",
    "    \"ground_truth\": age_30\n",
    "})\n",
    "\n",
    "logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)\n",
    "logit_predictions.head(n=10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We usually think of \"True\" as 1 and \"False\" as 0\n",
    "logit_predictions.astype(int).head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Evaluate overall accuracy:\n",
    "logit_accuracy = logit_predictions.correct.mean()\n",
    "print(\"Logistic regression accuracy: {:.2f}%\".format(logit_accuracy*100))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model comparison\n",
    "\n",
    "We'll try making the same prediction, using different machine learning models:\n",
    "\n",
    "- Logistic regression\n",
    "- Decision tree\n",
    "- Random forest"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Logistic regression\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "age_logit = LogisticRegression()\n",
    "age_logit.fit(profile_features, age_30)\n",
    "round((age_logit.predict(profile_features)==age_30).mean()*100, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decision Tree\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n",
    "age_dt.fit(profile_features, age_30)\n",
    "round((age_dt.predict(profile_features)==age_30).mean()*100, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random forest\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n",
    "age_rf.fit(profile_features, age_30)\n",
    "round((age_rf.predict(profile_features)==age_30).mean()*100, 2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A few takeaways:\n",
    "\n",
    "- Accuracy isn't *amazingly* better using fancy method like random forest\n",
    "- Fancy ML methods often only shine with truly *big data* (10k, 100k, 1m+ observations)\n",
    "    - Not common in most organizations (outside Google, FB, Amazon, Twitter, etc.)\n",
    "    - Lots of news is biased toward breakthroughs at these big comapnies... rarely relevant for business practitioners\n",
    "- The code to run different algorithms is remarkably similar\n",
    "    - With tools like Python/SciKit-Learn, ML coding is a commodity!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross-validated Accuracy (skip for class)\n",
    "\n",
    "If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "scoring = {\n",
    "    \"accuracy\": \"accuracy\",\n",
    "    \"precision\": \"precision\",\n",
    "    \"recall\": \"recall\",\n",
    "    \"f1\": \"f1_macro\"\n",
    "}\n",
    "\n",
    "logit_clf = LogisticRegression()\n",
    "\n",
    "scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n",
    "for sc in scoring.keys():\n",
    "    print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)\n",
    "\n",
    "scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n",
    "for sc in scoring.keys():\n",
    "    print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)\n",
    "\n",
    "scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n",
    "for sc in scoring.keys():\n",
    "    print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 2: Working with Text and Word Embeddings\n",
    "\n",
    "How can we improve performance? One idea: use text inputs from user profiles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dating_data[[c for c in dating_data.columns if c.startswith(\"essay\")]].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using word embeddings on dating profiles\n",
    "\n",
    "### Pre-processing\n",
    "\n",
    "Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:\n",
    "\n",
    "- Take all text input from users and identify the all the unique words used\n",
    "- Get embeddings of all words from a pre-trained word-embedding model\n",
    "    - GloVe, [source here](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models)\n",
    "    - Trained on 6 billion documents from Wikipedia and Gigaword repository\n",
    "- Average the vector of all the words used by a given user\n",
    "- Save the output in its own file\n",
    "\n",
    "Result below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_features = pd.read_csv(\"./dating_data/text_features.csv\", index_col=0)\n",
    "text_features.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using embedding of text data to predict age:\n",
    "\n",
    "age_logit = LogisticRegression()\n",
    "age_logit.fit(text_features, age_30)\n",
    "(age_logit.predict(text_features)==age_30).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What happens if we combine the profile characteristics and text features?\n",
    "\n",
    "combined_features = np.hstack((text_features.values, profile_features.values))\n",
    "\n",
    "age_logit = LogisticRegression()\n",
    "age_logit.fit(combined_features, age_30)\n",
    "(age_logit.predict(combined_features)==age_30).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# What about using fancy methods with fancy word embeddings?\n",
    "\n",
    "age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)\n",
    "age_rf.fit(text_features, age_30)\n",
    "(age_rf.predict(text_features)==age_30).mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# BE WARY! This is \"in-sample\" fit; predictions on \"out-of-sample\"\n",
    "# data are actually no better than logistic regression in this case"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cross-validated accuracy scores (skip for class)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "logit_clf = LogisticRegression()\n",
    "scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n",
    "for sc in scoring.keys():\n",
    "    print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)\n",
    "\n",
    "scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)\n",
    "for sc in scoring.keys():\n",
    "    print(\"{: >10}: {:.3f}\".format(sc, scoring_obj[\"test_\"+sc].mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrapping up\n",
    "\n",
    "This code ([source](https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook/49199019#49199019)) lists all required packages used in this notebook, making it easy to share this code to run in your own environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import pkg_resources\n",
    "import types\n",
    "def get_imports():\n",
    "    for name, val in globals().items():\n",
    "        if isinstance(val, types.ModuleType):\n",
    "            # Split ensures you get root package, \n",
    "            # not just imported function\n",
    "            name = val.__name__.split(\".\")[0]\n",
    "\n",
    "        elif isinstance(val, type):\n",
    "            name = val.__module__.split(\".\")[0]\n",
    "\n",
    "        # Some packages are weird and have different\n",
    "        # imported names vs. system/pip names. Unfortunately,\n",
    "        # there is no systematic way to get pip names from\n",
    "        # a package's imported name. You'll have to had\n",
    "        # exceptions to this list manually!\n",
    "        poorly_named_packages = {\n",
    "            \"PIL\": \"Pillow\",\n",
    "            \"sklearn\": \"scikit-learn\"\n",
    "        }\n",
    "        if name in poorly_named_packages.keys():\n",
    "            name = poorly_named_packages[name]\n",
    "\n",
    "        yield name\n",
    "imports = list(set(get_imports()))\n",
    "\n",
    "# The only way I found to get the version of the root package\n",
    "# from only the name of the package is to cross-check the names \n",
    "# of installed packages vs. imported packages\n",
    "requirements = []\n",
    "for m in pkg_resources.working_set:\n",
    "    if m.project_name in imports and m.project_name!=\"pip\":\n",
    "        requirements.append((m.project_name, m.version))\n",
    "\n",
    "for r in requirements:\n",
    "    print(\"{}=={}\".format(*r))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}