<h1 style="text-align: center;"> Data Science/Machine Learning Code Walkthrough</h1>
<br/>

<img style="max-height: 80px; position: relative; left: -30px" src="./img/wharton-logo.png" alt="Wharton School logo"/>
<br/>

<h3 style="text-align: center; margin: 5px;">Fall 2018, OIDD314/662 </h3>
<h3 style="text-align: center; margin: 5px;">Alex P. Miller, Kartik Hosanagar</h3>

<h4 style="text-align: center; margin: 5px;">{alexmill,kartikh}@wharton.upenn.edu</h4>
<h4 style="text-align: center; margin: 5px;"><a href="https://twitter.com/alexpmil">@alexpmil</a>, <a href="https://twitter.com/khosanagar">@KHosanagar</a></h4>

<h4 style="text-align: center; margin: 5px; font-weight: normal;"><a href="https://github.com/alexmill/machine-learning-wharton">https://github.com/alexmill/machine-learning-wharton</a></h4>

---

Main goals:
- Understand basics of working with raw data in ML
- Understand what "machine learning" looks like in practice
- Get a sense of where fancy methods help and where they don't
- Give you a jumping off point if you want to learn more

(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)

In [None]:
# Import basic functions

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from copy import deepcopy

pd.set_option('display.max_columns', 50)

# Dataset: Online Dating Profiles

This is a useful, publicly available dataset for demonstrating some common data science techniques ([data source](https://github.com/rudeboybert/JSE_OkCupid)). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.

# Part 1: Basic Data Processing and Prediction


In [None]:
# Load in raw profiles
dating_data = pd.read_csv("./dating_data/profiles_sample.csv", index_col=0)
dating_data.head()

In [None]:
dating_data.shape

### Question: Can we predict a person's age from their profile characteristics?

In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product.

In [None]:
# Let's use just these features to try to predict a person's age
# (I'm excluding variables like "kids", which might be dead giveaways.)
prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']
dating_data[prof_cols].head()

### But wait...
**Question:** How do we get a computer to "understand" a person's dating profile?

**Answer:** Math! (matrices, linear algebra).

In [None]:
# Most columns are "categorical"
# e.g., for whether or not someone drinks alcohol, they
# can choose from among the following categories:
dating_data.drinks.unique()

In [None]:
# To convert this data into a matrix, we will take each 
# category and convert it into a binary column:
dating_data.drinks.str.get_dummies().head(n=20)

In [None]:
# Note: data is often very messy
# Lots of work in data science is just cleaning/processing data

# Example:
dating_data.pets.unique()

In [None]:
# I've done the processing work ahead of time for
# the rest of the columns in the dataset

# Load in pre-processed data:
profile_features = pd.read_csv("./dating_data/profile_features.csv", index_col=0)
profile_features.head(n=10)

### Outcome variable: Age

In [None]:
# How to define outcome variable (age)?
age = dating_data.age
age.head()

In [None]:
_ = plt.hist(age)
_ = plt.title("Distribution of ages in dataset")

In [None]:
# In most applications, you probably don't need super
# fine precision, i.e., someone's exact age

# Here, we wil "discretize" age into a categorical variable:

# Binary definition; i.e., "is 30 yrs old or younger"
age_30 = (age <= 30)
age_30.head()

In [None]:
# Categorical definition:

# Define bin boundaries
bins = [0,20,30,40,50,100]

# Use pd.cut function to bin the data
category = pd.cut(age,bins)
age_bins = category.apply(lambda x: str(x))
age_bins.head()

## The magic: "machine learning"!

In [None]:
# Building a basic logistic regression classifier
# using profile features to predict age

from sklearn.linear_model import LogisticRegression

age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)

In [None]:
logit_predictions = pd.DataFrame({
    "prediction": age_logit.predict(profile_features),
    "ground_truth": age_30
})

logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)
logit_predictions.head(n=10)

In [None]:
# We usually think of "True" as 1 and "False" as 0
logit_predictions.astype(int).head()

In [None]:
# Evaluate overall accuracy:
logit_accuracy = logit_predictions.correct.mean()
print("Logistic regression accuracy: {:.2f}%".format(logit_accuracy*100))

## Model comparison

We'll try making the same prediction, using different machine learning models:

- Logistic regression
- Decision tree
- Random forest

In [None]:
# Logistic regression
from sklearn.linear_model import LogisticRegression

age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)
round((age_logit.predict(profile_features)==age_30).mean()*100, 2)

In [None]:
# Decision Tree
from sklearn.tree import DecisionTreeClassifier

age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)
age_dt.fit(profile_features, age_30)
round((age_dt.predict(profile_features)==age_30).mean()*100, 2)

In [None]:
# Random forest
from sklearn.ensemble import RandomForestClassifier

age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)
age_rf.fit(profile_features, age_30)
round((age_rf.predict(profile_features)==age_30).mean()*100, 2)

## A few takeaways:

- Accuracy isn't *amazingly* better using fancy method like random forest
- Fancy ML methods often only shine with truly *big data* (10k, 100k, 1m+ observations)
    - Not common in most organizations (outside Google, FB, Amazon, Twitter, etc.)
    - Lots of news is biased toward breakthroughs at these big comapnies... rarely relevant for business practitioners
- The code to run different algorithms is remarkably similar
    - With tools like Python/SciKit-Learn, ML coding is a commodity!


### Cross-validated Accuracy (skip for class)

If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures.

In [None]:
from sklearn.model_selection import cross_validate

scoring = {
    "accuracy": "accuracy",
    "precision": "precision",
    "recall": "recall",
    "f1": "f1_macro"
}

logit_clf = LogisticRegression()

scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [None]:
dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)

scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)

scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

# Part 2: Working with Text and Word Embeddings

How can we improve performance? One idea: use text inputs from user profiles.

In [None]:
dating_data[[c for c in dating_data.columns if c.startswith("essay")]].head()

## Using word embeddings on dating profiles

### Pre-processing

Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:

- Take all text input from users and identify the all the unique words used
- Get embeddings of all words from a pre-trained word-embedding model
    - GloVe, [source here](https://github.com/3Top/word2vec-api#where-to-get-a-pretrained-models)
    - Trained on 6 billion documents from Wikipedia and Gigaword repository
- Average the vector of all the words used by a given user
- Save the output in its own file

Result below:

In [None]:
text_features = pd.read_csv("./dating_data/text_features.csv", index_col=0)
text_features.head()

In [None]:
# Using embedding of text data to predict age:

age_logit = LogisticRegression()
age_logit.fit(text_features, age_30)
(age_logit.predict(text_features)==age_30).mean()

In [None]:
# What happens if we combine the profile characteristics and text features?

combined_features = np.hstack((text_features.values, profile_features.values))

age_logit = LogisticRegression()
age_logit.fit(combined_features, age_30)
(age_logit.predict(combined_features)==age_30).mean()

In [None]:
# What about using fancy methods with fancy word embeddings?

age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)
age_rf.fit(text_features, age_30)
(age_rf.predict(text_features)==age_30).mean()

In [None]:
# BE WARY! This is "in-sample" fit; predictions on "out-of-sample"
# data are actually no better than logistic regression in this case

### Cross-validated accuracy scores (skip for class)

In [None]:
logit_clf = LogisticRegression()
scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

In [None]:
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)

scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
    print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))

## Wrapping up

This code ([source](https://stackoverflow.com/questions/40428931/package-for-listing-version-of-packages-used-in-a-jupyter-notebook/49199019#49199019)) lists all required packages used in this notebook, making it easy to share this code to run in your own environment.

In [None]:

import pkg_resources
import types
def get_imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            # Split ensures you get root package, 
            # not just imported function
            name = val.__name__.split(".")[0]

        elif isinstance(val, type):
            name = val.__module__.split(".")[0]

        # Some packages are weird and have different
        # imported names vs. system/pip names. Unfortunately,
        # there is no systematic way to get pip names from
        # a package's imported name. You'll have to had
        # exceptions to this list manually!
        poorly_named_packages = {
            "PIL": "Pillow",
            "sklearn": "scikit-learn"
        }
        if name in poorly_named_packages.keys():
            name = poorly_named_packages[name]

        yield name
imports = list(set(get_imports()))

# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names 
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
    if m.project_name in imports and m.project_name!="pip":
        requirements.append((m.project_name, m.version))

for r in requirements:
    print("{}=={}".format(*r))