{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Predicting Ad Spending on Snapchat\n", "\n", "## Cyril Gorlla\n", "### University of California, San Diego" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Summary of Findings\n", "\n", "\n", "### Introduction\n", "This project is a follow up to the [previous](https://github.com/cgorlla/Snapchat-Ads/blob/master/Targeted%20Advertising%20in%20Snapchat%20Political%20Ads.ipynb) analysis of this dataset. The Snapchat ads dataset contains political ad data for ads on Snapchat, one of the largest social media networks in the world. A key feature of the dataset is how much money an organization spends on a particular ad, found in the `Spend` column. It is reasonable to assume that this amount varies based on certain factors, but can we use those factors to figure out how much is spent on an ad? We can explore this by predicting ad spending through machine learning. Specifically, we can utilize a regression model based on other features of the dataset to predict how much money will be spent on a particular ad. That is, we can use other columns of the dataset to predict the `Spend` column, our target variable, and we can evaluate our model's performance with $R^2$, or goodness of fit, so we can ascertain how well our model is replicating the outcomes observed in the data. $R^2$ is useful in telling us how effectively our model understands the patterns in the original data.\n", "\n", "### Baseline Model\n", "For our intial model, we choose to include six features: `Impressions` (quantitative), `StartMonth` (Ordinal), `StartDay` (Ordinal), `EndMonth` (Ordinal), `EndDay` (Ordinal), and `PayingAdvertiserName` (Nominal). The date-related features were chosen as there may be a correlation between the amount spent and what day or month the ad started or ended. The number of impressions is important as an ad with more impressions likely had a higher budget behind it. Lastly, the advertiser name (which was converted into one-hot encoding) may also be useful as certain advertisers may tend to spend more. We have a total of one quantitative, four ordinal, and one nominal feature(s). With this linear regression model, we achieve a $R^2$ of .66. This essentially means our model explains 66% of the variation in the original data. This is somewhat decent, but there is still a large portion of the data that the model is not understanding, so to speak, so there is definitely room for improvement.\n", "\n", "### Final Model\n", "We can improve our model by engineering two new features. Specifically, we can standardize the number of impressions by z-scoring, $z=(x-mean)/stddev.$ This yields data with a mean of 0 and a standard deviation of 1, better allowing us to see deviations from the mean number of impressions. Second, we can utilize Pandas' datetime functionality to calculate the difference in time from when the ad ended to when it started, giving us the total duration of the ad. It's reasonable to assume that the longer the duration of the ad, the more was spent on it, so this should be a useful feature in our model. We remove the other date features in favor of this feature. After trying various other regression models, it was determined that linear regression was still the best in terms of $R^2$, so this was chosen to be the final model. `GridSearchCV` was used to determine the optimal parameters for the linear regression model, these were `fit_intercept = True` and `normalize = False`. The other parameters of the model are unrelated to the actual model output. These were the default values for the model, so they were left unchanged. Our final model yielded a $R^2$ of .84, a large improvement from our previous model. This indicates that the new model replicates the outcomes of the observed data much better.\n", "\n", "### Fairness Evaluation\n", "We now wish to ascertain how well our model performs on certain portions of the data. Specifically, on those ads with low amounts of impressions. We define \"low\" as slightly lower than the 25th percentile, 18,000 impressions. To determine how fair our model is to ads with lower impressions vs. regular and higher impressions, we separate the dataset into two, one with ads with low impressions as defined above and one with the rest of the ads. We then see how well the model performs on various permutations of each subset to get a clearer picture of its fairness to ads with low vs other amounts of impressions. We use $\\alpha$ = 0.1 with the null hypothesis that the model treats both the same in terms of $R^2$ and the alternative hypothesis being that it treats ads with low impressions more poorly with a lower $R^2$. The model has an average $R^2$ of .45 on the low dataset but .72 on the rest of the data. Using the Kolmogorov–Smirnov test, we can compare the distributions of the scores of the model for both subsets of the data. We have an extremely small p-value of $3.65* 10^{-153}$, so we reject the null hypothesis and determine that the model is likely unfair in that it performs worse on ads with lower impressions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Code" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import os\n", "import pandas as pd\n", "import seaborn as sns\n", "from sklearn import *\n", "from sklearn.preprocessing import *\n", "from sklearn.linear_model import *\n", "from sklearn.metrics import *\n", "from sklearn.model_selection import *\n", "from sklearn.pipeline import *\n", "from sklearn.compose import *\n", "import scipy.stats\n", "%matplotlib inline\n", "%config InlineBackend.figure_format = 'retina' # Higher resolution figures" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Baseline Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#currency was converted and months/days were added in previous project\n", "sc = pd.read_csv('sc.csv').drop('Unnamed: 0',axis=1)\n", "#we drop null values, the only null values are in EndMonth/Day\n", "base = sc[['Impressions','StartMonth','StartDay','EndMonth','EndDay','PayingAdvertiserName','Spend']].dropna()\n", "#select the columns we want to use" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Numeric columns and associated transformers\n", "num_feat = ['Impressions']\n", "num_transformer = Pipeline(steps=[\n", " ('passthrough', FunctionTransformer(lambda x:x)) # passthrough\n", "])\n", "\n", "# Categorical columns and associated transformers\n", "cat_feat = ['StartMonth','StartDay','EndMonth','EndDay','PayingAdvertiserName']\n", "cat_transformer = Pipeline(steps=[\n", " ('intenc', OrdinalEncoder()), # converts to int\n", " ('onehot', OneHotEncoder()) # output from Ordinal becomes input to OneHot\n", "])\n", "\n", "# preprocessing pipeline (put them together)\n", "preproc = ColumnTransformer(transformers=[('num', num_transformer, num_feat), ('cat', cat_transformer, cat_feat)])\n", "\n", "pl = Pipeline(steps=[('regressor',LinearRegression())])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = preproc.fit_transform(base.drop('Spend', axis=1)) #process the dataset\n", "y = base.Spend" ] }, { "cell_type": "code", "execution_count": 437, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.6611158867619407\n" ] } ], "source": [ "score = []\n", "for x in range(100):\n", " X_train, X_test, y_train, y_test = train_test_split(X, y)\n", " pl.fit(X_train, y_train)\n", " score.append(pl.score(X_test, y_test))\n", "print(np.mean(score)) #average R^2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Final Model" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Numeric columns and associated transformers\n", "num_feat = ['Impressions']\n", "num_transformer = Pipeline(steps=[\n", " ('scaler', StandardScaler()) # z-scale impressions\n", "])\n", "\n", "def get_hours(x):\n", " df = pd.to_datetime(x['EndDate']).dt.hour - pd.to_datetime(x['StartDate']).dt.hour\n", " return pd.DataFrame(df)\n", "dates = ['StartDate','EndDate']\n", "date_transformer = Pipeline(steps=[\n", " ('duration', FunctionTransformer(get_hours))\n", " #time duration\n", "])\n", "\n", "# Categorical columns and associated transformers\n", "cat_feat = ['PayingAdvertiserName']\n", "cat_transformer = Pipeline(steps=[\n", " ('intenc', OrdinalEncoder()), # converts to int\n", " ('onehot', OneHotEncoder()) # output from Ordinal becomes input to OneHot\n", "])\n", "\n", "# preprocessing pipeline (put them together)\n", "preproc2 = ColumnTransformer(transformers=[('num', num_transformer, num_feat), ('dates',date_transformer,dates), ('cat', cat_transformer, cat_feat)])\n", "\n", "pl2 = Pipeline(steps=[('regressor', LinearRegression())])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#select features for new model\n", "improved = sc[['Impressions','StartDate','EndDate','PayingAdvertiserName','Spend']].dropna()" ] }, { "cell_type": "code", "execution_count": 420, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['memory', 'steps', 'verbose', 'regressor', 'regressor__copy_X', 'regressor__fit_intercept', 'regressor__n_jobs', 'regressor__normalize'])" ] }, "execution_count": 420, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#get parameters to optimize\n", "pl2.get_params().keys()" ] }, { "cell_type": "code", "execution_count": 423, "metadata": {}, "outputs": [], "source": [ "params = {'regressor__fit_intercept': [True, False], 'regressor__normalize': [True, False]}\n", "grids = GridSearchCV(pl2, param_grid=params, cv=5)" ] }, { "cell_type": "code", "execution_count": 424, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('regressor', LinearRegression())]),\n", " param_grid={'regressor__fit_intercept': [True, False],\n", " 'regressor__normalize': [True, False]})" ] }, "execution_count": 424, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_tr, X_ts, y_tr, y_ts = train_test_split(X, y)\n", "grids.fit(X_tr, y_tr)" ] }, { "cell_type": "code", "execution_count": 425, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'regressor__fit_intercept': True, 'regressor__normalize': False}" ] }, "execution_count": 425, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grids.best_params_ #these are the defaults" ] }, { "cell_type": "code", "execution_count": 426, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7636818802834835" ] }, "execution_count": 426, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grids.best_score_" ] }, { "cell_type": "code", "execution_count": 532, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.8493351530307777\n" ] } ], "source": [ "X = preproc2.fit_transform(improved.drop('Spend', axis=1)) #process the dataset\n", "y = improved.Spend\n", "score = []\n", "for x in range(100):\n", " X_train, X_test, y_train, y_test = train_test_split(X, y)\n", " pl2.fit(X_train, y_train)\n", " score.append(pl2.score(X_train, y_train))\n", "print(np.mean(score)) #R^2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fairness Evaluation" ] }, { "cell_type": "code", "execution_count": 460, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 460, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 263, "width": 410 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.scatterplot(improved['Impressions'],improved['Spend'])\n", "#There is some correlation between spend and impressions" ] }, { "cell_type": "code", "execution_count": 461, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Impressions
count4.184000e+03
mean9.269965e+05
std5.490533e+06
min1.000000e+00
25%1.851525e+04
50%1.017510e+05
75%4.533338e+05
max2.349018e+08
\n", "
" ], "text/plain": [ " Impressions\n", "count 4.184000e+03\n", "mean 9.269965e+05\n", "std 5.490533e+06\n", "min 1.000000e+00\n", "25% 1.851525e+04\n", "50% 1.017510e+05\n", "75% 4.533338e+05\n", "max 2.349018e+08" ] }, "execution_count": 461, "metadata": {}, "output_type": "execute_result" } ], "source": [ "improved[['Impressions']].describe()\n", "#what if we looked at ads with low impressions, below the 25%?" ] }, { "cell_type": "code", "execution_count": 536, "metadata": {}, "outputs": [], "source": [ "improved['Binarized'] = Binarizer(threshold=18000).fit_transform(improved[['Impressions']])" ] }, { "cell_type": "code", "execution_count": 555, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ImpressionsStartDateEndDatePayingAdvertiserNameSpendBinarized
011832872019/09/27 12:29:18Z2019/10/05 14:00:00ZFederal National Council4187.0000001
11908472019/03/20 13:00:00Z2019/04/04 03:59:59ZBen & Jerry's1576.0000001
2846871402019/10/23 13:00:00Z2019/11/16 07:59:59ZRecreational Equipment, Inc.99361.0000001
325559402019/09/30 14:00:00Z2020/06/29 03:59:00Ztruth10360.0000001
43238902019/06/03 07:00:00Z2019/09/04 06:59:59ZPlan International Canada260.2196231
532312019/11/26 00:05:10Z2019/11/26 23:00:00ZHOPE not hate Charitable Trust7.7602680
627625992019/11/12 13:11:17Z2019/11/18 23:59:59ZThe Labour Party6466.8901931
787792019/09/13 09:32:01Z2019/09/14 09:32:01ZNCDHD23.0000000
825852019/12/13 00:29:58Z2020/01/01 04:59:59ZWarren for President48.0000000
9509462019/10/17 20:09:20Z2019/11/06 00:00:00ZACRONYM365.0000001
\n", "
" ], "text/plain": [ " Impressions StartDate EndDate \\\n", "0 1183287 2019/09/27 12:29:18Z 2019/10/05 14:00:00Z \n", "1 190847 2019/03/20 13:00:00Z 2019/04/04 03:59:59Z \n", "2 84687140 2019/10/23 13:00:00Z 2019/11/16 07:59:59Z \n", "3 2555940 2019/09/30 14:00:00Z 2020/06/29 03:59:00Z \n", "4 323890 2019/06/03 07:00:00Z 2019/09/04 06:59:59Z \n", "5 3231 2019/11/26 00:05:10Z 2019/11/26 23:00:00Z \n", "6 2762599 2019/11/12 13:11:17Z 2019/11/18 23:59:59Z \n", "7 8779 2019/09/13 09:32:01Z 2019/09/14 09:32:01Z \n", "8 2585 2019/12/13 00:29:58Z 2020/01/01 04:59:59Z \n", "9 50946 2019/10/17 20:09:20Z 2019/11/06 00:00:00Z \n", "\n", " PayingAdvertiserName Spend Binarized \n", "0 Federal National Council 4187.000000 1 \n", "1 Ben & Jerry's 1576.000000 1 \n", "2 Recreational Equipment, Inc. 99361.000000 1 \n", "3 truth 10360.000000 1 \n", "4 Plan International Canada 260.219623 1 \n", "5 HOPE not hate Charitable Trust 7.760268 0 \n", "6 The Labour Party 6466.890193 1 \n", "7 NCDHD 23.000000 0 \n", "8 Warren for President 48.000000 0 \n", "9 ACRONYM 365.000000 1 " ] }, "execution_count": 555, "metadata": {}, "output_type": "execute_result" } ], "source": [ "improved.head(10)" ] }, { "cell_type": "code", "execution_count": 538, "metadata": {}, "outputs": [], "source": [ "#separate data\n", "low = df[improved['Binarized'] == 0]\n", "rest = df[improved['Binarized'] == 1]" ] }, { "cell_type": "code", "execution_count": 541, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.4502236073057152\n" ] } ], "source": [ "X_low = (preproc2.fit_transform(low.drop('Spend', axis=1)))\n", "y_low = low.Spend\n", "score_low = []\n", "for x in range(500):\n", " X_train, X_test, y_train, y_test = train_test_split(X_low, y_low)\n", " score_low.append(pl2.score(X_test, y_test))\n", "print(np.mean(score_low)) #low R^2" ] }, { "cell_type": "code", "execution_count": 542, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.7288819175587933\n" ] } ], "source": [ "X_rest = (preproc2.fit_transform(rest.drop('Spend', axis=1)))\n", "y_rest = rest.Spend\n", "score_rest = []\n", "for x in range(500):\n", " X_train, X_test, y_train, y_test = train_test_split(X_rest, y_rest)\n", " score_rest.append(pl2.score(X_test, y_test))\n", "print(np.mean(score_rest)) #higher R^2" ] }, { "cell_type": "code", "execution_count": 558, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 558, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "image/png": { "height": 250, "width": 364 }, "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "sns.distplot(score_low)\n", "sns.distplot(score_rest)" ] }, { "cell_type": "code", "execution_count": 544, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Ks_2sampResult(statistic=0.832, pvalue=3.6552424432544507e-153)" ] }, "execution_count": 544, "metadata": {}, "output_type": "execute_result" } ], "source": [ "scipy.stats.ks_2samp(score_low,score_rest) #very low P-val, dist. are not similar" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 2 }