{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Gmail Experiment: Data Analysis\n", "Author: Leon Yin and Surya Mattu
\n", "Links: [GitHub](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/2-analysis.ipynb) | [nbviewer](https://nbviewer.jupyter.org/github/the-markup/investigation-wheres-my-email/blob/master/notebooks/2-analysis.ipynb) | [Story](https://themarkup.org/google-the-giant/2020/02/26/wheres-my-email)
\n", "\n", "This notebook shows how we aggregate the data we preprocessed in the [last notebook](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/1-data-preprocessing.ipynb) ([see in nbviewer](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/1-data-preprocessing.ipynb)). The tables in this notebook are featured in our [experiment](https://themarkup.org/google-the-giant/2020/02/26/show-your-work-wheres-my-email) and [story](https://themarkup.org/google-the-giant/2020/02/26/wheres-my-email)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import os\n", "\n", "import numpy as np\n", "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# inputs\n", "fn_input = '../data/intermediates/email_metadata_merged_with_newsletters.csv.gz'\n", "fn_newsletter_categories = '../data/input/newsletter_categories.csv'\n", "\n", "# outputs\n", "appedix_dir = '../data/output/appendix'\n", "table_dir = '../data/output/tables'\n", "\n", "for d in [appedix_dir, table_dir]:\n", " os.makedirs(d, exist_ok=True)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5134" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# read the dataset we preprocessed in the last notebook.\n", "df = pd.read_csv(fn_input)\n", "len(df)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall our dataset looks like this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'X-GM-THRID': 1655113375086340206,\n", " 'X-Gmail-Labels': 'Spam,Category Updates,Unread',\n", " 'Date': '2020-01-07 23:18:33+00:00',\n", " 'Subject': 're: your 2020 membership',\n", " 'text': 'Become a card-carrying member of the Democratic Party by renewing your membership today.\\nHi --\\nWe’ve been talking about the importance of preparing for 2020 for months now, and it’s finally here. We’re in the thick of it. The Iowa caucuses are less than a month away, and before we know it we’ll have our Democratic nominee.\\nIt’s more clear now than ever: This is not the year to sit on the sidelines. As a member of our party, we know you are committed to defeating Trump and his allies -- and now is the time to prove it.\\nBecome a card-carrying member of the Democratic Party by making a $7 donation to renew your membership today.\\nRENEW MY MEMBERSHIP\\nIf we’re going to win up and down the ballot this year, we need you in this fight. I hope you’ll carry your 2020 DNC membership card with pride -- it’s a reflection of all you’ve done for our party as we prepare for the fight ahead.\\nRenew your membership today to receive your Official 2020 DNC Membership Card.\\nThanks as always,\\nLauren\\nLauren Williams\\nDirect Marketing Director\\nDemocratic National Committee\\nIf you no longer wish to receive emails from the DNC, submit this form to unsubscribe. If you’d only like to receive our most important messages, sign up to receive less email.\\nIf you’re ready to elect Democrats in all 50 states, make a contribution today.\\nContributions or gifts to the Democratic National Committee are not tax deductible. Paid for by the Democratic National Committee, www.Democrats.org, and not authorized by any candidate or candidate’s committee.',\n", " 'markup_id': '5a0fe9753c37a010ae045117cc77e261',\n", " 'Category Personal': 0,\n", " 'Category Promotions': 0,\n", " 'Category Updates': 1,\n", " 'Inbox': 0,\n", " 'Spam': 1,\n", " 'Primary': 0,\n", " 'From_Email': 'feedback@act.democrats.org',\n", " 'From_Domain': 'democrats.org',\n", " 'From_Name': 'DNC Headquarters',\n", " 'To_Email': 'ajcorrigan88+dnc@gmail.com',\n", " 'Name': 'DNC',\n", " 'Website': 'https://democrats.org/',\n", " 'Category': 'Advocacy organization or think tank',\n", " 'Entity_ID': 'dbc25ff0242fc30a08228f14e658ad48'}" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.iloc[-1].to_dict()" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Presidential candidate', 'House battleground campaign',\n", " 'Advocacy organization or think tank',\n", " 'House battleground official'], dtype=object)" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories = df.Category.unique()\n", "categories" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Appendix 1: Taking inventory\n", "How many emails did we get from each email sender? Which newsletters didn't send us anything?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameEntity_IDEmails receivedDate of first emailDate of last emailNewsletter Category
0Sharice Davids0440493f2ad4348bc652f41cb1ff5d49152019-10-222020-02-06House battleground official
1Harley Rouda04dfd84c37ccdf9421ba355e25529adc582019-10-212020-02-10House battleground campaign
\n", "
" ], "text/plain": [ " Name Entity_ID Emails received \\\n", "0 Sharice Davids 0440493f2ad4348bc652f41cb1ff5d49 15 \n", "1 Harley Rouda 04dfd84c37ccdf9421ba355e25529adc 58 \n", "\n", " Date of first email Date of last email Newsletter Category \n", "0 2019-10-22 2020-02-06 House battleground official \n", "1 2019-10-21 2020-02-10 House battleground campaign " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = []\n", "for eID, _df in df.groupby('Entity_ID'):\n", " first_email = _df.Date.min().split()[0]\n", " last_email = _df.Date.max().split()[0]\n", " n_email = _df.markup_id.nunique()\n", " cat = _df.Category.iloc[0]\n", " name = _df.Name.iloc[0]\n", " \n", " row = {\n", " 'Name' : name,\n", " 'Entity_ID' : eID,\n", " 'Emails received' : n_email,\n", " 'Date of first email' : first_email,\n", " 'Date of last email' : last_email,\n", " 'Newsletter Category' : cat\n", " }\n", " \n", " data.append(row)\n", " \n", "emails_received = pd.DataFrame(data)\n", "emails_received.head(2)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameWebsiteEmailCategoryEntity_ID
0AARPhttps://www.aarp.org/ajcorrigan88+aarp@gmail.comAdvocacy organization or think tank555b07f95e3dc347516af4e713ed190b
1Center for Responsive Politicshttp://www.opensecrets.org/ajcorrigan88+cfrp@gmail.comAdvocacy organization or think tank77751658e640d8c56dbda5d7501dfd5d
\n", "
" ], "text/plain": [ " Name Website \\\n", "0 AARP https://www.aarp.org/ \n", "1 Center for Responsive Politics http://www.opensecrets.org/ \n", "\n", " Email Category \\\n", "0 ajcorrigan88+aarp@gmail.com Advocacy organization or think tank \n", "1 ajcorrigan88+cfrp@gmail.com Advocacy organization or think tank \n", "\n", " Entity_ID \n", "0 555b07f95e3dc347516af4e713ed190b \n", "1 77751658e640d8c56dbda5d7501dfd5d " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# newsletters we signed up for and which alias we used.\n", "newsletter_categories = pd.read_csv(fn_newsletter_categories)\n", "newsletter_categories.head(2)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Presidential candidate:\n", "we signed up for 17\n", "got emails from 16\n", "\n", "House battleground campaign:\n", "we signed up for 71\n", "got emails from 44\n", "\n", "Advocacy organization or think tank:\n", "we signed up for 73\n", "got emails from 50\n", "\n", "House battleground official:\n", "we signed up for 70\n", "got emails from 61\n", "\n" ] } ], "source": [ "A1_output_cols = [\n", " 'Name', 'Website', \n", " 'Emails received',\n", " 'Date of first email', \n", " 'Date of last email',\n", "]\n", "# this prevents duplicate rows\n", "cols_to_use = emails_received.columns.difference(\n", " newsletter_categories.columns\n", ")\n", "\n", "no_shows = []\n", "for cat in categories:\n", " fn_out = os.path.join(\n", " appedix_dir, f\"A1-{cat.replace(' ', '-')}.csv\"\n", " )\n", " \n", " newsletter_categories_cat = newsletter_categories[\n", " newsletter_categories.Category == cat].set_index('Name')\n", " \n", " emails_received_cat = emails_received[\n", " emails_received['Newsletter Category'] == cat].set_index('Name')\n", "\n", " # here we combine all emails we opt-d in for, with all recieved\n", " email_tracker = newsletter_categories_cat.merge(emails_received_cat[cols_to_use],\n", " on= ['Name'], \n", " how='left') \\\n", " .sort_values(by= 'Name') \\\n", " .reset_index()\n", " \n", " # For email senders with no match, set filler values.\n", " email_tracker['Emails received'].fillna(0, inplace=True)\n", " for col in ['Date of first email', 'Date of last email']:\n", " email_tracker[col].fillna('-', inplace=True)\n", " \n", " # let's record who we didn't get!\n", " no_shows.extend(\n", " email_tracker[email_tracker['Emails received'] == 0][[\n", " 'Name', 'Category', 'Entity_ID'\n", " ]].to_dict(orient='records')\n", " )\n", " \n", " # save this for the appendix\n", " email_tracker[A1_output_cols].to_csv(fn_out, index=False)\n", " del fn_out\n", " \n", " # print some summary stats\n", " n_signed_up = len(newsletter_categories_cat)\n", " n_received = len(emails_received_cat)\n", " print(f\"{cat}:\\nwe signed up for {n_signed_up}\\n\"\n", " f\"got emails from {n_received}\\n\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get the stats for all senders:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Out of the 231 email senders we opted into we received 171 of them and didn't recieve 60.\n" ] } ], "source": [ "n_no_shows = len(no_shows)\n", "n_received = df.Entity_ID.nunique()\n", "n_subscribed = newsletter_categories.Entity_ID.nunique()\n", "\n", "print(f\"Out of the {n_subscribed} email senders we opted into \"\n", " f\"we received {n_received} of them and didn't recieve {n_no_shows}.\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# Who didn't send us emails?\n", "fn_out = os.path.join(appedix_dir, 'Amisc_email_senders_who_didnt_send.csv')\n", "pd.DataFrame(no_shows).sort_values('Name').to_csv(fn_out, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table 1: How Gmail classified our emails" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
CategoryPrimaryPromotionsSpamTotal EmailsSTD SpamSTD PromotionsSTD Primary
0Advocacy organization or think tank0.0884720.4638070.44772118650.3404190.3501580.280486
1House battleground campaign0.0071170.4946620.49822111240.3803900.3768410.178170
2House battleground official0.4478830.2491860.3029326140.3102480.2750130.302288
3Presidential candidate0.0640100.6381450.29784515310.3350970.3660650.179952
4All emails0.1063500.4968840.39676751340.3368990.3277920.280078
\n", "
" ], "text/plain": [ " Category Primary Promotions Spam \\\n", "0 Advocacy organization or think tank 0.088472 0.463807 0.447721 \n", "1 House battleground campaign 0.007117 0.494662 0.498221 \n", "2 House battleground official 0.447883 0.249186 0.302932 \n", "3 Presidential candidate 0.064010 0.638145 0.297845 \n", "4 All emails 0.106350 0.496884 0.396767 \n", "\n", " Total Emails STD Spam STD Promotions STD Primary \n", "0 1865 0.340419 0.350158 0.280486 \n", "1 1124 0.380390 0.376841 0.178170 \n", "2 614 0.310248 0.275013 0.302288 \n", "3 1531 0.335097 0.366065 0.179952 \n", "4 5134 0.336899 0.327792 0.280078 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fn_out = os.path.join(table_dir, 'T1-overview.csv')\n", "\n", "data = []\n", "for cat, _df in df.groupby('Category'):\n", " spam = []\n", " promos = []\n", " primary = []\n", " for user, __df in _df.groupby('To_Email'):\n", " n_emails = __df.markup_id.nunique()\n", " n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails\n", " n_promo = __df[(__df['Category Promotions'] == 1) &\n", " (__df['Spam'] != 1)].markup_id.nunique()/ n_emails\n", " n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails\n", " \n", " spam.append(n_spam)\n", " promos.append(n_promo)\n", " primary.append(n_primary)\n", " \n", " # this is to get the variance\n", " spam = np.std(spam)\n", " promos = np.std(promos)\n", " primary = np.std(primary)\n", " \n", " n_emails = _df.markup_id.nunique()\n", " n_spam = _df[_df['Spam'] == 1].markup_id.nunique() / n_emails\n", " n_promo = _df[(_df['Category Promotions'] == 1) &\n", " (_df['Spam'] != 1)].markup_id.nunique()/ n_emails\n", " n_primary = _df[_df['Primary'] == 1].markup_id.nunique()/ n_emails\n", " \n", " row = {\n", " 'Category' : cat,\n", " 'Primary' : n_primary,\n", " 'Promotions' : n_promo,\n", " 'Spam' : n_spam,\n", " 'Total Emails' : n_emails,\n", " 'STD Spam' : spam,\n", " 'STD Promotions' :promos,\n", " 'STD Primary' : primary\n", " }\n", " \n", " data.append(row)\n", " \n", "# Average for all\n", "spam = []\n", "promos = []\n", "primary = []\n", "\n", "for user, __df in df.groupby('To_Email'):\n", " n_emails = __df.markup_id.nunique()\n", " n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails\n", " n_promo = __df[(__df['Category Promotions'] == 1) &\n", " (__df['Spam'] != 1)].markup_id.nunique()/ n_emails\n", " n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails\n", "\n", " spam.append(n_spam)\n", " promos.append(n_promo)\n", " primary.append(n_primary)\n", "\n", "spam = np.std(spam)\n", "promos = np.std(promos)\n", "primary = np.std(primary)\n", "\n", "n_emails = df.markup_id.nunique()\n", "n_spam = df[df['Spam'] == 1].markup_id.nunique() / n_emails\n", "n_promo = df[(df['Category Promotions'] == 1) &\n", " (df['Spam'] != 1)].markup_id.nunique()/ n_emails\n", "n_primary = df[df['Primary'] == 1].markup_id.nunique()/ n_emails\n", "\n", "row = {\n", " 'Category' : 'All emails',\n", " 'Primary' : n_primary,\n", " 'Promotions' : n_promo,\n", " 'Spam' : n_spam,\n", " 'Total Emails' : n_emails,\n", " 'STD Spam' : spam,\n", " 'STD Promotions' :promos,\n", " 'STD Primary' : primary\n", "}\n", "\n", "data.append(row)\n", "\n", "overview = pd.DataFrame(data)\n", "overview.to_csv(fn_out, index=False)\n", "del fn_out\n", "overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Table 2: Gmail classification statistics for each email sender" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "cat2stats = {}\n", "data_all = []\n", "for cat in categories:\n", " stats = []\n", " for eID, messages in df[df.Category == cat].groupby('Entity_ID'):\n", " name = messages.iloc[0].Name\n", " n_cat = messages.markup_id.nunique()\n", " n_pro = messages[(messages['Category Promotions'] == 1) &\n", " (messages['Spam'] != 1) ].markup_id.nunique()\n", " n_u = messages[messages['Category Updates'] == 1].markup_id.nunique()\n", " n_s = messages[messages['Spam'] == 1].markup_id.nunique()\n", " n_pri = messages[messages['Primary'] == 1].markup_id.nunique()\n", " \n", " start_date = messages.Date.min()\n", " end_date = messages.Date.max()\n", " website = messages.Website.iloc[0]\n", " \n", " row = {\n", " 'Entity_ID' : eID,\n", " 'Name' : name,\n", " 'Website' : website,\n", " 'n_promos' : n_pro,\n", " 'n_updates' : n_u,\n", " 'n_spam' : n_s,\n", " 'n_primary' : n_pri,\n", " 'start_date' : start_date,\n", " 'end_date' : end_date,\n", " 'n_messages' : len(messages)\n", " }\n", " stats.append(row)\n", " \n", " # assign all sender stats to the category\n", " cat2stats[cat] = stats\n", " # add all sender stats to the list of everything\n", " data_all.extend(stats)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# names of columns we're using\n", "metrics = [ 'n_primary', 'n_promos', 'n_spam']\n", "ratios = [col.replace('n_', 'perc_') for col in metrics]\n", "\n", "# columns we want for our output table\n", "t2_cols = ['Name'] + ratios + ['n_messages']\n", "\n", "# rename these columns for the appendix\n", "col2col = {\n", " 'Name' : 'Name',\n", " 'perc_primary' : 'Primary',\n", " 'perc_promos' : 'Promotions',\n", " 'perc_spam' : 'Spam',\n", " 'n_messages' : 'Total Emails'\n", "}" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "def get_inbox_ratios(df):\n", " \"\"\"Gets some summary stats per sender and sorts by primary inbox\"\"\"\n", " df[ratios] = df[metrics].div(df['n_messages'], \n", " axis=0).round(4)\n", " output = df[t2_cols]\n", " output.columns = [col2col.get(c, c) for c in output.columns]\n", " output = output.set_index('Name').sort_values(\n", " by=['Primary', 'Total Emails'], ascending=False\n", " ).reset_index()\n", " \n", " return output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How is Gmail classifying emails from each of these presidential candidates?" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Presidential candidate\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrimaryPromotionsSpamTotal Emails
0Pete Buttigieg0.62790.09300.279143
1Andrew Yang0.46380.04350.492869
2Michael Bloomberg0.16670.83330.00006
3Julián Castro0.12040.74070.1389216
4Bill Weld0.02910.29130.6796103
5Tulsi Gabbard0.02220.77780.200045
6Bernie Sanders0.01800.96400.0180111
7Cory Booker0.01450.76330.2222207
8Amy Klobuchar0.00960.90710.0833312
9Kamala Harris0.00000.34960.6504123
10Joe Walsh0.00000.05490.945191
11Joe Biden0.00000.85710.142991
12Elizabeth Warren0.00000.84850.151566
13Beto O’Rourke0.00000.19570.804346
14Tom Steyer0.00001.00000.00001
15Donald J. Trump0.00000.00000.00000
\n", "
" ], "text/plain": [ " Name Primary Promotions Spam Total Emails\n", "0 Pete Buttigieg 0.6279 0.0930 0.2791 43\n", "1 Andrew Yang 0.4638 0.0435 0.4928 69\n", "2 Michael Bloomberg 0.1667 0.8333 0.0000 6\n", "3 Julián Castro 0.1204 0.7407 0.1389 216\n", "4 Bill Weld 0.0291 0.2913 0.6796 103\n", "5 Tulsi Gabbard 0.0222 0.7778 0.2000 45\n", "6 Bernie Sanders 0.0180 0.9640 0.0180 111\n", "7 Cory Booker 0.0145 0.7633 0.2222 207\n", "8 Amy Klobuchar 0.0096 0.9071 0.0833 312\n", "9 Kamala Harris 0.0000 0.3496 0.6504 123\n", "10 Joe Walsh 0.0000 0.0549 0.9451 91\n", "11 Joe Biden 0.0000 0.8571 0.1429 91\n", "12 Elizabeth Warren 0.0000 0.8485 0.1515 66\n", "13 Beto O’Rourke 0.0000 0.1957 0.8043 46\n", "14 Tom Steyer 0.0000 1.0000 0.0000 1\n", "15 Donald J. Trump 0.0000 0.0000 0.0000 0" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fn_out = os.path.join(table_dir, 'T2a-presidental-candidates.csv')\n", "cat = 'Presidential candidate'\n", "print(cat)\n", "\n", "df_cat = pd.DataFrame(cat2stats[cat])\n", "output = get_inbox_ratios(df_cat)\n", "\n", "# remove Trump rallies and add Trump with zero emails.\n", "output = output[output.Name != 'Trump rallies'].append(\n", " pd.DataFrame([{\n", " 'Name' : 'Donald J. Trump',\n", " 'Primary' : 0.0,\n", " 'Promotions' : 0.0,\n", " 'Spam' : 0.0,\n", " 'Total Emails' : 0\n", " }])).reset_index(drop=True)\n", "# output = publ?ication_ready(output)\n", "output.to_csv(fn_out, index=False)\n", "del fn_out\n", "output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also get this for the coalition of advocacy groups plus Change.org." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "coalition_plus_change = [\n", " 'SumOfUs',\n", " 'Democracy for America',\n", " 'CREDO Action',\n", " 'Change.org'\n", "]" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrimaryPromotionsSpamTotal Emails
0SumOfUs0.09760.90240.000041
1Change.org0.03850.84620.115426
2Democracy for America0.00000.60230.397788
3CREDO Action0.00000.72730.272744
\n", "
" ], "text/plain": [ " Name Primary Promotions Spam Total Emails\n", "0 SumOfUs 0.0976 0.9024 0.0000 41\n", "1 Change.org 0.0385 0.8462 0.1154 26\n", "2 Democracy for America 0.0000 0.6023 0.3977 88\n", "3 CREDO Action 0.0000 0.7273 0.2727 44" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fn_out = os.path.join(table_dir, 'T2b-coalition-groups.csv')\n", "cat = 'Advocacy organization or think tank'\n", "\n", "df_cat = pd.DataFrame(cat2stats[cat])\n", "df_cat = df_cat[df_cat.Name.isin(coalition_plus_change)]\n", "output = get_inbox_ratios(df_cat)\n", "output.to_csv(fn_out, index=False)\n", "del fn_out\n", "output" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Appendix 2\n", "The same stats as Table 2a and 2b, but for all the categories." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NamePrimaryPromotionsSpamTotal Emails
0Antonio Delgado1.00.00.011
1Troy Balderson1.00.00.04
\n", "
" ], "text/plain": [ " Name Primary Promotions Spam Total Emails\n", "0 Antonio Delgado 1.0 0.0 0.0 11\n", "1 Troy Balderson 1.0 0.0 0.0 4" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "for cat in categories:\n", " fn_out = os.path.join(appedix_dir, f\"A2-{cat.replace(' ', '-')}.csv\")\n", " df_cat = pd.DataFrame(cat2stats[cat])\n", " output = get_inbox_ratios(df_cat)\n", " output.to_csv(fn_out, index=False)\n", " del fn_out\n", " \n", "output.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Misc Stats" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is the percentage of email senders that never got an email in primary?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we created the `cat2stats` dictionary, we also added each row to `data_all`." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "171" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "email_sender_stats = pd.DataFrame(data_all)\n", "len(email_sender_stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's only look at senders who sent us at least two emails." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "149" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "senders = email_sender_stats[email_sender_stats.n_messages >= 2]\n", "len(senders)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What percentage these senders never got to the primary inbox?" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.4161073825503356" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(senders[senders.n_primary == 0]) / len(senders)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What percentage of Spam emails are also labelled as promotions?" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6921944035346097" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[(df['Category Promotions'] == 1) & (df['Spam'] == 1)]) / len(df[df['Spam'] == 1])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### How many emails _would_ have been in Updates if Updates was active?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "540" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[(df['Category Updates'] == 1) & (df['Spam'] != 1)])" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.1" } }, "nbformat": 4, "nbformat_minor": 4 }