{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gmail Experiment: Data Analysis\n",
    "Author: Leon Yin and Surya Mattu<br>\n",
    "Links: [GitHub](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/2-analysis.ipynb) | [nbviewer](https://nbviewer.jupyter.org/github/the-markup/investigation-wheres-my-email/blob/master/notebooks/2-analysis.ipynb) | [Story](https://themarkup.org/google-the-giant/2020/02/26/wheres-my-email) <br>\n",
    "\n",
    "This notebook shows how we aggregate the data we preprocessed in the [last notebook](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/1-data-preprocessing.ipynb) ([see in nbviewer](https://github.com/the-markup/investigation-wheres-my-email/blob/master/notebooks/1-data-preprocessing.ipynb)). The tables in this notebook are featured in our [experiment](https://themarkup.org/google-the-giant/2020/02/26/show-your-work-wheres-my-email) and [story](https://themarkup.org/google-the-giant/2020/02/26/wheres-my-email)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import os\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# inputs\n",
    "fn_input = '../data/intermediates/email_metadata_merged_with_newsletters.csv.gz'\n",
    "fn_newsletter_categories = '../data/input/newsletter_categories.csv'\n",
    "\n",
    "# outputs\n",
    "appedix_dir = '../data/output/appendix'\n",
    "table_dir = '../data/output/tables'\n",
    "\n",
    "for d in [appedix_dir, table_dir]:\n",
    "    os.makedirs(d, exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5134"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# read the dataset we preprocessed in the last notebook.\n",
    "df = pd.read_csv(fn_input)\n",
    "len(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recall our dataset looks like this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'X-GM-THRID': 1655113375086340206,\n",
       " 'X-Gmail-Labels': 'Spam,Category Updates,Unread',\n",
       " 'Date': '2020-01-07 23:18:33+00:00',\n",
       " 'Subject': 're: your 2020 membership',\n",
       " 'text': 'Become a card-carrying member of the Democratic Party by renewing your membership today.\\nHi --\\nWe’ve been talking about the importance of preparing for 2020 for months now, and it’s finally here. We’re in the thick of it. The Iowa caucuses are less than a month away, and before we know it we’ll have our Democratic nominee.\\nIt’s more clear now than ever: This is not the year to sit on the sidelines. As a member of our party, we know you are committed to defeating Trump and his allies -- and now is the time to prove it.\\nBecome a card-carrying member of the Democratic Party by making a $7 donation to renew your membership today.\\nRENEW MY MEMBERSHIP\\nIf we’re going to win up and down the ballot this year, we need you in this fight. I hope you’ll carry your 2020 DNC membership card with pride -- it’s a reflection of all you’ve done for our party as we prepare for the fight ahead.\\nRenew your membership today to receive your Official 2020 DNC Membership Card.\\nThanks as always,\\nLauren\\nLauren Williams\\nDirect Marketing Director\\nDemocratic National Committee\\nIf you no longer wish to receive emails from the DNC, submit this form to unsubscribe. If you’d only like to receive our most important messages, sign up to receive less email.\\nIf you’re ready to elect Democrats in all 50 states, make a contribution today.\\nContributions or gifts to the Democratic National Committee are not tax deductible. Paid for by the Democratic National Committee, www.Democrats.org, and not authorized by any candidate or candidate’s committee.',\n",
       " 'markup_id': '5a0fe9753c37a010ae045117cc77e261',\n",
       " 'Category Personal': 0,\n",
       " 'Category Promotions': 0,\n",
       " 'Category Updates': 1,\n",
       " 'Inbox': 0,\n",
       " 'Spam': 1,\n",
       " 'Primary': 0,\n",
       " 'From_Email': 'feedback@act.democrats.org',\n",
       " 'From_Domain': 'democrats.org',\n",
       " 'From_Name': 'DNC Headquarters',\n",
       " 'To_Email': 'ajcorrigan88+dnc@gmail.com',\n",
       " 'Name': 'DNC',\n",
       " 'Website': 'https://democrats.org/',\n",
       " 'Category': 'Advocacy organization or think tank',\n",
       " 'Entity_ID': 'dbc25ff0242fc30a08228f14e658ad48'}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.iloc[-1].to_dict()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['Presidential candidate', 'House battleground campaign',\n",
       "       'Advocacy organization or think tank',\n",
       "       'House battleground official'], dtype=object)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "categories = df.Category.unique()\n",
    "categories"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Appendix 1: Taking inventory\n",
    "How many emails did we get from each email sender? Which newsletters didn't send us anything?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Entity_ID</th>\n",
       "      <th>Emails received</th>\n",
       "      <th>Date of first email</th>\n",
       "      <th>Date of last email</th>\n",
       "      <th>Newsletter Category</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Sharice Davids</td>\n",
       "      <td>0440493f2ad4348bc652f41cb1ff5d49</td>\n",
       "      <td>15</td>\n",
       "      <td>2019-10-22</td>\n",
       "      <td>2020-02-06</td>\n",
       "      <td>House battleground official</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Harley Rouda</td>\n",
       "      <td>04dfd84c37ccdf9421ba355e25529adc</td>\n",
       "      <td>58</td>\n",
       "      <td>2019-10-21</td>\n",
       "      <td>2020-02-10</td>\n",
       "      <td>House battleground campaign</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             Name                         Entity_ID  Emails received  \\\n",
       "0  Sharice Davids  0440493f2ad4348bc652f41cb1ff5d49               15   \n",
       "1    Harley Rouda  04dfd84c37ccdf9421ba355e25529adc               58   \n",
       "\n",
       "  Date of first email Date of last email          Newsletter Category  \n",
       "0          2019-10-22         2020-02-06  House battleground official  \n",
       "1          2019-10-21         2020-02-10  House battleground campaign  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = []\n",
    "for eID, _df in df.groupby('Entity_ID'):\n",
    "    first_email = _df.Date.min().split()[0]\n",
    "    last_email = _df.Date.max().split()[0]\n",
    "    n_email = _df.markup_id.nunique()\n",
    "    cat = _df.Category.iloc[0]\n",
    "    name = _df.Name.iloc[0]\n",
    "    \n",
    "    row = {\n",
    "        'Name' : name,\n",
    "        'Entity_ID' : eID,\n",
    "        'Emails received' : n_email,\n",
    "        'Date of first email' : first_email,\n",
    "        'Date of last email' : last_email,\n",
    "        'Newsletter Category' : cat\n",
    "    }\n",
    "    \n",
    "    data.append(row)\n",
    "    \n",
    "emails_received = pd.DataFrame(data)\n",
    "emails_received.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Website</th>\n",
       "      <th>Email</th>\n",
       "      <th>Category</th>\n",
       "      <th>Entity_ID</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AARP</td>\n",
       "      <td>https://www.aarp.org/</td>\n",
       "      <td>ajcorrigan88+aarp@gmail.com</td>\n",
       "      <td>Advocacy organization or think tank</td>\n",
       "      <td>555b07f95e3dc347516af4e713ed190b</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Center for Responsive Politics</td>\n",
       "      <td>http://www.opensecrets.org/</td>\n",
       "      <td>ajcorrigan88+cfrp@gmail.com</td>\n",
       "      <td>Advocacy organization or think tank</td>\n",
       "      <td>77751658e640d8c56dbda5d7501dfd5d</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                             Name                      Website  \\\n",
       "0                            AARP        https://www.aarp.org/   \n",
       "1  Center for Responsive Politics  http://www.opensecrets.org/   \n",
       "\n",
       "                         Email                             Category  \\\n",
       "0  ajcorrigan88+aarp@gmail.com  Advocacy organization or think tank   \n",
       "1  ajcorrigan88+cfrp@gmail.com  Advocacy organization or think tank   \n",
       "\n",
       "                          Entity_ID  \n",
       "0  555b07f95e3dc347516af4e713ed190b  \n",
       "1  77751658e640d8c56dbda5d7501dfd5d  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# newsletters we signed up for and which alias we used.\n",
    "newsletter_categories = pd.read_csv(fn_newsletter_categories)\n",
    "newsletter_categories.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Presidential candidate:\n",
      "we signed up for 17\n",
      "got emails from 16\n",
      "\n",
      "House battleground campaign:\n",
      "we signed up for 71\n",
      "got emails from 44\n",
      "\n",
      "Advocacy organization or think tank:\n",
      "we signed up for 73\n",
      "got emails from 50\n",
      "\n",
      "House battleground official:\n",
      "we signed up for 70\n",
      "got emails from 61\n",
      "\n"
     ]
    }
   ],
   "source": [
    "A1_output_cols = [\n",
    "    'Name', 'Website', \n",
    "    'Emails received',\n",
    "    'Date of first email', \n",
    "    'Date of last email',\n",
    "]\n",
    "# this prevents duplicate rows\n",
    "cols_to_use = emails_received.columns.difference(\n",
    "    newsletter_categories.columns\n",
    ")\n",
    "\n",
    "no_shows = []\n",
    "for cat in categories:\n",
    "    fn_out = os.path.join(\n",
    "        appedix_dir, f\"A1-{cat.replace(' ', '-')}.csv\"\n",
    "    )\n",
    "    \n",
    "    newsletter_categories_cat = newsletter_categories[\n",
    "        newsletter_categories.Category == cat].set_index('Name')\n",
    "    \n",
    "    emails_received_cat = emails_received[\n",
    "        emails_received['Newsletter Category'] == cat].set_index('Name')\n",
    "\n",
    "    # here we combine all emails we opt-d in for, with all recieved\n",
    "    email_tracker = newsletter_categories_cat.merge(emails_received_cat[cols_to_use],\n",
    "                                                    on= ['Name'], \n",
    "                                                    how='left') \\\n",
    "                                             .sort_values(by= 'Name') \\\n",
    "                                             .reset_index()\n",
    "    \n",
    "    # For email senders with no match, set filler values.\n",
    "    email_tracker['Emails received'].fillna(0, inplace=True)\n",
    "    for col in ['Date of first email', 'Date of last email']:\n",
    "        email_tracker[col].fillna('-', inplace=True)\n",
    "    \n",
    "    # let's record who we didn't get!\n",
    "    no_shows.extend(\n",
    "        email_tracker[email_tracker['Emails received'] == 0][[\n",
    "            'Name', 'Category', 'Entity_ID'\n",
    "        ]].to_dict(orient='records')\n",
    "    )\n",
    "    \n",
    "    # save this for the appendix\n",
    "    email_tracker[A1_output_cols].to_csv(fn_out, index=False)\n",
    "    del fn_out\n",
    "    \n",
    "    # print some summary stats\n",
    "    n_signed_up = len(newsletter_categories_cat)\n",
    "    n_received = len(emails_received_cat)\n",
    "    print(f\"{cat}:\\nwe signed up for {n_signed_up}\\n\"\n",
    "          f\"got emails from {n_received}\\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get the stats for all senders:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Out of the 231 email senders we opted into we received 171 of them and didn't recieve 60.\n"
     ]
    }
   ],
   "source": [
    "n_no_shows = len(no_shows)\n",
    "n_received = df.Entity_ID.nunique()\n",
    "n_subscribed = newsletter_categories.Entity_ID.nunique()\n",
    "\n",
    "print(f\"Out of the {n_subscribed} email senders we opted into \"\n",
    "      f\"we received {n_received} of them and didn't recieve {n_no_shows}.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Who didn't send us emails?\n",
    "fn_out = os.path.join(appedix_dir, 'Amisc_email_senders_who_didnt_send.csv')\n",
    "pd.DataFrame(no_shows).sort_values('Name').to_csv(fn_out, index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Table 1: How Gmail classified our emails"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Category</th>\n",
       "      <th>Primary</th>\n",
       "      <th>Promotions</th>\n",
       "      <th>Spam</th>\n",
       "      <th>Total Emails</th>\n",
       "      <th>STD Spam</th>\n",
       "      <th>STD Promotions</th>\n",
       "      <th>STD Primary</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Advocacy organization or think tank</td>\n",
       "      <td>0.088472</td>\n",
       "      <td>0.463807</td>\n",
       "      <td>0.447721</td>\n",
       "      <td>1865</td>\n",
       "      <td>0.340419</td>\n",
       "      <td>0.350158</td>\n",
       "      <td>0.280486</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>House battleground campaign</td>\n",
       "      <td>0.007117</td>\n",
       "      <td>0.494662</td>\n",
       "      <td>0.498221</td>\n",
       "      <td>1124</td>\n",
       "      <td>0.380390</td>\n",
       "      <td>0.376841</td>\n",
       "      <td>0.178170</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>House battleground official</td>\n",
       "      <td>0.447883</td>\n",
       "      <td>0.249186</td>\n",
       "      <td>0.302932</td>\n",
       "      <td>614</td>\n",
       "      <td>0.310248</td>\n",
       "      <td>0.275013</td>\n",
       "      <td>0.302288</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Presidential candidate</td>\n",
       "      <td>0.064010</td>\n",
       "      <td>0.638145</td>\n",
       "      <td>0.297845</td>\n",
       "      <td>1531</td>\n",
       "      <td>0.335097</td>\n",
       "      <td>0.366065</td>\n",
       "      <td>0.179952</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>All emails</td>\n",
       "      <td>0.106350</td>\n",
       "      <td>0.496884</td>\n",
       "      <td>0.396767</td>\n",
       "      <td>5134</td>\n",
       "      <td>0.336899</td>\n",
       "      <td>0.327792</td>\n",
       "      <td>0.280078</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                              Category   Primary  Promotions      Spam  \\\n",
       "0  Advocacy organization or think tank  0.088472    0.463807  0.447721   \n",
       "1          House battleground campaign  0.007117    0.494662  0.498221   \n",
       "2          House battleground official  0.447883    0.249186  0.302932   \n",
       "3               Presidential candidate  0.064010    0.638145  0.297845   \n",
       "4                           All emails  0.106350    0.496884  0.396767   \n",
       "\n",
       "   Total Emails  STD Spam  STD Promotions  STD Primary  \n",
       "0          1865  0.340419        0.350158     0.280486  \n",
       "1          1124  0.380390        0.376841     0.178170  \n",
       "2           614  0.310248        0.275013     0.302288  \n",
       "3          1531  0.335097        0.366065     0.179952  \n",
       "4          5134  0.336899        0.327792     0.280078  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fn_out = os.path.join(table_dir, 'T1-overview.csv')\n",
    "\n",
    "data = []\n",
    "for cat, _df in df.groupby('Category'):\n",
    "    spam = []\n",
    "    promos = []\n",
    "    primary = []\n",
    "    for user, __df in _df.groupby('To_Email'):\n",
    "        n_emails = __df.markup_id.nunique()\n",
    "        n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails\n",
    "        n_promo = __df[(__df['Category Promotions'] == 1) &\n",
    "                       (__df['Spam'] != 1)].markup_id.nunique()/ n_emails\n",
    "        n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails\n",
    "        \n",
    "        spam.append(n_spam)\n",
    "        promos.append(n_promo)\n",
    "        primary.append(n_primary)\n",
    "    \n",
    "    # this is to get the variance\n",
    "    spam = np.std(spam)\n",
    "    promos = np.std(promos)\n",
    "    primary = np.std(primary)\n",
    "    \n",
    "    n_emails = _df.markup_id.nunique()\n",
    "    n_spam = _df[_df['Spam'] == 1].markup_id.nunique() / n_emails\n",
    "    n_promo = _df[(_df['Category Promotions'] == 1) &\n",
    "                  (_df['Spam'] != 1)].markup_id.nunique()/ n_emails\n",
    "    n_primary = _df[_df['Primary'] == 1].markup_id.nunique()/ n_emails\n",
    "    \n",
    "    row = {\n",
    "        'Category' : cat,\n",
    "        'Primary' : n_primary,\n",
    "        'Promotions' : n_promo,\n",
    "        'Spam' : n_spam,\n",
    "        'Total Emails' : n_emails,\n",
    "        'STD Spam' : spam,\n",
    "        'STD Promotions' :promos,\n",
    "        'STD Primary' : primary\n",
    "    }\n",
    "    \n",
    "    data.append(row)\n",
    "    \n",
    "# Average for all\n",
    "spam = []\n",
    "promos = []\n",
    "primary = []\n",
    "\n",
    "for user, __df in df.groupby('To_Email'):\n",
    "    n_emails = __df.markup_id.nunique()\n",
    "    n_spam = __df[__df['Spam'] == 1].markup_id.nunique()/ n_emails\n",
    "    n_promo = __df[(__df['Category Promotions'] == 1) &\n",
    "                   (__df['Spam'] != 1)].markup_id.nunique()/ n_emails\n",
    "    n_primary = __df[__df['Primary'] == 1].markup_id.nunique() / n_emails\n",
    "\n",
    "    spam.append(n_spam)\n",
    "    promos.append(n_promo)\n",
    "    primary.append(n_primary)\n",
    "\n",
    "spam = np.std(spam)\n",
    "promos = np.std(promos)\n",
    "primary = np.std(primary)\n",
    "\n",
    "n_emails = df.markup_id.nunique()\n",
    "n_spam = df[df['Spam'] == 1].markup_id.nunique() / n_emails\n",
    "n_promo = df[(df['Category Promotions'] == 1) &\n",
    "             (df['Spam'] != 1)].markup_id.nunique()/ n_emails\n",
    "n_primary = df[df['Primary'] == 1].markup_id.nunique()/ n_emails\n",
    "\n",
    "row = {\n",
    "    'Category' : 'All emails',\n",
    "    'Primary' : n_primary,\n",
    "    'Promotions' : n_promo,\n",
    "    'Spam' : n_spam,\n",
    "    'Total Emails' : n_emails,\n",
    "    'STD Spam' : spam,\n",
    "    'STD Promotions' :promos,\n",
    "    'STD Primary' : primary\n",
    "}\n",
    "\n",
    "data.append(row)\n",
    "\n",
    "overview = pd.DataFrame(data)\n",
    "overview.to_csv(fn_out, index=False)\n",
    "del fn_out\n",
    "overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Table 2: Gmail classification statistics for each email sender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "cat2stats = {}\n",
    "data_all = []\n",
    "for cat in categories:\n",
    "    stats = []\n",
    "    for eID, messages in df[df.Category == cat].groupby('Entity_ID'):\n",
    "        name = messages.iloc[0].Name\n",
    "        n_cat = messages.markup_id.nunique()\n",
    "        n_pro = messages[(messages['Category Promotions'] == 1) &\n",
    "                         (messages['Spam'] != 1) ].markup_id.nunique()\n",
    "        n_u = messages[messages['Category Updates'] == 1].markup_id.nunique()\n",
    "        n_s = messages[messages['Spam'] == 1].markup_id.nunique()\n",
    "        n_pri = messages[messages['Primary'] == 1].markup_id.nunique()\n",
    "        \n",
    "        start_date = messages.Date.min()\n",
    "        end_date = messages.Date.max()\n",
    "        website = messages.Website.iloc[0]\n",
    "        \n",
    "        row = {\n",
    "            'Entity_ID' : eID,\n",
    "            'Name' : name,\n",
    "            'Website' : website,\n",
    "            'n_promos' : n_pro,\n",
    "            'n_updates' : n_u,\n",
    "            'n_spam' : n_s,\n",
    "            'n_primary' : n_pri,\n",
    "            'start_date' : start_date,\n",
    "            'end_date' : end_date,\n",
    "            'n_messages' : len(messages)\n",
    "        }\n",
    "        stats.append(row)\n",
    "        \n",
    "    # assign all sender stats to the category\n",
    "    cat2stats[cat] = stats\n",
    "    # add all sender stats to the list of everything\n",
    "    data_all.extend(stats)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# names of columns we're using\n",
    "metrics = [ 'n_primary', 'n_promos', 'n_spam']\n",
    "ratios = [col.replace('n_', 'perc_') for col in metrics]\n",
    "\n",
    "# columns we want for our output table\n",
    "t2_cols = ['Name'] + ratios +  ['n_messages']\n",
    "\n",
    "# rename these columns for the appendix\n",
    "col2col = {\n",
    "    'Name' : 'Name',\n",
    "    'perc_primary' : 'Primary',\n",
    "    'perc_promos' : 'Promotions',\n",
    "    'perc_spam' : 'Spam',\n",
    "    'n_messages' : 'Total Emails'\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_inbox_ratios(df):\n",
    "    \"\"\"Gets some summary stats per sender and sorts by primary inbox\"\"\"\n",
    "    df[ratios] = df[metrics].div(df['n_messages'], \n",
    "                                 axis=0).round(4)\n",
    "    output = df[t2_cols]\n",
    "    output.columns = [col2col.get(c, c) for c in output.columns]\n",
    "    output = output.set_index('Name').sort_values(\n",
    "        by=['Primary', 'Total Emails'], ascending=False\n",
    "    ).reset_index()\n",
    "    \n",
    "    return output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How is Gmail classifying emails from each of these presidential candidates?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Presidential candidate\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Primary</th>\n",
       "      <th>Promotions</th>\n",
       "      <th>Spam</th>\n",
       "      <th>Total Emails</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Pete Buttigieg</td>\n",
       "      <td>0.6279</td>\n",
       "      <td>0.0930</td>\n",
       "      <td>0.2791</td>\n",
       "      <td>43</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Andrew Yang</td>\n",
       "      <td>0.4638</td>\n",
       "      <td>0.0435</td>\n",
       "      <td>0.4928</td>\n",
       "      <td>69</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Michael Bloomberg</td>\n",
       "      <td>0.1667</td>\n",
       "      <td>0.8333</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>6</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Julián Castro</td>\n",
       "      <td>0.1204</td>\n",
       "      <td>0.7407</td>\n",
       "      <td>0.1389</td>\n",
       "      <td>216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bill Weld</td>\n",
       "      <td>0.0291</td>\n",
       "      <td>0.2913</td>\n",
       "      <td>0.6796</td>\n",
       "      <td>103</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Tulsi Gabbard</td>\n",
       "      <td>0.0222</td>\n",
       "      <td>0.7778</td>\n",
       "      <td>0.2000</td>\n",
       "      <td>45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Bernie Sanders</td>\n",
       "      <td>0.0180</td>\n",
       "      <td>0.9640</td>\n",
       "      <td>0.0180</td>\n",
       "      <td>111</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Cory Booker</td>\n",
       "      <td>0.0145</td>\n",
       "      <td>0.7633</td>\n",
       "      <td>0.2222</td>\n",
       "      <td>207</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Amy Klobuchar</td>\n",
       "      <td>0.0096</td>\n",
       "      <td>0.9071</td>\n",
       "      <td>0.0833</td>\n",
       "      <td>312</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Kamala Harris</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.3496</td>\n",
       "      <td>0.6504</td>\n",
       "      <td>123</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Joe Walsh</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.0549</td>\n",
       "      <td>0.9451</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Joe Biden</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.8571</td>\n",
       "      <td>0.1429</td>\n",
       "      <td>91</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Elizabeth Warren</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.8485</td>\n",
       "      <td>0.1515</td>\n",
       "      <td>66</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Beto O’Rourke</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.1957</td>\n",
       "      <td>0.8043</td>\n",
       "      <td>46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Tom Steyer</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>1.0000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Donald J. Trump</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 Name  Primary  Promotions    Spam  Total Emails\n",
       "0      Pete Buttigieg   0.6279      0.0930  0.2791            43\n",
       "1         Andrew Yang   0.4638      0.0435  0.4928            69\n",
       "2   Michael Bloomberg   0.1667      0.8333  0.0000             6\n",
       "3       Julián Castro   0.1204      0.7407  0.1389           216\n",
       "4           Bill Weld   0.0291      0.2913  0.6796           103\n",
       "5       Tulsi Gabbard   0.0222      0.7778  0.2000            45\n",
       "6      Bernie Sanders   0.0180      0.9640  0.0180           111\n",
       "7         Cory Booker   0.0145      0.7633  0.2222           207\n",
       "8       Amy Klobuchar   0.0096      0.9071  0.0833           312\n",
       "9       Kamala Harris   0.0000      0.3496  0.6504           123\n",
       "10          Joe Walsh   0.0000      0.0549  0.9451            91\n",
       "11          Joe Biden   0.0000      0.8571  0.1429            91\n",
       "12   Elizabeth Warren   0.0000      0.8485  0.1515            66\n",
       "13      Beto O’Rourke   0.0000      0.1957  0.8043            46\n",
       "14         Tom Steyer   0.0000      1.0000  0.0000             1\n",
       "15    Donald J. Trump   0.0000      0.0000  0.0000             0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fn_out = os.path.join(table_dir, 'T2a-presidental-candidates.csv')\n",
    "cat = 'Presidential candidate'\n",
    "print(cat)\n",
    "\n",
    "df_cat = pd.DataFrame(cat2stats[cat])\n",
    "output = get_inbox_ratios(df_cat)\n",
    "\n",
    "# remove Trump rallies and add Trump with zero emails.\n",
    "output = output[output.Name != 'Trump rallies'].append(\n",
    "    pd.DataFrame([{\n",
    "        'Name' : 'Donald J. Trump',\n",
    "        'Primary' : 0.0,\n",
    "        'Promotions' : 0.0,\n",
    "        'Spam' : 0.0,\n",
    "        'Total Emails' : 0\n",
    "    }])).reset_index(drop=True)\n",
    "# output = publ?ication_ready(output)\n",
    "output.to_csv(fn_out, index=False)\n",
    "del fn_out\n",
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also get this for the coalition of advocacy groups plus Change.org."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "coalition_plus_change = [\n",
    "    'SumOfUs',\n",
    "    'Democracy for America',\n",
    "    'CREDO Action',\n",
    "    'Change.org'\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Primary</th>\n",
       "      <th>Promotions</th>\n",
       "      <th>Spam</th>\n",
       "      <th>Total Emails</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>SumOfUs</td>\n",
       "      <td>0.0976</td>\n",
       "      <td>0.9024</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Change.org</td>\n",
       "      <td>0.0385</td>\n",
       "      <td>0.8462</td>\n",
       "      <td>0.1154</td>\n",
       "      <td>26</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Democracy for America</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.6023</td>\n",
       "      <td>0.3977</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CREDO Action</td>\n",
       "      <td>0.0000</td>\n",
       "      <td>0.7273</td>\n",
       "      <td>0.2727</td>\n",
       "      <td>44</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                    Name  Primary  Promotions    Spam  Total Emails\n",
       "0                SumOfUs   0.0976      0.9024  0.0000            41\n",
       "1             Change.org   0.0385      0.8462  0.1154            26\n",
       "2  Democracy for America   0.0000      0.6023  0.3977            88\n",
       "3           CREDO Action   0.0000      0.7273  0.2727            44"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "fn_out = os.path.join(table_dir, 'T2b-coalition-groups.csv')\n",
    "cat = 'Advocacy organization or think tank'\n",
    "\n",
    "df_cat = pd.DataFrame(cat2stats[cat])\n",
    "df_cat = df_cat[df_cat.Name.isin(coalition_plus_change)]\n",
    "output = get_inbox_ratios(df_cat)\n",
    "output.to_csv(fn_out, index=False)\n",
    "del fn_out\n",
    "output"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Appendix 2\n",
    "The same stats as Table 2a and 2b, but for all the categories."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Primary</th>\n",
       "      <th>Promotions</th>\n",
       "      <th>Spam</th>\n",
       "      <th>Total Emails</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Antonio Delgado</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Troy Balderson</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "              Name  Primary  Promotions  Spam  Total Emails\n",
       "0  Antonio Delgado      1.0         0.0   0.0            11\n",
       "1   Troy Balderson      1.0         0.0   0.0             4"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "for cat in categories:\n",
    "    fn_out = os.path.join(appedix_dir, f\"A2-{cat.replace(' ', '-')}.csv\")\n",
    "    df_cat = pd.DataFrame(cat2stats[cat])\n",
    "    output = get_inbox_ratios(df_cat)\n",
    "    output.to_csv(fn_out, index=False)\n",
    "    del fn_out\n",
    "    \n",
    "output.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Misc Stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What is the percentage of email senders that never got an email in primary?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When we created the `cat2stats` dictionary, we also added each row to `data_all`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "171"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "email_sender_stats = pd.DataFrame(data_all)\n",
    "len(email_sender_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's only look at senders who sent us at least two emails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "149"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "senders = email_sender_stats[email_sender_stats.n_messages >= 2]\n",
    "len(senders)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What percentage these senders never got to the primary inbox?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4161073825503356"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(senders[senders.n_primary == 0]) / len(senders)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What percentage of Spam emails are also labelled as promotions?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.6921944035346097"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df[(df['Category Promotions'] == 1) & (df['Spam'] == 1)]) / len(df[df['Spam'] == 1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### How many emails _would_ have been in Updates if Updates was active?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "540"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(df[(df['Category Updates'] == 1) & (df['Spam'] != 1)])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}