{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Section ID Count and Event Data\n",
    "- limited to WP:M pages\n",
    "\n",
    "Section ID data was not limited to top-level (H2) sections during data capture, requiring post-capture processing for section ID click data.\n",
    "\n",
    "Example of capture issue: https://en.wikipedia.org/wiki/Hepatitis#Signs_and_symptoms.\n",
    "Clicks on links under \"Acute hepatitis\" were captured with section_id Acute_hepatitis, not Signs_and_symptoms.\n",
    "\n",
    "Post-capture data augmentation: click event section_ids where mapped to parent H2 section headings. See [populate-section-table.ipynb](populate-section-table.ipynb) for extraction details.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# basic defaults, including study dates, common SQL exclusions and parquet files for anonymized data\n",
    "%run -i 'data-defaults.py'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Raw section data from captured events\n",
    "- Raw total count of events (by all event types) for each section ID for WP:M pages only.\n",
    "- Limited to >= 3000 events.\n",
    "- shows the extent of the capture issue described above\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# \"raw\" section data from captured events to show extent of the capture issue described above\n",
    "pm_section_events_raw_query = \"\"\"\n",
    "SELECT section_id, action, count(*) count\n",
    "FROM \n",
    "    citationusage \n",
    "WHERE page_id IN (SELECT page_id FROM ryanmax.population_wpm_pages_with_extlinks)\n",
    "    AND wiki = 'enwiki'\n",
    "    {}\n",
    "    AND to_date(event_time) >= '{}'\n",
    "    AND to_date(event_time) <= '{}'\n",
    "    AND useragent_is_bot = FALSE\n",
    "GROUP BY section_id, action\n",
    "ORDER BY count desc\n",
    "LIMIT 100\n",
    "\"\"\"\n",
    "pm_section_events_raw = spark.sql(\n",
    "    pm_section_events_raw_query.format(\n",
    "        event_exclusion_sql, start_date_string, end_date_string\n",
    "    ))\n",
    "pm_section_events_raw_pandas = pm_section_events_raw.toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>action</th>\n",
       "      <th>extClick</th>\n",
       "      <th>fnClick</th>\n",
       "      <th>fnHover</th>\n",
       "      <th>upClick</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>section_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-- missing --</th>\n",
       "      <td>143490</td>\n",
       "      <td>365209</td>\n",
       "      <td>523476</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Adverse_effects</th>\n",
       "      <td>None</td>\n",
       "      <td>4252</td>\n",
       "      <td>6343</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Background</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3496</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cause</th>\n",
       "      <td>None</td>\n",
       "      <td>3821</td>\n",
       "      <td>7787</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Causes</th>\n",
       "      <td>None</td>\n",
       "      <td>10930</td>\n",
       "      <td>18686</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classification</th>\n",
       "      <td>None</td>\n",
       "      <td>3029</td>\n",
       "      <td>6620</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Criminal_charges</th>\n",
       "      <td>None</td>\n",
       "      <td>3025</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Diagnosis</th>\n",
       "      <td>None</td>\n",
       "      <td>7353</td>\n",
       "      <td>14525</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Downfall</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3523</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Early_life</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3192</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epidemiology</th>\n",
       "      <td>None</td>\n",
       "      <td>8049</td>\n",
       "      <td>16396</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>External_links</th>\n",
       "      <td>114496</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Function</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3082</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Further_reading</th>\n",
       "      <td>12920</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Genetics</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>6477</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>History</th>\n",
       "      <td>None</td>\n",
       "      <td>23552</td>\n",
       "      <td>48210</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Management</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>4420</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>4187</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism_of_action</th>\n",
       "      <td>None</td>\n",
       "      <td>4698</td>\n",
       "      <td>10453</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_uses</th>\n",
       "      <td>None</td>\n",
       "      <td>5672</td>\n",
       "      <td>9901</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medications</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3138</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Notes</th>\n",
       "      <td>14437</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathophysiology</th>\n",
       "      <td>None</td>\n",
       "      <td>3882</td>\n",
       "      <td>10161</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Personal_life</th>\n",
       "      <td>None</td>\n",
       "      <td>6076</td>\n",
       "      <td>5398</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacodynamics</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>7132</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacokinetics</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3705</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prevention</th>\n",
       "      <td>None</td>\n",
       "      <td>3202</td>\n",
       "      <td>5437</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prognosis</th>\n",
       "      <td>None</td>\n",
       "      <td>6368</td>\n",
       "      <td>7950</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>References</th>\n",
       "      <td>488685</td>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>45536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Research</th>\n",
       "      <td>None</td>\n",
       "      <td>3922</td>\n",
       "      <td>6242</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Risk_factors</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3273</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Side_effects</th>\n",
       "      <td>None</td>\n",
       "      <td>4008</td>\n",
       "      <td>5390</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Signs_and_symptoms</th>\n",
       "      <td>None</td>\n",
       "      <td>17120</td>\n",
       "      <td>30088</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Society_and_culture</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3369</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Transmission</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3536</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatment</th>\n",
       "      <td>None</td>\n",
       "      <td>12053</td>\n",
       "      <td>15874</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Types</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>3559</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>United_States</th>\n",
       "      <td>None</td>\n",
       "      <td>None</td>\n",
       "      <td>4951</td>\n",
       "      <td>None</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "action              extClick fnClick fnHover upClick\n",
       "section_id                                          \n",
       "-- missing --         143490  365209  523476    None\n",
       "Adverse_effects         None    4252    6343    None\n",
       "Background              None    None    3496    None\n",
       "Cause                   None    3821    7787    None\n",
       "Causes                  None   10930   18686    None\n",
       "Classification          None    3029    6620    None\n",
       "Criminal_charges        None    3025    None    None\n",
       "Diagnosis               None    7353   14525    None\n",
       "Downfall                None    None    3523    None\n",
       "Early_life              None    None    3192    None\n",
       "Epidemiology            None    8049   16396    None\n",
       "External_links        114496    None    None    None\n",
       "Function                None    None    3082    None\n",
       "Further_reading        12920    None    None    None\n",
       "Genetics                None    None    6477    None\n",
       "History                 None   23552   48210    None\n",
       "Management              None    None    4420    None\n",
       "Mechanism               None    None    4187    None\n",
       "Mechanism_of_action     None    4698   10453    None\n",
       "Medical_uses            None    5672    9901    None\n",
       "Medications             None    None    3138    None\n",
       "Notes                  14437    None    None    None\n",
       "Pathophysiology         None    3882   10161    None\n",
       "Personal_life           None    6076    5398    None\n",
       "Pharmacodynamics        None    None    7132    None\n",
       "Pharmacokinetics        None    None    3705    None\n",
       "Prevention              None    3202    5437    None\n",
       "Prognosis               None    6368    7950    None\n",
       "References            488685    None    None   45536\n",
       "Research                None    3922    6242    None\n",
       "Risk_factors            None    None    3273    None\n",
       "Side_effects            None    4008    5390    None\n",
       "Signs_and_symptoms      None   17120   30088    None\n",
       "Society_and_culture     None    None    3369    None\n",
       "Transmission            None    None    3536    None\n",
       "Treatment               None   12053   15874    None\n",
       "Types                   None    None    3559    None\n",
       "United_States           None    None    4951    None"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "section_pda_raw = pm_section_events_raw_pandas.copy()\n",
    "# replace 'NaN' section_id with 'missing'\n",
    "section_pda_raw.section_id.fillna(value='-- missing --', inplace=True)\n",
    "# limit to counts of 1K or more\n",
    "df_filtered_raw = section_pda_raw.query('count>3000').copy()\n",
    "# set precision before pivot\n",
    "df_filtered_raw['count'] = df_filtered_raw['count'].map(lambda x: '{0:.0f}'.format(x))\n",
    "df_filtered_raw.pivot(index='section_id', columns='action', values='count')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Count of WP:M pages by top-level (H2) section ID"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>section_h2</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>References</td>\n",
       "      <td>30862</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>External_links</td>\n",
       "      <td>18460</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>See_also</td>\n",
       "      <td>13327</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>History</td>\n",
       "      <td>5666</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Diagnosis</td>\n",
       "      <td>4654</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Treatment</td>\n",
       "      <td>4263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Further_reading</td>\n",
       "      <td>3165</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Signs_and_symptoms</td>\n",
       "      <td>2839</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Causes</td>\n",
       "      <td>2149</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>Epidemiology</td>\n",
       "      <td>2066</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>Prognosis</td>\n",
       "      <td>1300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>Career</td>\n",
       "      <td>1253</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>Notes</td>\n",
       "      <td>1244</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>Research</td>\n",
       "      <td>1221</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>Cause</td>\n",
       "      <td>1209</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>Pathophysiology</td>\n",
       "      <td>1145</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>Medical_uses</td>\n",
       "      <td>1016</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>Society_and_culture</td>\n",
       "      <td>937</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>Biography</td>\n",
       "      <td>842</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>Prevention</td>\n",
       "      <td>827</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>20</th>\n",
       "      <td>Management</td>\n",
       "      <td>796</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>21</th>\n",
       "      <td>Early_life</td>\n",
       "      <td>729</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>22</th>\n",
       "      <td>Symptoms</td>\n",
       "      <td>624</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>Side_effects</td>\n",
       "      <td>607</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>24</th>\n",
       "      <td>Genetics</td>\n",
       "      <td>597</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25</th>\n",
       "      <td>Mechanism</td>\n",
       "      <td>593</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>26</th>\n",
       "      <td>Types</td>\n",
       "      <td>588</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>27</th>\n",
       "      <td>Personal_life</td>\n",
       "      <td>587</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>28</th>\n",
       "      <td>Pharmacology</td>\n",
       "      <td>553</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>29</th>\n",
       "      <td>Bibliography</td>\n",
       "      <td>543</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>30</th>\n",
       "      <td>Background</td>\n",
       "      <td>536</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>31</th>\n",
       "      <td>Early_life_and_education</td>\n",
       "      <td>526</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>32</th>\n",
       "      <td>Publications</td>\n",
       "      <td>502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>33</th>\n",
       "      <td>Presentation</td>\n",
       "      <td>493</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>34</th>\n",
       "      <td>Classification</td>\n",
       "      <td>486</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>35</th>\n",
       "      <td>Education</td>\n",
       "      <td>480</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>36</th>\n",
       "      <td>Chemistry</td>\n",
       "      <td>470</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>37</th>\n",
       "      <td>Life</td>\n",
       "      <td>427</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>38</th>\n",
       "      <td>Mechanism_of_action</td>\n",
       "      <td>396</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>39</th>\n",
       "      <td>Adverse_effects</td>\n",
       "      <td>388</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>40</th>\n",
       "      <td>Sources</td>\n",
       "      <td>387</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>41</th>\n",
       "      <td>Contraindications</td>\n",
       "      <td>351</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>42</th>\n",
       "      <td>Works</td>\n",
       "      <td>340</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>43</th>\n",
       "      <td>Awards</td>\n",
       "      <td>338</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>44</th>\n",
       "      <td>Interactions</td>\n",
       "      <td>311</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>45</th>\n",
       "      <td>Structure</td>\n",
       "      <td>310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>Uses</td>\n",
       "      <td>291</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>47</th>\n",
       "      <td>Selected_publications</td>\n",
       "      <td>290</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>48</th>\n",
       "      <td>Function</td>\n",
       "      <td>279</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>49</th>\n",
       "      <td>Overview</td>\n",
       "      <td>279</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  section_h2  count\n",
       "0                 References  30862\n",
       "1             External_links  18460\n",
       "2                   See_also  13327\n",
       "3                    History   5666\n",
       "4                  Diagnosis   4654\n",
       "5                  Treatment   4263\n",
       "6            Further_reading   3165\n",
       "7         Signs_and_symptoms   2839\n",
       "8                     Causes   2149\n",
       "9               Epidemiology   2066\n",
       "10                 Prognosis   1300\n",
       "11                    Career   1253\n",
       "12                     Notes   1244\n",
       "13                  Research   1221\n",
       "14                     Cause   1209\n",
       "15           Pathophysiology   1145\n",
       "16              Medical_uses   1016\n",
       "17       Society_and_culture    937\n",
       "18                 Biography    842\n",
       "19                Prevention    827\n",
       "20                Management    796\n",
       "21                Early_life    729\n",
       "22                  Symptoms    624\n",
       "23              Side_effects    607\n",
       "24                  Genetics    597\n",
       "25                 Mechanism    593\n",
       "26                     Types    588\n",
       "27             Personal_life    587\n",
       "28              Pharmacology    553\n",
       "29              Bibliography    543\n",
       "30                Background    536\n",
       "31  Early_life_and_education    526\n",
       "32              Publications    502\n",
       "33              Presentation    493\n",
       "34            Classification    486\n",
       "35                 Education    480\n",
       "36                 Chemistry    470\n",
       "37                      Life    427\n",
       "38       Mechanism_of_action    396\n",
       "39           Adverse_effects    388\n",
       "40                   Sources    387\n",
       "41         Contraindications    351\n",
       "42                     Works    340\n",
       "43                    Awards    338\n",
       "44              Interactions    311\n",
       "45                 Structure    310\n",
       "46                      Uses    291\n",
       "47     Selected_publications    290\n",
       "48                  Function    279\n",
       "49                  Overview    279"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# count of top-level (H2) section IDs for WP:M pages only\n",
    "pm_sections_query = \"\"\"\n",
    "SELECT section_h2, count(distinct page_id) count\n",
    "FROM\n",
    "    ryanmax.population_wpm_sections \n",
    "GROUP BY section_h2\n",
    "ORDER BY count desc, section_h2\n",
    "\"\"\"\n",
    "pm_sections = spark.sql(pm_sections_query)\n",
    "pm_sections.toPandas().head(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Total count of events (all types) for each top-level (H2) section ID for WP:M pages only\n",
    "- Limited to sections with more than 1K fnClick events\n",
    "- **missing** values are largely because section IDs were not recorded \"if the section is the Main Section\" as per [Schema:CitationUsage](https://meta.wikimedia.org/wiki/Schema:CitationUsage)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Total count of events (by all event types) for each top-level (H2) section ID for WP:M pages only\n",
    "pm_section_events_query = \"\"\"\n",
    "SELECT population_wpm_sections.section_h2, action, count(*) count, count(*)/{} AS daily_average\n",
    "FROM \n",
    "    citationusage\n",
    "    LEFT JOIN ryanmax.population_wpm_sections \n",
    "        ON \n",
    "        population_wpm_sections.page_id = citationusage.page_id \n",
    "        AND population_wpm_sections.section_id = citationusage.section_id\n",
    "WHERE\n",
    "    wiki = 'enwiki'\n",
    "    AND citationusage.page_id IN (SELECT page_id FROM ryanmax.population_wpm_pages_with_extlinks)\n",
    "    {}\n",
    "    AND to_date(citationusage.event_time) >= '{}'\n",
    "    AND to_date(citationusage.event_time) <= '{}'\n",
    "    AND useragent_is_bot = FALSE\n",
    "GROUP BY population_wpm_sections.section_h2, action\n",
    "ORDER BY count desc\n",
    "\"\"\"\n",
    "pm_section_events = spark.sql(\n",
    "    pm_section_events_query.format(\n",
    "        days_in_study,\n",
    "        event_exclusion_sql, start_date_string, end_date_string\n",
    "    ))\n",
    "pm_section_events_pandas = pm_section_events.toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>action</th>\n",
       "      <th>extClick</th>\n",
       "      <th>fnClick</th>\n",
       "      <th>fnHover</th>\n",
       "      <th>upClick</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>section_h2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-- missing --</th>\n",
       "      <td>144124.0</td>\n",
       "      <td>367446.0</td>\n",
       "      <td>527633.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Adverse_effects</th>\n",
       "      <td>206.0</td>\n",
       "      <td>8017.0</td>\n",
       "      <td>14158.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Applications</th>\n",
       "      <td>18.0</td>\n",
       "      <td>1187.0</td>\n",
       "      <td>2429.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Background</th>\n",
       "      <td>53.0</td>\n",
       "      <td>1853.0</td>\n",
       "      <td>4216.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Biography</th>\n",
       "      <td>60.0</td>\n",
       "      <td>1700.0</td>\n",
       "      <td>3864.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Career</th>\n",
       "      <td>124.0</td>\n",
       "      <td>1801.0</td>\n",
       "      <td>3876.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cause</th>\n",
       "      <td>174.0</td>\n",
       "      <td>8959.0</td>\n",
       "      <td>21897.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Causes</th>\n",
       "      <td>386.0</td>\n",
       "      <td>20480.0</td>\n",
       "      <td>39966.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Characteristics</th>\n",
       "      <td>66.0</td>\n",
       "      <td>1624.0</td>\n",
       "      <td>3835.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chemistry</th>\n",
       "      <td>14.0</td>\n",
       "      <td>2062.0</td>\n",
       "      <td>4100.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classification</th>\n",
       "      <td>1602.0</td>\n",
       "      <td>3771.0</td>\n",
       "      <td>8233.0</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Clinical_significance</th>\n",
       "      <td>19.0</td>\n",
       "      <td>1210.0</td>\n",
       "      <td>2708.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Criticism</th>\n",
       "      <td>3.0</td>\n",
       "      <td>1263.0</td>\n",
       "      <td>2204.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Definition</th>\n",
       "      <td>42.0</td>\n",
       "      <td>1644.0</td>\n",
       "      <td>3126.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Definitions</th>\n",
       "      <td>21.0</td>\n",
       "      <td>1299.0</td>\n",
       "      <td>2501.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Description</th>\n",
       "      <td>173.0</td>\n",
       "      <td>1234.0</td>\n",
       "      <td>2597.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Diagnosis</th>\n",
       "      <td>1836.0</td>\n",
       "      <td>15648.0</td>\n",
       "      <td>34727.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Early_life</th>\n",
       "      <td>28.0</td>\n",
       "      <td>1431.0</td>\n",
       "      <td>3047.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Effects</th>\n",
       "      <td>10.0</td>\n",
       "      <td>2519.0</td>\n",
       "      <td>4717.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epidemiology</th>\n",
       "      <td>47.0</td>\n",
       "      <td>9952.0</td>\n",
       "      <td>21259.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Examples</th>\n",
       "      <td>243.0</td>\n",
       "      <td>1022.0</td>\n",
       "      <td>1544.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Experiments</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1096.0</td>\n",
       "      <td>1573.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Function</th>\n",
       "      <td>16.0</td>\n",
       "      <td>1538.0</td>\n",
       "      <td>4510.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Genetics</th>\n",
       "      <td>239.0</td>\n",
       "      <td>1293.0</td>\n",
       "      <td>3632.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Health_effects</th>\n",
       "      <td>NaN</td>\n",
       "      <td>2426.0</td>\n",
       "      <td>5195.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>History</th>\n",
       "      <td>536.0</td>\n",
       "      <td>34264.0</td>\n",
       "      <td>75457.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Interactions</th>\n",
       "      <td>5.0</td>\n",
       "      <td>1042.0</td>\n",
       "      <td>2046.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Life</th>\n",
       "      <td>212.0</td>\n",
       "      <td>1532.0</td>\n",
       "      <td>4646.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Management</th>\n",
       "      <td>11.0</td>\n",
       "      <td>7595.0</td>\n",
       "      <td>18470.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism</th>\n",
       "      <td>71.0</td>\n",
       "      <td>2875.0</td>\n",
       "      <td>7266.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism_of_action</th>\n",
       "      <td>79.0</td>\n",
       "      <td>3827.0</td>\n",
       "      <td>8596.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_use</th>\n",
       "      <td>7.0</td>\n",
       "      <td>1859.0</td>\n",
       "      <td>3987.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_uses</th>\n",
       "      <td>27.0</td>\n",
       "      <td>14913.0</td>\n",
       "      <td>29189.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Notable_cases</th>\n",
       "      <td>6.0</td>\n",
       "      <td>2097.0</td>\n",
       "      <td>1742.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Overview</th>\n",
       "      <td>80.0</td>\n",
       "      <td>1726.0</td>\n",
       "      <td>3172.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathogenesis</th>\n",
       "      <td>1.0</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>2869.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathophysiology</th>\n",
       "      <td>68.0</td>\n",
       "      <td>5911.0</td>\n",
       "      <td>15846.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Personal_life</th>\n",
       "      <td>6.0</td>\n",
       "      <td>6333.0</td>\n",
       "      <td>5641.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacology</th>\n",
       "      <td>22.0</td>\n",
       "      <td>7245.0</td>\n",
       "      <td>19622.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physiology</th>\n",
       "      <td>4.0</td>\n",
       "      <td>1220.0</td>\n",
       "      <td>2799.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prevention</th>\n",
       "      <td>90.0</td>\n",
       "      <td>5387.0</td>\n",
       "      <td>10765.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prognosis</th>\n",
       "      <td>74.0</td>\n",
       "      <td>7022.0</td>\n",
       "      <td>9220.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Research</th>\n",
       "      <td>121.0</td>\n",
       "      <td>6413.0</td>\n",
       "      <td>11408.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Risk_factors</th>\n",
       "      <td>7.0</td>\n",
       "      <td>3796.0</td>\n",
       "      <td>7167.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Safety</th>\n",
       "      <td>16.0</td>\n",
       "      <td>1321.0</td>\n",
       "      <td>2336.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Science_and_technology</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1135.0</td>\n",
       "      <td>2123.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Side_effects</th>\n",
       "      <td>21.0</td>\n",
       "      <td>5930.0</td>\n",
       "      <td>9299.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Signs_and_symptoms</th>\n",
       "      <td>810.0</td>\n",
       "      <td>27628.0</td>\n",
       "      <td>56668.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Society_and_culture</th>\n",
       "      <td>345.0</td>\n",
       "      <td>10760.0</td>\n",
       "      <td>20100.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Structure</th>\n",
       "      <td>195.0</td>\n",
       "      <td>1159.0</td>\n",
       "      <td>3023.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Symptoms</th>\n",
       "      <td>78.0</td>\n",
       "      <td>1622.0</td>\n",
       "      <td>2472.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Theranos</th>\n",
       "      <td>NaN</td>\n",
       "      <td>5974.0</td>\n",
       "      <td>10060.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatment</th>\n",
       "      <td>443.0</td>\n",
       "      <td>19942.0</td>\n",
       "      <td>29533.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatments</th>\n",
       "      <td>9.0</td>\n",
       "      <td>1016.0</td>\n",
       "      <td>1394.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Types</th>\n",
       "      <td>345.0</td>\n",
       "      <td>4721.0</td>\n",
       "      <td>9512.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Uses</th>\n",
       "      <td>34.0</td>\n",
       "      <td>4604.0</td>\n",
       "      <td>9748.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "action                  extClick   fnClick   fnHover  upClick\n",
       "section_h2                                                   \n",
       "-- missing --           144124.0  367446.0  527633.0      6.0\n",
       "Adverse_effects            206.0    8017.0   14158.0      NaN\n",
       "Applications                18.0    1187.0    2429.0      NaN\n",
       "Background                  53.0    1853.0    4216.0      1.0\n",
       "Biography                   60.0    1700.0    3864.0      NaN\n",
       "Career                     124.0    1801.0    3876.0      NaN\n",
       "Cause                      174.0    8959.0   21897.0      NaN\n",
       "Causes                     386.0   20480.0   39966.0      1.0\n",
       "Characteristics             66.0    1624.0    3835.0      NaN\n",
       "Chemistry                   14.0    2062.0    4100.0      NaN\n",
       "Classification            1602.0    3771.0    8233.0      2.0\n",
       "Clinical_significance       19.0    1210.0    2708.0      NaN\n",
       "Criticism                    3.0    1263.0    2204.0      NaN\n",
       "Definition                  42.0    1644.0    3126.0      NaN\n",
       "Definitions                 21.0    1299.0    2501.0      NaN\n",
       "Description                173.0    1234.0    2597.0      NaN\n",
       "Diagnosis                 1836.0   15648.0   34727.0      NaN\n",
       "Early_life                  28.0    1431.0    3047.0      NaN\n",
       "Effects                     10.0    2519.0    4717.0      NaN\n",
       "Epidemiology                47.0    9952.0   21259.0      NaN\n",
       "Examples                   243.0    1022.0    1544.0      NaN\n",
       "Experiments                  NaN    1096.0    1573.0      NaN\n",
       "Function                    16.0    1538.0    4510.0      NaN\n",
       "Genetics                   239.0    1293.0    3632.0      NaN\n",
       "Health_effects               NaN    2426.0    5195.0      NaN\n",
       "History                    536.0   34264.0   75457.0      NaN\n",
       "Interactions                 5.0    1042.0    2046.0      NaN\n",
       "Life                       212.0    1532.0    4646.0      NaN\n",
       "Management                  11.0    7595.0   18470.0      NaN\n",
       "Mechanism                   71.0    2875.0    7266.0      NaN\n",
       "Mechanism_of_action         79.0    3827.0    8596.0      NaN\n",
       "Medical_use                  7.0    1859.0    3987.0      NaN\n",
       "Medical_uses                27.0   14913.0   29189.0      NaN\n",
       "Notable_cases                6.0    2097.0    1742.0      NaN\n",
       "Overview                    80.0    1726.0    3172.0      NaN\n",
       "Pathogenesis                 1.0    1024.0    2869.0      NaN\n",
       "Pathophysiology             68.0    5911.0   15846.0      NaN\n",
       "Personal_life                6.0    6333.0    5641.0      NaN\n",
       "Pharmacology                22.0    7245.0   19622.0      NaN\n",
       "Physiology                   4.0    1220.0    2799.0      NaN\n",
       "Prevention                  90.0    5387.0   10765.0      NaN\n",
       "Prognosis                   74.0    7022.0    9220.0      NaN\n",
       "Research                   121.0    6413.0   11408.0      NaN\n",
       "Risk_factors                 7.0    3796.0    7167.0      NaN\n",
       "Safety                      16.0    1321.0    2336.0      NaN\n",
       "Science_and_technology       NaN    1135.0    2123.0      NaN\n",
       "Side_effects                21.0    5930.0    9299.0      NaN\n",
       "Signs_and_symptoms         810.0   27628.0   56668.0      1.0\n",
       "Society_and_culture        345.0   10760.0   20100.0      NaN\n",
       "Structure                  195.0    1159.0    3023.0      NaN\n",
       "Symptoms                    78.0    1622.0    2472.0      NaN\n",
       "Theranos                     NaN    5974.0   10060.0      NaN\n",
       "Treatment                  443.0   19942.0   29533.0      NaN\n",
       "Treatments                   9.0    1016.0    1394.0      NaN\n",
       "Types                      345.0    4721.0    9512.0      NaN\n",
       "Uses                        34.0    4604.0    9748.0      NaN"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "section_pda = pm_section_events_pandas.copy()\n",
    "# replace 'NaN' section_h2 with 'missing'\n",
    "section_pda.section_h2.fillna(value='-- missing --', inplace=True)\n",
    "pivot=section_pda.pivot(index='section_h2', columns='action', values='count').query('fnClick > 1000')\n",
    "section_h2s=pivot.index\n",
    "pivot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Daily averages of above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>action</th>\n",
       "      <th>extClick</th>\n",
       "      <th>fnClick</th>\n",
       "      <th>fnHover</th>\n",
       "      <th>upClick</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>section_h2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-- missing --</th>\n",
       "      <td>4503.87500</td>\n",
       "      <td>11482.68750</td>\n",
       "      <td>16488.53125</td>\n",
       "      <td>0.18750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Adverse_effects</th>\n",
       "      <td>6.43750</td>\n",
       "      <td>250.53125</td>\n",
       "      <td>442.43750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Applications</th>\n",
       "      <td>0.56250</td>\n",
       "      <td>37.09375</td>\n",
       "      <td>75.90625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Background</th>\n",
       "      <td>1.65625</td>\n",
       "      <td>57.90625</td>\n",
       "      <td>131.75000</td>\n",
       "      <td>0.03125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Biography</th>\n",
       "      <td>1.87500</td>\n",
       "      <td>53.12500</td>\n",
       "      <td>120.75000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Career</th>\n",
       "      <td>3.87500</td>\n",
       "      <td>56.28125</td>\n",
       "      <td>121.12500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cause</th>\n",
       "      <td>5.43750</td>\n",
       "      <td>279.96875</td>\n",
       "      <td>684.28125</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Causes</th>\n",
       "      <td>12.06250</td>\n",
       "      <td>640.00000</td>\n",
       "      <td>1248.93750</td>\n",
       "      <td>0.03125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Characteristics</th>\n",
       "      <td>2.06250</td>\n",
       "      <td>50.75000</td>\n",
       "      <td>119.84375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chemistry</th>\n",
       "      <td>0.43750</td>\n",
       "      <td>64.43750</td>\n",
       "      <td>128.12500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classification</th>\n",
       "      <td>50.06250</td>\n",
       "      <td>117.84375</td>\n",
       "      <td>257.28125</td>\n",
       "      <td>0.06250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Clinical_significance</th>\n",
       "      <td>0.59375</td>\n",
       "      <td>37.81250</td>\n",
       "      <td>84.62500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Criticism</th>\n",
       "      <td>0.09375</td>\n",
       "      <td>39.46875</td>\n",
       "      <td>68.87500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Definition</th>\n",
       "      <td>1.31250</td>\n",
       "      <td>51.37500</td>\n",
       "      <td>97.68750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Definitions</th>\n",
       "      <td>0.65625</td>\n",
       "      <td>40.59375</td>\n",
       "      <td>78.15625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Description</th>\n",
       "      <td>5.40625</td>\n",
       "      <td>38.56250</td>\n",
       "      <td>81.15625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Diagnosis</th>\n",
       "      <td>57.37500</td>\n",
       "      <td>489.00000</td>\n",
       "      <td>1085.21875</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Early_life</th>\n",
       "      <td>0.87500</td>\n",
       "      <td>44.71875</td>\n",
       "      <td>95.21875</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Effects</th>\n",
       "      <td>0.31250</td>\n",
       "      <td>78.71875</td>\n",
       "      <td>147.40625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epidemiology</th>\n",
       "      <td>1.46875</td>\n",
       "      <td>311.00000</td>\n",
       "      <td>664.34375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Examples</th>\n",
       "      <td>7.59375</td>\n",
       "      <td>31.93750</td>\n",
       "      <td>48.25000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Experiments</th>\n",
       "      <td>NaN</td>\n",
       "      <td>34.25000</td>\n",
       "      <td>49.15625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Function</th>\n",
       "      <td>0.50000</td>\n",
       "      <td>48.06250</td>\n",
       "      <td>140.93750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Genetics</th>\n",
       "      <td>7.46875</td>\n",
       "      <td>40.40625</td>\n",
       "      <td>113.50000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Health_effects</th>\n",
       "      <td>NaN</td>\n",
       "      <td>75.81250</td>\n",
       "      <td>162.34375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>History</th>\n",
       "      <td>16.75000</td>\n",
       "      <td>1070.75000</td>\n",
       "      <td>2358.03125</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Interactions</th>\n",
       "      <td>0.15625</td>\n",
       "      <td>32.56250</td>\n",
       "      <td>63.93750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Life</th>\n",
       "      <td>6.62500</td>\n",
       "      <td>47.87500</td>\n",
       "      <td>145.18750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Management</th>\n",
       "      <td>0.34375</td>\n",
       "      <td>237.34375</td>\n",
       "      <td>577.18750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism</th>\n",
       "      <td>2.21875</td>\n",
       "      <td>89.84375</td>\n",
       "      <td>227.06250</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism_of_action</th>\n",
       "      <td>2.46875</td>\n",
       "      <td>119.59375</td>\n",
       "      <td>268.62500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_use</th>\n",
       "      <td>0.21875</td>\n",
       "      <td>58.09375</td>\n",
       "      <td>124.59375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_uses</th>\n",
       "      <td>0.84375</td>\n",
       "      <td>466.03125</td>\n",
       "      <td>912.15625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Notable_cases</th>\n",
       "      <td>0.18750</td>\n",
       "      <td>65.53125</td>\n",
       "      <td>54.43750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Overview</th>\n",
       "      <td>2.50000</td>\n",
       "      <td>53.93750</td>\n",
       "      <td>99.12500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathogenesis</th>\n",
       "      <td>0.03125</td>\n",
       "      <td>32.00000</td>\n",
       "      <td>89.65625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathophysiology</th>\n",
       "      <td>2.12500</td>\n",
       "      <td>184.71875</td>\n",
       "      <td>495.18750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Personal_life</th>\n",
       "      <td>0.18750</td>\n",
       "      <td>197.90625</td>\n",
       "      <td>176.28125</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacology</th>\n",
       "      <td>0.68750</td>\n",
       "      <td>226.40625</td>\n",
       "      <td>613.18750</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physiology</th>\n",
       "      <td>0.12500</td>\n",
       "      <td>38.12500</td>\n",
       "      <td>87.46875</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prevention</th>\n",
       "      <td>2.81250</td>\n",
       "      <td>168.34375</td>\n",
       "      <td>336.40625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prognosis</th>\n",
       "      <td>2.31250</td>\n",
       "      <td>219.43750</td>\n",
       "      <td>288.12500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Research</th>\n",
       "      <td>3.78125</td>\n",
       "      <td>200.40625</td>\n",
       "      <td>356.50000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Risk_factors</th>\n",
       "      <td>0.21875</td>\n",
       "      <td>118.62500</td>\n",
       "      <td>223.96875</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Safety</th>\n",
       "      <td>0.50000</td>\n",
       "      <td>41.28125</td>\n",
       "      <td>73.00000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Science_and_technology</th>\n",
       "      <td>NaN</td>\n",
       "      <td>35.46875</td>\n",
       "      <td>66.34375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Side_effects</th>\n",
       "      <td>0.65625</td>\n",
       "      <td>185.31250</td>\n",
       "      <td>290.59375</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Signs_and_symptoms</th>\n",
       "      <td>25.31250</td>\n",
       "      <td>863.37500</td>\n",
       "      <td>1770.87500</td>\n",
       "      <td>0.03125</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Society_and_culture</th>\n",
       "      <td>10.78125</td>\n",
       "      <td>336.25000</td>\n",
       "      <td>628.12500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Structure</th>\n",
       "      <td>6.09375</td>\n",
       "      <td>36.21875</td>\n",
       "      <td>94.46875</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Symptoms</th>\n",
       "      <td>2.43750</td>\n",
       "      <td>50.68750</td>\n",
       "      <td>77.25000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Theranos</th>\n",
       "      <td>NaN</td>\n",
       "      <td>186.68750</td>\n",
       "      <td>314.37500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatment</th>\n",
       "      <td>13.84375</td>\n",
       "      <td>623.18750</td>\n",
       "      <td>922.90625</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatments</th>\n",
       "      <td>0.28125</td>\n",
       "      <td>31.75000</td>\n",
       "      <td>43.56250</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Types</th>\n",
       "      <td>10.78125</td>\n",
       "      <td>147.53125</td>\n",
       "      <td>297.25000</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Uses</th>\n",
       "      <td>1.06250</td>\n",
       "      <td>143.87500</td>\n",
       "      <td>304.62500</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "action                    extClick      fnClick      fnHover  upClick\n",
       "section_h2                                                           \n",
       "-- missing --           4503.87500  11482.68750  16488.53125  0.18750\n",
       "Adverse_effects            6.43750    250.53125    442.43750      NaN\n",
       "Applications               0.56250     37.09375     75.90625      NaN\n",
       "Background                 1.65625     57.90625    131.75000  0.03125\n",
       "Biography                  1.87500     53.12500    120.75000      NaN\n",
       "Career                     3.87500     56.28125    121.12500      NaN\n",
       "Cause                      5.43750    279.96875    684.28125      NaN\n",
       "Causes                    12.06250    640.00000   1248.93750  0.03125\n",
       "Characteristics            2.06250     50.75000    119.84375      NaN\n",
       "Chemistry                  0.43750     64.43750    128.12500      NaN\n",
       "Classification            50.06250    117.84375    257.28125  0.06250\n",
       "Clinical_significance      0.59375     37.81250     84.62500      NaN\n",
       "Criticism                  0.09375     39.46875     68.87500      NaN\n",
       "Definition                 1.31250     51.37500     97.68750      NaN\n",
       "Definitions                0.65625     40.59375     78.15625      NaN\n",
       "Description                5.40625     38.56250     81.15625      NaN\n",
       "Diagnosis                 57.37500    489.00000   1085.21875      NaN\n",
       "Early_life                 0.87500     44.71875     95.21875      NaN\n",
       "Effects                    0.31250     78.71875    147.40625      NaN\n",
       "Epidemiology               1.46875    311.00000    664.34375      NaN\n",
       "Examples                   7.59375     31.93750     48.25000      NaN\n",
       "Experiments                    NaN     34.25000     49.15625      NaN\n",
       "Function                   0.50000     48.06250    140.93750      NaN\n",
       "Genetics                   7.46875     40.40625    113.50000      NaN\n",
       "Health_effects                 NaN     75.81250    162.34375      NaN\n",
       "History                   16.75000   1070.75000   2358.03125      NaN\n",
       "Interactions               0.15625     32.56250     63.93750      NaN\n",
       "Life                       6.62500     47.87500    145.18750      NaN\n",
       "Management                 0.34375    237.34375    577.18750      NaN\n",
       "Mechanism                  2.21875     89.84375    227.06250      NaN\n",
       "Mechanism_of_action        2.46875    119.59375    268.62500      NaN\n",
       "Medical_use                0.21875     58.09375    124.59375      NaN\n",
       "Medical_uses               0.84375    466.03125    912.15625      NaN\n",
       "Notable_cases              0.18750     65.53125     54.43750      NaN\n",
       "Overview                   2.50000     53.93750     99.12500      NaN\n",
       "Pathogenesis               0.03125     32.00000     89.65625      NaN\n",
       "Pathophysiology            2.12500    184.71875    495.18750      NaN\n",
       "Personal_life              0.18750    197.90625    176.28125      NaN\n",
       "Pharmacology               0.68750    226.40625    613.18750      NaN\n",
       "Physiology                 0.12500     38.12500     87.46875      NaN\n",
       "Prevention                 2.81250    168.34375    336.40625      NaN\n",
       "Prognosis                  2.31250    219.43750    288.12500      NaN\n",
       "Research                   3.78125    200.40625    356.50000      NaN\n",
       "Risk_factors               0.21875    118.62500    223.96875      NaN\n",
       "Safety                     0.50000     41.28125     73.00000      NaN\n",
       "Science_and_technology         NaN     35.46875     66.34375      NaN\n",
       "Side_effects               0.65625    185.31250    290.59375      NaN\n",
       "Signs_and_symptoms        25.31250    863.37500   1770.87500  0.03125\n",
       "Society_and_culture       10.78125    336.25000    628.12500      NaN\n",
       "Structure                  6.09375     36.21875     94.46875      NaN\n",
       "Symptoms                   2.43750     50.68750     77.25000      NaN\n",
       "Theranos                       NaN    186.68750    314.37500      NaN\n",
       "Treatment                 13.84375    623.18750    922.90625      NaN\n",
       "Treatments                 0.28125     31.75000     43.56250      NaN\n",
       "Types                     10.78125    147.53125    297.25000      NaN\n",
       "Uses                       1.06250    143.87500    304.62500      NaN"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "section_pda.pivot(index='section_h2', columns='action', values='daily_average').loc[section_h2s]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# count of events by event type and access mode for each top-level (H2) section ID for WP:M pages only\n",
    "pm_section_events_mode_query = \"\"\"\n",
    "SELECT population_wpm_sections.section_h2, action, mode, count(*) count\n",
    "FROM \n",
    "    citationusage\n",
    "    LEFT JOIN ryanmax.population_wpm_sections \n",
    "        ON \n",
    "        population_wpm_sections.page_id = citationusage.page_id \n",
    "        AND population_wpm_sections.section_id = citationusage.section_id\n",
    "WHERE\n",
    "    wiki = 'enwiki'\n",
    "    AND citationusage.page_id IN (SELECT page_id FROM ryanmax.population_wpm_pages_with_extlinks)\n",
    "    {}\n",
    "    AND to_date(citationusage.event_time) >= '{}'\n",
    "    AND to_date(citationusage.event_time) <= '{}'\n",
    "    AND useragent_is_bot = FALSE\n",
    "GROUP BY population_wpm_sections.section_h2, action, mode\n",
    "ORDER BY count desc\n",
    "\"\"\"\n",
    "pm_section_events_mode = spark.sql(\n",
    "    pm_section_events_mode_query.format(\n",
    "        event_exclusion_sql, start_date_string, end_date_string\n",
    "    ))\n",
    "pm_section_events_mode_pandas = pm_section_events_mode.toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Desktop event counts for each top-level (H2) section ID for WP:M pages only\n",
    "- limits above apply here as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>action</th>\n",
       "      <th>extClick</th>\n",
       "      <th>fnClick</th>\n",
       "      <th>fnHover</th>\n",
       "      <th>upClick</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>section_h2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-- missing --</th>\n",
       "      <td>65483.0</td>\n",
       "      <td>123684.0</td>\n",
       "      <td>485050.0</td>\n",
       "      <td>6.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Adverse_effects</th>\n",
       "      <td>49.0</td>\n",
       "      <td>2487.0</td>\n",
       "      <td>13283.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Background</th>\n",
       "      <td>25.0</td>\n",
       "      <td>1024.0</td>\n",
       "      <td>4057.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cause</th>\n",
       "      <td>96.0</td>\n",
       "      <td>3624.0</td>\n",
       "      <td>20839.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Causes</th>\n",
       "      <td>207.0</td>\n",
       "      <td>7194.0</td>\n",
       "      <td>37401.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Chemistry</th>\n",
       "      <td>9.0</td>\n",
       "      <td>1232.0</td>\n",
       "      <td>3941.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classification</th>\n",
       "      <td>377.0</td>\n",
       "      <td>1697.0</td>\n",
       "      <td>7789.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Diagnosis</th>\n",
       "      <td>722.0</td>\n",
       "      <td>6724.0</td>\n",
       "      <td>32950.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epidemiology</th>\n",
       "      <td>33.0</td>\n",
       "      <td>5265.0</td>\n",
       "      <td>20627.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Function</th>\n",
       "      <td>9.0</td>\n",
       "      <td>1096.0</td>\n",
       "      <td>4392.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Health_effects</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1019.0</td>\n",
       "      <td>4856.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>History</th>\n",
       "      <td>330.0</td>\n",
       "      <td>18914.0</td>\n",
       "      <td>72711.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Management</th>\n",
       "      <td>5.0</td>\n",
       "      <td>3046.0</td>\n",
       "      <td>17377.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism</th>\n",
       "      <td>40.0</td>\n",
       "      <td>1564.0</td>\n",
       "      <td>6992.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism_of_action</th>\n",
       "      <td>54.0</td>\n",
       "      <td>2371.0</td>\n",
       "      <td>8291.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_uses</th>\n",
       "      <td>14.0</td>\n",
       "      <td>5291.0</td>\n",
       "      <td>27188.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Overview</th>\n",
       "      <td>45.0</td>\n",
       "      <td>1003.0</td>\n",
       "      <td>3055.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathophysiology</th>\n",
       "      <td>38.0</td>\n",
       "      <td>3170.0</td>\n",
       "      <td>15170.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Personal_life</th>\n",
       "      <td>3.0</td>\n",
       "      <td>1389.0</td>\n",
       "      <td>5229.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacology</th>\n",
       "      <td>14.0</td>\n",
       "      <td>4039.0</td>\n",
       "      <td>18964.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prevention</th>\n",
       "      <td>38.0</td>\n",
       "      <td>2140.0</td>\n",
       "      <td>10100.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prognosis</th>\n",
       "      <td>48.0</td>\n",
       "      <td>2126.0</td>\n",
       "      <td>8709.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Research</th>\n",
       "      <td>70.0</td>\n",
       "      <td>2977.0</td>\n",
       "      <td>10980.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Risk_factors</th>\n",
       "      <td>4.0</td>\n",
       "      <td>1524.0</td>\n",
       "      <td>6698.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Side_effects</th>\n",
       "      <td>15.0</td>\n",
       "      <td>1775.0</td>\n",
       "      <td>8654.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Signs_and_symptoms</th>\n",
       "      <td>240.0</td>\n",
       "      <td>8395.0</td>\n",
       "      <td>52838.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Society_and_culture</th>\n",
       "      <td>167.0</td>\n",
       "      <td>3998.0</td>\n",
       "      <td>19013.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Theranos</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1840.0</td>\n",
       "      <td>9671.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatment</th>\n",
       "      <td>135.0</td>\n",
       "      <td>5844.0</td>\n",
       "      <td>27318.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Types</th>\n",
       "      <td>170.0</td>\n",
       "      <td>2025.0</td>\n",
       "      <td>8863.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Uses</th>\n",
       "      <td>16.0</td>\n",
       "      <td>1978.0</td>\n",
       "      <td>9257.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "action               extClick   fnClick   fnHover  upClick\n",
       "section_h2                                                \n",
       "-- missing --         65483.0  123684.0  485050.0      6.0\n",
       "Adverse_effects          49.0    2487.0   13283.0      NaN\n",
       "Background               25.0    1024.0    4057.0      NaN\n",
       "Cause                    96.0    3624.0   20839.0      NaN\n",
       "Causes                  207.0    7194.0   37401.0      NaN\n",
       "Chemistry                 9.0    1232.0    3941.0      NaN\n",
       "Classification          377.0    1697.0    7789.0      1.0\n",
       "Diagnosis               722.0    6724.0   32950.0      NaN\n",
       "Epidemiology             33.0    5265.0   20627.0      NaN\n",
       "Function                  9.0    1096.0    4392.0      NaN\n",
       "Health_effects            NaN    1019.0    4856.0      NaN\n",
       "History                 330.0   18914.0   72711.0      NaN\n",
       "Management                5.0    3046.0   17377.0      NaN\n",
       "Mechanism                40.0    1564.0    6992.0      NaN\n",
       "Mechanism_of_action      54.0    2371.0    8291.0      NaN\n",
       "Medical_uses             14.0    5291.0   27188.0      NaN\n",
       "Overview                 45.0    1003.0    3055.0      NaN\n",
       "Pathophysiology          38.0    3170.0   15170.0      NaN\n",
       "Personal_life             3.0    1389.0    5229.0      NaN\n",
       "Pharmacology             14.0    4039.0   18964.0      NaN\n",
       "Prevention               38.0    2140.0   10100.0      NaN\n",
       "Prognosis                48.0    2126.0    8709.0      NaN\n",
       "Research                 70.0    2977.0   10980.0      NaN\n",
       "Risk_factors              4.0    1524.0    6698.0      NaN\n",
       "Side_effects             15.0    1775.0    8654.0      NaN\n",
       "Signs_and_symptoms      240.0    8395.0   52838.0      1.0\n",
       "Society_and_culture     167.0    3998.0   19013.0      NaN\n",
       "Theranos                  NaN    1840.0    9671.0      NaN\n",
       "Treatment               135.0    5844.0   27318.0      NaN\n",
       "Types                   170.0    2025.0    8863.0      NaN\n",
       "Uses                     16.0    1978.0    9257.0      NaN"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "desktop_pda = pm_section_events_mode_pandas.query('mode == \"desktop\"').copy()\n",
    "# replace 'NaN' section_h2 with 'missing'\n",
    "desktop_pda.section_h2.fillna(value='-- missing --', inplace=True)\n",
    "desktop_pda.pivot(index='section_h2', columns='action', values='count').query('fnClick > 1000')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Mobile event counts for each top-level (H2) section ID for WP:M pages only\n",
    "- limits above apply here as well"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>action</th>\n",
       "      <th>extClick</th>\n",
       "      <th>fnClick</th>\n",
       "      <th>fnHover</th>\n",
       "      <th>upClick</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>section_h2</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>-- missing --</th>\n",
       "      <td>78641.0</td>\n",
       "      <td>243762.0</td>\n",
       "      <td>42583.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Adverse_effects</th>\n",
       "      <td>157.0</td>\n",
       "      <td>5530.0</td>\n",
       "      <td>875.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Career</th>\n",
       "      <td>38.0</td>\n",
       "      <td>1044.0</td>\n",
       "      <td>132.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cause</th>\n",
       "      <td>78.0</td>\n",
       "      <td>5335.0</td>\n",
       "      <td>1058.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Causes</th>\n",
       "      <td>179.0</td>\n",
       "      <td>13286.0</td>\n",
       "      <td>2565.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Characteristics</th>\n",
       "      <td>4.0</td>\n",
       "      <td>1009.0</td>\n",
       "      <td>204.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Classification</th>\n",
       "      <td>1225.0</td>\n",
       "      <td>2074.0</td>\n",
       "      <td>444.0</td>\n",
       "      <td>1.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Diagnosis</th>\n",
       "      <td>1114.0</td>\n",
       "      <td>8924.0</td>\n",
       "      <td>1777.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Early_life</th>\n",
       "      <td>12.0</td>\n",
       "      <td>1090.0</td>\n",
       "      <td>193.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Effects</th>\n",
       "      <td>10.0</td>\n",
       "      <td>1641.0</td>\n",
       "      <td>305.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Epidemiology</th>\n",
       "      <td>14.0</td>\n",
       "      <td>4687.0</td>\n",
       "      <td>632.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Health_effects</th>\n",
       "      <td>NaN</td>\n",
       "      <td>1407.0</td>\n",
       "      <td>339.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>History</th>\n",
       "      <td>206.0</td>\n",
       "      <td>15350.0</td>\n",
       "      <td>2746.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Management</th>\n",
       "      <td>6.0</td>\n",
       "      <td>4549.0</td>\n",
       "      <td>1093.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism</th>\n",
       "      <td>31.0</td>\n",
       "      <td>1311.0</td>\n",
       "      <td>274.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Mechanism_of_action</th>\n",
       "      <td>25.0</td>\n",
       "      <td>1456.0</td>\n",
       "      <td>305.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_use</th>\n",
       "      <td>2.0</td>\n",
       "      <td>1095.0</td>\n",
       "      <td>232.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Medical_uses</th>\n",
       "      <td>13.0</td>\n",
       "      <td>9622.0</td>\n",
       "      <td>2001.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Notable_cases</th>\n",
       "      <td>3.0</td>\n",
       "      <td>1633.0</td>\n",
       "      <td>126.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pathophysiology</th>\n",
       "      <td>30.0</td>\n",
       "      <td>2741.0</td>\n",
       "      <td>676.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Personal_life</th>\n",
       "      <td>3.0</td>\n",
       "      <td>4944.0</td>\n",
       "      <td>412.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacology</th>\n",
       "      <td>8.0</td>\n",
       "      <td>3206.0</td>\n",
       "      <td>658.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prevention</th>\n",
       "      <td>52.0</td>\n",
       "      <td>3247.0</td>\n",
       "      <td>665.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Prognosis</th>\n",
       "      <td>26.0</td>\n",
       "      <td>4896.0</td>\n",
       "      <td>511.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Research</th>\n",
       "      <td>51.0</td>\n",
       "      <td>3436.0</td>\n",
       "      <td>428.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Risk_factors</th>\n",
       "      <td>3.0</td>\n",
       "      <td>2272.0</td>\n",
       "      <td>469.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Side_effects</th>\n",
       "      <td>6.0</td>\n",
       "      <td>4155.0</td>\n",
       "      <td>645.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Signs_and_symptoms</th>\n",
       "      <td>570.0</td>\n",
       "      <td>19233.0</td>\n",
       "      <td>3830.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Society_and_culture</th>\n",
       "      <td>178.0</td>\n",
       "      <td>6762.0</td>\n",
       "      <td>1087.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Symptoms</th>\n",
       "      <td>55.0</td>\n",
       "      <td>1138.0</td>\n",
       "      <td>168.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Theranos</th>\n",
       "      <td>NaN</td>\n",
       "      <td>4134.0</td>\n",
       "      <td>389.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Treatment</th>\n",
       "      <td>308.0</td>\n",
       "      <td>14098.0</td>\n",
       "      <td>2215.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Types</th>\n",
       "      <td>175.0</td>\n",
       "      <td>2696.0</td>\n",
       "      <td>649.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Uses</th>\n",
       "      <td>18.0</td>\n",
       "      <td>2626.0</td>\n",
       "      <td>491.0</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "action               extClick   fnClick  fnHover  upClick\n",
       "section_h2                                               \n",
       "-- missing --         78641.0  243762.0  42583.0      NaN\n",
       "Adverse_effects         157.0    5530.0    875.0      NaN\n",
       "Career                   38.0    1044.0    132.0      NaN\n",
       "Cause                    78.0    5335.0   1058.0      NaN\n",
       "Causes                  179.0   13286.0   2565.0      1.0\n",
       "Characteristics           4.0    1009.0    204.0      NaN\n",
       "Classification         1225.0    2074.0    444.0      1.0\n",
       "Diagnosis              1114.0    8924.0   1777.0      NaN\n",
       "Early_life               12.0    1090.0    193.0      NaN\n",
       "Effects                  10.0    1641.0    305.0      NaN\n",
       "Epidemiology             14.0    4687.0    632.0      NaN\n",
       "Health_effects            NaN    1407.0    339.0      NaN\n",
       "History                 206.0   15350.0   2746.0      NaN\n",
       "Management                6.0    4549.0   1093.0      NaN\n",
       "Mechanism                31.0    1311.0    274.0      NaN\n",
       "Mechanism_of_action      25.0    1456.0    305.0      NaN\n",
       "Medical_use               2.0    1095.0    232.0      NaN\n",
       "Medical_uses             13.0    9622.0   2001.0      NaN\n",
       "Notable_cases             3.0    1633.0    126.0      NaN\n",
       "Pathophysiology          30.0    2741.0    676.0      NaN\n",
       "Personal_life             3.0    4944.0    412.0      NaN\n",
       "Pharmacology              8.0    3206.0    658.0      NaN\n",
       "Prevention               52.0    3247.0    665.0      NaN\n",
       "Prognosis                26.0    4896.0    511.0      NaN\n",
       "Research                 51.0    3436.0    428.0      NaN\n",
       "Risk_factors              3.0    2272.0    469.0      NaN\n",
       "Side_effects              6.0    4155.0    645.0      NaN\n",
       "Signs_and_symptoms      570.0   19233.0   3830.0      NaN\n",
       "Society_and_culture     178.0    6762.0   1087.0      NaN\n",
       "Symptoms                 55.0    1138.0    168.0      NaN\n",
       "Theranos                  NaN    4134.0    389.0      NaN\n",
       "Treatment               308.0   14098.0   2215.0      NaN\n",
       "Types                   175.0    2696.0    649.0      NaN\n",
       "Uses                     18.0    2626.0    491.0      NaN"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mobile_pda = pm_section_events_mode_pandas.query('mode == \"mobile\"').copy()\n",
    "# replace 'NaN' section_h2 with 'missing'\n",
    "mobile_pda.section_h2.fillna(value='-- missing --', inplace=True)\n",
    "mobile_pda.pivot(index='section_h2', columns='action', values='count').query('fnClick > 1000')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Total event counts for Adverse_effects section H2s for WP:M pages\n",
    " - 20 pages with most events"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# event counts for pages with Adverse_effects section H2s\n",
    "adverse_events_query = \"\"\"\n",
    "SELECT citationusage.page_id, population_page_titles_20190420.page_title, count(*) AS event_count\n",
    "FROM \n",
    "    citationusage\n",
    "    LEFT JOIN ryanmax.population_wpm_sections \n",
    "        ON \n",
    "        population_wpm_sections.page_id = citationusage.page_id \n",
    "        AND population_wpm_sections.section_id = citationusage.section_id\n",
    "    LEFT JOIN ryanmax.population_page_titles_20190420\n",
    "        ON\n",
    "        population_page_titles_20190420.page_id = citationusage.page_id \n",
    "WHERE\n",
    "    wiki = 'enwiki'\n",
    "    AND population_wpm_sections.section_h2 = 'Adverse_effects'\n",
    "    AND citationusage.page_id IN (SELECT page_id FROM ryanmax.population_wpm_pages_with_extlinks)\n",
    "    {}\n",
    "    AND to_date(citationusage.event_time) >= '{}'\n",
    "    AND to_date(citationusage.event_time) <= '{}'\n",
    "    AND useragent_is_bot = FALSE\n",
    "GROUP BY citationusage.page_id, population_page_titles_20190420.page_title\n",
    "ORDER BY event_count desc\n",
    "\"\"\"\n",
    "adverse_events = spark.sql(\n",
    "    adverse_events_query.format(\n",
    "        event_exclusion_sql, start_date_string, end_date_string\n",
    "    ))\n",
    "adverse_events_pandas = adverse_events.toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total events for Adverse_effects:  22381\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>page_id</th>\n",
       "      <th>page_title</th>\n",
       "      <th>event_count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1481886</td>\n",
       "      <td>Cannabis (drug)</td>\n",
       "      <td>1198</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>10024</td>\n",
       "      <td>MDMA</td>\n",
       "      <td>1099</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1245311</td>\n",
       "      <td>Finasteride</td>\n",
       "      <td>511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>17537</td>\n",
       "      <td>Lysergic acid diethylamide</td>\n",
       "      <td>464</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>83406</td>\n",
       "      <td>Paracetamol</td>\n",
       "      <td>433</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>141915</td>\n",
       "      <td>Fentanyl</td>\n",
       "      <td>393</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>724730</td>\n",
       "      <td>Clonazepam</td>\n",
       "      <td>375</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>8718425</td>\n",
       "      <td>Circumcision</td>\n",
       "      <td>339</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>155627</td>\n",
       "      <td>Ibuprofen</td>\n",
       "      <td>332</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>178197</td>\n",
       "      <td>Statin</td>\n",
       "      <td>315</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>253720</td>\n",
       "      <td>Metformin</td>\n",
       "      <td>310</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>229985</td>\n",
       "      <td>Isotretinoin</td>\n",
       "      <td>280</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>22071</td>\n",
       "      <td>Nonsteroidal anti-inflammatory drug</td>\n",
       "      <td>263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>1525</td>\n",
       "      <td>Aspirin</td>\n",
       "      <td>262</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>4781</td>\n",
       "      <td>Benzodiazepine</td>\n",
       "      <td>245</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>185272</td>\n",
       "      <td>Atypical antipsychotic</td>\n",
       "      <td>237</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>201310</td>\n",
       "      <td>Citalopram</td>\n",
       "      <td>229</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>520574</td>\n",
       "      <td>Venlafaxine</td>\n",
       "      <td>220</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>31690663</td>\n",
       "      <td>Quinolone antibiotic</td>\n",
       "      <td>219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>875202</td>\n",
       "      <td>Mitragyna speciosa</td>\n",
       "      <td>216</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     page_id                           page_title  event_count\n",
       "0    1481886                      Cannabis (drug)         1198\n",
       "1      10024                                 MDMA         1099\n",
       "2    1245311                          Finasteride          511\n",
       "3      17537           Lysergic acid diethylamide          464\n",
       "4      83406                          Paracetamol          433\n",
       "5     141915                             Fentanyl          393\n",
       "6     724730                           Clonazepam          375\n",
       "7    8718425                         Circumcision          339\n",
       "8     155627                            Ibuprofen          332\n",
       "9     178197                               Statin          315\n",
       "10    253720                            Metformin          310\n",
       "11    229985                         Isotretinoin          280\n",
       "12     22071  Nonsteroidal anti-inflammatory drug          263\n",
       "13      1525                              Aspirin          262\n",
       "14      4781                       Benzodiazepine          245\n",
       "15    185272               Atypical antipsychotic          237\n",
       "16    201310                           Citalopram          229\n",
       "17    520574                          Venlafaxine          220\n",
       "18  31690663                 Quinolone antibiotic          219\n",
       "19    875202                   Mitragyna speciosa          216"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print('total events for Adverse_effects: ',adverse_events_pandas['event_count'].sum())\n",
    "adverse_events_pandas.head(20)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "PySpark - YARN (large)",
   "language": "python",
   "name": "spark_yarn_pyspark_large"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}