{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MIMIC Notes Pre-Processing\n",
    "\n",
    "Pre-processing MIMIC notes for further use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check `preprocess_notes.py` for more details.\n",
    "\n",
    "Redacted items:\n",
    "* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`\n",
    "* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`\n",
    "* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`\n",
    "* [x] Name: `[**Name (NI) **]`, `xxname`\n",
    "* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`\n",
    "* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`\n",
    "* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`\n",
    "* [x] Hospital: `[**Hospital1 **]`, `xxhosp`\n",
    "* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`\n",
    "* [x] Company: `[**Company 12924**]`, `xxwork`\n",
    "* [x] University/College: `[**University/College **]`, `xxwork`\n",
    "* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`\n",
    "* [x] Year: `[**Year (4 digits) **]`, `xxyear`\n",
    "* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\\b\\d{4}\\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.\n",
    "* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`\n",
    "* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`\n",
    "* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`\n",
    "* [x] Holiday: `[**Holiday 3470**]`, `xxhols`\n",
    "* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`\n",
    "* [x] Country: `[**Country 9958**]`, `xxcntry`\n",
    "* [x] State: `[**State 3283**]`, `xxstate`\n",
    "* [x] Location: `**Location (un) 2432**`, `xxloc`\n",
    "* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`\n",
    "* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`\n",
    "* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`\n",
    "* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`\n",
    "* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`\n",
    "* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`\n",
    "* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`\n",
    "* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`\n",
    "* [x] Age over 90: `[**Age over 90 **]`, `xxage90`\n",
    "* [x] Contact Info: `[**Contact Info **]`, `xxcontact`\n",
    "* [x] Job Number: `[**Job Number **]`, `xxjobno`\n",
    "* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`\n",
    "* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`\n",
    "* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`\n",
    "* 2-digit Numbers: `[** 84 **]`, `xx2digit`\n",
    "* 3-digit Numbers: `[** 834 **]`, `xx3digit`\n",
    "* Wardname"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`886` notes are marked incorrect with `iserror` flag set to 1. Thus, there are total of `2,082,294` notes. I have set up a `view` called `correctnotes` in the database, which only includese the correct notes. All the data I grab is from that `view`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports and Inits"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import psycopg2\n",
    "import numpy as np\n",
    "import re\n",
    "import random\n",
    "import datetime\n",
    "from pathlib import Path\n",
    "import pickle\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Softlink `ln -s` your data path to a `data` variable in the current folder. That way we don't need to change the path in the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH = Path('data')\n",
    "!ls {PATH}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from preprocess_notes import *"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Grab Data from MIMIC"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### From Database"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-03-18T23:05:14.536531Z",
     "start_time": "2018-03-18T23:04:17.500997Z"
    },
    "hidden": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2082294, 10)\n",
      "CPU times: user 12.1 s, sys: 2.58 s, total: 14.7 s\n",
      "Wall time: 1min 24s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "cats = pd.read_csv('note_categories.csv')\n",
    "max_limit = -1\n",
    "\n",
    "queries = []\n",
    "for category, n_notes in zip(cats['category'], cats['number_of_notes']):\n",
    "    limit = min(max_limit, n_notes) if max_limit > 0 else n_notes\n",
    "    if limit == max_limit:\n",
    "        q = f\"\"\"\n",
    "        select * from correctnotes where category=\\'{category}\\' order by random() limit {limit};\n",
    "        \"\"\"\n",
    "    else:\n",
    "        q = f\"\"\"\n",
    "        select * from correctnotes where category=\\'{category}\\';\n",
    "        \"\"\"\n",
    "    queries.append(q)\n",
    "\n",
    "dfs = []\n",
    "\n",
    "con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')\n",
    "for q in queries:\n",
    "    df = pd.read_sql_query(q, con)\n",
    "    dfs.append(df)\n",
    "con.close()\n",
    "    \n",
    "df = pd.concat(dfs)\n",
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### From Notes File"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')\n",
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocess"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(2082294, 9)\n"
     ]
    }
   ],
   "source": [
    "df.columns = map(str.lower, df.columns)\n",
    "df.set_index('row_id', inplace=True)\n",
    "print(df.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Confirm that the number of notes match the actual number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Case Management</th>\n",
       "      <td>953</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Consult</th>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Discharge summary</th>\n",
       "      <td>59652</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ECG</th>\n",
       "      <td>209051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Echo</th>\n",
       "      <td>45794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>General</th>\n",
       "      <td>8236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing</th>\n",
       "      <td>223182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing/other</th>\n",
       "      <td>822497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nutrition</th>\n",
       "      <td>9400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacy</th>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physician</th>\n",
       "      <td>141281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Radiology</th>\n",
       "      <td>522279</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rehab Services</th>\n",
       "      <td>5408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Respiratory</th>\n",
       "      <td>31701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Social Work</th>\n",
       "      <td>2661</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     text\n",
       "                    count\n",
       "category                 \n",
       "Case Management       953\n",
       "Consult                98\n",
       "Discharge summary   59652\n",
       "ECG                209051\n",
       "Echo                45794\n",
       "General              8236\n",
       "Nursing            223182\n",
       "Nursing/other      822497\n",
       "Nutrition            9400\n",
       "Pharmacy              101\n",
       "Physician          141281\n",
       "Radiology          522279\n",
       "Rehab Services       5408\n",
       "Respiratory         31701\n",
       "Social Work          2661"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[['category', 'text']].groupby(['category']).agg(['count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 24min 5s, sys: 3.86 s, total: 24min 9s\n",
      "Wall time: 24min 55s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "df['proc_text'] = df['text'].apply(preprocess_note)\n",
    "\n",
    "with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:\n",
    "    pickle.dump(df, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create datasets for Language Modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the `description` and `preprocessed_text` fields in the datasets. The `description` column is composed of free-text and has `3840` unique descriptions. I consider the description as a unique `field` which will be marked as such during tokenization as done in the FastAI library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>labels</th>\n",
       "      <th>category</th>\n",
       "      <th>description</th>\n",
       "      <th>proc_text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>row_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>2056836</th>\n",
       "      <td>0</td>\n",
       "      <td>Nursing/other</td>\n",
       "      <td>Report</td>\n",
       "      <td>NPN\\n\\n\\n#2 Resp: Breathing comfortably on roo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>854474</th>\n",
       "      <td>0</td>\n",
       "      <td>Radiology</td>\n",
       "      <td>CHEST (PA &amp; LAT)</td>\n",
       "      <td>xxdate xxafter\\n CHEST (PA &amp; LAT)             ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>116729</th>\n",
       "      <td>0</td>\n",
       "      <td>ECG</td>\n",
       "      <td>Report</td>\n",
       "      <td>Baseline artifact is present. Sinus rhythm. Ve...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>722400</th>\n",
       "      <td>0</td>\n",
       "      <td>Nursing</td>\n",
       "      <td>Nursing Progress Note</td>\n",
       "      <td>Altered mental status (not Delirium)\\n   Asses...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>921412</th>\n",
       "      <td>0</td>\n",
       "      <td>Radiology</td>\n",
       "      <td>CT HEAD W/O CONTRAST</td>\n",
       "      <td>xxdate xxafter\\n CT HEAD W/O CONTRAST         ...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         labels       category            description  \\\n",
       "row_id                                                  \n",
       "2056836       0  Nursing/other                 Report   \n",
       "854474        0      Radiology       CHEST (PA & LAT)   \n",
       "116729        0            ECG                 Report   \n",
       "722400        0        Nursing  Nursing Progress Note   \n",
       "921412        0      Radiology   CT HEAD W/O CONTRAST   \n",
       "\n",
       "                                                 proc_text  \n",
       "row_id                                                      \n",
       "2056836  NPN\\n\\n\\n#2 Resp: Breathing comfortably on roo...  \n",
       "854474   xxdate xxafter\\n CHEST (PA & LAT)             ...  \n",
       "116729   Baseline artifact is present. Sinus rhythm. Ve...  \n",
       "722400   Altered mental status (not Delirium)\\n   Asses...  \n",
       "921412   xxdate xxafter\\n CT HEAD W/O CONTRAST         ...  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\\\n",
    "                      columns=['labels', 'category', 'description', 'proc_text'])\n",
    "sub_df.sample(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in **each category**. So I chose to iterate over each entry of the `category` column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set random seed for reproducible results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1874270 1874065 205\n",
      "208024 208229 -205\n",
      "CPU times: user 3.17 s, sys: 104 ms, total: 3.27 s\n",
      "Wall time: 2.65 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "np.random.seed(42)\n",
    "\n",
    "dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]\n",
    "msks = [np.random.rand(len(d)) < 0.9 for d in dfs]\n",
    "\n",
    "train_dfs = [None] * len(dfs)\n",
    "val_dfs = [None] * len(dfs)\n",
    "\n",
    "for i in range(len(dfs)):\n",
    "    idf = dfs[i]\n",
    "    mask = msks[i]\n",
    "    train_dfs[i] = idf[mask]\n",
    "    val_dfs[i] = idf[~mask]\n",
    "    \n",
    "train_df = pd.concat(train_dfs)\n",
    "val_df = pd.concat(val_dfs)\n",
    "\n",
    "print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))\n",
    "print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10))    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Sanity check the aggregate count for each category over the 3 dataframes. Then write the `train` and `val` dataframes to disk."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>proc_text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Case Management</th>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Consult</th>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Discharge summary</th>\n",
       "      <td>5902</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ECG</th>\n",
       "      <td>20977</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Echo</th>\n",
       "      <td>4446</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>General</th>\n",
       "      <td>801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing</th>\n",
       "      <td>22307</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing/other</th>\n",
       "      <td>82222</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nutrition</th>\n",
       "      <td>963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacy</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physician</th>\n",
       "      <td>13982</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Radiology</th>\n",
       "      <td>52337</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rehab Services</th>\n",
       "      <td>522</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Respiratory</th>\n",
       "      <td>3194</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Social Work</th>\n",
       "      <td>260</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  proc_text\n",
       "                      count\n",
       "category                   \n",
       "Case Management          96\n",
       "Consult                  11\n",
       "Discharge summary      5902\n",
       "ECG                   20977\n",
       "Echo                   4446\n",
       "General                 801\n",
       "Nursing               22307\n",
       "Nursing/other         82222\n",
       "Nutrition               963\n",
       "Pharmacy                  4\n",
       "Physician             13982\n",
       "Radiology             52337\n",
       "Rehab Services          522\n",
       "Respiratory            3194\n",
       "Social Work             260"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>proc_text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Case Management</th>\n",
       "      <td>857</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Consult</th>\n",
       "      <td>87</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Discharge summary</th>\n",
       "      <td>53750</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ECG</th>\n",
       "      <td>188074</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Echo</th>\n",
       "      <td>41348</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>General</th>\n",
       "      <td>7435</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing</th>\n",
       "      <td>200875</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing/other</th>\n",
       "      <td>740275</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nutrition</th>\n",
       "      <td>8437</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacy</th>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physician</th>\n",
       "      <td>127299</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Radiology</th>\n",
       "      <td>469942</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rehab Services</th>\n",
       "      <td>4886</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Respiratory</th>\n",
       "      <td>28507</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Social Work</th>\n",
       "      <td>2401</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  proc_text\n",
       "                      count\n",
       "category                   \n",
       "Case Management         857\n",
       "Consult                  87\n",
       "Discharge summary     53750\n",
       "ECG                  188074\n",
       "Echo                  41348\n",
       "General                7435\n",
       "Nursing              200875\n",
       "Nursing/other        740275\n",
       "Nutrition              8437\n",
       "Pharmacy                 97\n",
       "Physician            127299\n",
       "Radiology            469942\n",
       "Rehab Services         4886\n",
       "Respiratory           28507\n",
       "Social Work            2401"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>proc_text</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>category</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Case Management</th>\n",
       "      <td>953</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Consult</th>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Discharge summary</th>\n",
       "      <td>59652</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>ECG</th>\n",
       "      <td>209051</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Echo</th>\n",
       "      <td>45794</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>General</th>\n",
       "      <td>8236</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing</th>\n",
       "      <td>223182</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nursing/other</th>\n",
       "      <td>822497</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Nutrition</th>\n",
       "      <td>9400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Pharmacy</th>\n",
       "      <td>101</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Physician</th>\n",
       "      <td>141281</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Radiology</th>\n",
       "      <td>522279</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Rehab Services</th>\n",
       "      <td>5408</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Respiratory</th>\n",
       "      <td>31701</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Social Work</th>\n",
       "      <td>2661</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                  proc_text\n",
       "                      count\n",
       "category                   \n",
       "Case Management         953\n",
       "Consult                  98\n",
       "Discharge summary     59652\n",
       "ECG                  209051\n",
       "Echo                  45794\n",
       "General                8236\n",
       "Nursing              223182\n",
       "Nursing/other        822497\n",
       "Nutrition              9400\n",
       "Pharmacy                101\n",
       "Physician            141281\n",
       "Radiology            522279\n",
       "Rehab Services         5408\n",
       "Respiratory           31701\n",
       "Social Work            2661"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1min 7s, sys: 2.64 s, total: 1min 10s\n",
      "Wall time: 1min 13s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)\n",
    "val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}