{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# MIMIC Notes Pre-Processing\n",
"\n",
"Pre-processing MIMIC notes for further use."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check `preprocess_notes.py` for more details.\n",
"\n",
"Redacted items:\n",
"* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`\n",
"* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`\n",
"* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`\n",
"* [x] Name: `[**Name (NI) **]`, `xxname`\n",
"* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`\n",
"* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`\n",
"* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`\n",
"* [x] Hospital: `[**Hospital1 **]`, `xxhosp`\n",
"* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`\n",
"* [x] Company: `[**Company 12924**]`, `xxwork`\n",
"* [x] University/College: `[**University/College **]`, `xxwork`\n",
"* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`\n",
"* [x] Year: `[**Year (4 digits) **]`, `xxyear`\n",
"* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\\b\\d{4}\\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.\n",
"* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`\n",
"* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`\n",
"* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`\n",
"* [x] Holiday: `[**Holiday 3470**]`, `xxhols`\n",
"* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`\n",
"* [x] Country: `[**Country 9958**]`, `xxcntry`\n",
"* [x] State: `[**State 3283**]`, `xxstate`\n",
"* [x] Location: `**Location (un) 2432**`, `xxloc`\n",
"* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`\n",
"* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`\n",
"* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`\n",
"* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`\n",
"* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`\n",
"* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`\n",
"* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`\n",
"* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`\n",
"* [x] Age over 90: `[**Age over 90 **]`, `xxage90`\n",
"* [x] Contact Info: `[**Contact Info **]`, `xxcontact`\n",
"* [x] Job Number: `[**Job Number **]`, `xxjobno`\n",
"* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`\n",
"* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`\n",
"* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`\n",
"* 2-digit Numbers: `[** 84 **]`, `xx2digit`\n",
"* 3-digit Numbers: `[** 834 **]`, `xx3digit`\n",
"* Wardname"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`886` notes are marked incorrect with `iserror` flag set to 1. Thus, there are total of `2,082,294` notes. I have set up a `view` called `correctnotes` in the database, which only includese the correct notes. All the data I grab is from that `view`."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Imports and Inits"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import psycopg2\n",
"import numpy as np\n",
"import re\n",
"import random\n",
"import datetime\n",
"from pathlib import Path\n",
"import pickle\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Softlink `ln -s` your data path to a `data` variable in the current folder. That way we don't need to change the path in the notebook."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"PATH = Path('data')\n",
"!ls {PATH}"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"from preprocess_notes import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Grab Data from MIMIC"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From Database"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-03-18T23:05:14.536531Z",
"start_time": "2018-03-18T23:04:17.500997Z"
},
"hidden": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2082294, 10)\n",
"CPU times: user 12.1 s, sys: 2.58 s, total: 14.7 s\n",
"Wall time: 1min 24s\n"
]
}
],
"source": [
"%%time\n",
"cats = pd.read_csv('note_categories.csv')\n",
"max_limit = -1\n",
"\n",
"queries = []\n",
"for category, n_notes in zip(cats['category'], cats['number_of_notes']):\n",
" limit = min(max_limit, n_notes) if max_limit > 0 else n_notes\n",
" if limit == max_limit:\n",
" q = f\"\"\"\n",
" select * from correctnotes where category=\\'{category}\\' order by random() limit {limit};\n",
" \"\"\"\n",
" else:\n",
" q = f\"\"\"\n",
" select * from correctnotes where category=\\'{category}\\';\n",
" \"\"\"\n",
" queries.append(q)\n",
"\n",
"dfs = []\n",
"\n",
"con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')\n",
"for q in queries:\n",
" df = pd.read_sql_query(q, con)\n",
" dfs.append(df)\n",
"con.close()\n",
" \n",
"df = pd.concat(dfs)\n",
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### From Notes File"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%time\n",
"df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')\n",
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocess"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(2082294, 9)\n"
]
}
],
"source": [
"df.columns = map(str.lower, df.columns)\n",
"df.set_index('row_id', inplace=True)\n",
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Confirm that the number of notes match the actual number."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" text | \n",
"
\n",
" \n",
" | \n",
" count | \n",
"
\n",
" \n",
" category | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Case Management | \n",
" 953 | \n",
"
\n",
" \n",
" Consult | \n",
" 98 | \n",
"
\n",
" \n",
" Discharge summary | \n",
" 59652 | \n",
"
\n",
" \n",
" ECG | \n",
" 209051 | \n",
"
\n",
" \n",
" Echo | \n",
" 45794 | \n",
"
\n",
" \n",
" General | \n",
" 8236 | \n",
"
\n",
" \n",
" Nursing | \n",
" 223182 | \n",
"
\n",
" \n",
" Nursing/other | \n",
" 822497 | \n",
"
\n",
" \n",
" Nutrition | \n",
" 9400 | \n",
"
\n",
" \n",
" Pharmacy | \n",
" 101 | \n",
"
\n",
" \n",
" Physician | \n",
" 141281 | \n",
"
\n",
" \n",
" Radiology | \n",
" 522279 | \n",
"
\n",
" \n",
" Rehab Services | \n",
" 5408 | \n",
"
\n",
" \n",
" Respiratory | \n",
" 31701 | \n",
"
\n",
" \n",
" Social Work | \n",
" 2661 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" text\n",
" count\n",
"category \n",
"Case Management 953\n",
"Consult 98\n",
"Discharge summary 59652\n",
"ECG 209051\n",
"Echo 45794\n",
"General 8236\n",
"Nursing 223182\n",
"Nursing/other 822497\n",
"Nutrition 9400\n",
"Pharmacy 101\n",
"Physician 141281\n",
"Radiology 522279\n",
"Rehab Services 5408\n",
"Respiratory 31701\n",
"Social Work 2661"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df[['category', 'text']].groupby(['category']).agg(['count'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 24min 5s, sys: 3.86 s, total: 24min 9s\n",
"Wall time: 24min 55s\n"
]
}
],
"source": [
"%%time\n",
"df['proc_text'] = df['text'].apply(preprocess_note)\n",
"\n",
"with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:\n",
" pickle.dump(df, f)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Create datasets for Language Modeling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the `description` and `preprocessed_text` fields in the datasets. The `description` column is composed of free-text and has `3840` unique descriptions. I consider the description as a unique `field` which will be marked as such during tokenization as done in the FastAI library."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" labels | \n",
" category | \n",
" description | \n",
" proc_text | \n",
"
\n",
" \n",
" row_id | \n",
" | \n",
" | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" 2056836 | \n",
" 0 | \n",
" Nursing/other | \n",
" Report | \n",
" NPN\\n\\n\\n#2 Resp: Breathing comfortably on roo... | \n",
"
\n",
" \n",
" 854474 | \n",
" 0 | \n",
" Radiology | \n",
" CHEST (PA & LAT) | \n",
" xxdate xxafter\\n CHEST (PA & LAT) ... | \n",
"
\n",
" \n",
" 116729 | \n",
" 0 | \n",
" ECG | \n",
" Report | \n",
" Baseline artifact is present. Sinus rhythm. Ve... | \n",
"
\n",
" \n",
" 722400 | \n",
" 0 | \n",
" Nursing | \n",
" Nursing Progress Note | \n",
" Altered mental status (not Delirium)\\n Asses... | \n",
"
\n",
" \n",
" 921412 | \n",
" 0 | \n",
" Radiology | \n",
" CT HEAD W/O CONTRAST | \n",
" xxdate xxafter\\n CT HEAD W/O CONTRAST ... | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" labels category description \\\n",
"row_id \n",
"2056836 0 Nursing/other Report \n",
"854474 0 Radiology CHEST (PA & LAT) \n",
"116729 0 ECG Report \n",
"722400 0 Nursing Nursing Progress Note \n",
"921412 0 Radiology CT HEAD W/O CONTRAST \n",
"\n",
" proc_text \n",
"row_id \n",
"2056836 NPN\\n\\n\\n#2 Resp: Breathing comfortably on roo... \n",
"854474 xxdate xxafter\\n CHEST (PA & LAT) ... \n",
"116729 Baseline artifact is present. Sinus rhythm. Ve... \n",
"722400 Altered mental status (not Delirium)\\n Asses... \n",
"921412 xxdate xxafter\\n CT HEAD W/O CONTRAST ... "
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\\\n",
" columns=['labels', 'category', 'description', 'proc_text'])\n",
"sub_df.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in **each category**. So I chose to iterate over each entry of the `category` column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Set random seed for reproducible results."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1874270 1874065 205\n",
"208024 208229 -205\n",
"CPU times: user 3.17 s, sys: 104 ms, total: 3.27 s\n",
"Wall time: 2.65 s\n"
]
}
],
"source": [
"%%time\n",
"np.random.seed(42)\n",
"\n",
"dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]\n",
"msks = [np.random.rand(len(d)) < 0.9 for d in dfs]\n",
"\n",
"train_dfs = [None] * len(dfs)\n",
"val_dfs = [None] * len(dfs)\n",
"\n",
"for i in range(len(dfs)):\n",
" idf = dfs[i]\n",
" mask = msks[i]\n",
" train_dfs[i] = idf[mask]\n",
" val_dfs[i] = idf[~mask]\n",
" \n",
"train_df = pd.concat(train_dfs)\n",
"val_df = pd.concat(val_dfs)\n",
"\n",
"print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))\n",
"print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10)) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sanity check the aggregate count for each category over the 3 dataframes. Then write the `train` and `val` dataframes to disk."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" proc_text | \n",
"
\n",
" \n",
" | \n",
" count | \n",
"
\n",
" \n",
" category | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Case Management | \n",
" 96 | \n",
"
\n",
" \n",
" Consult | \n",
" 11 | \n",
"
\n",
" \n",
" Discharge summary | \n",
" 5902 | \n",
"
\n",
" \n",
" ECG | \n",
" 20977 | \n",
"
\n",
" \n",
" Echo | \n",
" 4446 | \n",
"
\n",
" \n",
" General | \n",
" 801 | \n",
"
\n",
" \n",
" Nursing | \n",
" 22307 | \n",
"
\n",
" \n",
" Nursing/other | \n",
" 82222 | \n",
"
\n",
" \n",
" Nutrition | \n",
" 963 | \n",
"
\n",
" \n",
" Pharmacy | \n",
" 4 | \n",
"
\n",
" \n",
" Physician | \n",
" 13982 | \n",
"
\n",
" \n",
" Radiology | \n",
" 52337 | \n",
"
\n",
" \n",
" Rehab Services | \n",
" 522 | \n",
"
\n",
" \n",
" Respiratory | \n",
" 3194 | \n",
"
\n",
" \n",
" Social Work | \n",
" 260 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" proc_text\n",
" count\n",
"category \n",
"Case Management 96\n",
"Consult 11\n",
"Discharge summary 5902\n",
"ECG 20977\n",
"Echo 4446\n",
"General 801\n",
"Nursing 22307\n",
"Nursing/other 82222\n",
"Nutrition 963\n",
"Pharmacy 4\n",
"Physician 13982\n",
"Radiology 52337\n",
"Rehab Services 522\n",
"Respiratory 3194\n",
"Social Work 260"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" proc_text | \n",
"
\n",
" \n",
" | \n",
" count | \n",
"
\n",
" \n",
" category | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Case Management | \n",
" 857 | \n",
"
\n",
" \n",
" Consult | \n",
" 87 | \n",
"
\n",
" \n",
" Discharge summary | \n",
" 53750 | \n",
"
\n",
" \n",
" ECG | \n",
" 188074 | \n",
"
\n",
" \n",
" Echo | \n",
" 41348 | \n",
"
\n",
" \n",
" General | \n",
" 7435 | \n",
"
\n",
" \n",
" Nursing | \n",
" 200875 | \n",
"
\n",
" \n",
" Nursing/other | \n",
" 740275 | \n",
"
\n",
" \n",
" Nutrition | \n",
" 8437 | \n",
"
\n",
" \n",
" Pharmacy | \n",
" 97 | \n",
"
\n",
" \n",
" Physician | \n",
" 127299 | \n",
"
\n",
" \n",
" Radiology | \n",
" 469942 | \n",
"
\n",
" \n",
" Rehab Services | \n",
" 4886 | \n",
"
\n",
" \n",
" Respiratory | \n",
" 28507 | \n",
"
\n",
" \n",
" Social Work | \n",
" 2401 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" proc_text\n",
" count\n",
"category \n",
"Case Management 857\n",
"Consult 87\n",
"Discharge summary 53750\n",
"ECG 188074\n",
"Echo 41348\n",
"General 7435\n",
"Nursing 200875\n",
"Nursing/other 740275\n",
"Nutrition 8437\n",
"Pharmacy 97\n",
"Physician 127299\n",
"Radiology 469942\n",
"Rehab Services 4886\n",
"Respiratory 28507\n",
"Social Work 2401"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" proc_text | \n",
"
\n",
" \n",
" | \n",
" count | \n",
"
\n",
" \n",
" category | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" Case Management | \n",
" 953 | \n",
"
\n",
" \n",
" Consult | \n",
" 98 | \n",
"
\n",
" \n",
" Discharge summary | \n",
" 59652 | \n",
"
\n",
" \n",
" ECG | \n",
" 209051 | \n",
"
\n",
" \n",
" Echo | \n",
" 45794 | \n",
"
\n",
" \n",
" General | \n",
" 8236 | \n",
"
\n",
" \n",
" Nursing | \n",
" 223182 | \n",
"
\n",
" \n",
" Nursing/other | \n",
" 822497 | \n",
"
\n",
" \n",
" Nutrition | \n",
" 9400 | \n",
"
\n",
" \n",
" Pharmacy | \n",
" 101 | \n",
"
\n",
" \n",
" Physician | \n",
" 141281 | \n",
"
\n",
" \n",
" Radiology | \n",
" 522279 | \n",
"
\n",
" \n",
" Rehab Services | \n",
" 5408 | \n",
"
\n",
" \n",
" Respiratory | \n",
" 31701 | \n",
"
\n",
" \n",
" Social Work | \n",
" 2661 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" proc_text\n",
" count\n",
"category \n",
"Case Management 953\n",
"Consult 98\n",
"Discharge summary 59652\n",
"ECG 209051\n",
"Echo 45794\n",
"General 8236\n",
"Nursing 223182\n",
"Nursing/other 822497\n",
"Nutrition 9400\n",
"Pharmacy 101\n",
"Physician 141281\n",
"Radiology 522279\n",
"Rehab Services 5408\n",
"Respiratory 31701\n",
"Social Work 2661"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 1min 7s, sys: 2.64 s, total: 1min 10s\n",
"Wall time: 1min 13s\n"
]
}
],
"source": [
"%%time\n",
"train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)\n",
"val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}