{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# MIMIC Notes Pre-Processing\n", "\n", "Pre-processing MIMIC notes for further use." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a list of redacted items with an example and the replacement token. Replacement tokens are changeable. Check `preprocess_notes.py` for more details.\n", "\n", "Redacted items:\n", "* [x] First Name: `[**First Name (Titles) 137**]`, `xxname`\n", "* [x] Last Name: `[**Last Name (Titles) **]`, `xxln`\n", "* [x] Initials: `[**Initials (NamePattern4) **]`, `xxinit`\n", "* [x] Name: `[**Name (NI) **]`, `xxname`\n", "* [x] Doctor First Name: `[**Doctor First Name 1266**]`, `xxdocfn`\n", "* [x] Doctor Last Name: `[**Doctor Last Name 1266**]`, `xxdocln`\n", "* [x] Known Last Name: `[**Known lastname 658**]`, `xxln`\n", "* [x] Hospital: `[**Hospital1 **]`, `xxhosp`\n", "* [x] Hospital Unit Name: `**Hospital Unit Name 10**`, `xxhosp`\n", "* [x] Company: `[**Company 12924**]`, `xxwork`\n", "* [x] University/College: `[**University/College **]`, `xxwork`\n", "* [x] Date of format YYYY-M-DD: `[**2112-4-18**]`, `xxdate`\n", "* [x] Year: `[**Year (4 digits) **]`, `xxyear`\n", "* [x] Year YYYY format: `[**2119**]`, `xxyear` - I use a regex `\\b\\d{4}\\b` that will match **any** 4 digits which might be problematic, but for the most part 4 digits by itself seems to indicate a year.\n", "* [x] Date of format M-DD: `[**6-12**]`, `[**12/2151**]`, `xxmmdd`\n", "* [x] Month/Day: `[**Month/Day (2) 509**]`, `xxmmdd`\n", "* [x] Month (only): `[**Month (only) 51**]`, `xxmonth`\n", "* [x] Holiday: `[**Holiday 3470**]`, `xxhols`\n", "* [x] Date Range: `[**Date range (1) 7610**]`, `xxdtrnge`\n", "* [x] Country: `[**Country 9958**]`, `xxcntry`\n", "* [x] State: `[**State 3283**]`, `xxstate`\n", "* [x] Location: `**Location (un) 2432**`, `xxloc`\n", "* [x] Telephone/Fax: `[**Telephone/Fax (3) 8049**]`, `xxph`\n", "* [x] Clip Number: `[**Clip Number (Radiology) 29923**]`, `xxradclip`\n", "* [x] Pager Numeric Identifier: `[**Numeric Identifier 6403**]`, `xxpager`\n", "* [x] Pager Number: `[**Pager number 13866**]`, `xxpager`\n", "* [x] Social Security Number: `[**Security Number 10198**]`, `xxssn`\n", "* [x] Serial Number: `[**Serial Number 3567**]`, `xxsno`\n", "* [x] Medical Record Number: `[**Medical Record Number **]`, `xxmrno`\n", "* [x] Provider Number: `[**Provider Number 12521**]`, `xxpno`\n", "* [x] Age over 90: `[**Age over 90 **]`, `xxage90`\n", "* [x] Contact Info: `[**Contact Info **]`, `xxcontact`\n", "* [x] Job Number: `[**Job Number **]`, `xxjobno`\n", "* [x] Dictator Number: `[**Dictator Info **]`, `xxdict`\n", "* [x] Pharmacy MD Number/MD number: `[**Pharmacy MD Number **]`, `xxmdno`\n", "* [x] Time: `12:52 PM`, split into 6 segments by the hour and replace with the following tokens: `midnight, dawn, forenoon, afternoon, dusk, night`\n", "* 2-digit Numbers: `[** 84 **]`, `xx2digit`\n", "* 3-digit Numbers: `[** 834 **]`, `xx3digit`\n", "* Wardname" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`886` notes are marked incorrect with `iserror` flag set to 1. Thus, there are total of `2,082,294` notes. I have set up a `view` called `correctnotes` in the database, which only includese the correct notes. All the data I grab is from that `view`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Imports and Inits" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import psycopg2\n", "import numpy as np\n", "import re\n", "import random\n", "import datetime\n", "from pathlib import Path\n", "import pickle\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Softlink `ln -s` your data path to a `data` variable in the current folder. That way we don't need to change the path in the notebook." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "PATH = Path('data')\n", "!ls {PATH}" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from preprocess_notes import *" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grab Data from MIMIC" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From Database" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here the data is grabbed from the MIMIC database. Data can also be grabbed from other sources" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2018-03-18T23:05:14.536531Z", "start_time": "2018-03-18T23:04:17.500997Z" }, "hidden": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2082294, 10)\n", "CPU times: user 12.1 s, sys: 2.58 s, total: 14.7 s\n", "Wall time: 1min 24s\n" ] } ], "source": [ "%%time\n", "cats = pd.read_csv('note_categories.csv')\n", "max_limit = -1\n", "\n", "queries = []\n", "for category, n_notes in zip(cats['category'], cats['number_of_notes']):\n", " limit = min(max_limit, n_notes) if max_limit > 0 else n_notes\n", " if limit == max_limit:\n", " q = f\"\"\"\n", " select * from correctnotes where category=\\'{category}\\' order by random() limit {limit};\n", " \"\"\"\n", " else:\n", " q = f\"\"\"\n", " select * from correctnotes where category=\\'{category}\\';\n", " \"\"\"\n", " queries.append(q)\n", "\n", "dfs = []\n", "\n", "con = psycopg2.connect(dbname='mimic', user='sudarshan', host='/var/run/postgresql')\n", "for q in queries:\n", " df = pd.read_sql_query(q, con)\n", " dfs.append(df)\n", "con.close()\n", " \n", "df = pd.concat(dfs)\n", "print(df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### From Notes File" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%time\n", "df = pd.read_csv(PATH/'NOTEEVENTS.csv.gz')\n", "print(df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2082294, 9)\n" ] } ], "source": [ "df.columns = map(str.lower, df.columns)\n", "df.set_index('row_id', inplace=True)\n", "print(df.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Confirm that the number of notes match the actual number." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
text
count
category
Case Management953
Consult98
Discharge summary59652
ECG209051
Echo45794
General8236
Nursing223182
Nursing/other822497
Nutrition9400
Pharmacy101
Physician141281
Radiology522279
Rehab Services5408
Respiratory31701
Social Work2661
\n", "
" ], "text/plain": [ " text\n", " count\n", "category \n", "Case Management 953\n", "Consult 98\n", "Discharge summary 59652\n", "ECG 209051\n", "Echo 45794\n", "General 8236\n", "Nursing 223182\n", "Nursing/other 822497\n", "Nutrition 9400\n", "Pharmacy 101\n", "Physician 141281\n", "Radiology 522279\n", "Rehab Services 5408\n", "Respiratory 31701\n", "Social Work 2661" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['category', 'text']].groupby(['category']).agg(['count'])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 24min 5s, sys: 3.86 s, total: 24min 9s\n", "Wall time: 24min 55s\n" ] } ], "source": [ "%%time\n", "df['proc_text'] = df['text'].apply(preprocess_note)\n", "\n", "with open(PATH/'preprocessed_noteevents.pkl', 'wb') as f:\n", " pickle.dump(df, f)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create datasets for Language Modeling" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To follow the FastAI language modeling lesson, I've created a subset of the original dataframe to sample for the datasets. In particular, I've included the `description` and `preprocessed_text` fields in the datasets. The `description` column is composed of free-text and has `3840` unique descriptions. I consider the description as a unique `field` which will be marked as such during tokenization as done in the FastAI library." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labelscategorydescriptionproc_text
row_id
20568360Nursing/otherReportNPN\\n\\n\\n#2 Resp: Breathing comfortably on roo...
8544740RadiologyCHEST (PA & LAT)xxdate xxafter\\n CHEST (PA & LAT) ...
1167290ECGReportBaseline artifact is present. Sinus rhythm. Ve...
7224000NursingNursing Progress NoteAltered mental status (not Delirium)\\n Asses...
9214120RadiologyCT HEAD W/O CONTRASTxxdate xxafter\\n CT HEAD W/O CONTRAST ...
\n", "
" ], "text/plain": [ " labels category description \\\n", "row_id \n", "2056836 0 Nursing/other Report \n", "854474 0 Radiology CHEST (PA & LAT) \n", "116729 0 ECG Report \n", "722400 0 Nursing Nursing Progress Note \n", "921412 0 Radiology CT HEAD W/O CONTRAST \n", "\n", " proc_text \n", "row_id \n", "2056836 NPN\\n\\n\\n#2 Resp: Breathing comfortably on roo... \n", "854474 xxdate xxafter\\n CHEST (PA & LAT) ... \n", "116729 Baseline artifact is present. Sinus rhythm. Ve... \n", "722400 Altered mental status (not Delirium)\\n Asses... \n", "921412 xxdate xxafter\\n CT HEAD W/O CONTRAST ... " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sub_df = pd.DataFrame({'proc_text': df['proc_text'], 'category': df['category'], 'description': df['description'], 'labels': [0]*len(df)},\\\n", " columns=['labels', 'category', 'description', 'proc_text'])\n", "sub_df.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can just do a train/test split on the entire dataset for getting a 90/10 training and testing dataset. However, I would like the train/test set have a 90%/10% split in **each category**. So I chose to iterate over each entry of the `category` column and create masks to split data with a 90/10 split for training and testing so that I grab 10% of texts in each category for testing instead of a global 10%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Set random seed for reproducible results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1874270 1874065 205\n", "208024 208229 -205\n", "CPU times: user 3.17 s, sys: 104 ms, total: 3.27 s\n", "Wall time: 2.65 s\n" ] } ], "source": [ "%%time\n", "np.random.seed(42)\n", "\n", "dfs = [sub_df.loc[df['category'] == c] for c in sub_df['category'].unique()]\n", "msks = [np.random.rand(len(d)) < 0.9 for d in dfs]\n", "\n", "train_dfs = [None] * len(dfs)\n", "val_dfs = [None] * len(dfs)\n", "\n", "for i in range(len(dfs)):\n", " idf = dfs[i]\n", " mask = msks[i]\n", " train_dfs[i] = idf[mask]\n", " val_dfs[i] = idf[~mask]\n", " \n", "train_df = pd.concat(train_dfs)\n", "val_df = pd.concat(val_dfs)\n", "\n", "print(len(train_df), (len(df) - len(df)//10), len(train_df)-(len(df) - len(df)//10))\n", "print(len(val_df), (len(df)//10), len(val_df)-(len(df)//10)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sanity check the aggregate count for each category over the 3 dataframes. Then write the `train` and `val` dataframes to disk." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proc_text
count
category
Case Management96
Consult11
Discharge summary5902
ECG20977
Echo4446
General801
Nursing22307
Nursing/other82222
Nutrition963
Pharmacy4
Physician13982
Radiology52337
Rehab Services522
Respiratory3194
Social Work260
\n", "
" ], "text/plain": [ " proc_text\n", " count\n", "category \n", "Case Management 96\n", "Consult 11\n", "Discharge summary 5902\n", "ECG 20977\n", "Echo 4446\n", "General 801\n", "Nursing 22307\n", "Nursing/other 82222\n", "Nutrition 963\n", "Pharmacy 4\n", "Physician 13982\n", "Radiology 52337\n", "Rehab Services 522\n", "Respiratory 3194\n", "Social Work 260" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "val_df[['category', 'proc_text']].groupby(['category']).agg(['count'])" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proc_text
count
category
Case Management857
Consult87
Discharge summary53750
ECG188074
Echo41348
General7435
Nursing200875
Nursing/other740275
Nutrition8437
Pharmacy97
Physician127299
Radiology469942
Rehab Services4886
Respiratory28507
Social Work2401
\n", "
" ], "text/plain": [ " proc_text\n", " count\n", "category \n", "Case Management 857\n", "Consult 87\n", "Discharge summary 53750\n", "ECG 188074\n", "Echo 41348\n", "General 7435\n", "Nursing 200875\n", "Nursing/other 740275\n", "Nutrition 8437\n", "Pharmacy 97\n", "Physician 127299\n", "Radiology 469942\n", "Rehab Services 4886\n", "Respiratory 28507\n", "Social Work 2401" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "train_df[['category', 'proc_text']].groupby(['category']).agg(['count'])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
proc_text
count
category
Case Management953
Consult98
Discharge summary59652
ECG209051
Echo45794
General8236
Nursing223182
Nursing/other822497
Nutrition9400
Pharmacy101
Physician141281
Radiology522279
Rehab Services5408
Respiratory31701
Social Work2661
\n", "
" ], "text/plain": [ " proc_text\n", " count\n", "category \n", "Case Management 953\n", "Consult 98\n", "Discharge summary 59652\n", "ECG 209051\n", "Echo 45794\n", "General 8236\n", "Nursing 223182\n", "Nursing/other 822497\n", "Nutrition 9400\n", "Pharmacy 101\n", "Physician 141281\n", "Radiology 522279\n", "Rehab Services 5408\n", "Respiratory 31701\n", "Social Work 2661" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sub_df[['category', 'proc_text']].groupby(['category']).agg(['count'])" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 1min 7s, sys: 2.64 s, total: 1min 10s\n", "Wall time: 1min 13s\n" ] } ], "source": [ "%%time\n", "train_df[['labels', 'description', 'proc_text']].to_csv(PATH/'train.csv', header=False, index=False)\n", "val_df[['labels', 'description', 'proc_text']].to_csv(PATH/'test.csv', header=False, index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }