{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## AI for Medicine Course 3 Week 2 lecture notebook - Cleaning Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this notebook you'll be using the `re` module, which is part of Python's Standard Library and provides support for regular expressions (which you may know as `regexp`). \n", "- If you aren't familiar with `regexp`, we recommend checking the [documentation](https://docs.python.org/3/library/re.html).\n", "\n", "Regular expressions allow you to perform searches and replacements in strings based on patterns. Let's start by looking at some examples.\n", "\n", "You'll be using the search method, which looks like this: \n", "\n", "```python\n", "search(pattern, text)\n", "```\n", "It will output a match if one is found, or None otherwise.\n", "\n", "For the following three examples, you'll try to match the pattern to the string \"Pleural Effusion.\" Take note of these special characters:\n", "\n", "- ^ denotes \"starts with\" followed by the pattern\n", "- $ denotes \"ends with\" preceded by the pattern\n", "- | denotes \"or\" followed by another pattern\n", "\n", "Go ahead and import the `re` library, then run the following three cells. Can you see why the first two examples output a match, while the third one does not?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import Library" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examples of Search Patterns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Pl\n" ] } ], "source": [ "# Search if string starts with 'Pl' or ends with 'ion'\n", "m = re.search(pattern = \"^Pl|ion$\", string = \"Pleural Effusion\")\n", "\n", "# return the matched string\n", "if m:\n", " print(m.group(0))\n", "else:\n", " print(None)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ion\n" ] } ], "source": [ "# Search if string starts with 'Sa' or ends in 'ion'\n", "m = re.search(pattern = \"^Sa|ion$\", string = \"Pleural Effusion\")\n", "\n", "# return the matched string\n", "if m:\n", " print(m.group(0))\n", "else:\n", " print(None)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "None\n" ] } ], "source": [ "# Search if string starts with 'Eff'\n", "m = re.search(pattern=\"^Eff\", string=\"Pleural Effusion\")\n", "\n", "# return the matched string\n", "if m:\n", " print(m.group(0))\n", "else:\n", " print(None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that even though 'Eff' exists in the string, the string does not begin with 'Eff', so there is no match." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's try a more advanced example. Your goal here is to match the pattern \"a single letter of the alphabet\" followed by a number.\n", "\n", "### Characters in a set [a-zA-Z]\n", "[ ]\n", "\n", "- [a-z] matches lowercase letters\n", "- [A-Z] matches uppercase letters\n", "- [a-zA-Z] matches lowercase and uppercase letters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can match any lowercase or uppercase single letter followed by '123' like this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "C123\n" ] } ], "source": [ "# Match a letter followed by 123\n", "m = re.search(pattern ='[a-zA-Z]123', string = \"99C123\")\n", "print(f\"{m.group(0)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how the match includes the letter 'C' in C123." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'Lookbehind' assertion\n", "\n", "If you want to match the single letter but not include it in the returned match, you can use the lookbehind assertion\n", "\n", "(?<=...)\n", "\n", "Here is the documentation for lookbehind:\n", "\n", ">Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:\n", "\n", "Note that you'll need to put `?<=[a-zA-Z]` within parentheses, like this: `(?<=[a-zA-Z])`; otherwise you'll see an error message.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123\n" ] } ], "source": [ "# Match a letter followed by 123, exclude the letter.\n", "m = re.search(pattern = '(?<=[a-zA-Z])123', string = \"99C12399\")\n", "print(f\"{m.group(0)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice the difference here. The match is returned because '123' is preceded by a letter 'C', but the letter 'C' is not returned as part of the matched substring. You're 'looking back' but not including the lookback as part of the returned substring." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Match 123 followed by a letter\n", "\n", "Similarly, you can match a letter followed by '123', including the letter." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123C\n" ] } ], "source": [ "# Match 123 followed by a letter\n", "m = re.search(pattern = '123[a-zA-Z]', string = \"99123C99\")\n", "print(f\"{m.group(0)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the letter 'C' is included in the returned match." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 'Lookahead' assertion\n", "\n", "Similarly, you can match '123' followed by a letter, but exclude the letter from the match. \n", "- You can do this by using the lookahead assertion.\n", "(?=...)\n", "\n", "Here is the documentation:\n", ">Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.\n", "\n", "Similar to the lookbehind, you'll need to wrap `?=[a-zA-Z]` around parentheses, like this: `(?=[a-zA-Z])`, to avoid an error message." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "123\n" ] } ], "source": [ "# Match 123 followed by a letter, exclude the letter from returned match.\n", "m = re.search(pattern = '123(?=[a-zA-Z])', string = \"99123C99\")\n", "print(f\"{m.group(0)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the returned match does not include the letter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### String Cleaning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's implement a `clean()` function. It should receive a sentence as input, clean it up and then return the clean version of it. \"Cleaning\" in this case refers to:\n", "\n", " 1. Convert to lowercase only\n", " 2. Change \"and/or\" to \"or\"\n", " 3. Change \"/\" to \"or\" when used to indicate equality between two words such as tomatos/tomatoes\n", " 4. Replace double periods \"..\" with single period \".\"\n", " 5. Insert the appropiate space after periods or commas\n", " 6. Convert multiple whitespaces to a single whitespace\n", "\n", "Let's take this one step at a time, and pay attention to how the sentence changes along the way.\n", "\n", "Here's the sample sentence:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Choose a sentence to be cleaned\n", "sentence = \" BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS and/or PNEUMONIA/bronchopneumonia..\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 1: lowercase\n", "Now, use the built-in `lower()` method to change all characters of a string to lowercase. Quick and easy!" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' bibasilar opacities,likely representing bilateral pleural effusions with atelectasis and/or pneumonia/bronchopneumonia..'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Convert to all lowercase letters\n", "sentence = sentence.lower()\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 2: and/or -> or\n", "The `re` module provides the `sub()` method, which substitutes patterns in a string with another string. Here you'll be looking for 'and/or' and replacing it with just 'or'." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' bibasilar opacities,likely representing bilateral pleural effusions with atelectasis or pneumonia/bronchopneumonia..'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence = re.sub('and/or', 'or', sentence)\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 3: / -> or\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' bibasilar opacities,likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia..'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sentence = re.sub('(?<=[a-zA-Z])/(?=[a-zA-Z])', ' or ', sentence)\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 4: .. -> .\n", "\n", "When finding a specific string and replacing, you can also use Python's built-in `replace()` method.\n", "\n", "Otherwise, when matching special characters like `.`, use backslash to specify that you're looking for the actual period character `\\.`" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " bibasilar opacities,likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia.\n", " bibasilar opacities,likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia.\n" ] } ], "source": [ "# Replace .. with . using re.sub (option 1)\n", "tmp1 = re.sub(\"\\.\\.\", \".\", sentence)\n", "print(tmp1)\n", "\n", "# Replace .. with . using string.replace (option 2)\n", "tmp2 = sentence.replace('..','.')\n", "print(tmp2)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' bibasilar opacities,likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia.'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Replace .. with . using string.replace\n", "sentence = sentence.replace(\"..\", \".\")\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 5: add whitespace after punctuation\n", "\n", "For step 5, let's use a built-in Python function, `translate()`. \n", "- This will return a copy of your string, mapped to a translation table that you define. It's usually used alongside the `maketrans()` method, and you can read about both of them [here](https://docs.python.org/3/library/stdtypes.html#str.translate). " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'!': '!!!', 'z': 's'}\n", "{33: '!!!', 122: 's'}\n" ] } ], "source": [ "# Define a dictionary to specify that ! is replaced by !!!\n", "# and 's' is replaced by ''\n", "translation_dict = {'!': '!!!',\n", " 'z': 's'\n", " }\n", "print(translation_dict)\n", "# Create the translation table\n", "translation_tbl = str.maketrans(translation_dict)\n", "print(translation_tbl)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that each key in the dictionary should be a character of length 1 (the key can't be a word of length 2 or more)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "colonization, realization, organization!\n", "colonisation, realisation, organisation!!!\n" ] } ], "source": [ "# Choose a string to be translated\n", "tmp_str = \"colonization, realization, organization!\"\n", "print(tmp_str)\n", "\n", "# Translate the string using the translation table\n", "tmp_str2 = tmp_str.translate(translation_tbl)\n", "print(tmp_str2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice how z is replaced by s, and ! is replaced by !!!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Add whitespace after punctuation.\n", "- Now apply this to replace '.' with '. ', where the period is followed by a whitespace.\n", "- Similarly, replace ',' with ', ', so that the comma is followed by a whitespace." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'.': '. ', ',': ', '}" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "key: |.| \tval:|. |\n", "key: |,| \tval:|, |\n" ] } ], "source": [ "# Creat translation table using a dictionary comprehension\n", "translation_dict = {key: f\"{key} \" for key in \".,\"}\n", "\n", "# View the translation dictionary\n", "display(translation_dict)\n", "\n", "# View the translation dictionary with some formatting for easier reading\n", "# Use vertical bars to help see the whitespace more easily.\n", "for key, val in translation_dict.items():\n", " print(f\"key: |{key}| \\tval:|{val}|\")" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "' bibasilar opacities, likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia. '" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Create the translation table using the translation dictionary\n", "punctuation_spacer = str.maketrans(translation_dict)\n", "\n", "# Apply the translation table to add whitespace after punctuation\n", "sentence = sentence.translate(punctuation_spacer)\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Step 6: trim whitespace\n", "\n", "Nice! For step 6, you can trim multiple whitespaces with Python's `join()` method. Sidenote: This can be also done using `regexp`." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['bibasilar',\n", " 'opacities,',\n", " 'likely',\n", " 'representing',\n", " 'bilateral',\n", " 'pleural',\n", " 'effusions',\n", " 'with',\n", " 'atelectasis',\n", " 'or',\n", " 'pneumonia',\n", " 'or',\n", " 'bronchopneumonia.']" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Split the string using whitespace as the delimiter\n", "# This removes all whitespace between words\n", "sentence_list = sentence.split()\n", "sentence_list" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'bibasilar opacities, likely representing bilateral pleural effusions with atelectasis or pneumonia or bronchopneumonia.'" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Join the tokens with a single whitespace.\n", "# This ensures that there is a single whitespace between words\n", "sentence = ' '.join(sentence_list)\n", "sentence" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sentence is now cleaner and easier to work with!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it all together\n", "\n", "Now you can put all that together into a function and test this implementation on more sentences." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "##########################\n", "\n", "Sentence number: 1\n", "Raw sentence: \n", " BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS and/or PNEUMONIA..\n", "Cleaned sentence: \n", "bibasilar opacities, likely representing bilateral pleural effusions with atelectasis or pneumonia.\n", "\n", "##########################\n", "\n", "Sentence number: 2\n", "Raw sentence: \n", "Small left pleural effusion/decreased lung volumes bilaterally.left RetroCardiac Atelectasis.\n", "Cleaned sentence: \n", "small left pleural effusion or decreased lung volumes bilaterally. left retrocardiac atelectasis.\n", "\n", "##########################\n", "\n", "Sentence number: 3\n", "Raw sentence: \n", "PA and lateral views of the chest demonstrate clear lungs,with NO focal air space opacity and/or pleural effusion.\n", "Cleaned sentence: \n", "pa and lateral views of the chest demonstrate clear lungs, with no focal air space opacity or pleural effusion.\n", "\n", "##########################\n", "\n", "Sentence number: 4\n", "Raw sentence: \n", "worrisome nodule in the Right Upper lobe.CANNOT exclude neoplasm..\n", "Cleaned sentence: \n", "worrisome nodule in the right upper lobe. cannot exclude neoplasm.\n" ] } ], "source": [ "def clean(sentence):\n", " lower_sentence = sentence.lower()\n", " corrected_sentence = re.sub('and/or', 'or', lower_sentence)\n", " corrected_sentence = re.sub('(?<=[a-zA-Z])/(?=[a-zA-Z])', ' or ', corrected_sentence)\n", " clean_sentence = corrected_sentence.replace(\"..\", \".\")\n", " punctuation_spacer = str.maketrans({key: f\"{key} \" for key in \".,\"})\n", " clean_sentence = clean_sentence.translate(punctuation_spacer)\n", " clean_sentence = ' '.join(clean_sentence.split())\n", " return clean_sentence\n", "\n", "sentences = [\" BIBASILAR OPACITIES,likely representing bilateral pleural effusions with ATELECTASIS and/or PNEUMONIA..\",\n", " \"Small left pleural effusion/decreased lung volumes bilaterally.left RetroCardiac Atelectasis.\",\n", " \"PA and lateral views of the chest demonstrate clear lungs,with NO focal air space opacity and/or pleural effusion.\",\n", " \"worrisome nodule in the Right Upper lobe.CANNOT exclude neoplasm..\"]\n", "\n", "for n, sentence in enumerate(sentences):\n", " print(\"\\n##########################\\n\")\n", " print(f\"Sentence number: {n+1}\")\n", " print(f\"Raw sentence: \\n{sentence}\")\n", " print(f\"Cleaned sentence: \\n{clean(sentence)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Congratulations \n", "Congratulations on finishing this lecture notebook!** By now, you should have a better grasp of `regexp` along with some built-in Python methods for cleaning text. You'll be seeing the `clean()` function again in the upcoming graded assignment. Good luck and have fun! " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 4 }