{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biomedical NLP" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Rule-based TNM Extraction\n", "\n", "This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions.\n", "A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re\n", "\n", "tnm_pattern = r\"T\\d+[a-zA-Z]*N\\d+[a-zA-Z]*M\\d+[a-zA-Z]*\"\n", "\n", "def check_valid(text):\n", " print(\"valid\" if re.match(tnm_pattern, text) else \"not valid\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "valid\n" ] } ], "source": [ "check_valid('T1N0M1')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "valid\n" ] } ], "source": [ "check_valid('T1aN2M0')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "not valid\n" ] } ], "source": [ "check_valid('T123')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "not valid\n" ] } ], "source": [ "check_valid('pT1N0M1')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "not valid\n" ] } ], "source": [ "check_valid('T1')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "valid\n" ] } ], "source": [ "check_valid('T8N9M9')" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "not valid\n" ] } ], "source": [ "check_valid('T1 N0 M1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A more complex NLP Pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we are using the spaCy library with [scispaCy](https://allenai.github.io/scispacy/) models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Note: on some systems, installing scispaCy fails due to build errors of nmslib. This can usually be circumvented by installing a pre-built nmslib version from conda\n", "#!conda install nmslib" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "!pip install -q scispacy==0.5.1" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", " warnings.warn(\n", "/Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n", "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n", " warnings.warn(\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spacy\n", "from scispacy.linking import EntityLinker\n", "\n", "nlp = spacy.load('en_core_sci_sm')\n", "nlp.add_pipe(\"scispacy_linker\", config={\"resolve_abbreviations\": True, \"linker_name\": \"mesh\", \"k\" : 5})" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "text = \"The patient underwent a CT scan in April. It did not reveal any abnormalities.\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "doc = nlp(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linguistic Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boundary detection / sentence splitting" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The patient underwent a CT scan in April.\n", "It did not reveal any abnormalities.\n" ] } ], "source": [ "for s in doc.sents:\n", " print(s)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "sentence = list(doc.sents)[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tokenization" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The\n", "patient\n", "underwent\n", "a\n", "CT\n", "scan\n", "in\n", "April\n", ".\n" ] } ], "source": [ "for token in sentence:\n", " print(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Part-of-speech tagging" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The DET\n", "patient NOUN\n", "underwent VERB\n", "a DET\n", "CT PROPN\n", "scan NOUN\n", "in ADP\n", "April PROPN\n", ". PUNCT\n" ] } ], "source": [ "for token in sentence:\n", " print(token, token.pos_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Noun chunking" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The patient\n", "a CT scan\n" ] } ], "source": [ "for token in sentence.noun_chunks:\n", " print(token)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dependency parsing" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "from spacy import displacy" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", " The\n", " DET\n", "\n", "\n", "\n", " patient\n", " NOUN\n", "\n", "\n", "\n", " underwent\n", " VERB\n", "\n", "\n", "\n", " a\n", " DET\n", "\n", "\n", "\n", " CT\n", " PROPN\n", "\n", "\n", "\n", " scan\n", " NOUN\n", "\n", "\n", "\n", " in\n", " ADP\n", "\n", "\n", "\n", " April.\n", " PROPN\n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nsubj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " det\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " compound\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " dobj\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " case\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " nmod\n", " \n", " \n", "\n", "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(sentence, style=\"dep\", jupyter=True, options={'distance' : 100})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Information Extraction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entity extraction" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entity: patient\n", "Entity: CT scan\n" ] } ], "source": [ "for e in sentence.ents:\n", " print('Entity:', e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Entity normalization / linking" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from IPython.display import display_markdown" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "linker = nlp.get_pipe(\"scispacy_linker\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "__Entity: patient__" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Probability: 0.8386321067810059\n", "CUI: D019727, Name: Proxy\n", "Definition: A person authorized to decide or act for another person, for example, a person having durable power of attorney.\n", "TUI(s): \n", "Aliases: (total: 2): \n", "\t Patient Agent, Proxy\n", "Probability: 0.7973071336746216\n", "CUI: D010361, Name: Patients\n", "Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.\n", "TUI(s): \n", "Aliases: (total: 2): \n", "\t Patients, Clients\n", "Probability: 0.7851048707962036\n", "CUI: D005791, Name: Patient Care\n", "Definition: Care rendered by non-professionals.\n", "TUI(s): \n", "Aliases: (total: 2): \n", "\t Informal care, Patient Care\n", "Probability: 0.7439237833023071\n", "CUI: D000070659, Name: Patient Comfort\n", "Definition: Patient care intended to prevent or relieve suffering in conditions that ensure optimal quality living.\n", "TUI(s): \n", "Aliases: (total: 2): \n", "\t Comfort Care, Patient Comfort\n", "Probability: 0.7175934910774231\n", "CUI: D064406, Name: Patient Harm\n", "Definition: A measure of PATIENT SAFETY considering errors or mistakes which result in harm to the patient. They include errors in the administration of drugs and other medications (MEDICATION ERRORS), errors in the performance of procedures or the use of other types of therapy, in the use of equipment, and in the interpretation of laboratory findings and preventable accidents involving patients.\n", "TUI(s): \n", "Aliases: (total: 1): \n", "\t Patient Harm\n" ] }, { "data": { "text/markdown": [ "__Entity: CT scan__" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Probability: 0.8230447173118591\n", "CUI: D000072098, Name: Single Photon Emission Computed Tomography Computed Tomography\n", "Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session.\n", "TUI(s): \n", "Aliases: (total: 5): \n", "\t CT SPECT Scan, Single Photon Emission Computed Tomography Computed Tomography, CT SPECT, SPECT CT Scan, SPECT CT\n", "Probability: 0.8186503648757935\n", "CUI: D000072078, Name: Positron Emission Tomography Computed Tomography\n", "Definition: An imaging technique that combines a POSITRON-EMISSION TOMOGRAPHY (PET) scanner and a CT X RAY scanner. This establishes a precise anatomic localization in the same session.\n", "TUI(s): \n", "Aliases: (total: 7): \n", "\t PET-CT Scan, PET-CT, CT PET Scan, Positron Emission Tomography Computed Tomography, PET CT Scan, Positron Emission Tomography-Computed Tomography, CT PET\n", "Probability: 0.7265672087669373\n", "CUI: D056973, Name: Four-Dimensional Computed Tomography\n", "Definition: Three-dimensional computed tomographic imaging with the added dimension of time, to follow motion during imaging.\n", "TUI(s): \n", "Aliases: (total: 8): \n", "\t 4D CT Scan, 4D CT, Four-Dimensional CT, Four-Dimensional CT Scan, Four-Dimensional Computed Tomography, 4D Computed Tomography, 4D CAT Scan, Four-Dimensional CAT Scan\n" ] } ], "source": [ "for e in sentence.ents:\n", " display_markdown(f'__Entity: {e}__', raw=True)\n", " for entity_id, prob in e._.kb_ents:\n", " mesh_term = linker.kb.cui_to_entity[entity_id]\n", " print('Probability:', prob)\n", " print(mesh_term)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Gene Named Entity Recognition" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "text = \"\"\"Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC \n", "might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma\"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Specialized model for biological entities" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "bionlp = spacy.load('en_ner_bionlp13cg_md')\n", "biodoc = bionlp(text)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Entity: MAPK , Label: GENE_OR_GENE_PRODUCT\n", "Entity: BRAF , Label: GENE_OR_GENE_PRODUCT\n", "Entity: MEK , Label: GENE_OR_GENE_PRODUCT\n", "Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER\n", "Entity: BRAF , Label: GENE_OR_GENE_PRODUCT\n", "Entity: melanoma , Label: CELL\n" ] } ], "source": [ "for e in biodoc.ents:\n", " print('Entity:', e, ', Label:', e.label_)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Dual \n", "\n", " MAPK\n", " GENE_OR_GENE_PRODUCT\n", "\n", " pathway inhibition with \n", "\n", " BRAF\n", " GENE_OR_GENE_PRODUCT\n", "\n", " and \n", "\n", " MEK\n", " GENE_OR_GENE_PRODUCT\n", "\n", " inhibitors in \n", "\n", " BRAF(V600E)-mutant NSCLC\n", " CANCER\n", "\n", "
might improve efficacy over \n", "\n", " BRAF\n", " GENE_OR_GENE_PRODUCT\n", "\n", " inhibitor monotherapy based on observations in BRAF(V600)-mutant \n", "\n", " melanoma\n", " CELL\n", "\n", "
" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "displacy.render(biodoc, style='ent', jupyter=True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda env:dm4dh]", "language": "python", "name": "conda-env-dm4dh-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.5" } }, "nbformat": 4, "nbformat_minor": 4 }