{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Biomedical NLP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Rule-based TNM Extraction\n",
    "\n",
    "This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions.\n",
    "A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "tnm_pattern = r\"T\\d+[a-zA-Z]*N\\d+[a-zA-Z]*M\\d+[a-zA-Z]*\"\n",
    "\n",
    "def check_valid(text):\n",
    "    print(\"valid\" if re.match(tnm_pattern, text) else \"not valid\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T1N0M1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T1aN2M0')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T123')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('pT1N0M1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T1')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T8N9M9')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "not valid\n"
     ]
    }
   ],
   "source": [
    "check_valid('T1 N0 M1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A more complex NLP Pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, we are using the spaCy library with [scispaCy](https://allenai.github.io/scispacy/) models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Note: on some systems, installing scispaCy fails due to build errors of nmslib. This can usually be circumvented by installing a pre-built nmslib version from conda\n",
    "#!conda install nmslib"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q scispacy==0.5.1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n",
      "/Users/phlobo/miniconda3/envs/dm4dh/lib/python3.11/site-packages/sklearn/base.py:348: InconsistentVersionWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.2.post1 when using version 1.3.1. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<scispacy.linking.EntityLinker at 0x1664cbe50>"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import spacy\n",
    "from scispacy.linking import EntityLinker\n",
    "\n",
    "nlp = spacy.load('en_core_sci_sm')\n",
    "nlp.add_pipe(\"scispacy_linker\", config={\"resolve_abbreviations\": True, \"linker_name\": \"mesh\", \"k\" : 5})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"The patient underwent a CT scan in April. It did not reveal any abnormalities.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = nlp(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linguistic Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Boundary detection / sentence splitting"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The patient underwent a CT scan in April.\n",
      "It did not reveal any abnormalities.\n"
     ]
    }
   ],
   "source": [
    "for s in doc.sents:\n",
    "    print(s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = list(doc.sents)[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tokenization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The\n",
      "patient\n",
      "underwent\n",
      "a\n",
      "CT\n",
      "scan\n",
      "in\n",
      "April\n",
      ".\n"
     ]
    }
   ],
   "source": [
    "for token in sentence:\n",
    "    print(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Part-of-speech tagging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The DET\n",
      "patient NOUN\n",
      "underwent VERB\n",
      "a DET\n",
      "CT PROPN\n",
      "scan NOUN\n",
      "in ADP\n",
      "April PROPN\n",
      ". PUNCT\n"
     ]
    }
   ],
   "source": [
    "for token in sentence:\n",
    "    print(token, token.pos_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Noun chunking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The patient\n",
      "a CT scan\n"
     ]
    }
   ],
   "source": [
    "for token in sentence.noun_chunks:\n",
    "    print(token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Dependency parsing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "from spacy import displacy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" xml:lang=\"en\" id=\"5eb9e91cce6145078279ee86ef3e4495-0\" class=\"displacy\" width=\"850\" height=\"287.0\" direction=\"ltr\" style=\"max-width: none; height: 287.0px; color: #000000; background: #ffffff; font-family: Arial; direction: ltr\">\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"50\">The</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"50\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"150\">patient</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"150\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"250\">underwent</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"250\">VERB</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"350\">a</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"350\">DET</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"450\">CT</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"450\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"550\">scan</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"550\">NOUN</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"650\">in</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"650\">ADP</tspan>\n",
       "</text>\n",
       "\n",
       "<text class=\"displacy-token\" fill=\"currentColor\" text-anchor=\"middle\" y=\"197.0\">\n",
       "    <tspan class=\"displacy-word\" fill=\"currentColor\" x=\"750\">April.</tspan>\n",
       "    <tspan class=\"displacy-tag\" dy=\"2em\" fill=\"currentColor\" x=\"750\">PROPN</tspan>\n",
       "</text>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-0\" stroke-width=\"2px\" d=\"M70,152.0 C70,102.0 140.0,102.0 140.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-0\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M70,154.0 L62,142.0 78,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-1\" stroke-width=\"2px\" d=\"M170,152.0 C170,102.0 240.0,102.0 240.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-1\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nsubj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M170,154.0 L162,142.0 178,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-2\" stroke-width=\"2px\" d=\"M370,152.0 C370,102.0 440.0,102.0 440.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-2\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">det</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M370,154.0 L362,142.0 378,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-3\" stroke-width=\"2px\" d=\"M470,152.0 C470,102.0 540.0,102.0 540.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-3\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">compound</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M470,154.0 L462,142.0 478,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-4\" stroke-width=\"2px\" d=\"M270,152.0 C270,52.0 545.0,52.0 545.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-4\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">dobj</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M545.0,154.0 L553.0,142.0 537.0,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-5\" stroke-width=\"2px\" d=\"M670,152.0 C670,102.0 740.0,102.0 740.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-5\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">case</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M670,154.0 L662,142.0 678,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "\n",
       "<g class=\"displacy-arrow\">\n",
       "    <path class=\"displacy-arc\" id=\"arrow-5eb9e91cce6145078279ee86ef3e4495-0-6\" stroke-width=\"2px\" d=\"M270,152.0 C270,2.0 750.0,2.0 750.0,152.0\" fill=\"none\" stroke=\"currentColor\"/>\n",
       "    <text dy=\"1.25em\" style=\"font-size: 0.8em; letter-spacing: 1px\">\n",
       "        <textPath xlink:href=\"#arrow-5eb9e91cce6145078279ee86ef3e4495-0-6\" class=\"displacy-label\" startOffset=\"50%\" side=\"left\" fill=\"currentColor\" text-anchor=\"middle\">nmod</textPath>\n",
       "    </text>\n",
       "    <path class=\"displacy-arrowhead\" d=\"M750.0,154.0 L758.0,142.0 742.0,142.0\" fill=\"currentColor\"/>\n",
       "</g>\n",
       "</svg></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(sentence, style=\"dep\", jupyter=True, options={'distance' : 100})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Information Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entity extraction"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entity: patient\n",
      "Entity: CT scan\n"
     ]
    }
   ],
   "source": [
    "for e in sentence.ents:\n",
    "    print('Entity:', e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entity normalization / linking"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "from IPython.display import display_markdown"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
    "linker = nlp.get_pipe(\"scispacy_linker\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "__Entity: patient__"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability: 0.8386321067810059\n",
      "CUI: D019727, Name: Proxy\n",
      "Definition: A person authorized to decide or act for another person, for example, a person having durable power of attorney.\n",
      "TUI(s): \n",
      "Aliases: (total: 2): \n",
      "\t Patient Agent, Proxy\n",
      "Probability: 0.7973071336746216\n",
      "CUI: D010361, Name: Patients\n",
      "Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.\n",
      "TUI(s): \n",
      "Aliases: (total: 2): \n",
      "\t Patients, Clients\n",
      "Probability: 0.7851048707962036\n",
      "CUI: D005791, Name: Patient Care\n",
      "Definition: Care rendered by non-professionals.\n",
      "TUI(s): \n",
      "Aliases: (total: 2): \n",
      "\t Informal care, Patient Care\n",
      "Probability: 0.7439237833023071\n",
      "CUI: D000070659, Name: Patient Comfort\n",
      "Definition: Patient care intended to prevent or relieve suffering in conditions that ensure optimal quality living.\n",
      "TUI(s): \n",
      "Aliases: (total: 2): \n",
      "\t Comfort Care, Patient Comfort\n",
      "Probability: 0.7175934910774231\n",
      "CUI: D064406, Name: Patient Harm\n",
      "Definition: A measure of PATIENT SAFETY considering errors or mistakes which result in harm to the patient. They include errors in the administration of drugs and other medications (MEDICATION ERRORS), errors in the performance of procedures or the use of other types of therapy, in the use of equipment, and in the interpretation of laboratory findings and preventable accidents involving patients.\n",
      "TUI(s): \n",
      "Aliases: (total: 1): \n",
      "\t Patient Harm\n"
     ]
    },
    {
     "data": {
      "text/markdown": [
       "__Entity: CT scan__"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Probability: 0.8230447173118591\n",
      "CUI: D000072098, Name: Single Photon Emission Computed Tomography Computed Tomography\n",
      "Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session.\n",
      "TUI(s): \n",
      "Aliases: (total: 5): \n",
      "\t CT SPECT Scan, Single Photon Emission Computed Tomography Computed Tomography, CT SPECT, SPECT CT Scan, SPECT CT\n",
      "Probability: 0.8186503648757935\n",
      "CUI: D000072078, Name: Positron Emission Tomography Computed Tomography\n",
      "Definition: An imaging technique that combines a POSITRON-EMISSION TOMOGRAPHY (PET) scanner and a CT X RAY scanner. This establishes a precise anatomic localization in the same session.\n",
      "TUI(s): \n",
      "Aliases: (total: 7): \n",
      "\t PET-CT Scan, PET-CT, CT PET Scan, Positron Emission Tomography Computed Tomography, PET CT Scan, Positron Emission Tomography-Computed Tomography, CT PET\n",
      "Probability: 0.7265672087669373\n",
      "CUI: D056973, Name: Four-Dimensional Computed Tomography\n",
      "Definition: Three-dimensional computed tomographic imaging with the added dimension of time, to follow motion during imaging.\n",
      "TUI(s): \n",
      "Aliases: (total: 8): \n",
      "\t 4D CT Scan, 4D CT, Four-Dimensional CT, Four-Dimensional CT Scan, Four-Dimensional Computed Tomography, 4D Computed Tomography, 4D CAT Scan, Four-Dimensional CAT Scan\n"
     ]
    }
   ],
   "source": [
    "for e in sentence.ents:\n",
    "    display_markdown(f'__Entity: {e}__', raw=True)\n",
    "    for entity_id, prob in e._.kb_ents:\n",
    "        mesh_term = linker.kb.cui_to_entity[entity_id]\n",
    "        print('Probability:', prob)\n",
    "        print(mesh_term)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gene Named Entity Recognition"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC \n",
    "might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Specialized model for biological entities"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "bionlp = spacy.load('en_ner_bionlp13cg_md')\n",
    "biodoc = bionlp(text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Entity: MAPK , Label: GENE_OR_GENE_PRODUCT\n",
      "Entity: BRAF , Label: GENE_OR_GENE_PRODUCT\n",
      "Entity: MEK , Label: GENE_OR_GENE_PRODUCT\n",
      "Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER\n",
      "Entity: BRAF , Label: GENE_OR_GENE_PRODUCT\n",
      "Entity: melanoma , Label: CELL\n"
     ]
    }
   ],
   "source": [
    "for e in biodoc.ents:\n",
    "    print('Entity:', e, ', Label:', e.label_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">Dual \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    MAPK\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GENE_OR_GENE_PRODUCT</span>\n",
       "</mark>\n",
       " pathway inhibition with \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    BRAF\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GENE_OR_GENE_PRODUCT</span>\n",
       "</mark>\n",
       " and \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    MEK\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GENE_OR_GENE_PRODUCT</span>\n",
       "</mark>\n",
       " inhibitors in \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    BRAF(V600E)-mutant NSCLC\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">CANCER</span>\n",
       "</mark>\n",
       " </br>might improve efficacy over \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    BRAF\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">GENE_OR_GENE_PRODUCT</span>\n",
       "</mark>\n",
       " inhibitor monotherapy based on observations in BRAF(V600)-mutant \n",
       "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n",
       "    melanoma\n",
       "    <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">CELL</span>\n",
       "</mark>\n",
       "</div></span>"
      ],
      "text/plain": [
       "<IPython.core.display.HTML object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "displacy.render(biodoc, style='ent', jupyter=True)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:dm4dh]",
   "language": "python",
   "name": "conda-env-dm4dh-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}