# Biomedical NLP

## Rule-based TNM Extraction

This example shows a simplistic and somewhat problematic regular expression for matching TNM expressions.
A more realistic solution can be found here: https://github.com/hpi-dhc/onco-nlp/blob/master/onconlp/classification/rulebased_tnm.py

In [1]:
import re

tnm_pattern = r"T\d+[a-zA-Z]*N\d+[a-zA-Z]*M\d+[a-zA-Z]*"

def check_valid(text):
 print("valid" if re.match(tnm_pattern, text) else "not valid")

In [2]:
check_valid('T1N0M1')

valid


In [3]:
check_valid('T1aN2M0')

valid


In [4]:
check_valid('T123')

not valid


In [5]:
check_valid('pT1N0M1')

not valid


In [6]:
check_valid('T1')

not valid


In [7]:
check_valid('T8N9M9')

valid


In [8]:
check_valid('T1 N0 M1')

not valid


## A more complex NLP Pipeline

Here, we are using the spaCy library with [scispaCy](https://allenai.github.io/scispacy/) models for domain-specific entity extraction. We also use scispaCy's entity linker to map entities to the MeSH vocabulary for normalization.

In [9]:
# Note: on some systems, installing scispaCy fails due to build errors of nmslib. This can usually be circumvented by installing a pre-built nmslib version from conda
#!conda install nmslib

In [10]:
!pip install -q scispacy==0.5.1

In [11]:
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

In [12]:
import spacy
from scispacy.linking import EntityLinker

nlp = spacy.load('en_core_sci_sm')
nlp.add_pipe("scispacy_linker", config={"resolve_abbreviations": True, "linker_name": "mesh", "k" : 5})

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


<scispacy.linking.EntityLinker at 0x1664cbe50>

In [13]:
text = "The patient underwent a CT scan in April. It did not reveal any abnormalities."

In [14]:
doc = nlp(text)

### Linguistic Analysis

Boundary detection / sentence splitting

In [15]:
for s in doc.sents:
 print(s)

The patient underwent a CT scan in April.
It did not reveal any abnormalities.


In [16]:
sentence = list(doc.sents)[0]

Tokenization

In [17]:
for token in sentence:
 print(token)

The
patient
underwent
a
CT
scan
in
April
.


Part-of-speech tagging

In [18]:
for token in sentence:
 print(token, token.pos_)

The DET
patient NOUN
underwent VERB
a DET
CT PROPN
scan NOUN
in ADP
April PROPN
. PUNCT


Noun chunking

In [19]:
for token in sentence.noun_chunks:
 print(token)

The patient
a CT scan


Dependency parsing

In [20]:
from spacy import displacy

In [21]:
displacy.render(sentence, style="dep", jupyter=True, options={'distance' : 100})

## Information Extraction

Entity extraction

In [22]:
for e in sentence.ents:
 print('Entity:', e)

Entity: patient
Entity: CT scan


Entity normalization / linking

In [23]:
from IPython.display import display_markdown

In [24]:
linker = nlp.get_pipe("scispacy_linker")

In [25]:
for e in sentence.ents:
 display_markdown(f'__Entity: {e}__', raw=True)
 for entity_id, prob in e._.kb_ents:
 mesh_term = linker.kb.cui_to_entity[entity_id]
 print('Probability:', prob)
 print(mesh_term)

__Entity: patient__

Probability: 0.8386321067810059
CUI: D019727, Name: Proxy
Definition: A person authorized to decide or act for another person, for example, a person having durable power of attorney.
TUI(s): 
Aliases: (total: 2): 
	 Patient Agent, Proxy
Probability: 0.7973071336746216
CUI: D010361, Name: Patients
Definition: Individuals participating in the health care system for the purpose of receiving therapeutic, diagnostic, or preventive procedures.
TUI(s): 
Aliases: (total: 2): 
	 Patients, Clients
Probability: 0.7851048707962036
CUI: D005791, Name: Patient Care
Definition: Care rendered by non-professionals.
TUI(s): 
Aliases: (total: 2): 
	 Informal care, Patient Care
Probability: 0.7439237833023071
CUI: D000070659, Name: Patient Comfort
Definition: Patient care intended to prevent or relieve suffering in conditions that ensure optimal quality living.
TUI(s): 
Aliases: (total: 2): 
	 Comfort Care, Patient Comfort
Probability: 0.7175934910774231
CUI: D064406, Name: Patient Harm
Definition: A meas

__Entity: CT scan__

Probability: 0.8230447173118591
CUI: D000072098, Name: Single Photon Emission Computed Tomography Computed Tomography
Definition: An imaging technique using a device which combines TOMOGRAPHY, EMISSION-COMPUTED, SINGLE-PHOTON and TOMOGRAPHY, X-RAY COMPUTED in the same session.
TUI(s): 
Aliases: (total: 5): 
	 CT SPECT Scan, Single Photon Emission Computed Tomography Computed Tomography, CT SPECT, SPECT CT Scan, SPECT CT
Probability: 0.8186503648757935
CUI: D000072078, Name: Positron Emission Tomography Computed Tomography
Definition: An imaging technique that combines a POSITRON-EMISSION TOMOGRAPHY (PET) scanner and a CT X RAY scanner. This establishes a precise anatomic localization in the same session.
TUI(s): 
Aliases: (total: 7): 
	 PET-CT Scan, PET-CT, CT PET Scan, Positron Emission Tomography Computed Tomography, PET CT Scan, Positron Emission Tomography-Computed Tomography, CT PET
Probability: 0.7265672087669373
CUI: D056973, Name: Four-Dimensional Computed Tomography
Definition

# Gene Named Entity Recognition

In [26]:
!pip install -q https://s3-us-west-2.amazonaws.com/ai2-s2-scispacy/releases/v0.5.1/en_ner_bionlp13cg_md-0.5.1.tar.gz

In [27]:
text = """Dual MAPK pathway inhibition with BRAF and MEK inhibitors in BRAF(V600E)-mutant NSCLC 
might improve efficacy over BRAF inhibitor monotherapy based on observations in BRAF(V600)-mutant melanoma"""

Specialized model for biological entities

In [28]:
bionlp = spacy.load('en_ner_bionlp13cg_md')
biodoc = bionlp(text)

In [29]:
for e in biodoc.ents:
 print('Entity:', e, ', Label:', e.label_)

Entity: MAPK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT
Entity: MEK , Label: GENE_OR_GENE_PRODUCT
Entity: BRAF(V600E)-mutant NSCLC , Label: CANCER
Entity: BRAF , Label: GENE_OR_GENE_PRODUCT
Entity: melanoma , Label: CELL


In [30]:
displacy.render(biodoc, style='ent', jupyter=True)