# Drug responses - Background traits, PharmGKB

## Table of contents

1. [ClinVar](#Data-from-ClinVar)
    1. [Thoughts](#Thoughts)
2. [PharmGKB](#PharmGKB-data)
    1. [Clinical annotations](#Clinical-annotations)
    2. [Example extraction](#Example-extraction)
    2. [Connecting with ClinVar](#Connecting-with-ClinVar)
    3. [Star alleles](#Star-alleles)
    3. [Notes](#Notes)
3. [General](#General)
    1. [Meeting notes](#Meeting-notes)

In [1]:
from collections import Counter
import sys

sys.path.append('..')

In [2]:
from filter_clinvar_xml import filter_xml, pprint, iterate_cvs_from_xml
from clinvar_xml_io.clinvar_xml_io import *

## Data from ClinVar

[Top of page](#Table-of-contents)

Questions to address:

* Can we reliably get the background trait, i.e. the disease that the drug acts on?
* How many records are explicitly reporting efficacy phenotypes?

In [3]:
# July 2022 data
drug_xml = '/home/april/projects/opentargets/drug-response.xml.gz'

In [4]:
dataset = ClinVarDataset(drug_xml)

In [237]:
# Entire CVS record (RCV + SCV) for reference
for raw_cvs_xml in iterate_cvs_from_xml(drug_xml):
    pprint(raw_cvs_xml)
    break

<ClinVarSet ID="74627773">
  <RecordStatus>current</RecordStatus>
  <Title>NM_000769.4(CYP2C19):c.-806C&gt;A AND clopidogrel response - Dosage, Efficacy, Toxicity/ADR</Title>
  <ReferenceClinVarAssertion DateCreated="2016-05-18" DateLastUpdated="2021-09-29" ID="503964">
    <ClinVarAccession Acc="RCV000211201" Version="2" Type="RCV" DateUpdated="2021-09-29" />
    <RecordStatus>current</RecordStatus>
    <ClinicalSignificance DateLastEvaluated="2016-06-14">
      <ReviewStatus>reviewed by expert panel</ReviewStatus>
      <Description>drug response</Description>
    </ClinicalSignificance>
    <Assertion Type="variation to disease" />
    <ObservedIn>
      <Sample>
        <Origin>germline</Origin>
        <Species TaxonomyId="9606">human</Species>
        <AffectedStatus>yes</AffectedStatus>
      </Sample>
      <Method>
        <MethodType>curation</MethodType>
      </Method>
      <ObservedData ID="74435109">
        <Attribute Type="Description">not provided</Attribute>
      </

Example [RCV000211201](https://www.ncbi.nlm.nih.gov/clinvar/RCV000211201/) - contains trait relationship between drug and disease but only in SCV not RCV record.  (Note also there's only one SCV for this RCV.)

**SCV:**

```
<TraitSet Type="DrugResponse">
  <Trait Type="DrugResponse">
    <Name>
      <ElementValue Type="Preferred">clopidogrel response - Dosage, Efficacy, Toxicity/ADR</ElementValue>
    </Name>
    <TraitRelationship Type="DrugResponseAndDisease">
      <Name>
        <ElementValue Type="Preferred">Acute coronary syndrome</ElementValue>
      </Name>
    </TraitRelationship>
    <TraitRelationship Type="DrugResponseAndDisease">
      <Name>
        <ElementValue Type="Preferred">Coronary Artery Disease</ElementValue>
      </Name>
    </TraitRelationship>
    <TraitRelationship Type="DrugResponseAndDisease">
      <Name>
        <ElementValue Type="Preferred">Myocardial Infarction</ElementValue>
      </Name>
    </TraitRelationship>
  </Trait>
</TraitSet>
```

**RCV:**
```
<TraitSet Type="DrugResponse" ID="26824">
  <Trait ID="35423" Type="DrugResponse">
    <Name>
      <ElementValue Type="Preferred">clopidogrel response - Dosage, Efficacy, Toxicity/ADR</ElementValue>
    </Name>
    <XRef ID="CN236507" DB="MedGen" />
  </Trait>
</TraitSet>
```

In [19]:
# Check whether any of the RCV records have this kind of information
for record in dataset:
    if len(record.trait_set) > 1:
        # No trait set with both a drug and a disease
        print(record.accession)
        print([trait.preferred_or_other_valid_name for trait in record.trait_set])
    for trait in record.trait_set:
        # No traits in RCV with relationship element
        relationships = find_elements(trait.trait_xml, './TraitRelationship')
        if relationships:
            print(record.accession)
            pprint(trait.trait_xml)

RCV001824998
['Cabozantinib resistance', 'Entrectinib resistance', 'Larotrectinib resistance', 'Repotrectinib resistance', 'Selitrectinib resistance']


In [49]:
def get_name(x):
    return ClinVarTrait(x, None).preferred_or_other_valid_name


def is_pgkb(raw_cvs_xml):
    scvs = find_elements(raw_cvs_xml, './ClinVarAssertion/ClinVarSubmissionID')
    submitters = {scv.attrib.get('submitter') for scv in scvs}
    return 'PharmGKB' in submitters

In [239]:
# Check whether all the SCV records have this kind of information
n = 0
count_all = 0
count_pgkb = 0
all_strs = set()
for raw_cvs_xml in iterate_cvs_from_xml(drug_xml):
    n += 1
    elts = find_elements(raw_cvs_xml, './ClinVarAssertion/TraitSet/Trait')
    for e in elts:
        if e.attrib['Type'] == 'DrugResponse':
            relations = find_elements(e, './TraitRelationship')
            name = get_name(e)
            background_traits = []
            for r in relations:
                if r.attrib['Type'] == 'DrugResponseAndDisease':
                    background_traits.append(get_name(r))
            if background_traits:
                count_all += 1
                if is_pgkb(raw_cvs_xml):
                    count_pgkb += 1
                    all_strs.add(f'*{get_name(e)} => {background_traits}')
                else:
                    all_strs.add(f'{get_name(e)} => {background_traits}')

for s in all_strs:
    print(s)

*hmg coa reductase inhibitors response - Toxicity => ['statin-related myopathy']
*nicotine response - Toxicity => ['Tobacco Use Disorder']
*azathioprine response - Toxicity => ['Inflammatory Bowel Diseases', 'Myelosuppression']
Piroxicam response => ['Pain', 'Inflammation', 'Osteoarthritis', 'Rheumatoid arthritis']
*halothane response - Toxicity => ['Malignant Hyperthermia']
*warfarin response - Toxicity/ADR => ['Over-anticoagulation']
*efavirenz response - Metabolism/PK => ['HIV Infections']
Prednisolone response => ['Minimal change disease']
efavirenz response => ['HIV']
Deutetrabenazine response => ['Chorea', 'Huntington disease', 'Tardive dyskinesia']
Lesinurad response => ['Gout']
*rosuvastatin response - Efficacy => ['Hypercholesterolemia', 'Myocardial Infarction']
Dabrafenib response => ['Pancreatic Adenocarcinoma']
*tobramycin response - Toxicity => ['Ototoxicity']
*peginterferon alfa-2b and ribavirin response - Toxicity => ['Anemia', 'Hepatitis C, Chronic']
*captopril response

In [60]:
print(f'Out of {n} records, found {count_all} with drug response & disease relationship ({count_pgkb} from PharmGKB).')

Out of 4970 records, found 576 with drug response & disease relationship (361 from PharmGKB).


In [235]:
count_all = 0
count_pgkb = 0
for raw_cvs_xml in iterate_cvs_from_xml(drug_xml):
    elts = find_elements(raw_cvs_xml, './ClinVarAssertion/TraitSet/Trait')
    for e in elts:
        if e.attrib['Type'] == 'DrugResponse':
            name = get_name(e)
            if name and 'efficacy' in name.lower():
                count_all += 1
                if is_pgkb(raw_cvs_xml):
                    count_pgkb += 1

In [236]:
print(f'Out of {n} records, found {count_all} with efficacy phenotype ({count_pgkb} from PharmGKB).')

Out of 4970 records, found 54 with efficacy phenotype (54 from PharmGKB).


### Thoughts

[Top of page](#Table-of-contents)

* Is it worth starting to parse SCV for drug response / disease trait relationships?
    * Might be relatively straightforward to do in this restricted case
    * Opens up a can of worms, e.g. what happens if SCVs don't agree?  Do we end up redoing the work of aggregation?
* Why does ClinVar exclude this info from the RCV anyway?
* Is it worth trying other ways of linking drug & disease within ClinVar?
    * e.g. different RCV with same VCV, one for drug and one for disease
    * same SCV associated with different RCVs via different traits?
* Counts summary: **4970** drug response records
    * **401** with PharmGKB submission (previous notebook)
    * **576** with drug response & disease relationship (in SCV only)
        * Of these, **361** from PharmGKB
    * **54** with explicit efficacy phenotype, all from PharmGKB

## PharmGKB data

[Top of page](#Table-of-contents)

* Compare this with what PharmGKB submissions contain in ClinVar
* Also consider how we would get consequences and how we'd connect to ClinVar data

General PharmGKB notes:
* [Multiple datasets](https://www.pharmgkb.org/downloads) that we could cross-reference
    * I looked at some of the others but the clinical annotations are probably all we need/can use
* "PharmGKB submits Level 1 & 2 Clinical Annotations PGx into ClinVar" - see [levels](https://www.pharmgkb.org/page/clinAnnLevels)

In [115]:
import pandas as pd
import os
from IPython.display import display

In [84]:
pd.set_option('display.max_colwidth', None)

In [64]:
pharmgkb_root = '/home/april/projects/opentargets/pharmgkb'

### Clinical annotations

[Top of page](#Table-of-contents)

In [72]:
clinical_annotations = pd.read_csv(os.path.join(pharmgkb_root, 'clinical', 'clinical_annotations.tsv'), sep='\t')
clinical_alleles = pd.read_csv(os.path.join(pharmgkb_root, 'clinical', 'clinical_ann_alleles.tsv'), sep='\t')
clinical_evidence = pd.read_csv(os.path.join(pharmgkb_root, 'clinical', 'clinical_ann_evidence.tsv'), sep='\t')

In [132]:
len(clinical_annotations)

5013

In [112]:
def show_id(i):
    for t in (clinical_annotations[clinical_annotations['Clinical Annotation ID'] == i],
              clinical_alleles[clinical_alleles['Clinical Annotation ID'] == i],
              clinical_evidence[clinical_evidence['Clinical Annotation ID'] == i]):
        display(t)

Two examples: one with RS ID (981755803) and one with star allele only (1451243980)

In [113]:
show_id(981755803)

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Level Override,Level Modifiers,Score,Phenotype Category,PMID Count,Evidence Count,Drug(s),Phenotype(s),Latest History Date (YYYY-MM-DD),URL,Specialty Population
0,981755803,rs75527207,CFTR,1A,,Rare Variant; Tier 1 VIP,234.875,Efficacy,28,30,ivacaftor,Cystic Fibrosis,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/981755803,Pediatric


Unnamed: 0,Clinical Annotation ID,Genotype/Allele,Annotation Text,Allele Function
0,981755803,AA,"Patients with the rs75527207 AA genotype (two copies of the CFTR G551D variant) and cystic fibrosis may respond to ivacaftor treatment. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",
1,981755803,AG,"Patients with the rs75527207 AG genotype (one copy of the CFTR G551D variant) and cystic fibrosis may respond to ivacaftor treatment. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",
2,981755803,GG,"Patients with the rs75527207 GG genotype (do not have a copy of the CFTR G551D variant) and cystic fibrosis have an unknown response to ivacaftor treatment, as response may depend on the presence of other CFTR variants. FDA-approved drug labeling information and CPIC guidelines indicate use of ivacaftor in cystic fibrosis patients with at least one copy of a list of 33 CFTR genetic variants, including G551D. Other genetic and clinical factors may also influence response to ivacaftor.",


Unnamed: 0,Clinical Annotation ID,Evidence ID,Evidence Type,Evidence URL,PMID,Summary,Score
0,981755803,PA166114461,Guideline Annotation,https://www.pharmgkb.org/guidelineAnnotation/PA166114461,,Annotation of CPIC Guideline for ivacaftor and CFTR,100
1,981755803,PA166104890,Label Annotation,https://www.pharmgkb.org/labelAnnotation/PA166104890,,Annotation of FDA Label for ivacaftor and CFTR,100
2,981755803,981755665,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/981755665,21083385.0,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,0.25
3,981755803,981755678,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/981755678,22047557.0,Genotypes AA + AG are associated with response to ivacaftor in people with Cystic Fibrosis.,2.0
4,981755803,982006840,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/982006840,23313410.0,Allele A is associated with response to ivacaftor in men with Cystic Fibrosis.,0.25
5,981755803,982009991,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/982009991,23590265.0,Allele A is associated with response to ivacaftor in children with Cystic Fibrosis.,2.25
6,981755803,1043737597,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/1043737597,23757359.0,Allele A is associated with response to ivacaftor in people with Cystic Fibrosis.,2.0
7,981755803,1043737620,Variant Functional Assay Annotation,https://www.pharmgkb.org/variantAnnotation/1043737620,23757361.0,Allele A is associated with increased activity of CFTR when treated with ivacaftor in transfected CHO cells.,0.0
8,981755803,1043737636,Variant Functional Assay Annotation,https://www.pharmgkb.org/variantAnnotation/1043737636,23891399.0,Allele A is associated with activity of CFTR when treated with ivacaftor in FRT cell lines.,0.0
9,981755803,1183629335,Variant Drug Annotation,https://www.pharmgkb.org/variantAnnotation/1183629335,24066763.0,Genotype AA is associated with response to ivacaftor in women with Cystic Fibrosis.,0.25


In [114]:
show_id(1451243980)

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Level Override,Level Modifiers,Score,Phenotype Category,PMID Count,Evidence Count,Drug(s),Phenotype(s),Latest History Date (YYYY-MM-DD),URL,Specialty Population
4996,1451243980,"CYP2B6*1, CYP2B6*2, CYP2B6*6, CYP2B6*18, CYP2B6*38",CYP2B6,1A,,Tier 1 VIP,211.5,Toxicity,12,14,efavirenz,HIV Infections,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1451243980,


Unnamed: 0,Clinical Annotation ID,Genotype/Allele,Annotation Text,Allele Function
15404,1451243980,*1,"The CYP2B6*1 allele is assigned as a normal function allele by CPIC. Patients carrying CYP2B6*1 allele in combination with another normal function allele may have decreased risk of adverse events (eg. liver toxicity or CNS side effects) when treated with efavirenz as compared to patients with a no or decreased function allele in combination with a normal or increased function allele or with two no or decreased function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence the toxicity of efavirenz.",Normal function
15405,1451243980,*2,"The CYP2B6*2 allele is assigned as a normal function allele by CPIC. Patients carrying CYP2B6*2 allele in combination with another normal function allele may have decreased risk of adverse events (eg. liver toxicity or CNS side effects) when treated with efavirenz as compared to patients with a no or decreased function allele in combination with a normal or increased function allele or with two no or decreased function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence the toxicity of efavirenz.",Normal function
15406,1451243980,*6,"The CYP2B6*6 allele is assigned as a decreased function allele by CPIC. Patients carrying the CYP2B6*6 allele in combination with a normal, decreased, no, or increased function allele may have increased risk of adverse events (eg. liver toxicity or CNS side effects) when treated with efavirenz as compared to patients with two normal function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence toxicity of efavirenz.",Decreased function
15407,1451243980,*18,"The CYP2B6*18 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2B6*18 allele in combination with a normal, decreased, no, or increased function allele may have increased risk of adverse events (eg. liver toxicity or CNS side effects) when treated with efavirenz as compared to patients with two normal function alleles. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence toxicity of efavirenz.",No function
15408,1451243980,*38,"The CYP2B6*38 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2B6*38 allele in combination with a normal, decreased, no, or increased function allele may have increased risk of adverse events (eg. liver toxicity or CNS side effects) when treated with efavirenz as compared to patients with two normal function alleles. Other genetic and clinical factors may also influence toxicity of efavirenz.",No function


Unnamed: 0,Clinical Annotation ID,Evidence ID,Evidence Type,Evidence URL,PMID,Summary,Score
14695,1451243980,PA166182603,Guideline Annotation,https://www.pharmgkb.org/guidelineAnnotation/PA166182603,,Annotation of CPIC Guideline for efavirenz and CYP2B6,100.0
14696,1451243980,PA166182846,Guideline Annotation,https://www.pharmgkb.org/guidelineAnnotation/PA166182846,,Annotation of DPWG Guideline for efavirenz and CYP2B6,100.0
14697,1451243980,1451289240,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1451289240,25889207.0,Allele C is not associated with increased likelihood of Central Nervous System Diseases when treated with efavirenz in people with HIV Infections as compared to allele T.,-1.5
14698,1451243980,1183634232,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1183634232,24080498.0,Genotypes CC + CT are not associated with risk of Neurotoxicity Syndromes when treated with efavirenz in people with HIV Infections as compared to genotype TT.,-1.75
14699,1451243980,1184473287,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1184473287,24517233.0,Genotype TT is associated with increased risk of Central Nervous System Diseases when treated with efavirenz in people with HIV Infections.,2.0
14700,1451243980,1448636199,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1448636199,28692529.0,Genotype CC is associated with decreased likelihood of Drug Toxicity when treated with efavirenz in people with HIV Infections as compared to genotype TT.,2.0
14701,1451243980,1448993810,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1448993810,26715213.0,Genotypes CC + CT are associated with decreased risk of Central Nervous System Diseases when treated with efavirenz in people with HIV Infections as compared to genotype TT.,3.5
14702,1451243980,827707534,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/827707534,21862974.0,CYP2B6 *6/*6 is associated with increased risk of drug-induced liver injury when treated with efavirenz in people with HIV as compared to CYP2B6 *1/*1.,2.5
14703,1451243980,1184168515,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1184168515,23734829.0,CYP2B6 *1 is not associated with Neurotoxicity Syndromes when treated with efavirenz in people with HIV as compared to CYP2B6 *6.,-1.5
14704,1451243980,1448993721,Variant Phenotype Annotation,https://www.pharmgkb.org/variantAnnotation/1448993721,22808112.0,CYP2B6 *6 is associated with increased risk of Toxic liver disease when treated with efavirenz in people with HIV as compared to CYP2B6 *1/*1.,2.25


### Example extraction

[Top of page](#Table-of-contents)

New data model extracted from PharmKGB clinical annotations download file:
* The trait in the evidence will be PharmGKB's “Phenotypes”
* The drug will be extracted from PharmGKB's “Drugs”
* The target will be the target associated with the variant, PharmGKB’s “Gene”
* Filter rows for those whose category is `Efficacy` and has associated `Phenotypes`

In [134]:
clinical_annotations.columns

Index(['Clinical Annotation ID', 'Variant/Haplotypes', 'Gene',
       'Level of Evidence', 'Level Override', 'Level Modifiers', 'Score',
       'Phenotype Category', 'PMID Count', 'Evidence Count', 'Drug(s)',
       'Phenotype(s)', 'Latest History Date (YYYY-MM-DD)', 'URL',
       'Specialty Population'],
      dtype='object')

In [139]:
# Filter by efficacy
efficacy_annotations = clinical_annotations[clinical_annotations['Phenotype Category'] == 'Efficacy']

In [150]:
# Keep relevant columns
efficacy_annotations = efficacy_annotations[
    ['Clinical Annotation ID', 'Variant/Haplotypes', 'Gene',
     'Level of Evidence', 'Drug(s)', 'Phenotype(s)']]

In [162]:
len(efficacy_annotations)

1931

In [153]:
# Join on alleles data
efficacy_with_alleles = efficacy_annotations.set_index('Clinical Annotation ID').join(clinical_alleles.set_index('Clinical Annotation ID'))

In [154]:
efficacy_with_alleles

Unnamed: 0_level_0,Variant/Haplotypes,Gene,Level of Evidence,Drug(s),Phenotype(s),Genotype/Allele,Annotation Text,Allele Function
Clinical Annotation ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
613979021,rs1042714,ADRB2,3,carvedilol,Heart Failure,CC,Patients with the CC genotype and heart failure may have a poorer response to carvedilol treatment as compared to patients with the CG or GG genotype. Other genetic and clinical factors may also influence a patient's chance of response.,
613979021,rs1042714,ADRB2,3,carvedilol,Heart Failure,CG,Patients with the CG genotype and heart failure may have a poorer response to carvedilol treatment as compared to patients with the GG genotype and a better response as compared to patients with the CC genotype. Patients with the CG genotype may still be at risk for non-response to carvedilol treatment based on their genotype. Other genetic and clinical factors may also influence a patient's chance of response.,
613979021,rs1042714,ADRB2,3,carvedilol,Heart Failure,GG,Patients with the GG genotype and heart failure may have a better response to carvedilol treatment as compared to patients with the CC or CG genotype. Patients with the GG genotype may still be at risk for non-response to carvedilol treatment based on their genotype. Other genetic and clinical factors may also influence a patient's chance of response.,
613979403,rs5443,GNB3,3,sumatriptan,Cluster Headache,CC,Patients with the CC genotype and cluster headache who are treated with triptans may be less likely to have reduced pain or attack frequency as compared to patients with the CT genotype. Other genetic and clinical factors may also influence a patient's response to sumatriptan.,
613979403,rs5443,GNB3,3,sumatriptan,Cluster Headache,CT,Patients with the CT genotype and cluster headache who are treated with triptans may be more likely to have reduced pain or attack frequency as compared to patients with the CC genotype. Other genetic and clinical factors may also influence a patient's response to sumatriptan.,
...,...,...,...,...,...,...,...,...
1451868520,rs11198893,GRK5,3,Beta Blocking Agents,Coronary Artery Disease,AG,Patients with the rs11198893 AG genotype and coronary artery disease may have decreased response when treated with beta blocking agents as compared to patients with the GG genotype. Other genetic and clinical factors may also influence response to beta blocking agents.,
1451868520,rs11198893,GRK5,3,Beta Blocking Agents,Coronary Artery Disease,GG,Patients with the rs11198893 GG genotype and coronary artery disease may have increased response when treated with beta blocking agents as compared to patients with the AA or AG genotypes. Other genetic and clinical factors may also influence response to beta blocking agents.,
1451868540,rs4752292,GRK5,3,Beta Blocking Agents,Coronary Artery Disease,GG,Patients with the rs4752292 GG genotype and coronary artery disease may have increased response when treated with beta blocking agents as compared to patients with the TT or GT genotypes. Other genetic and clinical factors may also influence response to beta blocking agents.,
1451868540,rs4752292,GRK5,3,Beta Blocking Agents,Coronary Artery Disease,GT,Patients with the rs4752292 GT genotype and coronary artery disease may have decreased response when treated with beta blocking agents as compared to patients with the GG genotype. Other genetic and clinical factors may also influence response to beta blocking agents.,


In [161]:
# Number of alleles (as opposed to variants)
len(efficacy_with_alleles)

5881

In [158]:
# Number of entries with allele function
len(efficacy_with_alleles[pd.notna(efficacy_with_alleles['Allele Function'])])

126

In [160]:
# Number of entries with RS
len(efficacy_with_alleles[efficacy_with_alleles['Variant/Haplotypes'].str.contains('rs')])

5659

### Connecting with ClinVar

[Top of page](#Table-of-contents)

In [203]:
import re

In [230]:
# Can use Clinical Annotation ID which should appear in xrefs
all_pgkb_ids = []
for raw_cvs_xml in iterate_cvs_from_xml(drug_xml):
    if is_pgkb(raw_cvs_xml):
        record = ClinVarRecord(find_mandatory_unique_element(raw_cvs_xml, 'ReferenceClinVarAssertion'))
        if record.measure:
            # this is the soundest approach
            pgkb_ids = [
                int(elem.attrib['ID']) 
                for elem in find_elements(record.measure.measure_xml, './XRef[@DB="PharmGKB Clinical Annotation"]')
            ]
            if not pgkb_ids:
                # this yields a lot of redundancy
                pgkb_ids = [
                    int(re.split(r'[a-zA-Z]+', elem.attrib['ID'])[0])
                    for elem in find_elements(record.measure.measure_xml, './XRef[@DB="PharmGKB"]')
                ]
                if not pgkb_ids:
                    # this is stupid - probably don't do this
                    pgkb_ids = [
                        int(elem.text.split('/')[-1])
                        for elem in find_elements(raw_cvs_xml, './ClinVarAssertion/ClinicalSignificance/Citation/URL')
                    ]
                    if not pgkb_ids:
                        pprint(raw_cvs_xml)
                        break
            all_pgkb_ids.extend(pgkb_ids)

In [231]:
len(all_pgkb_ids)

2000

In [232]:
# Cf. 401 records with PGKB submissions
len(set(all_pgkb_ids))

167

In [234]:
clinical_annotations[clinical_annotations['Clinical Annotation ID'].isin(set(all_pgkb_ids))]

Unnamed: 0,Clinical Annotation ID,Variant/Haplotypes,Gene,Level of Evidence,Level Override,Level Modifiers,Score,Phenotype Category,PMID Count,Evidence Count,Drug(s),Phenotype(s),Latest History Date (YYYY-MM-DD),URL,Specialty Population
0,981755803,rs75527207,CFTR,1A,,Rare Variant; Tier 1 VIP,234.875,Efficacy,28,30,ivacaftor,Cystic Fibrosis,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/981755803,Pediatric
3,1449191690,rs141033578,CFTR,1A,,Rare Variant; Tier 1 VIP,200.000,Efficacy,1,3,ivacaftor,Cystic Fibrosis,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1449191690,
4,1449191746,rs78769542,CFTR,1A,,Rare Variant; Tier 1 VIP,200.000,Efficacy,1,3,ivacaftor,Cystic Fibrosis,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1449191746,
27,655386913,"CYP2C19*1, CYP2C19*17",CYP2C19,3,,Tier 1 VIP,6.000,Toxicity,15,16,clopidogrel,Acute coronary syndrome;Coronary Artery Disease;Hemorrhage;Myocardial Infarction,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/655386913,
159,981201854,rs28399499,CYP2B6,3,,Tier 1 VIP,5.250,Metabolism/PK,7,7,nevirapine,HIV Infections,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/981201854,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4531,1451237940,rs9923231,VKORC1,1A,,Tier 1 VIP,117.000,Dosage,10,11,phenprocoumon,,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1451237940,Pediatric
4533,1451243676,rs9923231,VKORC1,2A,,Tier 1 VIP,8.250,Toxicity,3,4,phenprocoumon,Hemorrhage;over-anticoagulation;time above therapeutic range,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1451243676,
4535,1451245360,rs1051266,SLC19A1,2A,,Tier 1 VIP,14.125,Efficacy,9,10,methotrexate,"Arthritis, Rheumatoid",2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1451245360,
4762,1449191758,rs75541969,CFTR,1A,,Rare Variant; Tier 1 VIP,200.000,Efficacy,1,3,ivacaftor,Cystic Fibrosis,2021-03-24,https://www.pharmgkb.org/clinicalAnnotation/1449191758,


### Star alleles

[Top of page](#Table-of-contents)

e.g. [CYP2D6](https://www.ncbi.nlm.nih.gov/books/NBK574601/) - corresponds to
> specific combinations of single nucleotide polymorphisms (SNPs) and/or small insertions and deletions (indels).... In addition, the CYP2D6 gene locus contains a number of complex structural variants including full gene deletions, gene duplications and multiplications [[via](https://www.nature.com/articles/s41525-020-0135-2)]

`CYP2D6*1` is the reference allele, `CYP2D6*(gene variant)XN`, refers to `N` copies of the gene.

Nomenclature is really heterogeneous, compare [HLA](http://hla.alleles.org/nomenclature/naming.html) - there are lots of rabbit holes we could go down!!

Conversion to rs / hgvs? e.g. in [PharmVar](https://www.pharmvar.org/gene/CYP2D6)
* has [data download](https://www.pharmvar.org/download)
* also has an [API](https://www.pharmvar.org/documentation)!

In [170]:
no_rs = efficacy_with_alleles[~efficacy_with_alleles['Variant/Haplotypes'].str.contains('rs')]['Variant/Haplotypes'].tolist()

In [238]:
set(no_rs)

{'CYP2B6*1, CYP2B6*4, CYP2B6*5, CYP2B6*6, CYP2B6*7',
 'CYP2B6*1, CYP2B6*5',
 'CYP2B6*1, CYP2B6*6',
 'CYP2C19*1, CYP2C19*2',
 'CYP2C19*1, CYP2C19*2, CYP2C19*3',
 'CYP2C19*1, CYP2C19*2, CYP2C19*3, CYP2C19*17',
 'CYP2C8*1, CYP2C8*2, CYP2C8*3, CYP2C8*4',
 'CYP2C8*1, CYP2C8*3',
 'CYP2C9*1, CYP2C9*2, CYP2C9*3',
 'CYP2C9*1, CYP2C9*2, CYP2C9*3, CYP2C9*13, CYP2C9*14',
 'CYP2C9*1, CYP2C9*3',
 'CYP2D6*1, CYP2D6*10',
 'CYP2D6*1, CYP2D6*1xN',
 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*2xN, CYP2D6*3, CYP2D6*4, CYP2D6*6',
 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*2xN, CYP2D6*4, CYP2D6*5, CYP2D6*10, CYP2D6*35xN',
 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2xN',
 'CYP2D6*1, CYP2D6*2, CYP2D6*2xN, CYP2D6*3, CYP2D6*4, CYP2D6*6',
 'CYP2D6*1, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*7, CYP2D6*9, CYP2D6*10, CYP2D6*10x2, CYP2D6*11, CYP2D6*17, CYP2D6*21, CYP2D6*36, CYP2D6*41',
 'CYP2D6*1, CYP2D6*3, CYP2D6*4',
 'CYP2D6*1, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*17',
 'CYP2D6*1, CYP2D6*4',
 'C

In [172]:
import requests

In [173]:
def get_pharmvar_result(allele):
    return requests.get(f'https://www.pharmvar.org/api-service/alleles/{allele}').json()

In [174]:
get_pharmvar_result('CYP2C9*2')

[{'geneSymbol': 'CYP2C9',
  'alleleName': 'CYP2C9*2',
  'pvId': 'PV00538',
  'legacyLabel': None,
  'coreAllele': None,
  'evidenceLevel': '0',
  'description': None,
  'function': 'decreased function',
  'activeInd': True,
  'references': [{'citation': 'Rettie et al. 1994',
    'url': 'http://www.ncbi.nlm.nih.gov/pubmed/8004131'},
   {'citation': 'Crespi et al. 1997',
    'url': 'http://www.ncbi.nlm.nih.gov/pubmed/9241660'},
   {'citation': 'deposited by Gaedigk et al.', 'url': None},
   {'citation': 'King et al. 2004',
    'url': 'http://www.ncbi.nlm.nih.gov/pubmed/15608560'},
   {'citation': 'Takahashi et al. 2004',
    'url': 'http://www.ncbi.nlm.nih.gov/pubmed/15070684'},
   {'citation': 'deposited by Campos et al.', 'url': None}],
  'variants': [{'referenceSequence': 'NC_000010.11',
    'referenceLocation': 'Sequence Start',
    'referenceCollections': ['GRCh38'],
    'hgvs': 'NC_000010.11:g.94942290C>T',
    'rsId': 'rs1799853',
    'impact': 'R144C',
    'variantFrequency': [{'

In [178]:
get_pharmvar_result('NAT2*6')

{'errorMessage': 'Allele NAT2*6 could not be located in the PharmVar database.',
 'errorCode': 404}

### Notes

[Top of page](#Table-of-contents)

* More data than submitted to ClinVar
    * only top 2 tiers of evidence are submitted, most data is in the 3rd
* Data is richer than ClinVar, but a fair amount of it is buried in free text annotations
    * in particular direction of effect
* Can connect with ClinVar RCVs via their internal identifiers
* Most data seems to use RS IDs
    * in theory get consequences via alleles data (assuming we can get reference allele I guess)
* Pharmacogenes with star alleles are few but important
    * will need some special treatment and possibly use of more resources like PharmVar
    * maybe parallels with how we handle other complex events in ClinVar

## General

[Top of page](#Table-of-contents)

Thinking both about PharmGKB data and the more general question of other data sources. Options:

* Add a new data source pipeline
    * most likely more data even from submitters to ClinVar
    * can also generalise to sources that don't submit to ClinVar at all
    * can be used as additional annotations to ClinVar or entirely separate submissions
    * probably more work for us
* Start parsing submitted records in ClinVar
    * beneficial if it's common that SCVs have more info than in RCV
    * potentially can get data from multiple upstream sources with a single SCV parser
    * lends itself to enriching "core" ClinVar data - ClinVar takes care of linkage
    * potential for extra/duplicate work aggregating submissions to ClinVar


### Questions for 29/9 meeting

* Should we start to parse SCVs in ClinVar?
* Is it worth trying other ways of linking drug & disease within ClinVar?
* What would be useful to get directly from PharmGKB (besides just more data)?
* Any familiarity with Pharmacogenes, star alleles and other nomenclature
* Other questions you have, other info that would be helpful for decision making

### Meeting notes

[Top of page](#Table-of-contents)

* existance of drugResponse field changes the meaning of disease from source - check OT is ok with this
* maybe this is why CV doesn't include disease traits in RCV - can't confidently associate the variant with the disease, only the drug response
* disease traits are potentially more ambiguous - free text, not annotated by CV with xrefs
    * probably extra manual curation for us
* are there other terms for efficacy we can consider - depends on how efficacy is measured
* same question as for clinvar - if drug & disease occur in same record, does it mean the drug is specifically targetting that disease
* other things to highlight - really low number of exact efficacy terms, can provide evidence levels from pharmgkb
* next steps - basically investigation into PharmGKB and/or SCV, but pending some questions for OT to raise at next meeting