{ "cells": [ { "cell_type": "markdown", "id": "e5e313ad-5396-4729-9b97-5fbbeb2c726b", "metadata": {}, "source": [ "# Star alleles\n", "\n", "## Table of contents\n", "\n", "1. [Non-rsID records](#Non-rsID-records)\n", "2. [Genotype/allele annotations](#Genotype/allele-annotations)\n", "3. [Allele definition tables](#Allele-definition-tables)\n", " * [Comparison with PharmVar](#Comparison-with-PharmVar)\n", " * [Informativeness](#Informativeness)\n", "4. [Summary and questions](#Summary)" ] }, { "cell_type": "code", "execution_count": 1, "id": "a637b49e-beea-446c-806e-b6a1b5e74e69", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import requests\n", "\n", "from opentargets_pharmgkb.pandas_utils import read_tsv_to_df" ] }, { "cell_type": "code", "execution_count": 2, "id": "c45560db-eccc-407e-8ac8-e9fde0529e22", "metadata": {}, "outputs": [], "source": [ "work_dir = '/home/april/projects/opentargets/pharmgkb/star-alleles'" ] }, { "cell_type": "code", "execution_count": null, "id": "7d16c67b-991f-4edf-b2cc-44a4932c2ab7", "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Rerun to refresh data\n", "!cd {work_dir}\n", "!wget -q https://api.pharmgkb.org/v1/download/file/data/clinicalAnnotations.zip\n", "!unzip -qj clinicalAnnotations.zip \"*.tsv\" -d {work_dir}\n", "!rm clinicalAnnotations.zip" ] }, { "cell_type": "markdown", "id": "91dc47ce-fcff-490c-b2de-25f74d595762", "metadata": {}, "source": [ "## Non-rsID records\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": 3, "id": "4975eb84-944a-49a5-b226-82c011281334", "metadata": {}, "outputs": [], "source": [ "annotations_df = read_tsv_to_df(f'{work_dir}/clinical_annotations.tsv')\n", "alleles_df = read_tsv_to_df(f'{work_dir}/clinical_ann_alleles.tsv')" ] }, { "cell_type": "code", "execution_count": 4, "id": "0b25a03e-7790-4056-8767-08fff2ee0fcb", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5101" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(annotations_df)" ] }, { "cell_type": "code", "execution_count": 5, "id": "44d292b9-aeaf-4912-b750-710c6cf6a9c0", "metadata": {}, "outputs": [], "source": [ "no_rs_annotations = annotations_df[~annotations_df['Variant/Haplotypes'].str.contains('rs')]" ] }, { "cell_type": "code", "execution_count": 6, "id": "1046110e-8fb9-41a5-a5b0-2a1715623888", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "596" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(no_rs_annotations)" ] }, { "cell_type": "code", "execution_count": 7, "id": "a845d217-10da-4909-8df0-a1b5ee444e21", "metadata": {}, "outputs": [], "source": [ "# Check names to see if there's anything truly bizarre\n", "names = no_rs_annotations['Variant/Haplotypes'].unique()" ] }, { "cell_type": "code", "execution_count": 8, "id": "4b5c33f1-e39f-4ed5-9fc0-7302cb3212a2", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array(['HLA-B*15:02',\n", " 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41',\n", " 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2xN, CYP2D6*4, CYP2D6*5',\n", " 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*2xN',\n", " 'CYP2D6*1, CYP2D6*3, CYP2D6*4, CYP2D6*4xN, CYP2D6*5, CYP2D6*6',\n", " 'CYP2D6*1, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*7, CYP2D6*9, CYP2D6*10, CYP2D6*10x2, CYP2D6*11, CYP2D6*17, CYP2D6*21, CYP2D6*36, CYP2D6*41',\n", " 'UGT1A3*1, UGT1A3*2, UGT1A3*3', 'HLA-B*55:01',\n", " 'CYP2C19*1, CYP2C19*17',\n", " 'NAT2*4, NAT2*5, NAT2*6, NAT2*7, NAT2*12, NAT2*13',\n", " 'CYP3A5*1, CYP3A5*3', 'CYP2C9*1, CYP2C9*3',\n", " 'CYP2C19*1, CYP2C19*2, CYP2C19*3', 'UGT1A1*1, UGT1A1*28',\n", " 'CYP2B6*1, CYP2B6*6', 'NUDT15*1, NUDT15*4, NUDT15*5, NUDT15*6',\n", " 'NUDT15*1, NUDT15*6', 'CYP2D6*1, CYP2D6*10', 'UGT1A1*1, UGT1A1*6',\n", " 'CYP2C9*1, CYP2C9*2, CYP2C9*3', 'HLA-B*48:01',\n", " 'CYP2C19*1, CYP2C19*2, CYP2C19*17', 'HLA-B*15:12',\n", " 'CYP2D6*1, CYP2D6*2, CYP2D6*2xN, CYP2D6*3, CYP2D6*4, CYP2D6*6',\n", " 'CYP2C19*1, CYP2C19*2',\n", " 'CYP2C19*1, CYP2C19*2, CYP2C19*3, CYP2C19*8, CYP2C19*9, CYP2C19*17',\n", " 'CYP2C8*1, CYP2C8*2, CYP2C8*3, CYP2C8*4',\n", " 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*2xN, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*7',\n", " 'HLA-B*38:02', 'CYP2C19*2', 'HLA-B*13:01', 'HLA-B*58:01',\n", " 'CYP2D6*2, CYP2D6*10', 'CYP2D6*1, CYP2D6*4', 'HLA-B*51:01',\n", " 'CYP2D6*1, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*17, CYP2D6*29, CYP2D6*35, CYP2D6*41',\n", " 'CYP2D6*1, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*14',\n", " 'CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*2xN, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*17, CYP2D6*29, CYP2D6*36, CYP2D6*41',\n", " 'CYP3A4*1, CYP3A4*18, CYP3A4*20, CYP3A4*22',\n", " 'CYP2D6*1, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*10',\n", " 'CYP2B6*1, CYP2B6*5',\n", " 'NAT2*4, NAT2*5A, NAT2*5B, NAT2*5C, NAT2*6A, NAT2*6B, NAT2*6J, NAT2*6O, NAT2*7A, NAT2*7B, NAT2*7G, NAT2*12A, NAT2*13A, NAT2*14A',\n", " 'CYP2D6*1, CYP2D6*3, CYP2D6*4', 'HLA-A*33:03',\n", " 'CYP2D6*1, CYP2D6*3, CYP2D6*4, CYP2D6*5',\n", " 'CYP2C9*1, CYP2C9*2, CYP2C9*3, CYP2C9*5, CYP2C9*6, CYP2C9*8, CYP2C9*11, CYP2C9*13, CYP2C9*14, CYP2C9*16, CYP2C9*29, CYP2C9*31, CYP2C9*33, CYP2C9*37, CYP2C9*39, CYP2C9*42, CYP2C9*43, CYP2C9*45, CYP2C9*50, CYP2C9*52, CYP2C9*55',\n", " 'CYP2B6*1, CYP2B6*4, CYP2B6*5, CYP2B6*6, CYP2B6*7',\n", " 'G6PD A- 202A_376G, G6PD B (reference)',\n", " 'CYP2D6*1, CYP2D6*5, CYP2D6*10',\n", " 'CYP2D6*1, CYP2D6*4, CYP2D6*4xN, CYP2D6*5, CYP2D6*10, CYP2D6*17, CYP2D6*92, CYP2D6*96'],\n", " dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Note that the \"variant/haplotype name\" is a listing of which alleles are annotated in the specific record\n", "names[:50]" ] }, { "cell_type": "code", "execution_count": 9, "id": "744854a8-dec3-4f92-92ae-7b2f10b943d2", "metadata": {}, "outputs": [], "source": [ "# Not necessarily an important distinction, but just to check...\n", "star_allele_names = [n for n in names if '*' in n]\n", "no_star_names = [n for n in names if '*' not in n]" ] }, { "cell_type": "code", "execution_count": 10, "id": "86cd0116-10c9-4e4b-8a3b-dbf666b56c1b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['G6PD A- 202A_376G, G6PD B (reference)',\n", " 'GSTT1 non-null, GSTT1 null',\n", " 'GSTM1 non-null, GSTM1 null',\n", " 'G6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham',\n", " 'SLC6A4 HTTLPR long form (L allele), SLC6A4 HTTLPR short form (S allele)',\n", " 'G6PD B (reference), G6PD Mediterranean Haplotype',\n", " 'G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham',\n", " 'G6PD B (reference), G6PD Canton, Taiwan-Hakka, Gifu-like, Agrigento-like',\n", " 'G6PD B (reference), G6PD Mediterranean Haplotype, G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham',\n", " 'G6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean Haplotype, G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham',\n", " 'G6PD A- 202A_376G']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_star_names" ] }, { "cell_type": "markdown", "id": "239d6533-0934-4a8f-ac13-3f6d5e7ac6dc", "metadata": {}, "source": [ "No star allele observations:\n", "* [G6PD](https://www.pharmgkb.org/gene/PA28469/haplotype) seems well-defined though the naming is idiosyncratic (e.g. is it safe to just comma-split these strings?)\n", " * may be a bit clearer in the alleles tables, e.g. [here](https://www.pharmgkb.org/clinicalAnnotation/1183621000)\n", "* [GSTT1](https://www.pharmgkb.org/gene/PA183/haplotype), [GSTM1](https://www.pharmgkb.org/gene/PA182/haplotype) null/non-null are just absence or presence of the entire gene, if this naming convention is standard we can work with it\n", "* [SLC6A4](https://www.pharmgkb.org/gene/PA312/haplotype) seems to be just... special\n", "\n", "Note we can clearly get affected genes for all of these alleles though, from PGKB directly." ] }, { "cell_type": "code", "execution_count": 11, "id": "6f6706cf-0a9c-4263-a83d-6fcc212a554e", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Confirming there are no missing genes in any of these\n", "no_rs_annotations['Gene'].isna().any()" ] }, { "cell_type": "markdown", "id": "6f1b68fa-ee27-4934-9fab-ff89d0d93dcf", "metadata": {}, "source": [ "## Genotype/allele annotations\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": 12, "id": "2dfb38c5-3670-4acd-b325-265584d0a466", "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_colwidth', None)" ] }, { "cell_type": "code", "execution_count": 13, "id": "fbc07fe0-8ef5-4478-be22-41dae966c41b", "metadata": {}, "outputs": [], "source": [ "joined_df = alleles_df.merge(no_rs_annotations, on='Clinical Annotation ID')" ] }, { "cell_type": "code", "execution_count": 14, "id": "0aae460a-0f32-4373-8cc4-6d4ddd816500", "metadata": {}, "outputs": [], "source": [ "# Remove some columns to make things easier to read...\n", "joined_df = joined_df[['Clinical Annotation ID', 'Genotype/Allele', 'Annotation Text',\n", " 'Allele Function', 'Variant/Haplotypes', 'Gene', 'Level of Evidence',\n", " 'Phenotype Category', 'Drug(s)', 'Phenotype(s)']]" ] }, { "cell_type": "code", "execution_count": 15, "id": "25ca829f-7bdd-4909-8ab2-19df6edde47a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clinical Annotation IDGenotype/AlleleAnnotation TextAllele FunctionVariant/HaplotypesGeneLevel of EvidencePhenotype CategoryDrug(s)Phenotype(s)
11451259580*1The CYP2D6*1 allele is assigned as a normal function allele by CPIC. Patients carrying the CYP2D6*1 allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.Normal functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
21451259580*1xNThe CYP2D6*1xN alleles (*1x2 and *1x≥3) have been assigned as increased function alleles by CPIC. Patients carrying the CYP2D6*1xN allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.Increased functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
31451259580*2The CYP2D6*2 allele is assigned as a normal function allele by CPIC. Patients carrying the CYP2D6*2 allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.Normal functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
41451259580*3The CYP2D6*3 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*3 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.No functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
51451259580*4The CYP2D6*4 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*4 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.No functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
61451259580*5The CYP2D6*5 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*5 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.No functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
71451259580*6The CYP2D6*6 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*6 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.No functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
81451259580*10The CYP2D6*10 allele is assigned as a decreased function allele with an activity value of 0.25 by CPIC. Patients carrying the CYP2D6*10 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.Decreased functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
91451259580*41The CYP2D6*41 allele is assigned as a decreased function allele with an activity value of 0.5 by CPIC. Patients carrying the CYP2D6*41 allele in combination with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline.Decreased functionCYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41CYP2D61AToxicityamitriptylineDepressive Disorder
\n", "
" ], "text/plain": [ " Clinical Annotation ID Genotype/Allele \\\n", "1 1451259580 *1 \n", "2 1451259580 *1xN \n", "3 1451259580 *2 \n", "4 1451259580 *3 \n", "5 1451259580 *4 \n", "6 1451259580 *5 \n", "7 1451259580 *6 \n", "8 1451259580 *10 \n", "9 1451259580 *41 \n", "\n", " Annotation Text \\\n", "1 The CYP2D6*1 allele is assigned as a normal function allele by CPIC. Patients carrying the CYP2D6*1 allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "2 The CYP2D6*1xN alleles (*1x2 and *1x≥3) have been assigned as increased function alleles by CPIC. Patients carrying the CYP2D6*1xN allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "3 The CYP2D6*2 allele is assigned as a normal function allele by CPIC. Patients carrying the CYP2D6*2 allele in combination with alleles that result in a normal metabolizer phenotype who are treated with amitriptyline may have decreased likelihood of side effects as compared to patients with a combination of alleles that result in intermediate or poor metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "4 The CYP2D6*3 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*3 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "5 The CYP2D6*4 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*4 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "6 The CYP2D6*5 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*5 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "7 The CYP2D6*6 allele is assigned as a no function allele by CPIC. Patients carrying the CYP2D6*6 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "8 The CYP2D6*10 allele is assigned as a decreased function allele with an activity value of 0.25 by CPIC. Patients carrying the CYP2D6*10 allele in combination with with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "9 The CYP2D6*41 allele is assigned as a decreased function allele with an activity value of 0.5 by CPIC. Patients carrying the CYP2D6*41 allele in combination with alleles that result in intermediate or poor metabolizer phenotype who are treated with amitriptyline may have increased likelihood of side effects as compared to patients with alleles that result in a normal metabolizer phenotype. Other genetic and clinical factors may also influence response to amitriptyline. \n", "\n", " Allele Function \\\n", "1 Normal function \n", "2 Increased function \n", "3 Normal function \n", "4 No function \n", "5 No function \n", "6 No function \n", "7 No function \n", "8 Decreased function \n", "9 Decreased function \n", "\n", " Variant/Haplotypes \\\n", "1 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "2 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "3 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "4 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "5 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "6 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "7 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "8 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "9 CYP2D6*1, CYP2D6*1xN, CYP2D6*2, CYP2D6*3, CYP2D6*4, CYP2D6*5, CYP2D6*6, CYP2D6*10, CYP2D6*41 \n", "\n", " Gene Level of Evidence Phenotype Category Drug(s) \\\n", "1 CYP2D6 1A Toxicity amitriptyline \n", "2 CYP2D6 1A Toxicity amitriptyline \n", "3 CYP2D6 1A Toxicity amitriptyline \n", "4 CYP2D6 1A Toxicity amitriptyline \n", "5 CYP2D6 1A Toxicity amitriptyline \n", "6 CYP2D6 1A Toxicity amitriptyline \n", "7 CYP2D6 1A Toxicity amitriptyline \n", "8 CYP2D6 1A Toxicity amitriptyline \n", "9 CYP2D6 1A Toxicity amitriptyline \n", "\n", " Phenotype(s) \n", "1 Depressive Disorder \n", "2 Depressive Disorder \n", "3 Depressive Disorder \n", "4 Depressive Disorder \n", "5 Depressive Disorder \n", "6 Depressive Disorder \n", "7 Depressive Disorder \n", "8 Depressive Disorder \n", "9 Depressive Disorder " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://www.pharmgkb.org/clinicalAnnotation/1451259580\n", "joined_df[joined_df['Clinical Annotation ID'] == '1451259580']" ] }, { "cell_type": "code", "execution_count": 16, "id": "81f81678-fee7-4dac-a8f8-fbe0a701af2d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clinical Annotation IDGenotype/AlleleAnnotation TextAllele FunctionVariant/HaplotypesGeneLevel of EvidencePhenotype CategoryDrug(s)Phenotype(s)
15511448427588non-null/non-nullPatients with the non-null/non-null genotype may have a decreased risk for neutropenia when treated with clozapine as compared to patients with the null/null genotype. Other genetic and clinical factors may also influence neutropenia risk.NaNGSTT1 non-null, GSTT1 nullGSTT13ToxicityclozapineNaN
15521448427588null/non-nullPatients with the null/non-null genotype may have a decreased risk for neutropenia when treated with clozapine as compared to patients with the null/null genotype. Other genetic and clinical factors may also influence neutropenia risk.NaNGSTT1 non-null, GSTT1 nullGSTT13ToxicityclozapineNaN
15531448427588null/nullPatients with the null/null genotype may have an increased risk for neutropenia when treated with clozapine as compared to patients with the null/non-null or non-null/non-null genotype. Other genetic and clinical factors may also influence neutropenia risk.NaNGSTT1 non-null, GSTT1 nullGSTT13ToxicityclozapineNaN
\n", "
" ], "text/plain": [ " Clinical Annotation ID Genotype/Allele \\\n", "1551 1448427588 non-null/non-null \n", "1552 1448427588 null/non-null \n", "1553 1448427588 null/null \n", "\n", " Annotation Text \\\n", "1551 Patients with the non-null/non-null genotype may have a decreased risk for neutropenia when treated with clozapine as compared to patients with the null/null genotype. Other genetic and clinical factors may also influence neutropenia risk. \n", "1552 Patients with the null/non-null genotype may have a decreased risk for neutropenia when treated with clozapine as compared to patients with the null/null genotype. Other genetic and clinical factors may also influence neutropenia risk. \n", "1553 Patients with the null/null genotype may have an increased risk for neutropenia when treated with clozapine as compared to patients with the null/non-null or non-null/non-null genotype. Other genetic and clinical factors may also influence neutropenia risk. \n", "\n", " Allele Function Variant/Haplotypes Gene Level of Evidence \\\n", "1551 NaN GSTT1 non-null, GSTT1 null GSTT1 3 \n", "1552 NaN GSTT1 non-null, GSTT1 null GSTT1 3 \n", "1553 NaN GSTT1 non-null, GSTT1 null GSTT1 3 \n", "\n", " Phenotype Category Drug(s) Phenotype(s) \n", "1551 Toxicity clozapine NaN \n", "1552 Toxicity clozapine NaN \n", "1553 Toxicity clozapine NaN " ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://www.pharmgkb.org/clinicalAnnotation/1448427588\n", "joined_df[joined_df['Clinical Annotation ID'] == '1448427588']" ] }, { "cell_type": "code", "execution_count": 17, "id": "4f766749-6127-4729-80b8-94336bc924c5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clinical Annotation IDGenotype/AlleleAnnotation TextAllele FunctionVariant/HaplotypesGeneLevel of EvidencePhenotype CategoryDrug(s)Phenotype(s)
498981419263*15:02Patients with one or two copies of the HLA-B*15:02 allele may have an increased risk of Severe Cutaneous Adverse Reactions when treated with carbamazepine as compared to patients with no HLA-B*15:02 alleles or negative for the HLA-B*15:02 test. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence risk of carbamazepine-induced adverse reactions.PresenceHLA-B*15:02, HLA-B*15:11HLA-B1AToxicitycarbamazepinedrug reaction with eosinophilia and systemic symptoms;Epidermal Necrolysis, Toxic;Maculopapular Exanthema;severe cutaneous adverse reactions;Stevens-Johnson Syndrome
499981419263*15:11Patients with one or two copies of the HLA-B*15:11 allele may have an increased risk of Severe Cutaneous Adverse Reactions, such as Stevens-Johnson Syndrome and Toxic Epidermal Necrolysis, when treated with carbamazepine as compared to patients with no HLA-B*15:11 alleles or negative for the HLA-B*15:11 test. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence risk of carbamazepine-induced adverse reactions.NaNHLA-B*15:02, HLA-B*15:11HLA-B1AToxicitycarbamazepinedrug reaction with eosinophilia and systemic symptoms;Epidermal Necrolysis, Toxic;Maculopapular Exanthema;severe cutaneous adverse reactions;Stevens-Johnson Syndrome
\n", "
" ], "text/plain": [ " Clinical Annotation ID Genotype/Allele \\\n", "498 981419263 *15:02 \n", "499 981419263 *15:11 \n", "\n", " Annotation Text \\\n", "498 Patients with one or two copies of the HLA-B*15:02 allele may have an increased risk of Severe Cutaneous Adverse Reactions when treated with carbamazepine as compared to patients with no HLA-B*15:02 alleles or negative for the HLA-B*15:02 test. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence risk of carbamazepine-induced adverse reactions. \n", "499 Patients with one or two copies of the HLA-B*15:11 allele may have an increased risk of Severe Cutaneous Adverse Reactions, such as Stevens-Johnson Syndrome and Toxic Epidermal Necrolysis, when treated with carbamazepine as compared to patients with no HLA-B*15:11 alleles or negative for the HLA-B*15:11 test. However, conflicting evidence has been reported. Other genetic and clinical factors may also influence risk of carbamazepine-induced adverse reactions. \n", "\n", " Allele Function Variant/Haplotypes Gene Level of Evidence \\\n", "498 Presence HLA-B*15:02, HLA-B*15:11 HLA-B 1A \n", "499 NaN HLA-B*15:02, HLA-B*15:11 HLA-B 1A \n", "\n", " Phenotype Category Drug(s) \\\n", "498 Toxicity carbamazepine \n", "499 Toxicity carbamazepine \n", "\n", " Phenotype(s) \n", "498 drug reaction with eosinophilia and systemic symptoms;Epidermal Necrolysis, Toxic;Maculopapular Exanthema;severe cutaneous adverse reactions;Stevens-Johnson Syndrome \n", "499 drug reaction with eosinophilia and systemic symptoms;Epidermal Necrolysis, Toxic;Maculopapular Exanthema;severe cutaneous adverse reactions;Stevens-Johnson Syndrome " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://www.pharmgkb.org/clinicalAnnotation/981419263\n", "joined_df[joined_df['Clinical Annotation ID'] == '981419263']" ] }, { "cell_type": "code", "execution_count": 18, "id": "c8c65e1c-150e-4bb5-bb69-26b46f0443bb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Clinical Annotation IDGenotype/AlleleAnnotation TextAllele FunctionVariant/HaplotypesGeneLevel of EvidencePhenotype CategoryDrug(s)Phenotype(s)
5241183621000A- 202A_376GPatients with one X-chromosome and the A- 202A_376G allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with the reference B allele (non-deficient, class IV). Patients with two X-chromosomes and the A- 202A_376G allele in combination with another deficient class I-III allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Patients with two X-chromosomes and the A- 202A_376G allele in combination with a non-deficient allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis.III/DeficientG6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, BirminghamG6PD1AToxicityrasburicaseHemolysis;Methemoglobinemia
5251183621000B (reference)Patients with one X-chromosome and the reference B (reference) allele (non-deficient, class IV) who are treated with rasburicase may have a decreased risk of methemoglobinemia and/or hemolysis as compared to patients with a deficient class I-III allele. Patients with two X-chromosomes and two copies of the reference B allele (non-deficient, class IV) who are treated with rasburicase may have a decreased risk of methemoglobinemia and/or hemolysis as compared to patients with a deficient class I-III allele. Patients with two X-chromosomes, one copy of the reference B allele (non-deficient, class IV) and one deficient class I-III allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis.IV/NormalG6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, BirminghamG6PD1AToxicityrasburicaseHemolysis;Methemoglobinemia
5261183621000Mediterranean, Dallas, Panama, Sassari, Cagliari, BirminghamPatients with one X-chromosome and the Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham allele (rs5030868 allele A) who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with the reference B allele (non-deficient, class IV)(rs5030868 allele G). Patients with two X-chromosomes and the Mediterranean, Dallas, Panama' Sassari, Cagliari, Birmingham variant (rs5030868 allele A) in combination with another deficient class I-III allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV)(rs5030868 allele G). Patients with two X-chromosomes and the Mediterranean, Dallas, Panama' Sassari, Cagliari, Birmingham variant (rs5030868 allele A) in combination with a non-deficient allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis.II/DeficientG6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, BirminghamG6PD1AToxicityrasburicaseHemolysis;Methemoglobinemia
\n", "
" ], "text/plain": [ " Clinical Annotation ID \\\n", "524 1183621000 \n", "525 1183621000 \n", "526 1183621000 \n", "\n", " Genotype/Allele \\\n", "524 A- 202A_376G \n", "525 B (reference) \n", "526 Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham \n", "\n", " Annotation Text \\\n", "524 Patients with one X-chromosome and the A- 202A_376G allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with the reference B allele (non-deficient, class IV). Patients with two X-chromosomes and the A- 202A_376G allele in combination with another deficient class I-III allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Patients with two X-chromosomes and the A- 202A_376G allele in combination with a non-deficient allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis. \n", "525 Patients with one X-chromosome and the reference B (reference) allele (non-deficient, class IV) who are treated with rasburicase may have a decreased risk of methemoglobinemia and/or hemolysis as compared to patients with a deficient class I-III allele. Patients with two X-chromosomes and two copies of the reference B allele (non-deficient, class IV) who are treated with rasburicase may have a decreased risk of methemoglobinemia and/or hemolysis as compared to patients with a deficient class I-III allele. Patients with two X-chromosomes, one copy of the reference B allele (non-deficient, class IV) and one deficient class I-III allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis. \n", "526 Patients with one X-chromosome and the Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham allele (rs5030868 allele A) who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with the reference B allele (non-deficient, class IV)(rs5030868 allele G). Patients with two X-chromosomes and the Mediterranean, Dallas, Panama' Sassari, Cagliari, Birmingham variant (rs5030868 allele A) in combination with another deficient class I-III allele who are treated with rasburicase may have an increased risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV)(rs5030868 allele G). Patients with two X-chromosomes and the Mediterranean, Dallas, Panama' Sassari, Cagliari, Birmingham variant (rs5030868 allele A) in combination with a non-deficient allele who are treated with rasburicase have an unknown risk of methemoglobinemia and/or hemolysis as compared to patients with two copies of the reference B allele (non-deficient, class IV). Other genetic and clinical factors may also influence risk of drug-induced hemolysis. \n", "\n", " Allele Function \\\n", "524 III/Deficient \n", "525 IV/Normal \n", "526 II/Deficient \n", "\n", " Variant/Haplotypes \\\n", "524 G6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham \n", "525 G6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham \n", "526 G6PD A- 202A_376G, G6PD B (reference), G6PD Mediterranean, Dallas, Panama, Sassari, Cagliari, Birmingham \n", "\n", " Gene Level of Evidence Phenotype Category Drug(s) \\\n", "524 G6PD 1A Toxicity rasburicase \n", "525 G6PD 1A Toxicity rasburicase \n", "526 G6PD 1A Toxicity rasburicase \n", "\n", " Phenotype(s) \n", "524 Hemolysis;Methemoglobinemia \n", "525 Hemolysis;Methemoglobinemia \n", "526 Hemolysis;Methemoglobinemia " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# https://www.pharmgkb.org/clinicalAnnotation/1183621000\n", "joined_df[joined_df['Clinical Annotation ID'] == '1183621000']" ] }, { "cell_type": "markdown", "id": "1d115955-3a7b-4485-ba9b-725a164a9ecf", "metadata": {}, "source": [ "Notes:\n", "* These are mostly allele-level annotations, though some are genotype-level (as opposed to rsID records which are nearly all genotype-level)\n", " * Genotype-specific information is sometimes buried in the annotation text for each allele...\n", "* `*1xN` means `N` copies of the `*1` version of the gene" ] }, { "cell_type": "markdown", "id": "cb7bdef9-5d0d-4ceb-b7fc-af48b8c9d5a4", "metadata": {}, "source": [ "## Allele definition tables\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": 19, "id": "ad564aef-5c58-4b06-9b7e-50391ab748db", "metadata": {}, "outputs": [], "source": [ "# Try to automatically get PGKB spreadsheet definitions\n", "allele_definition_url = 'https://api.pharmgkb.org/v1/download/file/attachment/{gene}_allele_definition_table.xlsx'" ] }, { "cell_type": "code", "execution_count": 20, "id": "b5d42cfb-68dc-471f-bc14-04203de26843", "metadata": {}, "outputs": [], "source": [ "genes = no_rs_annotations['Gene'].unique()" ] }, { "cell_type": "code", "execution_count": 21, "id": "dd272ffe-239e-4548-81d1-823b97317825", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['HLA-B', 'CYP2D6', 'UGT1A3', 'CYP2C19', 'NAT2', 'CYP3A5', 'CYP2C9',\n", " 'UGT1A1', 'CYP2B6', 'NUDT15', 'CYP2C8', 'CYP3A4', 'HLA-A', 'G6PD',\n", " 'UGT2B15', 'SLCO1B1', 'GSTT1', 'GSTM1', 'TPMT', 'SLC6A4', 'HLA-C',\n", " 'HLA-DRB1', 'HLA-DQB1', 'HLA-DPB1', 'CYP3A7', 'CYP2A6', 'HLA-DRB3',\n", " 'CYP1A2', 'UGT1A6', 'CYP2E1', 'UGT1A7', 'HLA-DQA1', 'UGT1A4',\n", " 'CYP1A1', 'CYP4F2'], dtype=object)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genes" ] }, { "cell_type": "code", "execution_count": 38, "id": "bcdfe29a-3436-4150-be26-1d4306bdaf64", "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for UGT1A3: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for NAT2: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for UGT2B15: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for GSTT1: HTTP Error 404: \n", "Error for GSTM1: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for SLC6A4: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for CYP3A7: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n", "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for CYP1A2: HTTP Error 404: \n", "Error for UGT1A6: HTTP Error 404: \n", "Error for CYP2E1: HTTP Error 404: \n", "Error for UGT1A7: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Error for UGT1A4: HTTP Error 404: \n", "Error for CYP1A1: HTTP Error 404: \n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/home/april/projects/opentargets-pharmgkb/venv/lib/python3.8/site-packages/openpyxl/styles/stylesheet.py:226: UserWarning: Workbook contains no default style, apply openpyxl's default\n", " warn(\"Workbook contains no default style, apply openpyxl's default\")\n" ] } ], "source": [ "allele_def_tables = {}\n", "for gene in genes:\n", " try:\n", " allele_def_tables[gene] = pd.read_excel(allele_definition_url.format(gene=gene), \n", " storage_options={'User-Agent': 'Mozilla/5.0'},\n", " header=None)\n", " except Exception as e:\n", " print(f'Error for {gene}: {e}')" ] }, { "cell_type": "code", "execution_count": 23, "id": "e2e00bb9-ce4e-43ae-97fc-a44024ef8271", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'CYP2A6',\n", " 'CYP2B6',\n", " 'CYP2C19',\n", " 'CYP2C8',\n", " 'CYP2C9',\n", " 'CYP2D6',\n", " 'CYP3A4',\n", " 'CYP3A5',\n", " 'CYP4F2',\n", " 'G6PD',\n", " 'HLA-A',\n", " 'HLA-B',\n", " 'HLA-C',\n", " 'HLA-DPB1',\n", " 'HLA-DQA1',\n", " 'HLA-DQB1',\n", " 'HLA-DRB1',\n", " 'HLA-DRB3',\n", " 'NUDT15',\n", " 'SLCO1B1',\n", " 'TPMT',\n", " 'UGT1A1'}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Genes with allele tables\n", "pharmgkb_genes = set(allele_def_tables.keys())\n", "pharmgkb_genes" ] }, { "cell_type": "code", "execution_count": 24, "id": "6a9c031d-4d3a-4137-a12f-d190890abf53", "metadata": {}, "outputs": [], "source": [ "no_allele_def_table_genes = set(genes) - pharmgkb_genes" ] }, { "cell_type": "code", "execution_count": 25, "id": "931a666e-0db2-4605-86ae-1bee8957a6d2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'CYP1A1',\n", " 'CYP1A2',\n", " 'CYP2E1',\n", " 'CYP3A7',\n", " 'GSTM1',\n", " 'GSTT1',\n", " 'NAT2',\n", " 'SLC6A4',\n", " 'UGT1A3',\n", " 'UGT1A4',\n", " 'UGT1A6',\n", " 'UGT1A7',\n", " 'UGT2B15'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "no_allele_def_table_genes" ] }, { "cell_type": "markdown", "id": "f4076e48-7137-4d91-9304-cf4b11ea1c1d", "metadata": {}, "source": [ "Checked a few of this list and they indeed don't have definition tables in PharmGKB, categories I see:\n", "* Refer to another resource: [some (but not all) CYP](https://www.pharmgkb.org/gene/PA129), [NAT](https://www.pharmgkb.org/gene/PA18/haplotype), [UGT](https://www.pharmgkb.org/gene/PA37179/haplotype)\n", "* Null/non-null: [GSTT1](https://www.pharmgkb.org/gene/PA183/haplotype), [GSTM1](https://www.pharmgkb.org/gene/PA182/haplotype)\n", "* Special: [SLC6A4](https://www.pharmgkb.org/gene/PA312/haplotype)\n", "\n", "For now we'll skip these and look at those with allele definition tables (covers about 90% of no-RS records in PGKB)." ] }, { "cell_type": "code", "execution_count": 26, "id": "df1f5680-0396-4a8a-9c04-cce197e06db2", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "53" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What do we lose if we skip these?\n", "len(no_rs_annotations[no_rs_annotations['Gene'].isin(no_allele_def_table_genes)])" ] }, { "cell_type": "markdown", "id": "036fd18f-28c5-4e40-94d0-a2470c2fb7cb", "metadata": {}, "source": [ "Note that the allele definition tables vary in informativeness, so just because one is present doesn't mean we can necessarily use it.\n", "* More informative example: [CYP2D6](https://docs.google.com/spreadsheets/d/1tIovgq2w7FXv6g2ASiQd5EtYjxEbSzYrCRpg9dIlrEY/edit?usp=sharing)\n", "* Less informative example: [HLA-A](https://docs.google.com/spreadsheets/d/1Wz0F74sdY-hG0LJP1Rg5ePsZddVQ1pZrHDySOYgrOhI/edit?usp=sharing)\n", "\n", "Understanding the allele definition table:\n", "* First few rows give various definitions of variants: protein/chromosome/gene-level HGVS, and rsID if present\n", "* Each subsequent row gives what alleles are present for each of these variants for a particular named allele\n", " * In theory should be able to use the \"Genotype/Allele\" column from the clinical allele annotations to index into this table\n", "* The final column is \"structural variation\" and contains text describing the nature of the variant, e.g. `CYP2D7::CYP2D6 hybrid gene`\n", "* Missing values = reference? Or is e.g. *1/first row the reference? If so what does missing value mean?" ] }, { "cell_type": "markdown", "id": "f1d28cc9-0d6b-4d5a-a3df-3c17ab9c9b82", "metadata": {}, "source": [ "### Comparison with PharmVar\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": 27, "id": "71d9cd9c-f93f-4b00-8f35-bc7fc3bba80b", "metadata": {}, "outputs": [], "source": [ "# Compare with what we would get from PharmVar\n", "pharmvar_url = 'https://www.pharmvar.org/api-service/alleles?exclude-sub-alleles=false&include-reference-variants=false&include-retired-alleles=false&include-retired-reference-sequences=false'\n", "response = requests.get(pharmvar_url)\n", "pharmvar_data = response.json()" ] }, { "cell_type": "code", "execution_count": 28, "id": "a1ce5f04-4990-478d-98db-2febb9270619", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1945" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 1 per allele\n", "len(pharmvar_data)" ] }, { "cell_type": "code", "execution_count": 29, "id": "be8553eb-0abb-43a6-abae-5ad5f0e083e7", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "{'geneSymbol': 'CYP2C9',\n", " 'alleleName': 'CYP2C9*49.001',\n", " 'pvId': 'PV00001',\n", " 'legacyLabel': 'CYP2C9*49',\n", " 'coreAllele': 'CYP2C9*49',\n", " 'evidenceLevel': 'L',\n", " 'description': None,\n", " 'function': 'uncertain function',\n", " 'activeInd': True,\n", " 'references': [{'citation': 'Dai et al. 2013',\n", " 'url': 'http://www.ncbi.nlm.nih.gov/pubmed/23400009'}],\n", " 'variants': [{'referenceSequence': 'NG_008385.2',\n", " 'referenceLocation': 'ATG Start',\n", " 'referenceCollections': ['RefSeqGene'],\n", " 'hgvs': 'NG_008385.2:g.15972A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/13610',\n", " 'variantId': '49',\n", " 'position': 'NG_008385.2:g.10447A>G'},\n", " {'referenceSequence': 'NC_000010.10',\n", " 'referenceLocation': 'Sequence Start',\n", " 'referenceCollections': ['GRCh37'],\n", " 'hgvs': 'NC_000010.10:g.96708886A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/195',\n", " 'variantId': '49',\n", " 'position': 'NC_000010.10:g.96708886A>G'},\n", " {'referenceSequence': 'NM_000771.4',\n", " 'referenceLocation': 'ATG Start',\n", " 'referenceCollections': ['RefSeqTranscript'],\n", " 'hgvs': 'NM_000771.4:c.664A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/13765',\n", " 'variantId': '49',\n", " 'position': 'NM_000771.4:c.664A>G'},\n", " {'referenceSequence': 'NC_000010.11',\n", " 'referenceLocation': 'Sequence Start',\n", " 'referenceCollections': ['GRCh38'],\n", " 'hgvs': 'NC_000010.11:g.94949129A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/193',\n", " 'variantId': '49',\n", " 'position': 'NC_000010.11:g.94949129A>G'},\n", " {'referenceSequence': 'NG_008385.2',\n", " 'referenceLocation': 'Sequence Start',\n", " 'referenceCollections': ['RefSeqGene'],\n", " 'hgvs': 'NG_008385.2:g.15972A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/13609',\n", " 'variantId': '49',\n", " 'position': 'NG_008385.2:g.15972A>G'},\n", " {'referenceSequence': 'NM_000771.4',\n", " 'referenceLocation': 'Sequence Start',\n", " 'referenceCollections': ['RefSeqTranscript'],\n", " 'hgvs': 'NM_000771.4:c.664A>G',\n", " 'rsId': None,\n", " 'impact': 'I222V',\n", " 'variantFrequency': [],\n", " 'url': 'https://www.pharmvar.org/variant/13766',\n", " 'variantId': '49',\n", " 'position': 'NM_000771.4:c.689A>G'}],\n", " 'alleleType': 'Sub',\n", " 'url': 'https://www.pharmvar.org/haplotype/PV00001',\n", " 'hgvs': 'NG_008385.2:g.15972A>G',\n", " 'variantGroups': []}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pharmvar_data[0]" ] }, { "cell_type": "code", "execution_count": 30, "id": "6723adcf-006d-4daf-b9b4-c59f4916a65b", "metadata": {}, "outputs": [], "source": [ "pharmvar_genes = {d['geneSymbol'] for d in pharmvar_data}" ] }, { "cell_type": "code", "execution_count": 31, "id": "96bd16e8-5261-461d-bf27-f3e843ff2dba", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'G6PD',\n", " 'HLA-A',\n", " 'HLA-B',\n", " 'HLA-C',\n", " 'HLA-DPB1',\n", " 'HLA-DQA1',\n", " 'HLA-DQB1',\n", " 'HLA-DRB1',\n", " 'HLA-DRB3',\n", " 'TPMT',\n", " 'UGT1A1'}" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pharmgkb_genes - pharmvar_genes" ] }, { "cell_type": "code", "execution_count": 32, "id": "694ee1f3-6223-429e-8f35-0454972c4eae", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'CYP2A13', 'DPYD'}" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pharmvar_genes - pharmgkb_genes" ] }, { "cell_type": "code", "execution_count": 33, "id": "4c9f2dd2-aa65-4305-90da-d61ab51a03cd", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(no_rs_annotations[no_rs_annotations['Gene'].isin(pharmvar_genes - pharmgkb_genes)])" ] }, { "cell_type": "markdown", "id": "3b4a3f58-f725-42c5-9837-54de2d4a8476", "metadata": {}, "source": [ "Conclusion from this is that PharmVar probably has less information than PharmGKB; though most of the genes covered by PGKB and not by PV are \"uninformative\" tables, there are at least 2 exceptions (G6PD and UGT1A1). In contrast PV genes not covered by PGKB are not present in PGKB data.\n", "\n", "I haven't compared the actual content of the PV vs. PGKB data but I'm assuming it's similar since it's sourced directly from PV.\n", "\n", "Implementation-wise, PV does have the advantage in that it has an actual API with JSON responses rather than spreadsheets." ] }, { "cell_type": "markdown", "id": "52926fb5-96dc-4a8b-852b-f18e35302d11", "metadata": {}, "source": [ "### Informativeness\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": 41, "id": "962886d1-b14e-4afa-9f5b-7f8b70adf7d5", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0123456789...141142143144145146147148149150
0GENE: CYP2D6NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
1NG_008376.4 (ATG start)14C>T19G>A31G>A64delC73C>T77G>A82C>T100C>T122C>T...4165T>G4167T>C4168G>A4169C>G4170T>C4173C>T4181G>C4187C>T4214G>AStructural Variation
2Effect on protein (NP_000097.3)p.A5Vp.V7Mp.V11Mp.L22Xp.R25Wp.R26Hp.R28Cp.P34Sp.P41L...p.F481VNaNp.A482Tp.A482GNaNNaNp.S486Tp.S488Fp.R497HNaN
3Position at NC_000022.11 (Homo sapiens chromosome 22, GRCh38.p13)g.42130778G>Ag.42130773C>Tg.42130761C>Tg.42130729delg.42130719G>Ag.42130715C>Tg.42130710G>Ag.42130692G>Ag.42130670G>A...g.42126627A>Cg.42126625A>Gg.42126624C>Tg.42126623G>Cg.42126622A>Gg.42126619G>Ag.42126611C>Gg.42126605G>Ag.42126578C>TNaN
4Position at NG_008376.4 (CYP2D6 RefSeqGene; reverse relative to chromosome)g.5033C>Tg.5038G>Ag.5050G>Ag.5083delg.5092C>Tg.5096G>Ag.5101C>Tg.5119C>Tg.5141C>T...g.9184T>Gg.9186T>Cg.9187G>Ag.9188C>Gg.9189T>Cg.9192C>Tg.9200G>Cg.9206C>Tg.9233G>ANaN
5rsIDrs773790593rs72549358rs769258NaNrs267608313rs28371696rs138100349rs1065852rs373243894...NaNNaNrs74478221rs75467367rs747998333rs28371736rs1135840rs568495591rs1440526469NaN
6CYP2D6 AlleleNaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
7*1GCCGGCGGG...AACGAGCGCNaN
8*2NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNGNaNNaNNaN
9*3NaNNaNNaNNaNNaNNaNNaNNaNNaN...NaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
\n", "

10 rows × 151 columns

\n", "
" ], "text/plain": [ " 0 \\\n", "0 GENE: CYP2D6 \n", "1 NG_008376.4 (ATG start) \n", "2 Effect on protein (NP_000097.3) \n", "3 Position at NC_000022.11 (Homo sapiens chromosome 22, GRCh38.p13) \n", "4 Position at NG_008376.4 (CYP2D6 RefSeqGene; reverse relative to chromosome) \n", "5 rsID \n", "6 CYP2D6 Allele \n", "7 *1 \n", "8 *2 \n", "9 *3 \n", "\n", " 1 2 3 4 5 \\\n", "0 NaN NaN NaN NaN NaN \n", "1 14C>T 19G>A 31G>A 64delC 73C>T \n", "2 p.A5V p.V7M p.V11M p.L22X p.R25W \n", "3 g.42130778G>A g.42130773C>T g.42130761C>T g.42130729del g.42130719G>A \n", "4 g.5033C>T g.5038G>A g.5050G>A g.5083del g.5092C>T \n", "5 rs773790593 rs72549358 rs769258 NaN rs267608313 \n", "6 NaN NaN NaN NaN NaN \n", "7 G C C G G \n", "8 NaN NaN NaN NaN NaN \n", "9 NaN NaN NaN NaN NaN \n", "\n", " 6 7 8 9 ... \\\n", "0 NaN NaN NaN NaN ... \n", "1 77G>A 82C>T 100C>T 122C>T ... \n", "2 p.R26H p.R28C p.P34S p.P41L ... \n", "3 g.42130715C>T g.42130710G>A g.42130692G>A g.42130670G>A ... \n", "4 g.5096G>A g.5101C>T g.5119C>T g.5141C>T ... \n", "5 rs28371696 rs138100349 rs1065852 rs373243894 ... \n", "6 NaN NaN NaN NaN ... \n", "7 C G G G ... \n", "8 NaN NaN NaN NaN ... \n", "9 NaN NaN NaN NaN ... \n", "\n", " 141 142 143 144 145 \\\n", "0 NaN NaN NaN NaN NaN \n", "1 4165T>G 4167T>C 4168G>A 4169C>G 4170T>C \n", "2 p.F481V NaN p.A482T p.A482G NaN \n", "3 g.42126627A>C g.42126625A>G g.42126624C>T g.42126623G>C g.42126622A>G \n", "4 g.9184T>G g.9186T>C g.9187G>A g.9188C>G g.9189T>C \n", "5 NaN NaN rs74478221 rs75467367 rs747998333 \n", "6 NaN NaN NaN NaN NaN \n", "7 A A C G A \n", "8 NaN NaN NaN NaN NaN \n", "9 NaN NaN NaN NaN NaN \n", "\n", " 146 147 148 149 \\\n", "0 NaN NaN NaN NaN \n", "1 4173C>T 4181G>C 4187C>T 4214G>A \n", "2 NaN p.S486T p.S488F p.R497H \n", "3 g.42126619G>A g.42126611C>G g.42126605G>A g.42126578C>T \n", "4 g.9192C>T g.9200G>C g.9206C>T g.9233G>A \n", "5 rs28371736 rs1135840 rs568495591 rs1440526469 \n", "6 NaN NaN NaN NaN \n", "7 G C G C \n", "8 NaN G NaN NaN \n", "9 NaN NaN NaN NaN \n", "\n", " 150 \n", "0 NaN \n", "1 Structural Variation \n", "2 NaN \n", "3 NaN \n", "4 NaN \n", "5 NaN \n", "6 NaN \n", "7 NaN \n", "8 NaN \n", "9 NaN \n", "\n", "[10 rows x 151 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# What we get \"out of the box\" - note first 7 rows & first column are headers\n", "allele_def_tables['CYP2D6'].head(10)" ] }, { "cell_type": "code", "execution_count": 42, "id": "bc4a0a33-7b35-416c-8647-b1a192cd52e0", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
01
0GENE: HLA-ANaN
1NaNNaN
2Effect on proteinNaN
3Position on chromosomal sequenceNaN
4Position on gene sequenceNaN
5rsIDNaN
6HLA-A AlleleNaN
7*01:01Not Callable
8*01:02Not Callable
9*01:03Not Callable
\n", "
" ], "text/plain": [ " 0 1\n", "0 GENE: HLA-A NaN\n", "1 NaN NaN\n", "2 Effect on protein NaN\n", "3 Position on chromosomal sequence NaN\n", "4 Position on gene sequence NaN\n", "5 rsID NaN\n", "6 HLA-A Allele NaN\n", "7 *01:01 Not Callable\n", "8 *01:02 Not Callable\n", "9 *01:03 Not Callable" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "allele_def_tables['HLA-A'].head(10)" ] }, { "cell_type": "code", "execution_count": 53, "id": "8f641e74-57c8-447e-a1f9-d7a062ed14d6", "metadata": {}, "outputs": [], "source": [ "# If there are more than 2 columns we'll assume the table is informative\n", "allele_def_metrics = []\n", "informative_tables = []\n", "for gene, table in allele_def_tables.items():\n", " allele_def_metrics.append({\n", " 'gene': gene,\n", " 'num_alleles': table.shape[0]-7,\n", " 'num_variants': table.shape[1]-1\n", " })\n", " if table.shape[1] > 2:\n", " informative_tables.append(gene)" ] }, { "cell_type": "code", "execution_count": 54, "id": "b09c10bc-4205-4329-a47e-3c9d6bc26fe8", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'gene': 'HLA-B', 'num_alleles': 1793, 'num_variants': 1},\n", " {'gene': 'CYP2D6', 'num_alleles': 163, 'num_variants': 150},\n", " {'gene': 'CYP2C19', 'num_alleles': 36, 'num_variants': 35},\n", " {'gene': 'CYP3A5', 'num_alleles': 6, 'num_variants': 5},\n", " {'gene': 'CYP2C9', 'num_alleles': 85, 'num_variants': 80},\n", " {'gene': 'UGT1A1', 'num_alleles': 9, 'num_variants': 4},\n", " {'gene': 'CYP2B6', 'num_alleles': 48, 'num_variants': 48},\n", " {'gene': 'NUDT15', 'num_alleles': 20, 'num_variants': 17},\n", " {'gene': 'CYP2C8', 'num_alleles': 18, 'num_variants': 17},\n", " {'gene': 'CYP3A4', 'num_alleles': 45, 'num_variants': 42},\n", " {'gene': 'HLA-A', 'num_alleles': 1332, 'num_variants': 1},\n", " {'gene': 'G6PD', 'num_alleles': 187, 'num_variants': 173},\n", " {'gene': 'SLCO1B1', 'num_alleles': 44, 'num_variants': 32},\n", " {'gene': 'TPMT', 'num_alleles': 46, 'num_variants': 43},\n", " {'gene': 'HLA-C', 'num_alleles': 955, 'num_variants': 1},\n", " {'gene': 'HLA-DRB1', 'num_alleles': 763, 'num_variants': 1},\n", " {'gene': 'HLA-DQB1', 'num_alleles': 106, 'num_variants': 1},\n", " {'gene': 'HLA-DPB1', 'num_alleles': 127, 'num_variants': 1},\n", " {'gene': 'CYP2A6', 'num_alleles': 51, 'num_variants': 64},\n", " {'gene': 'HLA-DRB3', 'num_alleles': 45, 'num_variants': 1},\n", " {'gene': 'HLA-DQA1', 'num_alleles': 24, 'num_variants': 1},\n", " {'gene': 'CYP4F2', 'num_alleles': 16, 'num_variants': 14}]" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "allele_def_metrics" ] }, { "cell_type": "code", "execution_count": 55, "id": "114ad6bf-d6a1-4974-aeac-9af6e0486ee7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.6363636363636364" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(informative_tables) / len(allele_def_metrics)" ] }, { "cell_type": "code", "execution_count": 56, "id": "4fa92e0d-49b3-46c6-a007-2479e6444909", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "381" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Count of non-rs clinical annotation records involving genes with informative allele definition tables\n", "len(no_rs_annotations[no_rs_annotations['Gene'].isin(informative_tables)])" ] }, { "cell_type": "markdown", "id": "3434cf75-70f3-4d1d-b3c2-5fe378cbee71", "metadata": {}, "source": [ "## Summary\n", "\n", "* Non-rsID containing records represent 596 / 5101 = 12% of the clinical annotations\n", "* Affected gene is easy to get for all named alleles - we can rely on the \"Gene\" column in PGKB data\n", "* Annotations tend to be per-allele rather than per-genotype\n", "* Most records have an allele definition table from PGKB that we can download\n", " * 53 / 596 = 8.9% do not\n", " * Not all of these tables contain specific variants - see [CYP2D6](https://docs.google.com/spreadsheets/d/1tIovgq2w7FXv6g2ASiQd5EtYjxEbSzYrCRpg9dIlrEY/edit?usp=sharing) vs. [HLA-A](https://docs.google.com/spreadsheets/d/1Wz0F74sdY-hG0LJP1Rg5ePsZddVQ1pZrHDySOYgrOhI/edit?usp=sharing)\n", " * 381 / 596 = 64% of records have an informative and discoverable allele definition table from PGKB\n", "* PGKB alleles tables mostly (but not entirely) come from PharmVar\n", "\n", "### Questions\n", "\n", "* Identifier for these?\n", " * PGKB basically uses a list of haplotypes being annotated as the \"variant\" identifier in their annotations table\n", " * Note that our PGx schema uses genotype IDs not variant IDs\n", "* Do we want to resolve named alleles to variants, and if so how to convey this information?\n", "* Are we interested in functional consequences or is affected gene enough?\n", " * Could try using [Haplosaurus](https://www.ensembl.org/info/docs/tools/vep/haplo/index.html) - use allele definition table & annotated allele to create VCF file input\n", "\n", "[Top of page](#Table-of-contents)" ] }, { "cell_type": "code", "execution_count": null, "id": "43412dbd-cec2-4801-9de6-c2c59c9b908b", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 5 }