## Access ClinVar Data from MyVariant.info Services

[ClinVar](http://www.clinvar.com/) is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. It is commonly used in genomics research and is also a very important data source included in [MyVariant.info](http://myvariant.info).

[myvariant.py](https://pypi.python.org/pypi/myvariant) is an easy-to-use Python wrapper to access MyVariant.Info services. By utilizing myvariant.py, you can easily access ClinVar data with only a few lines of code. 

In this demo, we will show you how to use myvariant.py to query for ClinVar data by a few use cases.

### Install myvariant.py
Install myvariant.py is easy, as pip is your friend:

 pip install myvariant
You can find more usage examples from [MyVariant.py PyPI page](https://pypi.python.org/pypi/myvariant). The detailed API documentation can be found at http://myvariant-py.readthedocs.org .

Now you just need to import it and instantiate **MyVariantInfo** class:

In [1]:
import myvariant
mv = myvariant.MyVariantInfo()

### Retrieve ClinVar data of a variant or variants
myvariant.py allows you to query for annotation information of a given variant or variants by calling **getvariant** or **getvariants** method. You can also customize the output by passing specific field name or names as parameters. For more information about field names available in MyVariant.info, please check [the detailed documentation](http://docs.myvariant.info/en/latest/doc/data.html#available-fields).

* If you want to get available ClinVar annotation of a variant, you can pass an *hgvs id* and *'clinvar'* as the ***fields*** parameter to the **getvariant** method. Without ***fields*** parameter, you will get back all annotations available for the passed variant, including ClinVar.

In [2]:
mv.getvariant('chr17:g.7578532A>G', fields = 'clinvar')

{u'_id': u'chr17:g.7578532A>G',
 u'_version': 1,
 u'clinvar': {u'allele_id': 27396,
 u'alt': u'G',
 u'chrom': u'17',
 u'cytogenic': u'17p13.1',
 u'gene': {u'id': u'7157', u'symbol': u'TP53'},
 u'hg19': {u'end': 7578532, u'start': 7578532},
 u'hg38': {u'end': 7675214, u'start': 7675214},
 u'hgvs': {u'coding': [u'LRG_321t8:c.281T>C',
 u'LRG_321t5:c.2T>C',
 u'LRG_321t6:c.2T>C',
 u'LRG_321t7:c.2T>C',
 u'LRG_321t1:c.398T>C',
 u'LRG_321t2:c.398T>C',
 u'LRG_321t3:c.398T>C',
 u'LRG_321t4:c.398T>C',
 u'NM_001276697.1:c.-80T>C',
 u'NM_001126118.1:c.281T>C',
 u'NM_001126115.1:c.2T>C',
 u'NM_001126116.1:c.2T>C',
 u'NM_001126117.1:c.2T>C',
 u'NM_000546.5:c.398T>C',
 u'NM_001126112.2:c.398T>C',
 u'NM_001126113.2:c.398T>C',
 u'NM_001126114.2:c.398T>C'],
 u'genomic': [u'LRG_321:g.17337T>C',
 u'NG_017013.2:g.17337T>C',
 u'NC_000017.11:g.7675214A>G',
 u'NC_000017.10:g.7578532A>G']},
 u'omim': u'191170.0011',
 u'rcv': {u'accession': u'RCV000013151',
 u'clinical_significance': u'Pathogenic',
 u'conditions

* Suppose you want to retrieve the *clinvar variant id* of a variant, you can pass hgvs id and the speicific field name to the **getvariant** method.

In [3]:
mv.getvariant('chr17:g.7578532A>G', fields = 'clinvar.variant_id')

{u'_id': u'chr17:g.7578532A>G',
 u'_version': 1,
 u'clinvar': {u'variant_id': 12357}}

* You can also retreive annotation information of multiple fields for the same variant by calling the **getvariant** method. 

In [4]:
out = mv.getvariant('chr17:g.7578532A>G', fields = ['clinvar.variant_id','clinvar.rcv.accession'])

 In this case, your output is clinvar *variant id* and *rcv accession number* of the variant.

In [5]:
out

{u'_id': u'chr17:g.7578532A>G',
 u'_version': 1,
 u'clinvar': {u'rcv': {u'accession': u'RCV000013151'}, u'variant_id': 12357}}

* One major advantage using MyVariant.info is that you can query fields from different data sources.

In [6]:
out = mv.getvariant('chr6:g.26093141G>A', fields = ['clinvar.rcv.accession', 'exac.af', 'dbnsfp.sift.converted_rankscore'])

 Here, you get the *rcv accession number* from ClinVar, *allele frequency* information from EXAC and *sift converted_rankscore* from dbnsfp for your target variant as the output.

In [7]:
out

{u'_id': u'chr6:g.26093141G>A',
 u'_version': 1,
 u'clinvar': {u'rcv': [{u'accession': u'RCV000000019'},
 {u'accession': u'RCV000000020'},
 {u'accession': u'RCV000000021'},
 {u'accession': u'RCV000000022'},
 {u'accession': u'RCV000000023'},
 {u'accession': u'RCV000000024'},
 {u'accession': u'RCV000000025'},
 {u'accession': u'RCV000117222'},
 {u'accession': u'RCV000178096'}]},
 u'dbnsfp': {u'sift': {u'converted_rankscore': 0.91219}},
 u'exac': {u'af': 0.032}}

* To retrieve annotation objects for a list of hgvs ids, you can call the **getvariants** methods.

In [8]:
out = mv.getvariants(['chr6:g.26093141G>A', 'chr11:g.118896012A>G'], fields = 'clinvar.gene.symbol')

querying 1-2...done.


 Here, your output is the *gene symbol* of both variants.

In [9]:
out

[{u'_id': u'chr6:g.26093141G>A',
 u'_score': 1.0,
 u'clinvar': {u'gene': {u'symbol': u'HFE'}},
 u'query': u'chr6:g.26093141G>A'},
 {u'_id': u'chr11:g.118896012A>G',
 u'_score': 1.0,
 u'clinvar': {u'gene': {u'symbol': u'SLC37A4'}},
 u'query': u'chr11:g.118896012A>G'}]

### Make queries based on a specific ClinVar field or fields
myvariant.py also allows you to query for a specific field or fields in MyVariant.info.

* You can make query based on a single RCV accession number.

In [10]:
mv.query('clinvar.rcv.accession:RCV000013151')

{u'hits': [{u'_id': u'chr17:g.7578532A>G',
 u'_score': 16.53367,
 u'cadd': {u'_license': u'http://goo.gl/bkpNhq',
 u'alt': u'G',
 u'anc': u'A',
 u'annotype': u'CodingTranscript',
 u'bstatistic': 473,
 u'chmm': {u'bivflnk': 0.0,
 u'enh': 0.008,
 u'enhbiv': 0.0,
 u'het': 0.0,
 u'quies': 0.008,
 u'reprpc': 0.0,
 u'reprpcwk': 0.0,
 u'tssa': 0.0,
 u'tssaflnk': 0.0,
 u'tssbiv': 0.0,
 u'tx': 0.819,
 u'txflnk': 0.0,
 u'txwk': 0.11,
 u'znfrpts': 0.055},
 u'chrom': 17,
 u'consdetail': u'missense',
 u'consequence': u'NON_SYNONYMOUS',
 u'consscore': 7,
 u'cpg': 0.07,
 u'dna': {u'helt': 0.01, u'mgw': 0.24, u'prot': 4.04, u'roll': 1.29},
 u'encode': {u'exp': 1100.13,
 u'h3k27ac': 6.2,
 u'h3k4me1': 16.04,
 u'h3k4me3': 3.08,
 u'nucleo': 1.8},
 u'exon': u'5/11',
 u'fitcons': 0.726386,
 u'gc': 0.58,
 u'gene': {u'ccds_id': u'CCDS11118.1',
 u'cds': {u'cdna_pos': 588,
 u'cds_pos': 398,
 u'rel_cdna_pos': 0.23,
 u'rel_cds_pos': 0.34},
 u'feature_id': u'ENST00000269305',
 u'gene_id': u'ENSG00000141510',
 u'ge

* Now, let's assume you are working on cancer genomics and BRCA1 is your target gene. If you want to get all annotation information of variants located on BRCA1 and recorded in ClinVar, you can pass *clinvar gene symbol* parameter to the **query** method.

In [11]:
out = mv.query('clinvar.gene.symbol:BRCA1')

 By default, the output lists the top 10 hits among 3532 unique variants recorded in ClinVar related to BRCA1. 

In [12]:
out

{u'hits': [{u'_id': u'NM_007294.3:c.4358-?_5277+?del',
 u'_score': 12.69064,
 u'clinvar': {u'allele_id': 94606,
 u'chrom': u'17',
 u'coding_hgvs_only': True,
 u'cytogenic': u'17q21.31',
 u'gene': {u'id': u'672', u'symbol': u'BRCA1'},
 u'hgvs': {u'coding': [u'LRG_292t1:c.4358-?_5277+?del',
 u'NM_007294.3:c.4358-?_5277+?del'],
 u'genomic': [u'LRG_292:g.(?_141370_160932_?)del',
 u'NC_000017.11:g.(?_43057052)_(43076614_?)del',
 u'NC_000017.10:g.(?_41209069)_(41228631_?)del']},
 u'rcv': [{u'accession': u'RCV000119187',
 u'clinical_significance': u'Pathogenic',
 u'conditions': {u'identifiers': {u'medgen': u'C0677776'},
 u'name': u'BRCA1 and BRCA2 Hereditary Breast and Ovarian Cancer (HBOC)',
 u'synonyms': [u'q', u'Hereditary breast and ovarian cancer syndrome']},
 u'last_evaluated': u'2014-03-27',
 u'number_submitters': 1,
 u'origin': u'germline',
 u'preferred_name': u'NM_007294.3(BRCA1):c.4358-?_5277+?del',
 u'review_status': u'no assertion criteria provided'},
 {u'accession': u'RCV00007459

* You can also set the output size to numbers other than 10. 

In [13]:
mv.query('clinvar.gene.symbol:BRCA1', size=5)

{u'hits': [{u'_id': u'NM_007294.3:c.4358-?_5277+?del',
 u'_score': 12.69064,
 u'clinvar': {u'allele_id': 94606,
 u'chrom': u'17',
 u'coding_hgvs_only': True,
 u'cytogenic': u'17q21.31',
 u'gene': {u'id': u'672', u'symbol': u'BRCA1'},
 u'hgvs': {u'coding': [u'LRG_292t1:c.4358-?_5277+?del',
 u'NM_007294.3:c.4358-?_5277+?del'],
 u'genomic': [u'LRG_292:g.(?_141370_160932_?)del',
 u'NC_000017.11:g.(?_43057052)_(43076614_?)del',
 u'NC_000017.10:g.(?_41209069)_(41228631_?)del']},
 u'rcv': [{u'accession': u'RCV000119187',
 u'clinical_significance': u'Pathogenic',
 u'conditions': {u'identifiers': {u'medgen': u'C0677776'},
 u'name': u'BRCA1 and BRCA2 Hereditary Breast and Ovarian Cancer (HBOC)',
 u'synonyms': [u'q', u'Hereditary breast and ovarian cancer syndrome']},
 u'last_evaluated': u'2014-03-27',
 u'number_submitters': 1,
 u'origin': u'germline',
 u'preferred_name': u'NM_007294.3(BRCA1):c.4358-?_5277+?del',
 u'review_status': u'no assertion criteria provided'},
 {u'accession': u'RCV00007459

* Otherwise, you can also choose to return a [generator](https://docs.python.org/3.5/library/stdtypes.html#generator-types) to retrieve all the query results.

In [14]:
mv.query('clinvar.gene.symbol:BRCA1',fetch_all=True)



* If you want to filter for all BRCA1 related variants that are defined as pathogenic in ClinVar, you can pass *gene symbol* and *clinical significance* parameters to the **query** method. 

In [15]:
out = mv.query('clinvar.gene.symbol:BRCA1 AND clinvar.rcv.clinical_significance:pathogenic')

 Now, only 1126 variants are left matching these two criteria.

In [16]:
out

{u'hits': [{u'_id': u'chr17:g.41197590_41197593del',
 u'_score': 15.629933,
 u'clinvar': {u'allele_id': 102794,
 u'alt': u'-',
 u'chrom': u'17',
 u'cytogenic': u'17q21.31',
 u'gene': {u'id': u'672', u'symbol': u'BRCA1'},
 u'hg19': {u'end': 41197593, u'start': 41197590},
 u'hg38': {u'end': 43045576, u'start': 43045573},
 u'hgvs': {u'coding': [u'LRG_292t1:c.*102_*105delCTGT',
 u'NM_007294.3:c.*102_*105delCTGT'],
 u'genomic': [u'LRG_292:g.172408_172411delCTGT',
 u'NG_005905.2:g.172408_172411delCTGT',
 u'NC_000017.11:g.43045573_43045576delACAG',
 u'NC_000017.10:g.41197590_41197593delACAG']},
 u'rcv': {u'accession': u'RCV000083012',
 u'clinical_significance': u'Pathogenic',
 u'conditions': {u'age_of_onset': u'All ages',
 u'identifiers': {u'medgen': u'C2676676',
 u'omim': u'604370',
 u'orphanet': u'145'},
 u'name': u'Breast-ovarian cancer, familial 1 (BROVCA1)',
 u'synonyms': [u'BREAST-OVARIAN CANCER, FAMILIAL, SUSCEPTIBILITY TO, 1',
 u'OVARIAN CANCER, SUSCEPTIBILITY TO',
 u'BREAST CANCER, F

* You can also query for multiple terms. For example, if you want to query for annotation data related to multiple RCV accession numbers, you can do so by calling the **querymany** method.

 Here is a list of *RCV accession numbers* you might want to query.

In [17]:
xli = ['RCV000059118',
 'RCV000059119',
 'RCV000057234',
 'RCV000037456',
 'RCV000000019',
 'RCV000000134']

In [18]:
out = mv.querymany(xli, scopes='clinvar.rcv.accession')

querying 1-6...done.
Finished.


 Here is the output of all annotation records related to these *RCV accession numbers* in xli in MyVariant.info.

In [19]:
out

[{u'_id': u'chr11:g.118895925C>T',
 u'_score': 10.333494,
 u'cadd': {u'_license': u'http://goo.gl/bkpNhq',
 u'alt': u'T',
 u'anc': u'C',
 u'annotype': [u'CodingTranscript', u'Intergenic'],
 u'bstatistic': 476,
 u'chmm': {u'bivflnk': 0.0,
 u'enh': 0.197,
 u'enhbiv': 0.0,
 u'het': 0.0,
 u'quies': 0.0,
 u'reprpc': 0.0,
 u'reprpcwk': 0.0,
 u'tssa': 0.0,
 u'tssaflnk': 0.0,
 u'tssbiv': 0.0,
 u'tx': 0.622,
 u'txflnk': 0.055,
 u'txwk': 0.11,
 u'znfrpts': 0.0},
 u'chrom': 11,
 u'consdetail': [u'missense', u'downstream'],
 u'consequence': [u'NON_SYNONYMOUS', u'DOWNSTREAM'],
 u'consscore': [7, 1],
 u'cpg': 0.04,
 u'dna': {u'helt': 2.03, u'mgw': 0.26, u'prot': -2.64, u'roll': -0.64},
 u'encode': {u'exp': 136.25,
 u'h3k27ac': 13.84,
 u'h3k4me1': 33.84,
 u'h3k4me3': 6.4,
 u'nucleo': 2.7,
 u'occ': 2,
 u'p_val': {u'comb': 0.95,
 u'ctcf': 0.0,
 u'dnas': 1.62,
 u'faire': 0.0,
 u'mycp': 0.0,
 u'polii': 0.0},
 u'sig': {u'ctcf': 0.03,
 u'dnase': 0.04,
 u'faire': 0.01,
 u'myc': 0.0,
 u'polii': 0.0}},
 u'exo

### Can I convert a very large list of ids?

Yes, you can. If you pass an id list (i.e., xli above) larger than 1000 ids, we will do the id mapping in-batch with 1000 ids at a time, and then concatenate the results all together for you. So, from the user-end, it's exactly the same as passing a shorter list. You don't need to worry about saturating our backend servers.

### To read more

* [MyVariant.info](http://myvariant.info)
* [MyVariant.info Documentation](http://docs.myvariant.info/en/latest/index.html)
 * [available fields](http://docs.myvariant.info/en/latest/doc/data.html#available-fields)
 * [data sources](http://docs.myvariant.info/en/latest/doc/data.html#data-sources)
* [MyVariant.py's Documentation](http://myvariant-py.readthedocs.org/en/latest/)
* [MyVaraint.py's PyPI page](https://pypi.python.org/pypi/myvariant)