README, COSMIC COSMIC Download Files ==================================== Version 87, 13th November 2018 --------------------------- Classification Information --------------------------- A comma separated table of COSMIC cancer classification information. [http://cancer.sanger.ac.uk/cancergenome/assets/classification.csv ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Cosmic_Phenotype_id Unique COSMIC identifier for the classification. [2:B] Site_Primary Primary tissue specified in the publication. [3:C] Site_Subtype1 Sub tissue specified in the publication. [4:D] Site_Subtype2 Sub tissue specified in the publication. [5:E] Site_Subtype3 Sub tissue specified in the publication. [6:F] Histology Primary histology specified in the publication. [7:G] Hist_Subtype1 Sub histology specified in the publication. [8:H] Hist_Subtype2 Sub histology specified in the publication. [9:I] Hist_Subtype3 Sub histology specified in the publication. [10:J] Site_Primary_COSMIC Primary tissue specified in COSMIC. [11:K] Site_Subtype1_COSMIC Sub tissue specified in COSMIC. [12:L] Site_Subtype2_COSMIC Sub tissue specified in COSMIC. [13:M] Site_Subtype3_COSMIC Sub tissue specified in COSMIC. [14:N] Histology_COSMIC Primary histology specified in COSMIC. [15:O] Hist_Subtype1_COSMIC Sub histology specified in COSMIC. [16:P] Hist_Subtype2_COSMIC Sub histology specified in COSMIC. [17:Q] Hist_Subtype3_COSMIC Sub histology specified in COSMIC. [18:R] NCI code NCI thesaurus code for tumour histological classification. For details see https://ncit.nci.nih.gov [19:S] EFO code Experimental Factor Ontology (EFO), for details see http://www.ebi.ac.uk/efo/ ------------------------------------------------- COSMIC Complete Mutation Data (Targeted Screens) ------------------------------------------------- A tab separated table of the complete curated COSMIC dataset (targeted screens) from the current release. It includes all coding point mutations, and the negative data set. [ /cosmic/grch37/cosmic/v87/CosmicCompleteTargetedScreensMutantExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Gene name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC symbol. [2:B] Accession Number The transcript identifier of the gene. [3:C] Gene CDS length Length of the gene (base pair) counts. [4:D] HGNC id Unique HGNC identifier, if the gene is in HGNC. [5:E] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [8:H] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [9:I] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [10:J] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [11:K] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [12:L] Primary Histology The histological classification of the sample. [13:M] Histology Subtype 1 Further histological classification (level 1) of the sample. [14:N] Histology Subtype 2 Further histological classification (level 2) of the sample. [15:O] Histology Subtype 3 Further histological classification (level 3) of the sample. [16:P] Genome-wide screen if the entire genome/exome is sequenced. [17:Q] Mutation Id unique mutation identifier. [18:R] Mutation CDS The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence. [19:S] Mutation AA The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page. [20:T] Mutation Description Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.) [21:U] Mutation zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [22:V] LOH LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown. [23:W] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [24:X] Mutation genome position The genomic coordinates of the mutation. [25:Y] Mutation strand Positive or negative. [26:Z] SNP All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. [27:AA] Resistance Mutation The mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details). [28:AB] FATHMM prediction More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from http://fathmm.biocompute.org.uk. FATHMM descriptors - * Pathogenic = Defined as Cancer or Damaging. * Neutral = Defined as Passenger or Tolerated. [29:AC] FATHMM Score The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper. [30:AD] Mutation somatic status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confirmed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [31:AE] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication. [32:AF] Id Study Lists the unique Ids of studies that have involved this sample. [33:AG] Sample Type,Tumour origin Describes where the sample has originated from including the tumour type. [35:AI] Age Age of the individual (if this information is provided with the publications). -------------------------------------- COSMIC Mutation Data (Genome Screens) -------------------------------------- A tab separated table of coding point mutations from genome wide screens (including whole exome sequencing). [ /cosmic/grch37/cosmic/v87/CosmicGenomeScreensMutantExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Gene name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. [2:B] Accession Number The transcript identifier of the gene. [3:C] Gene CDS length Length of the gene (base pair) counts. [4:D] HGNC id Unique HGNC identifier, if the gene is in HGNC. [5:E] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [8:H] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [9:I] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [10:J] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [11:K] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [12:L] Primary Histology The histological classification of the sample. [13:M] Histology Subtype 1 Further histological classification (level 1) of the sample. [14:N] Histology Subtype 2 Further histological classification (level 2) of the sample. [15:O] Histology Subtype 3 Further histological classification (level 3) of the sample. [16:P] Genome-wide screen if the entire genome/exome is sequenced. [17:Q] Mutation Id unique mutation identifier. [18:R] Mutation CDS The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence. [19:S] Mutation AA The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page. [20:T] Mutation Description Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.) [21:U] Mutation zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [22:V] LOH LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown. [23:W] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [24:X] Mutation genome position The genomic coordinates of the mutation. [25:Y] Mutation strand positive or negative. [26:Z] SNP All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. [27:AA] FATHMM prediction More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from http://fathmm.biocompute.org.uk. FATHMM descriptors - * Pathogenic = Defined as Cancer or Damaging. * Neutral = Defined as Passenger or Tolerated. [28:AB] FATHMM Score The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper. [29:AC] Mutation somatic status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [30:AD] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication. [31:AE] Id Study Lists the unique Ids of studies that have involved this sample. [32:AF] Sample Type,Tumour origin Describes where the sample has originated from including the tumour type. [34:AH] Age Age of the individual (if this information is provided with the publications). --------------------- COSMIC Mutation Data --------------------- A tab separated table of all COSMIC coding point mutations from targeted and genome wide screens from the current release. [ /cosmic/grch37/cosmic/v87/CosmicMutantExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Gene name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. [2:B] Accession Number The transcript identifier of the gene. [3:C] Gene CDS length Length of the gene (base pair) counts. [4:D] HGNC id if gene is in HGNC, this id helps linking it to HGNC. [5:E] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [8:H] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [9:I] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [10:J] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [11:K] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [12:L] Primary Histology The histological classification of the sample. [13:M] Histology Subtype 1 Further histological classification (level 1) of the sample. [14:N] Histology Subtype 2 Further histological classification (level 2) of the sample. [15:O] Histology Subtype 3 Further histological classification (level 3) of the sample. [16:P] Genome-wide screen if the entire genome/exome is sequenced. [17:Q] Mutation Id unique mutation identifier. [18:R] Mutation CDS The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence. [19:S] Mutation AA The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page. [20:T] Mutation Description Type of mutation at the amino acid level (substitution, deletion, insertion, complex, fusion, unknown etc.) [21:U] Mutation zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [22:V] LOH LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown. [23:W] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [24:X] Mutation genome position The genomic coordinates of the mutation. [25:Y] Mutation strand postive or negative. [26:Z] SNP All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. [27:AA] Resistance Mutation mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details). [28:AB] FATHMM prediction More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from http://fathmm.biocompute.org.uk. FATHMM descriptors - * Pathogenic = Defined as Cancer or Damaging. * Neutral = Defined as Passenger or Tolerated. [29:AC] FATHMM Score The scores are in the form of pvalues ranging from 0 to 1. Pvalues greater than 0.5 are pathogenic while less than 0.5 are benign. Pvalues close to 0 or 1 are the high confidence results which are more accurate. The results are annotated as 10 feature groups (separately for coding and non coding variants) details of which can be found in the original FATHMM-MKL paper. [30:AD] Mutation somatic status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [31:AE] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication. [32:AF] Id Study Lists the unique Ids of studies that have involved this sample. [33:AG] Sample Type,Tumour origin Describes where the sample has originated from including the tumour type. [35:AI] Age Age of the sample (if this information is provided with the publications). ---------------------------------- Structural Genomic Rearrangements ---------------------------------- STRUCTURAL VARIANTS All structural variants from the current release in a tab separated table. [ /cosmic/grch37/cosmic/v87/CosmicStructExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [4:D] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [5:E] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [6:F] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [7:G] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [8:H] Primary Histology The histological classification of the sample. [9:I] Histology Subtype 1 Further histological classification (level 1) of the sample. [10:J] Histology Subtype 2 Further histological classification (level 2) of the sample. [11:K] Histology Subtype 3 Further histological classification (level 3) of the sample. [12:L] Mutation Id unique mutation identifier. [13:M] Mutation Type Type of mutation : Intra/Inter (chromosomal), tandem duplication, deletion, inversion, complex substitutions, complex amplicons. [14:N] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [15:O] Description A syntax which describes the structural variant, based on HGVS recommendations. [16:P] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in. [17:Q] ID_STUDY Lists the unique Ids of studies that have involved this structural mutation. BREAKPOINTS All breakpoint data from the current release in a tab separated table. [ /cosmic/grch37/cosmic/v87/CosmicBreakpointsExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [4:D] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [5:E] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [6:F] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [7:G] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [8:H] Primary Histology The histological classification of the sample. [9:I] Histology Subtype 1 Further histological classification (level 1) of the sample. [10:J] Histology Subtype 2 Further histological classification (level 2) of the sample. [11:K] Histology Subtype 3 Further histological classification (level 3) of the sample. [12:L] Mutation Type Type of mutation : Intra/Inter (chromosomal), tandem duplication, deletion, inversion, complex substitutions, complex amplicons. [13:M] Mutation Id unique mutation identifier. [14:N] Breakpoint Order For variants involving multiple breakpoints, the predicted order along chromosome(s).Otherwise '0'. [15:O] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [16:P] Chrom From The chromosome where the first variant/breakpoint occurs. [17:Q] Location From min The first position in breakpoint range. [18:R] Location From max The last position in breakpoint range. [19:S] Strand From positive or negative. [20:T] Chrom To The chromosome where the last variant/breakpoint occurs. [21:U] Location To min The first position in breakpoint range. [22:V] Location To max The last position in breakpoint range. [23:W] Strand To positive or negative. [24:X] Non-templated ins seq Non Templated Sequence (if any) which is inserted at the breakpoint. The sequence is not encoded. [25:Y] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in. [26:Z] Id Study Lists the unique Ids of studies that have involved this structural mutation. ----------------------- Complete Fusion Export ----------------------- All gene fusion mutation data from the current release in a tab separated table. [ /cosmic/grch37/cosmic/v87/CosmicFusionExport.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample id,Sample name, A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [3:C] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [4:D] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [5:E] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [6:F] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [7:G] Primary Histology The histological classification of the sample. [8:H] Histology Subtype 1 Further histological classification (level 1) of the sample. [9:I] Histology Subtype 2 Further histological classification (level 2) of the sample. [10:J] Histology Subtype 3 Further histological classification (level 3) of the sample. [11:K] Fusion Id Unique fusion mutation identifier. [12:L] Translocation Name Syntax describing the portions of mRNA present (in HGVS 'r.' format) from each gene (allows representation of UTR sequences). [13:M] Fusion type Type of mutation. [14:N] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in. [15:O] Id Study Lists the unique Ids of studies that have involved this fusion mutation. ------------------------------ All Mutations in Census Genes ------------------------------ All coding mutations in genes listed in the Cancer Gene Census ( http://cancer.sanger.ac.uk/census ) in a tab separated table. [ /cosmic/grch37/cosmic/v87/CosmicMutantExportCensus.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Gene name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. [2:B] Accession Number The transcript identifier of the gene. [3:C] Gene CDS length Length of the gene (base pair) counts. [4:D] HGNC id if gene is in HGNC, this id helps linking it to HGNC. [5:E] Sample name,Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [8:H] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [9:I] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [10:J] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [11:K] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [12:L] Primary Histology The histological classification of the sample. [13:M] Histology Subtype 1 Further histological classification (level 1) of the sample. [14:N] Histology Subtype 2 Further histological classification (level 2) of the sample. [15:O] Histology Subtype 3 Further histological classification (level 3) of the sample. [16:P] Genome-wide screen if the entire genome/exome is sequenced. [17:Q] Mutation Id unique mutation identifier. [18:R] Mutation CDS The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence. [19:S] Mutation AA The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. The description of each type can be found by following the link to Mutation Overview page. [20:T] Mutation Description Type of mutation (substitution, deletion, insertion, complex, fusion etc.) [21:U] Mutation zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [22:V] LOH LOH Information on whether the gene was reported to have loss of heterozygosity in the sample: yes, no or unknown. [23:W] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [24:X] Mutation genome position The genomic coordinates of the mutation. [25:Y] Mutation strand positive or negative. [26:Z] SNP All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. [27:AA] Resistance Mutation mutation confers drug resistance (see CosmicResistanceMutations.tsv.gz for gene/drug details). [28:AB] FATHMM prediction More information about FATHMM (Functional Analysis through Hidden Markov Models) is available from http://fathmm.biocompute.org.uk. FATHMM descriptors - * Pathogenic = Defined as Cancer or Damaging. * Neutral = Defined as Passenger or Tolerated. [29:AC] FATHMM score The FATHMM-MKL functional score is a p-value, ranging from 0 to 1. Scores above 0.5 are deleterious, but in order to highlight the most significant data in COSMIC, only scores >= 0.7 are classified as 'Pathogenic'. Mutations are classed as 'Neutral' if the score is <= 0.5. [30:AD] Mutation somatic status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [31:AE] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication. [32:AF] Id Study Lists the unique Ids of studies that have involved this sample. [33:AG] Sample Type,Tumour origin Describes where the sample has originated from including the tumour type. [35:AI] Age Age of the sample (if this information is provided with the publications). [36:AJ] Tier 1 or 2 [see http://cancer.sanger.ac.uk/census for details or Tier 1 and 2] -------------------- Non coding variants -------------------- A tab separated table of all non-coding mutations from the current release. [ /cosmic/grch37/cosmic/v87/CosmicNCV.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample name,Sample id,Tumour id A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [4:D] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cell_lines/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [5:E] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [6:F] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [7:G] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [8:H] Primary Histology The histological classification of the sample. [9:I] Histology Subtype 1 Further histological classification (level 1) of the sample. [10:J] Histology Subtype 2 Further histological classification (level 2) of the sample. [11:K] Histology Subtype 3 Further histological classification (level 3) of the sample. [12:L] Id NCV unique non-coding variant identifier. [13:M] Zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [14:N] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [15:O] Genome position The genomic cooridnate of the mutation. [16:P] Mutation somatic status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [17:Q] WT SEQ wild type sequence. [18:R] MUT SEQ Mutated sequence. [19:S] SNP All the known SNPs are flagged as 'y' defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. [20:T] FATHMM_MKL_NON_CODING_SCORE FATHMM-MKL non-coding score. A p-value ranging from 0 to 1 where >= 0.7 is functionally significant. [21:U] FATHMM_MKL_NON_CODING_GROUPS FATHMM-MKL group classification. More details from http://cancer.sanger.ac.uk/cosmic/analyses. [22:V] FATHMM_MKL_CODING_SCORE FATHMM-MKL coding score (p-value ranging from 0 to 1). [23:W] FATHMM_MKL_CODING_GROUPS FATHMM-MKL group classification (coding). More details from http://cancer.sanger.ac.uk/cosmic/analyses. [24:X] Whole Genome Reseq if the enitre genome is sequenced. [25:Y] Whole_Exome if the enitre exome is sequenced. [26:Z] Id Study Lists the unique Ids of studies that have involved this non coding mutation. [27:AA] Pubmed_PMID The PUBMED ID for the paper that the sample was noted in. --------------------- Copy Number Variants --------------------- All copy number abberations from the current release in a tab separated table. For more information on copy number data, please see http://cancer.sanger.ac.uk/cosmic/help/cnv/overview. [ /cosmic/grch37/cosmic/v87/CosmicCompleteCNA.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] CNV_ID The unique identifier for the variant (not stable, differs between releases). [2:B] Id gene,Gene name The ID and symbol of the gene which overlaps the copy number segment (or '-' where there is no overlapping gene). [4:D] Sample id,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA. [6:F] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [7:G] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [8:H] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [9:I] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [10:J] Primary Histology The histological classification of the sample. [11:K] Histology Subtype 1 Further histological classification (level 1) of the sample. [12:L] Histology Subtype 2 Further histological classification (level 2) of the sample. [13:M] Histology Subtype 3 Further histological classification (level 3) of the sample. [14:N] Sample Name The name of the sample. [15:O] Total_CN The sum of the major and minor allele counts eg if ABB, total copy number = 3. [16:P] Minor Allele The number of copies of the least frequent allele eg if ABB, minor allele = A ( 1 copy) and major allele = B ( 2 copies). [17:Q] Mut Type Defined as Gain or Loss. For ICGC samples; as defined in the original data. For TCGA samples reanalysed with ASCAT - * GAIN = average genome ploidy <= 2.7 AND total copy number >= 5 OR average genome ploidy > 2.7 AND total copy number >= 9 * LOSS = average genome ploidy <= 2.7 AND total copy number = 0 OR average genome ploidy > 2.7 AND total copy number < ( average genome ploidy - 2.7 ) [18:R] Id Study Lists the unique Ids of studies that have involved this copy number variation. [19:S] GRCh The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [20:T] Chromosome:G_Start..G_Stop The genomic coordinates of the variation. ---------------- Gene Expression ---------------- All gene expression level 3 data from the TCGA portal for the current most release in a tab separated table. Please note : The platform codes currently used to produce the COSMIC gene expression values are: IlluminaGA_RNASeqV2, IlluminaHiSeq_RNASeqV2, AgilentG4502A_07_2, AgilentG4502A_07_3. For more information on the gene expression data, please see http://cancer.sanger.ac.uk/cosmic/analyses. [ /cosmic/grch37/cosmic/v87/CosmicCompleteGeneExpression.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample id,Sample name A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA. [3:C] Gene name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. [4:D] Regulation it could be over or under depending on the scores from different platforms if they are above or below the threshold. [5:E] Z-score z_score serves as an indicative score taken from the gene_expression from different platforms in order of preference: IlluminaHiSeq_RNASeqV2, IlluminaGA_RNASeqV2, AgilentG4502A_07_3. [6:F] Id Study Lists the unique Ids of studies that have involved this gene expression data. ------------ Methylation ------------ TCGA Level 3 methylation data from the ICGC portal for the current release in a tab separated table. More information on the methylation data is available from http://cancer.sanger.ac.uk/cosmic/analyses. [ /cosmic/grch37/cosmic/v87/CosmicCompleteDifferentialMethylation.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Study_ID The study Id for these data. [2:B] Id Sample,Sample name,Id tumour A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the TCGA. [5:E] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [6:F] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [7:G] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [8:H] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [9:I] Primary Histology The histological classification of the sample. [10:J] Histology Subtype 1 Further histological classification (level 1) of the sample. [11:K] Histology Subtype 2 Further histological classification (level 2) of the sample. [12:L] Histology Subtype 3 Further histological classification (level 3) of the sample. [13:M] Fragment Id The unique probe Id for a specific CpG. [14:N] Genome Version The coordinate system used - * 38 = GRCh38/Hg38 * 37 = GRCh37/Hg19 [15:O] Chromosome The chromosome location of the probe (1-22, X or Y). [16:P] Position The genome location of the CpG targeted by the probe (1-based coordinates). [17:Q] Strand Positive or negative. [18:R] Gene Name The gene name (if the probe falls within the coding region of a COSMIC gene) or the probe annotation as descibed by Illumina. [19:S] Methylation The methylation level; H (High, beta-value >0.8) or L (Low, beta-value < 0.2). [20:T] Avg Beta Value Normal The average beta-value across the normal population. The beta-value of the tumour must differ from this value by >0.5 to be considered a variant. [21:U] Beta Value The beta-value for the probe in the tumour sample. Only values >0.8 (High) or <0.2 (Low) are included. [22:V] Two Sided P-Value The two sided p-value. ------------------- Cancer Gene Census ------------------- A list of all cancer census genes from the current release in a comma separated table. The census table is exported from http://cancer.sanger.ac.uk/census and the format is the same. [ /cosmic/grch37/cosmic/v87/cancer_gene_census.csv ] ----------------------- COSMIC Sample Features ----------------------- All the features related to a sample from the current release in a tab separated file. [ /cosmic/grch37/cosmic/v87/CosmicSample.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample id,Sample name,Id tumour,Id Individual A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. These samples are from the ICGC and TCGA. [5:E] Primary Site The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [6:F] Site Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [7:G] Site Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [8:H] Site Subtype 3 Further sub classification (level 3) of the samples tissue of origin. [9:I] Primary Histology The histological classification of the sample. [10:J] Histology Subtype 1 Further histological classification (level 1) of the sample. [11:K] Histology Subtype 2 Further histological classification (level 2) of the sample. [12:L] Histology Subtype 3 Further histological classification (level 3) of the sample. [13:M] Therapy Relationship Relates the time-point of tissue sampling to the drug therapy used to treat the tumour. [14:N] Sample Differentiator Gives additional information if more than one sample (e.g. carcinomatous and sarcomatous components) from a tumour has been screened for mutations or if samples from a tumour were taken at different time points. [15:O] Mutation Allele Specification Where a publication has information on more than one mutation for one gene in a sample and reports whether or not the mutations occurred on the same or different chromosomes. [16:P] Msi If microsatellite instability data is given in the publication per sample then High, Low, Stable/Low, MSI or Stable is reported in COSMIC. Unknown is the default. [17:Q] Average Ploidy The average ploidy of the sample, calculated from copy number data (where available). [18:R] Whole Genome Screen 'y' if the sample was whole genome screened. [19:S] Whole Exome Screen 'y' if the sample was whole exome sequenced. [20:T] Sample Remark Any additional sample information e.g. % mutant allele burden. [21:U] Drug Response Clinical and in vitro responses to drugs (particularly targeted drugs). Phrasing based on RECIST guidelines. Note that in COSMIC, SD (stable disease) and PD (progressive disease) = clinical primary non response. [22:V] Grade Grade of tumour. The phrase 'Some Grade data are given in publication' is used when publication reports grade data or when data hasn't been given per sample. More detailed data follow commonly used grading systems in tumours. [23:W] Age at tumour recurrence Where both primary and recurrent tumour samples from an individual have been screened for mutations and the age (in years) of the patient at the time of the recurrence is different to that at diagnosis. [24:X] Stage Stage of tumour. The phrase 'Some Stage data are given in publication' is used when publication reports stage data or when data hasn't been given per sample. More detailed data follow commonly used staging systems in tumours. [25:Y] Cytogenetics Karyotype of the tumour. [26:Z] Metastatic Site Tissue site of any metastases identified in an individual. [27:AA] Tumour Source Source of tumour tissue sample e.g. primary, metastasis. [28:AB] Tumour Remark Any additional tumour information e.g. metachronous tumour. [29:AC] Age Age (in years) of individual at diagnosis. [30:AD] Ethnicity Ethnicity (e.g. Caucasian) of individual. [31:AE] Environmental Variables Environmental variables to which an individual has been exposed (e.g. viral exposure, smoking status). [32:AF] Germline Mutation Gene name/mutation if a germline mutation as well as a somatic mutation has been detected in the same gene in the same tumour sample. [33:AG] Therapy Any significant treatment an individual has received prior to mutation screening. [34:AH] Family Any familial cancer history for an individual or familial relationships of individuals screened for mutations in the same publication. [35:AI] Normal tissue tested If normal tissue from the same individual has been screened for mutations. [36:AJ] Gender Sex of individual. [37:AK] Individual Remark Any additional individual information (e.g. age group, hereditary syndromes). [38:AL] NCI code NCI thesaurus code for tumour histological classification. [39:AM] Sample Type Describes where the sample originated from. ------------ COSMIC HGNC ------------ A tab separated table showing the relationship between the Cancer Gene Census, COSMIC ID, Gene Name, HGNC ID and Entrez ID. [ /cosmic/grch37/cosmic/v87/CosmicHGNC.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] COSMIC_ID COSMIC Gene ID (COSG*). [2:B] COSMIC_GENE_NAME Gene name used in COSMIC. [3:C] Entrez_id Entrez ID mapping. [4:D] HGNC_ID HGNC mapping. [5:E] Mutated? Does the gene have coding mutations y/n. [6:F] Cancer_census? Is the gene in the Cancer gene census y/n. [7:G] Expert Curated? Has the gene been manually curated by the team of expert curators y/n. ---------------------------- COSMIC Resistance Mutations ---------------------------- A tab separated table listing the details of all mutations in COSMIC which are known to confer drug resistance. [ /cosmic/grch37/cosmic/v87/CosmicResistanceMutations.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Sample name,Sample id A sample is an instance of a portion of a tumour being examined for mutations. The sample name can be derived from a number of sources. In many cases it originates from the cell line name. Other sources include names assigned by the annotators, or an incremented number assigned during an anonymisation process. A number of samples can be taken from a single tumour and a number of tumours can be obtained from one individual. A sample id is used to identify a sample within the COSMIC database. There can be multiple ids, if the same sample has been entered into the database multiple times from different papers. [3:C] Gene Name The gene name for which the data has been curated in COSMIC. In most cases this is the accepted HGNC identifier. [4:D] Transcript The transcript identifier (accession number) of the gene. [5:E] Census Gene Is the gene in the Cancer Gene Census (Yes, or No). [6:F] Drug Name The name of the drug which the mutation confers resistance to. [7:G] ID Mutation The unique mutation identifier (COSM). [8:H] AA Mutation The change that has occurred in the peptide sequence. Formatting is based on the recommendations made by the Human Genome Variation Society. [9:I] CDS Mutation The change that has occurred in the nucleotide sequence. Formatting is identical to the method used for the peptide sequence. [10:J] Primary Tissue The primary tissue/cancer from which the sample originated. More details on the tissue classification are avaliable from http://cancer.sanger.ac.uk/cosmic/classification. In COSMIC we have standard classification system for tissue types and sub types because they vary a lot between different papers. [11:K] Tissue Subtype 1 Further sub classification (level 1) of the samples tissue of origin. [12:L] Tissue Subtype 2 Further sub classification (level 2) of the samples tissue of origin. [13:M] Histology The histological classification of the sample. [14:N] Histology Subtype 1 Further histological classification (level 1) of the sample. [15:O] Histology Subtype 2 Further histological classification (level 2) of the sample. [16:P] Pubmed ID The PUBMED ID for the paper that the sample was noted in, linking to pubmed to provide more details of the publication. [17:Q] CGP Study Lists the unique Ids of studies that have involved this sample. [18:R] Somatic Status Information on whether the sample was reported to be Confirmed Somatic, Previously Reported or Variant of unknown origin - * Variant of unknown origin = when the mutation is known to be somatic but the tumour was sequenced without a matched normal. * Confirmed Somatic = if the mutation has been confimed to be somatic in the experiment by sequencing both the tumour and a matched normal from the same patient. * Previously observed = when the mutation has been reported as somatic previously but not in current paper. [19:S] Sample Type Describes where the sample has originated from including the tumour type. [20:T] Zygosity Information on whether the mutation was reported to be homozygous , heterozygous or unknown within the sample. [21:U] Genome Coordinates (GRCh37/38) The genome location of the mutation (chr:start..end), on the specified genome version. [22:V] Tier 1 or 2 [see http://cancer.sanger.ac.uk/census for details or Tier 1 and 2] ---------------------------------- ASCAT Ploidy and Purity Estimates ---------------------------------- A tab separated table listing the ploidy and aberrant cell fraction (purity estimate), for TCGA samples re-analysed using ASCAT. [ /cosmic/grch37/cosmic/v87/ascat_acf_ploidy.tsv ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Cancer_Type_Code The disease code (decode available from https://tcga-data.nci.nih.gov/datareports/codeTablesReport.htm). [2:B] Sample The name of the sample. [3:C] Aberrant_Cell_Fraction(Purity) The aberrant cell fraction (purity estimate). [4:D] Ploidy The ploidy of the genome. -------------------------------------------- VCF Files (coding and non-coding mutations) -------------------------------------------- CODING MUTATIONS VCF file of all coding mutations in the current release. [ /cosmic/grch37/cosmic/v87/VCF/CosmicCodingMuts.vcf.gz ] NON-CODING VARIANTS VCF file of all non coding mutations in the current release. [ /cosmic/grch37/cosmic/v87/VCF/CosmicNonCodingVariants.vcf.gz ] ------------------- Fasta File (genes) ------------------- CDS sequence for all the genes in COSMIC. [ /cosmic/grch37/cosmic/v87/All_COSMIC_Genes.fasta.gz ] ------------------- COSMIC Transcripts ------------------- A tab separated table listing the gene name and transcript accession for each gene ID. [ /cosmic/grch37/cosmic/v87/CosmicTranscripts.tsv.gz ] File Description [column number:label] Heading Description -------------------------------------------------------------------------------------------------------- [1:A] Gene ID The unique ID of the gene. [2:B] Gene_NAME The name of the gene. [3:C] Transcript ID The accession of the transcript. --------------------- Oracle Database Dump --------------------- The oracle database dump of the current release. Please see the help document OracleSchemaDocumentation.pdf for a description of the database schema. [ /cosmic/grch37/cosmic/v87/COSMIC_ORACLE_EXPORT.dmp.gz ]