{ "metadata": { "name": "", "signature": "sha256:3bbbcfb0d62b99e5755186306ff0cf10e99bf3c5a78c08cf76b27d3ad4da5512" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import sys\n", "sys.version_info" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 1, "text": [ "sys.version_info(major=3, minor=4, micro=2, releaselevel='final', serial=0)" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "[petl](http://petl.readthedocs.org) Case Study 1 - Comparing Tables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This case study illustrates the use of the [petl](http://petl.readthedocs.org) package for doing some simple profiling and comparison of data from\n", "two tables.\n" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The files used in this case study can be downloaded from the following\n", "link:\n", "\n", "* http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip\n", "\n", "Download and unzip the files:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!wget http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip\n", "!unzip -o petl-case-study-1-files.zip" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "--2015-01-19 17:37:39-- http://aliman.s3.amazonaws.com/petl/petl-case-study-1-files.zip\r\n", "Resolving aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)... 54.231.9.241\r\n", "Connecting to aliman.s3.amazonaws.com (aliman.s3.amazonaws.com)|54.231.9.241|:80... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "connected.\r\n", "HTTP request sent, awaiting response... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "200 OK\r\n", "Length: 3076773 (2.9M) [application/zip]\r\n", "Saving to: \u2018petl-case-study-1-files.zip\u2019\r\n", "\r\n", "\r", " 0% [ ] 0 --.-K/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 2% [ ] 75,696 276KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 8% [==> ] 265,496 484KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "22% [=======> ] 688,896 838KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "50% [==================> ] 1,567,816 1.39MB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "100%[======================================>] 3,076,773 2.34MB/s in 1.3s \r\n", "\r\n", "2015-01-19 17:37:41 (2.34 MB/s) - \u2018petl-case-study-1-files.zip\u2019 saved [3076773/3076773]\r\n", "\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "Archive: petl-case-study-1-files.zip\r\n", " inflating: popdata.csv " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n", " inflating: snpdata.csv " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first file (`snpdata.csv`) contains a list of locations in the\n", "genome of the malaria parasite *P. falciparum*, along with some basic\n", "data about genetic variations found at those locations.\n", "\n", "The second file (`popdata.csv`) is supposed to contain the same list\n", "of genome locations, along with some additional data such as allele\n", "frequencies in different populations.\n", "\n", "The main point for this case study is that the first file\n", "(`snpdata.csv`) contains the canonical list of genome locations, and\n", "the second file (`popdata.csv`) contains some additional data about\n", "the same genome locations and therefore should be consistent with the\n", "first file. We want to check whether this second file is in fact\n", "consistent with the first file." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Preparing the data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Start by importing the petl package:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import petl as etl\n", "etl.__version__" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ "'1.0.0'" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "To save some typing, let ***a*** be the table of data extracted from the\n", "first file (`snpdata.csv`), and let ***b*** be the table of data extracted\n", "from the second file (`popdata.csv`), using the `fromcsv()`\n", "function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a = etl.fromtsv('snpdata.csv')\n", "b = etl.fromtsv('popdata.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Examine the header from each file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a.header()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "('Chr',\n", " 'Pos',\n", " 'Ref',\n", " 'Nref',\n", " 'Der',\n", " 'Mut',\n", " 'isTypable',\n", " 'GeneId',\n", " 'GeneAlias',\n", " 'GeneDescr')" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "b.header()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "('Chromosome',\n", " 'Coordinates',\n", " 'Ref. Allele',\n", " 'Non-Ref. Allele',\n", " 'Outgroup Allele',\n", " 'Ancestral Allele',\n", " 'Derived Allele',\n", " 'Ref. Aminoacid',\n", " 'Non-Ref. Aminoacid',\n", " 'Private Allele',\n", " 'Private population',\n", " 'maf AFR',\n", " 'maf PNG',\n", " 'maf SEA',\n", " 'daf AFR',\n", " 'daf PNG',\n", " 'daf SEA',\n", " 'nraf AFR',\n", " 'nraf PNG',\n", " 'nraf SEA',\n", " 'Mutation type',\n", " 'Gene',\n", " 'Gene Aliases',\n", " 'Gene Description',\n", " 'Gene Information')" ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is a common set of 9 fields that is present in both tables, and\n", "we would like focus on comparing these common fields, however\n", "different field names have been used in the two files. To simplify\n", "comparison, use `rename()` to rename some fields in the\n", "second file:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "b_renamed = b.rename({'Chromosome': 'Chr', \n", " 'Coordinates': 'Pos', \n", " 'Ref. Allele': 'Ref', \n", " 'Non-Ref. Allele': 'Nref', \n", " 'Derived Allele': 'Der', \n", " 'Mutation type': 'Mut', \n", " 'Gene': 'GeneId', \n", " 'Gene Aliases': 'GeneAlias', \n", " 'Gene Description': 'GeneDescr'})\n", "b_renamed.header()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ "('Chr',\n", " 'Pos',\n", " 'Ref',\n", " 'Nref',\n", " 'Outgroup Allele',\n", " 'Ancestral Allele',\n", " 'Der',\n", " 'Ref. Aminoacid',\n", " 'Non-Ref. Aminoacid',\n", " 'Private Allele',\n", " 'Private population',\n", " 'maf AFR',\n", " 'maf PNG',\n", " 'maf SEA',\n", " 'daf AFR',\n", " 'daf PNG',\n", " 'daf SEA',\n", " 'nraf AFR',\n", " 'nraf PNG',\n", " 'nraf SEA',\n", " 'Mut',\n", " 'GeneId',\n", " 'GeneAlias',\n", " 'GeneDescr',\n", " 'Gene Information')" ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use `cut()` to extract only the fields we're interested in\n", "from both tables:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "common_fields = ['Chr', 'Pos', 'Ref', 'Nref', 'Der', 'Mut', 'GeneId', 'GeneAlias', 'GeneDescr']\n", "a_common = a.cut(common_fields)\n", "b_common = b_renamed.cut(common_fields)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Inspect the data:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a_common" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
MAL191099GA-SPFA0095cMAL1P1.10rifin
MAL191104AT-NPFA0095cMAL1P1.10rifin
MAL193363TA-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
MAL193382TG-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
MAL193384GA-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
\n", "

...

" ], "metadata": {}, "output_type": "pyout", "prompt_number": 9, "text": [ "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "| Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |\n", "+========+=========+=====+======+=====+=====+============+=============+====================================================+\n", "| 'MAL1' | '91099' | 'G' | 'A' | '-' | 'S' | 'PFA0095c' | 'MAL1P1.10' | 'rifin' |\n", "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "| 'MAL1' | '91104' | 'A' | 'T' | '-' | 'N' | 'PFA0095c' | 'MAL1P1.10' | 'rifin' |\n", "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "| 'MAL1' | '93363' | 'T' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |\n", "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "| 'MAL1' | '93382' | 'T' | 'G' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |\n", "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "| 'MAL1' | '93384' | 'G' | 'A' | '-' | 'N' | 'PFA0100c' | 'MAL1P1.11' | 'hypothetical protein, conserved in P. falciparum' |\n", "+--------+---------+-----+------+-----+-----+------------+-------------+----------------------------------------------------+\n", "..." ] } ], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "b_common" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
MAL191099GA-SYNPFA0095cMAL1P1.10,RIFrifin
MAL191104AT-NONPFA0095cMAL1P1.10,RIFrifin
MAL193363TA-NONPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
MAL193382TG-NONPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
MAL193384GA-NONPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
\n", "

...

" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "| Chr | Pos | Ref | Nref | Der | Mut | GeneId | GeneAlias | GeneDescr |\n", "+========+=========+=====+======+=====+=======+============+=================+==========================================================+\n", "| 'MAL1' | '91099' | 'G' | 'A' | '-' | 'SYN' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |\n", "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "| 'MAL1' | '91104' | 'A' | 'T' | '-' | 'NON' | 'PFA0095c' | 'MAL1P1.10,RIF' | 'rifin' |\n", "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "| 'MAL1' | '93363' | 'T' | 'A' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |\n", "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "| 'MAL1' | '93382' | 'T' | 'G' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |\n", "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "| 'MAL1' | '93384' | 'G' | 'A' | '-' | 'NON' | 'PFA0100c' | 'MAL1P1.11' | 'Plasmodium exported protein (PHISTa), unknown function' |\n", "+--------+---------+-----+------+-----+-------+------------+-----------------+----------------------------------------------------------+\n", "..." ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `fromucsv()` function does not attempt to parse any of the\n", "values from the underlying CSV file, so all values are reported as\n", "strings:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "b_common.display(vrepr=repr)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
'MAL1''91099''G''A''-''SYN''PFA0095c''MAL1P1.10,RIF''rifin'
'MAL1''91104''A''T''-''NON''PFA0095c''MAL1P1.10,RIF''rifin'
'MAL1''93363''T''A''-''NON''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
'MAL1''93382''T''G''-''NON''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
'MAL1''93384''G''A''-''NON''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, the 'Pos' field should be interpreted as an integer.\n", "\n", "Also, the 'Mut' field has a different representation in the two\n", "tables, which needs to be converted before the data can be compared:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a_common.valuecounts('Mut')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mutcountfrequency
N711620.6865804123611875
S315350.30425386166507473
-9500.009165725973737783
\n" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "+-----+-------+----------------------+\n", "| Mut | count | frequency |\n", "+=====+=======+======================+\n", "| 'N' | 71162 | 0.6865804123611875 |\n", "+-----+-------+----------------------+\n", "| 'S' | 31535 | 0.30425386166507473 |\n", "+-----+-------+----------------------+\n", "| '-' | 950 | 0.009165725973737783 |\n", "+-----+-------+----------------------+" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "b_common.valuecounts('Mut')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mutcountfrequency
NON708800.6840510336042
SYN327380.31594896639579995
\n" ], "metadata": {}, "output_type": "pyout", "prompt_number": 13, "text": [ "+-------+-------+---------------------+\n", "| Mut | count | frequency |\n", "+=======+=======+=====================+\n", "| 'NON' | 70880 | 0.6840510336042 |\n", "+-------+-------+---------------------+\n", "| 'SYN' | 32738 | 0.31594896639579995 |\n", "+-------+-------+---------------------+" ] } ], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the `convert()` function to convert the type of the 'Pos'\n", "field in both tables and the representation of the 'Mut' field in\n", "table ***b***:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a_conv = a_common.convert('Pos', int)\n", "b_conv = (\n", " b_common\n", " .convert('Pos', int)\n", " .convert('Mut', {'SYN': 'S', 'NON': 'N'})\n", ")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "highlight = 'background-color: yellow'\n", "a_conv.display(caption='a', vrepr=repr, td_styles={'Pos': highlight})" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
a
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
'MAL1'91099'G''A''-''S''PFA0095c''MAL1P1.10''rifin'
'MAL1'91104'A''T''-''N''PFA0095c''MAL1P1.10''rifin'
'MAL1'93363'T''A''-''N''PFA0100c''MAL1P1.11''hypothetical protein, conserved in P. falciparum'
'MAL1'93382'T''G''-''N''PFA0100c''MAL1P1.11''hypothetical protein, conserved in P. falciparum'
'MAL1'93384'G''A''-''N''PFA0100c''MAL1P1.11''hypothetical protein, conserved in P. falciparum'
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "b_conv.display(caption='b', vrepr=repr, td_styles={'Pos': highlight, 'Mut': highlight})" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
b
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
'MAL1'91099'G''A''-''S''PFA0095c''MAL1P1.10,RIF''rifin'
'MAL1'91104'A''T''-''N''PFA0095c''MAL1P1.10,RIF''rifin'
'MAL1'93363'T''A''-''N''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
'MAL1'93382'T''G''-''N''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
'MAL1'93384'G''A''-''N''PFA0100c''MAL1P1.11''Plasmodium exported protein (PHISTa), unknown function'
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 16 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now the tables are ready for comparison." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Looking for missing or unexpected rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because both tables should contain the same list of genome locations,\n", "they should have the same number of rows. Use `nrows()` to\n", "compare:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a_conv.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 17, "text": [ "103647" ] } ], "prompt_number": 17 }, { "cell_type": "code", "collapsed": false, "input": [ "b_conv.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "103618" ] } ], "prompt_number": 18 }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is some discrepancy. First investigate by comparing just the\n", "genomic locations, defined by the 'Chr' and 'Pos' fields, using\n", "`complement()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "a_locs = a_conv.cut('Chr', 'Pos')\n", "b_locs = b_conv.cut('Chr', 'Pos')\n", "locs_only_in_a = a_locs.complement(b_locs)\n", "locs_only_in_a.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 19, "text": [ "29" ] } ], "prompt_number": 19 }, { "cell_type": "code", "collapsed": false, "input": [ "locs_only_in_a.displayall(caption='a only')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
a only
ChrPos
MAL1216961
MAL10538210
MAL10548779
MAL101432969
MAL11500289
MAL111119809
MAL111278859
MAL1251827
MAL13183727
MAL13398404
MAL13627342
MAL131216664
MAL132750149
MAL141991758
MAL142297918
MAL142372268
MAL142994810
MAL238577
MAL264017
MAL41094258
MAL51335335
MAL51338718
MAL7670602
MAL7690509
MAL8489937
MAL9416116
MAL9868677
MAL91201970
MAL91475245
\n" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "locs_only_in_b = b_locs.complement(a_locs)\n", "locs_only_in_b.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 21, "text": [ "0" ] } ], "prompt_number": 21 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So it appears that 29 locations are missing from table ***b***. Export\n", "these missing locations to a CSV file using `toucsv()`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "locs_only_in_a.tocsv('missing_locations.csv')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 22 }, { "cell_type": "markdown", "metadata": {}, "source": [ "An alternative method for finding rows in one table where some key\n", "value is not present in another table is to use the `antijoin()`\n", "function:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "locs_only_in_a = a_conv.antijoin(b_conv, key=('Chr', 'Pos'))\n", "locs_only_in_a.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 23, "text": [ "29" ] } ], "prompt_number": 23 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Finding conflicts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'd also like to compare the values given in the other fields, to\n", "find any discrepancies between the two tables.\n", "\n", "The simplest way to find conflicts is to `merge()` both tables under a given key:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ab_merge = etl.merge(a_conv, b_conv, key=('Chr', 'Pos'))\n", "ab_merge.display(caption='ab_merge', \n", " td_styles=lambda v: highlight if isinstance(v, etl.Conflict) else '')" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
ab_merge
ChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
MAL191099GA-SPFA0095cConflict({'MAL1P1.10', 'MAL1P1.10,RIF'})rifin
MAL191104AT-NPFA0095cConflict({'MAL1P1.10', 'MAL1P1.10,RIF'})rifin
MAL193363TA-NPFA0100cMAL1P1.11Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL193382TG-NPFA0100cMAL1P1.11Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
MAL193384GA-NPFA0100cMAL1P1.11Conflict({'Plasmodium exported protein (PHISTa), unknown function', 'hypothetical protein, conserved in P. falciparum'})
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 24 }, { "cell_type": "markdown", "metadata": {}, "source": [ "From a glance at the conflicts above, it appears there are\n", "discrepancies in the 'GeneAlias' and 'GeneDescr' fields. There may\n", "also be conflicts in other fields, so we need to investigate further.\n", "\n", "Note that the table ***ab_merge*** will contain all rows, not only those containing conflicts. To find only conflicting rows, use `cat()` then `conflicts()`, e.g.:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ab = etl.cat(a_conv.addfield('source', 'a', index=0), \n", " b_conv.addfield('source', 'b', index=0))\n", "ab_conflicts = ab.conflicts(key=('Chr', 'Pos'), exclude='source')\n", "ab_conflicts.display(10)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
sourceChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
aMAL191099GA-SPFA0095cMAL1P1.10rifin
bMAL191099GA-SPFA0095cMAL1P1.10,RIFrifin
aMAL191104AT-NPFA0095cMAL1P1.10rifin
bMAL191104AT-NPFA0095cMAL1P1.10,RIFrifin
aMAL193363TA-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
bMAL193363TA-NPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
aMAL193382TG-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
bMAL193382TG-NPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
aMAL193384GA-NPFA0100cMAL1P1.11hypothetical protein, conserved in P. falciparum
bMAL193384GA-NPFA0100cMAL1P1.11Plasmodium exported protein (PHISTa), unknown function
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 25 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, let's find conflicts in a specific field:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ab_conflicts_mut = ab.conflicts(key=('Chr', 'Pos'), include='Mut')\n", "ab_conflicts_mut.display(10, caption='Mut conflicts', td_styles={'Mut': highlight})" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Mut conflicts
sourceChrPosRefNrefDerMutGeneIdGeneAliasGeneDescr
aMAL199099GT--PFA0110wMAL1P1.13,Pf155ring-infected erythrocyte surface antigen
bMAL199099GT-NPFA0110wMAL1P1.13,Pf155,RESAring-infected erythrocyte surface antigen
aMAL199211CT--PFA0110wMAL1P1.13,Pf155ring-infected erythrocyte surface antigen
bMAL199211CT-NPFA0110wMAL1P1.13,Pf155,RESAring-infected erythrocyte surface antigen
aMAL1197903CAASPFA0220wMAL1P1.34bubiquitin carboxyl-terminal hydrolase, putative
bMAL1197903CAANPFA0220wPFA0215w,MAL1P1.34bubiquitin carboxyl-terminal hydrolase, putative
aMAL1384429CT-NPFA0485wMAL1P2.26dolichol kinase
bMAL1384429CT-S---
aMAL1513268AG-NPFA0650wMAL1P3.12,MAL1P3.12a,PFA0655wsurface-associated interspersed gene pseudogene, (SURFIN) pseudogene
bMAL1513268AG-SPFA0650wMAL1P3.12,PFA0655,MAL1P3.12a,3D7surf1.2,PFA0655w,MAL1P12asurface-associated interspersed gene (SURFIN), pseudogene
\n", "

...

" ], "metadata": {}, "output_type": "display_data" } ], "prompt_number": 26 }, { "cell_type": "code", "collapsed": false, "input": [ "ab_conflicts_mut.nrows()" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 27, "text": [ "3592" ] } ], "prompt_number": 27 }, { "cell_type": "markdown", "metadata": {}, "source": [ "For more information about the `petl` package see the [petl online documentation](http://petl.readthedocs.org)." ] } ], "metadata": {} } ] }