{ "metadata": { "name": "", "signature": "sha256:b2f9a236477bb40dfd4ba3009d61d6ce5e32fa8ddd86ed796f5116ff90e873dc" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "To reiterate, it looks like the Qi code provided with the 2006 paper does not include anything for the classification - I guess they didn't see any point seeing as the classification can be performed with plenty of different toolsets and they probably used one they couldn't just share.\n", "In any case, I want to replicate their results so I've got to get the feature vectors they had and get out some Python classification algorithms and see if I can get similar results to the paper.\n", "\n", "## Running Perl scripts\n", "\n", "The code is basically just a couple of perl scripts, and I think I should be able to just run these without any hassle and hopefully this will give me the feature vectors.\n", "Unfortunately, if anything goes wrong it's going to take a while to work out because I don't know Perl." ] }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u001b[0m\u001b[40m\u001b[m\u001b[34;42m0yeast_gene_list\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m12homology-PPI\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m2tf-binding\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m5essentiality\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m8nature-compare-sequence\u001b[0m/ \u001b[40m\u001b[m\u001b[00mBatch_feature_summary_ExtractWrapper.pl\u001b[0m \u001b[40m\u001b[m\u001b[00mREADME\u001b[0m\r\n", "\u001b[40m\u001b[m\u001b[34;42m10mips-phenotype\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m13domain-interaction\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m3gene-ontology\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m6HighExp-PPI\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m9mips-pclass\u001b[0m/ \u001b[40m\u001b[m\u001b[00mInvestigating qi_evaluation_2006.ipynb\u001b[0m \u001b[40m\u001b[m\u001b[00mReplicating Qi 2006.ipynb\u001b[0m\r\n", "\u001b[40m\u001b[m\u001b[34;42m11sequence-similarity\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m1gene-expression\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m4protein-expression\u001b[0m/ \u001b[40m\u001b[m\u001b[34;42m7genetic-interaction\u001b[0m/ \u001b[40m\u001b[m\u001b[00mBatch_feature_ExtractWrapper.pl\u001b[0m \u001b[40m\u001b[m\u001b[00mInvestigating qi_evaluation_2006.md\u001b[0m \u001b[40m\u001b[m\u001b[34;42mtrain-set\u001b[0m/\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "head -n 30 Batch_feature_ExtractWrapper.pl" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "######################################################################3\r\n", "#\r\n", "# copyright @ Yanjun Qi , qyj@cs.cmu.edu\r\n", "# Please cite: \r\n", "# Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, \"Evaluation of different biological data and computational classification methods for use in protein interaction prediction\", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006\r\n", "# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, \"A mixture of feature experts approach for protein-protein interaction prediction\", BMC Bioinformatics 8 (S10):S6, 2007 \r\n", "# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, \ufffdRandom Forest Similarity for Protein-Protein Interaction Prediction from Multiple source\ufffd, Pacific Symposium on Biocomputing 10: (PSB 2005) Jan. 2005. \r\n", "# \r\n", "######################################################################3\r\n", "\r\n", "\r\n", "# This program is a yeast PPI feature extraction wrapper \r\n", "# perl command inputPairlist\r\n", "\r\n", "\r\n", "use strict; \r\n", "die \"Usage: command inputPairFile \\n\" if scalar(@ARGV) < 1;\r\n", "my ($inputPair ) = @ARGV;\r\n", "\r\n", "\r\n", "print \"\\n--------------------------- 1gene-expression ----------------------------------------\\n\"; \r\n", "\r\n", "# ------------------- 1gene-expression ------------------------------\r\n", "\r\n", "my $cmdPre = \"perl ./1gene-expression/get_gene_expression.pl \"; \r\n", "my $cmdPro = \"./1gene-expression/YeastGeneListOrfGeneName-106_pval_v9.0.txt ./1gene-expression/all_expression_fixed_s4_csv.txt ./1gene-expression/expressionYanjunSplit.txt 0.6 \"; \r\n", "\r\n", "my $cmd = $cmdPre.\" \".$inputPair.\" \".$cmdPro.\" \".$inputPair.\".genexp\" ; \r\n", "print \"$cmd\\n\"; \r\n", "system($cmd); \r\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, so the documentation isn't very extensive, the important part is that `perl command inputPairlist`. Where I guess the command is the name of the wrapper?\n", "\n", "Anyway, which file is the inputpairlist? I think it's in the `0yeast_gene_list/` directory." ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd 0yeast_gene_list/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "/home/gavin/Documents/MRes/YeastPPI-shared-08/0yeast_gene_list\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "ls" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\u001b[0m\u001b[40m\u001b[m\u001b[00mmake_fullpair_4protein.pl\u001b[0m \u001b[40m\u001b[m\u001b[00mYeastGeneListOrfGeneName-106_pval_v9.0.txt\u001b[0m\r\n" ] } ], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "head YeastGeneListOrfGeneName-106_pval_v9.0.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "YAL001C\tTFC3\r\n", "YAL002W\tVPS8\r\n", "YAL003W\tEFB1\r\n", "YAL004W\tYAL004W\r\n", "YAL005C\tSSA1\r\n", "YAL007C\tERP2\r\n", "YAL008W\tFUN14\r\n", "YAL009W\tSPO7\r\n", "YAL010C\tMDM10\r\n", "YAL011W\tFUN36\r\n" ] } ], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Guess this maps one reference name for a gene to another?" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "head -n 30 make_fullpair_4protein.pl" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "######################################################################3\r\n", "#\r\n", "# copyright @ Yanjun Qi , qyj@cs.cmu.edu\r\n", "# Please cite: \r\n", "# Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, \"Evaluation of different biological data and computational classification methods for use in protein interaction prediction\", PROTEINS: Structure, Function, and Bioinformatics. 63(3):490-500. 2006\r\n", "# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, \"A mixture of feature experts approach for protein-protein interaction prediction\", BMC Bioinformatics 8 (S10):S6, 2007 \r\n", "# Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, \ufffdRandom Forest Similarity for Protein-Protein Interaction Prediction from Multiple source\ufffd, Pacific Symposium on Biocomputing 10: (PSB 2005) Jan. 2005. \r\n", "# \r\n", "######################################################################3\r\n", "\r\n", "\r\n", "# \r\n", "# This program is to make the pair list from the protein list \r\n", "#\r\n", "# perl make_fullpair_4protein.pl protein_list.txt pos_list output_pos output_other \r\n", "# \r\n", "# \r\n", "# $ perl make_fullpair_4protein.pl YeastGeneListOrfGeneName-106_pval_v9.0.txt Science03-pos_MIPS_complexes.txt mipsPosPair.txt mipsRandpair.txt\r\n", "#==> There are 6270 unique proteins in original list.\r\n", "#==> There should have 19653315 pairs possibly generated totally !\r\n", "# There are 8617 POS pairs originally .\r\n", "# fullpairs has: 7390 POS pairs.\r\n", "# fullpairs has: 19645925 RAND pairs.\r\n", "# ==> There are 19653315 pairs generated !\r\n", "# \r\n", "\r\n", "\r\n", "use strict; \r\n", "die \"Usage: command gene_name_file pos_pairFile outPosPairFile outRandPairFile \\n\" if scalar(@ARGV) < 4; \r\n", "\r\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Slightly confused about this. I'm guessing that this is to produce the input pair list for the main wrapper but I don't see where to get `pos_pairFile`, `outPosPairFile`, `outRandPairFile`. Trying to find: `Science03-pos_MIPS_complexes.txt`, `mipsPosPair.txt mipsRandpair.txt`:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "cd .." ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "find . -name Science*\n", "find . -name mips*" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 31 }, { "cell_type": "markdown", "metadata": {}, "source": [ "__No sign of them.__\n", "\n", "I guess I'm supposed to get them from elsewhere maybe? Doesn't make any sense though, why tarball up all the code but not make it runnable?\n", "\n", "Ok, maybe I'm just being stupid and two of those are output files? So we'd just need a positive list of proteins? Is that in here somewhere? Don't see it.\n", "\n", "Looks like I can get it online at [this page on the Qi site][qifeatureset]\n", "\n", "Downloaded them and put them in a directory called `featuresets`. Can have a look at them now:\n", "\n", "[qifeatureset]: http://www.cs.cmu.edu/~qyj/papers_sulp/proteins05_pages/feature-download.html" ] }, { "cell_type": "code", "collapsed": false, "input": [ "ls -lh featuresets/phyInteract/" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "total 188M\r\n", "\u001b[0m\u001b[40m-rw-r--r-- 1 gavin users 51K May 28 16:19 \u001b[m\u001b[00mdipsPosPair\u001b[0m\r\n", "\u001b[40m-rw-r--r-- 1 gavin users 2.1M May 28 16:19 \u001b[m\u001b[00mdipsPosPair.feature\u001b[0m\r\n", "\u001b[40m-rw-r--r-- 1 gavin users 4.2M May 28 16:19 \u001b[m\u001b[00mdipsRandpairSub23w\u001b[0m\r\n", "\u001b[40m-rw-r--r-- 1 gavin users 181M May 28 16:19 \u001b[m\u001b[00mdipsRandpairSub23w.feature\u001b[0m\r\n", "\u001b[40m-rw-r--r-- 1 gavin users 624 May 28 16:19 \u001b[m\u001b[00mreadme.txt\u001b[0m\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "head -n 1 featuresets/phyInteract/dipsPosPair.feature" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "0.793388763224369,0.00420804252006518,0.425688981443356,0.050368298677632,-100,-0.431032656998849,-100,0.219336773277164,-0.260415665650041,0.197204579148454,0.561672012832894,-0.0228811714495604,0.226695830426883,-0.498251467771468,0.191892679025485,-0.0442896786511169,-0.0410836070811429,0.31939401640312,0.201702280568662,-0.206576006538866,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5.42227951866507,1,-100,0,-100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,1\n" ] } ], "prompt_number": 16 }, { "cell_type": "code", "collapsed": false, "input": [ "%%bash\n", "head -n 1 featuresets/phyInteract/dipsRandpairSub23w.feature" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "-100,-0.451693610174967,-0.0287451930023522,0.199960245640947,-100,-0.132785992866974,-100,0.0113596052584383,-0.0184840432669403,-0.332990558018139,0.231135300459446,0.568965723612462,-0.00592435004576543,0.212805215910624,-0.934391682945072,-0.0356978663758904,0.635341056917029,0.507997427846247,0.449270105867988,0.376562615235622,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.47217114669236,0,-100,0,-100,-100,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,-100,0\n" ] } ], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Opening up these files:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import csv,glob\n", "nipfiles = glob.glob(\"featuresets/phyInteract/dips*.feature\")\n", "print nipfiles\n", "#write generator functions for each file will require generator object " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "['featuresets/phyInteract/dipsPosPair.feature', 'featuresets/phyInteract/dipsRandpairSub23w.feature']\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "One option would be to write a generator function to open these up and return rows as they are required to save RAM. If I get round to writing this I'll put it here:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#writing generator class" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another option would be to use `pandas` as it's pretty much designed to solve this problem." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import pandas as pd\n", "#test importing one of these csvs\n", "pd.read_csv(nipfiles[0],header=None).head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "8 | \n", "9 | \n", "10 | \n", "11 | \n", "12 | \n", "13 | \n", "14 | \n", "15 | \n", "16 | \n", "17 | \n", "18 | \n", "19 | \n", "\n", " |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.793389 | \n", "0.004208 | \n", "0.425689 | \n", "0.050368 | \n", "-100.000000 | \n", "-0.431033 | \n", "-100.000000 | \n", "0.219337 | \n", "-0.260416 | \n", "0.197205 | \n", "0.561672 | \n", "-0.022881 | \n", "0.226696 | \n", "-0.498251 | \n", "0.191893 | \n", "-0.044290 | \n", "-0.041084 | \n", "0.319394 | \n", "0.201702 | \n", "-0.206576 | \n", "... | \n", "
1 | \n", "-0.060023 | \n", "-0.210066 | \n", "-0.091081 | \n", "0.057263 | \n", "0.615934 | \n", "0.000000 | \n", "-0.270493 | \n", "0.459723 | \n", "0.283593 | \n", "-0.820271 | \n", "0.768494 | \n", "0.476355 | \n", "0.100546 | \n", "-0.458742 | \n", "0.179573 | \n", "-0.001993 | \n", "0.670944 | \n", "0.523979 | \n", "0.031387 | \n", "0.144759 | \n", "... | \n", "
2 | \n", "-0.139235 | \n", "0.233024 | \n", "-0.246856 | \n", "-0.197301 | \n", "-0.735368 | \n", "-0.848902 | \n", "-0.242361 | \n", "-0.274482 | \n", "-0.260667 | \n", "0.968193 | \n", "0.512238 | \n", "0.116022 | \n", "0.087863 | \n", "-0.000231 | \n", "-0.999968 | \n", "0.018011 | \n", "-0.331740 | \n", "0.006124 | \n", "-0.144496 | \n", "-0.287701 | \n", "... | \n", "
3 | \n", "-100.000000 | \n", "0.516909 | \n", "0.489173 | \n", "0.382931 | \n", "-100.000000 | \n", "0.365115 | \n", "0.055116 | \n", "0.382403 | \n", "-0.396365 | \n", "0.106891 | \n", "0.658290 | \n", "0.533645 | \n", "0.096992 | \n", "0.496594 | \n", "0.976505 | \n", "-0.014651 | \n", "-0.257374 | \n", "0.015310 | \n", "0.194071 | \n", "-0.033708 | \n", "... | \n", "
4 | \n", "0.654579 | \n", "0.657707 | \n", "0.114509 | \n", "0.293567 | \n", "0.096530 | \n", "0.000000 | \n", "-0.557642 | \n", "-0.108332 | \n", "0.079067 | \n", "0.801406 | \n", "0.234897 | \n", "0.414738 | \n", "0.370008 | \n", "-0.273466 | \n", "-0.949009 | \n", "0.043061 | \n", "0.343765 | \n", "0.329410 | \n", "0.320448 | \n", "-0.165973 | \n", "... | \n", "
5 rows \u00d7 163 columns
\n", "