{
 "metadata": {
  "name": "",
  "signature": "sha256:203b254988e60dc3140f5ca4c5767861830776747b2e2c877c60b863adf8398a"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Generating the training and test set involves using the `ocbio.extract` module with the chosen gold standard positive and negative datasets.\n",
      "This notebook is supposed to act like a script to do this, with documentation inline.\n",
      "\n",
      "First, the datasource table must be regenerated at the _top directory_ containing the data:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "cd ../../"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "/data/opencast/MRes\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import csv"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As the data repository has now been annexed the datasource table must first be unlocked:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!git annex unlock datasource.tab"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "unlock datasource.tab (copying...) ok\r\n"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#this script should be updated to add new features when available\n",
      "f = open(\"datasource.tab\", \"w\")\n",
      "c = csv.writer(f,delimiter=\"\\t\")\n",
      "# Gene Ontology features\n",
      "c.writerow([\"Gene_Ontology\",\"Gene_Ontology\",\"generator=geneontology/testgen.pickle\"])\n",
      "# Y2H SVM feature\n",
      "c.writerow([\"Y2H/Y2H.txt\",\"Y2H/Y2H.db\",\"valindexes=(4);ignoreheader=1;zeromissing=1\"])\n",
      "# ENTS feature\n",
      "c.writerow([\"ENTS\",\"ENTS\",\"generator=ents/human.ENTS.features.pickle\"])\n",
      "# ENTS summary feature\n",
      "c.writerow([\"ENTS_summary\",\"ENTS_summary\",\"generator=ents/human.Entrez.ENTS.summary.pickle\"])\n",
      "f.close()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Importing ocbio.extract\n",
      "\n",
      "Next, `ocbio.extract` must be added to the path and imported:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import sys"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "sys.path.append(\"opencast-bio/\")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import ocbio.extract"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "reload(ocbio.extract)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "<module 'ocbio.extract' from 'opencast-bio/ocbio/extract.pyc'>"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Unlocking databases\n",
      "\n",
      "Now that the data directory has been [annexed][gitannex] the database files must first be unlocked:\n",
      "\n",
      "[gitannex]: https://git-annex.branchable.com/"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "!git annex unlock Y2H/Y2H.db"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "unlock Y2H/Y2H.db (copying...) "
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "ok\r\n"
       ]
      }
     ],
     "prompt_number": 9
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Initialising assembler\n",
      "\n",
      "Then an assembler object must be initialised using the data source table:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assembler = ocbio.extract.FeatureVectorAssembler(\"datasource.tab\", verbose=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Using  from top data directory datasource.tab.\n",
        "Reading data source table:\n",
        "\tData source: Gene_Ontology to be processed to Gene_Ontology\n",
        "\tData source: Y2H/Y2H.txt to be processed to Y2H/Y2H.db\n",
        "\tData source: ENTS to be processed to ENTS\n",
        "\tData source: ENTS_summary to be processed to ENTS_summary\n",
        "Initialising parsers.\n",
        "Database Y2H/Y2H.db last updated 2014-06-25 12:15:04"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Finished Initialisation."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Regenerating features\n",
      "\n",
      "Then all the features should be regenerated to ensure they are up to date:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assembler.regenerate(verbose=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Regenerating parsers:\n",
        "\t parser 0\n",
        "Custom generator function, no database to regenerate.\n",
        "\t parser 1\n",
        "Database Y2H/Y2H.db last updated 2014-06-25 12:15:04\n",
        "\t parser 2\n",
        "Custom generator function, no database to regenerate.\n",
        "\t parser 3\n",
        "Custom generator function, no database to regenerate.\n"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Generating training set\n",
      "\n",
      "Using a set of positive interactions found through the [iRefIndex][] project created in [this notebook][gennotes] we can create a set of positive and negative feature vectors to train the classifier with:\n",
      "\n",
      "[irefindex]: http://irefindex.org/wiki/index.php?title=iRefIndex\n",
      "[gennotes]: http://nbviewer.ipython.org/github/ggray1729/opencast-bio/blob/master/notebooks/Creating%20training%20set.ipynb"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assembler.assemble(\"iRefIndex/human.iRefIndex.positive.pairs.txt\",\n",
      "                   \"features/human.iRefIndex.positive.vectors.txt\",verbose=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading pairfile: iRefIndex/human.iRefIndex.positive.pairs.txt\n",
        "Checking feature sizes:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\t Data source Gene_Ontology produces features of size 90.\n",
        "\t Data source Y2H/Y2H.txt produces features of size 1.\n",
        "\t Data source ENTS produces features of size 107.\n",
        "\t Data source ENTS_summary produces features of size 1.\n",
        "Writing feature vectors."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Wrote 188833 vectors.\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Gene_Ontology\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from Y2H/Y2H.txt\n",
        "Matched 38.39 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.positive.pairs.txt to features from ENTS_summary\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assembler.assemble(\"iRefIndex/human.iRefIndex.negative.pairs.txt\",\n",
      "                   \"features/human.iRefIndex.negative.vectors.txt\",verbose=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading pairfile: iRefIndex/human.iRefIndex.negative.pairs.txt\n",
        "Checking feature sizes:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\t Data source Gene_Ontology produces features of size 90.\n",
        "\t Data source Y2H/Y2H.txt produces features of size 1.\n",
        "\t Data source ENTS produces features of size 107.\n",
        "\t Data source ENTS_summary produces features of size 1.\n",
        "Writing feature vectors."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        ".."
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "Wrote 997760 vectors.\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Gene_Ontology\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from Y2H/Y2H.txt\n",
        "Matched 29.69 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS\n",
        "Matched 100.00 % of protein pairs in iRefIndex/human.iRefIndex.negative.pairs.txt to features from ENTS_summary\n"
       ]
      }
     ],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "# Generating active zone vectors\n",
      "\n",
      "To apply our classifier to the interactions in the Active Zone network we will need feature vectors corresponding to those interactions.\n",
      "These can be found in the following file:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assembler.assemble(\"forGAVIN/mergecode/OUT/edgelist.txt\",\n",
      "                   \"features/human.activezone.txt\",verbose=Tfeatures/"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Reading pairfile: forGAVIN/mergecode/OUT/edgelist.txt\n",
        "Checking feature sizes:\n",
        "\t Data source Gene_Ontology produces features of size 90.\n",
        "\t Data source Y2H/Y2H.txt produces features of size 1.\n",
        "\t Data source ENTS produces features of size 107.\n",
        "\t Data source ENTS_summary produces features of size 1.\n",
        "Writing feature vectors\n"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Wrote 9375 vectors.\n",
        "Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Gene_Ontology\n",
        "Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from Y2H/Y2H.txt\n",
        "Matched 42.74 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS\n",
        "Matched 100.00 % of protein pairs in forGAVIN/mergecode/OUT/edgelist.txt to features from ENTS_summary\n"
       ]
      }
     ],
     "prompt_number": 14
    }
   ],
   "metadata": {}
  }
 ]
}