{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyRNA stores and describes an RNA secondary structure using two different data structures:\n",
    "\n",
    "* a pandas Dataframe\n",
    "* a pyrna.features.SecondaryStructure object\n",
    "\n",
    "The pandas Dataframe lists all the base pairs (a.k.a interactions) making the secondary structure. If a pandas Dataframe restricts the exploration of a secondary structure to the level of the base pairs, a pyrna.features.SecondaryStructure gives us access to high-level objects like: \n",
    "\n",
    "* helices,\n",
    "* single-strands,\n",
    "* junctions.\n",
    "\n",
    "In both data structures, the edges of the base pairs are described with three different characters:\n",
    "\n",
    "    * '(' or ')' for Watson-Crick edges,\n",
    "    * '[' or ']' for Hoogsteen edges,\n",
    "    * '{' or '}' for Sugar edges.\n",
    "    \n",
    "A base pair can also be in a cis ('c') or trans ('t') orientation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creation of secondary structures from scratch"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  orientation edge1 edge2  pos1  pos2\n",
      "0           c     (     )     4    10\n",
      "1           c     {     )     3    11\n",
      "2           c     (     )     2    12\n",
      "3           c     (     )     1    13\n"
     ]
    }
   ],
   "source": [
    "import json #to have a better output for dict describing pyrna.features.RNA objects (interactions, helices, single-strands,..)\n",
    "\n",
    "from pyrna.parsers import parse_bn\n",
    "rna = RNA(name = 'my_rna', sequence = 'GGGGGACAACCCC')\n",
    "bn = '(({(.....))))'\n",
    "base_pairs = parse_bn(bn)\n",
    "print base_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we transform this list of base-pairs into a pyrna.features.SecondaryStructure object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyrna.parsers import base_pairs_to_secondary_structure\n",
    "ss = base_pairs_to_secondary_structure(rna, base_pairs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can browse over the helices. An helix contains a field to store non-canonical base pairs (meaning base pairs different from c() AU, c() GU or c() GC. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"length\": 4, \n",
      "  \"interactions\": [\n",
      "    {\n",
      "      \"edge1\": \"{\", \n",
      "      \"edge2\": \")\", \n",
      "      \"orientation\": \"c\", \n",
      "      \"location\": [\n",
      "        [\n",
      "          3, \n",
      "          3\n",
      "        ], \n",
      "        [\n",
      "          11, \n",
      "          11\n",
      "        ]\n",
      "      ]\n",
      "    }\n",
      "  ], \n",
      "  \"name\": \"H1\", \n",
      "  \"location\": [\n",
      "    [\n",
      "      1, \n",
      "      4\n",
      "    ], \n",
      "    [\n",
      "      10, \n",
      "      13\n",
      "    ]\n",
      "  ]\n",
      "}\n",
      "\n",
      "Any non canonical base pairs in this helix?\n",
      "\n",
      "{\n",
      "  \"edge1\": \"{\", \n",
      "  \"edge2\": \")\", \n",
      "  \"orientation\": \"c\", \n",
      "  \"location\": [\n",
      "    [\n",
      "      3, \n",
      "      3\n",
      "    ], \n",
      "    [\n",
      "      11, \n",
      "      11\n",
      "    ]\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "for helix in ss.helices:\n",
    "    print json.dumps(helix, indent = 2)\n",
    "    print \"\\nAny non canonical base pairs in this helix?\\n\"\n",
    "    for interaction in helix['interactions']:\n",
    "        print json.dumps(interaction, indent = 2)    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or over the single-strands..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"name\": \"SS1\", \n",
      "  \"location\": [\n",
      "    5, \n",
      "    9\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "for single_strand in ss.single_strands:\n",
    "    print json.dumps(single_strand, indent = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Any high-level object from a pyrna.features.SecondaryStructure is described with a \"location\". A location is a N x 2 matrix where N is the number of blocks of contiguous residues, each block being described with its start and end positions.\n",
    "\n",
    "An helix location will be a matrix like [ [1,4] , [10,13] ], meaning that the first strand is between residues 1 and 4, and the second strand is between residues 10 and 13.\n",
    "\n",
    "A base pair location will be a matrix like [ [3,3] , [11,11] ], meaning a base pair between residues 3 and 11."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From a pyrna.features.SecondaryStructure object, we can easily go back to a list of base pairs stored in a pandas Dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  edge1 edge2 orientation  pos1  pos2\n",
      "0     (     )           c     1    13\n",
      "1     (     )           c     2    12\n",
      "2     {     )           c     3    11\n",
      "3     (     )           c     4    10\n"
     ]
    }
   ],
   "source": [
    "from pyrna.parsers import secondary_structure_to_base_pairs\n",
    "print secondary_structure_to_base_pairs(ss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can create a new secondary structure containing a single tertiary interaction. A tertiary interaction is defined as a base pair which is not contiguous to another base-pair."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  orientation edge1 edge2  pos1  pos2\n",
      "0           c     (     )     6    12\n",
      "1           c     (     )     4    15\n",
      "2           c     {     )     3    16\n",
      "3           c     (     )     2    17\n",
      "4           c     (     )     1    18\n"
     ]
    }
   ],
   "source": [
    "rna = RNA(name = 'my_rna', sequence = 'GGGGGACGCAGTAACCCC')\n",
    "bn = '(({(.(.....)..))))'\n",
    "base_pairs = parse_bn(bn)\n",
    "print base_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The single tertiary interaction is stored in a pyrna.features.SecondaryStructure object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\n",
      "  \"edge1\": \"(\", \n",
      "  \"edge2\": \")\", \n",
      "  \"orientation\": \"c\", \n",
      "  \"location\": [\n",
      "    [\n",
      "      6, \n",
      "      6\n",
      "    ], \n",
      "    [\n",
      "      12, \n",
      "      12\n",
      "    ]\n",
      "  ]\n",
      "}\n"
     ]
    }
   ],
   "source": [
    "ss = base_pairs_to_secondary_structure(rna, base_pairs)\n",
    "for tertiary_interaction in ss.tertiary_interactions:\n",
    "    print json.dumps(tertiary_interaction, indent = 2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we want to go back to a list of base pairs in a pandas Dataframe, we need to precise if we want to keep the tertiary interactions..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  edge1 edge2 orientation  pos1  pos2\n",
      "0     (     )           c     1    18\n",
      "1     (     )           c     2    17\n",
      "2     {     )           c     3    16\n",
      "3     (     )           c     4    15\n",
      "4     (     )           c     6    12\n"
     ]
    }
   ],
   "source": [
    "print secondary_structure_to_base_pairs(ss, keep_tertiaries=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "...or not"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  edge1 edge2 orientation  pos1  pos2\n",
      "0     (     )           c     1    18\n",
      "1     (     )           c     2    17\n",
      "2     {     )           c     3    16\n",
      "3     (     )           c     4    15\n"
     ]
    }
   ],
   "source": [
    "print secondary_structure_to_base_pairs(ss)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creation of secondary structures from files"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyRNA is able to parse several file formats describing secondary structures. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">ft3100 FANTOM3\n",
      "TAACAATCTGCTGAAAGGTACCGTCGGAGGGAGCTTTGTTGCCAGCGCCA\n",
      "GAAACGCCGGTTTAACCAGCGCCGAAGTGAGCGCAGTGATTAAAGCCATG\n",
      "CAGTGGCAAATGGATTTCCGCAAACTGAAAAAAGGCGATGAATTTGCGGT\n",
      ".......((((.....(((((((.((...((.(((........))).)).\n",
      "....)).))))...)))..((((..(((..(((....((((...(((((.\n",
      "..))))).....))))..)))..))).......))))........)))).\n"
     ]
    }
   ],
   "source": [
    "from pyrna.parsers import parse_vienna\n",
    "\n",
    "h = open('../data/ft3100_2D_with_bracket_notation.fasta')\n",
    "vienna_content = h.read()\n",
    "h.close()\n",
    "\n",
    "print vienna_content"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a Vienna file, several molecules and bracket notations can be stored. Consequently, the function parse_vienna() returns a list of RNA objects and a list of pandas Dataframes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   orientation edge1 edge2  pos1  pos2\n",
      "0            c     (     )    35    44\n",
      "1            c     (     )    34    45\n",
      "2            c     (     )    33    46\n",
      "3            c     (     )    31    48\n",
      "4            c     (     )    30    49\n",
      "5            c     (     )    26    55\n",
      "6            c     (     )    25    56\n",
      "7            c     (     )    23    58\n",
      "8            c     (     )    22    59\n",
      "9            c     (     )    21    60\n",
      "10           c     (     )    20    61\n",
      "11           c     (     )    19    65\n",
      "12           c     (     )    18    66\n",
      "13           c     (     )    17    67\n",
      "14           c     (     )    99   103\n",
      "15           c     (     )    98   104\n",
      "16           c     (     )    97   105\n",
      "17           c     (     )    96   106\n",
      "18           c     (     )    95   107\n",
      "19           c     (     )    91   113\n",
      "20           c     (     )    90   114\n",
      "21           c     (     )    89   115\n",
      "22           c     (     )    88   116\n",
      "23           c     (     )    83   119\n",
      "24           c     (     )    82   120\n",
      "25           c     (     )    81   121\n",
      "26           c     (     )    78   124\n",
      "27           c     (     )    77   125\n",
      "28           c     (     )    76   126\n",
      "29           c     (     )    73   134\n",
      "30           c     (     )    72   135\n",
      "31           c     (     )    71   136\n",
      "32           c     (     )    70   137\n",
      "33           c     (     )    11   146\n",
      "34           c     (     )    10   147\n",
      "35           c     (     )     9   148\n",
      "36           c     (     )     8   149\n"
     ]
    }
   ],
   "source": [
    "all_molecules, all_base_pairs = parse_vienna(vienna_content)\n",
    "\n",
    "for base_pairs in all_base_pairs:\n",
    "    print base_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creation of secondary structures from databases"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PyRNA allows to recover Rfam entries directly from the database website."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyrna.db import Rfam\n",
    "rfam = Rfam(use_website = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function get_entry() returns:\n",
    "    \n",
    "* a list of gapped RNA objects\n",
    "* a dictionary to map species labels to name/start-end\n",
    "* a pandas Dataframe listing the paired columns in the Rfam alignment (a.k.a a consensus secondary structure)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sequence for AE014074.1/464992-464896: UAUUUCACAAAGGAGUGCUU---------------------------------------------------------UG-GCUGAGAUCGCAA--------------------------------------------------UUGCGAAA-UCCUGAGG-ACCUGA-UCUUGUUAGUACAAGCG-UAGGGA--UUGUGACCAAUAAUCAA\n",
      "sequence for AL935263.2/99249-99354: UUUAAACACUAGGGGUGUCCAAAA---------------------------------------------------AUGG-GCUGAGAUGGUGCUGUAA---------------------------------------------GUACCGAU-CCCUUUGA-ACCUG--U-AAGCUCAAACUUGCG-UAGGAA--AGUGUCACAGCUAAUGU\n",
      "sequence for BX248356.1/232868-232999: UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCC-------------------------------------ACAAA-GCUGAGACAGGGCGAGAAGACGUGCACG-----------------------------------UCCCUGAA-CCGUUGA--ACCUGA-UCCGGGUAAUACCGGCGAUAGGAA--GAAUAAUGAACCGAUCG\n",
      "sequence for AK120238.1/2075-2184: CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGA-----------------------------------AUUGG-GCUGAGAA------------------------------------------------------------AGU-CCCUUUGA-ACCUGA-ACAGGAUAAUGCCUGCG-AAGGGA-GUGUGCAUUUCUACUUUU\n",
      "sequence for X54035.1/52-153: GAUGACCACAAGGGGAGCAUU-------------------------------------------------------AAA-GCUGAGAGUGAGCGGUUUC--------------------------------------------GUUC-UGA-CCCUUUGA-ACCUG--U-UAGUUAACGCUGGCG-UAGGGA--UGUGGCAAAUGCAAUAG\n",
      "sequence for AE008691.1/1524796-1524694: CUCAAGUGCUAGGGGAGCCAAA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGAAAU\n",
      "sequence for AB087258.1/15478-15568: CCGACUUAGUCGGGGUGCUGAU-----------------------------------------------------AACA-GCUGAGAUA-----------------------------------------------------------AUA-CCCGUGA--ACCUGA-UACAGUUAAUACUGACG-UAGGAA--ACUAAUGGUCUUACUCC\n",
      "sequence for AE008691.1/2140374-2140272: UCAAAGUGCUAGGGGAGCCAGA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGGAAU\n",
      "sequence for AL939111.1/162719-162830: CAAGGCACUCGCGGGAGCCCGGACGC------------------------------------------------ACCGG-GCUGAGAGGGAGGCUGGCG--------------------------------------------GCCUCCGA-CCGUACGA-ACCUGA-UCCGGGUCAUGCCGGCG-AAGGGA--GGGGCUGGACGCCCAUG\n",
      "sequence for AE017221.1/302202-302311: GGGGCAGGCUAGGGGUGCCCGAAUG-------------------------------------------------GAAGG-GCUGAGAGCUGGGUUUCU---------------------------------------------CCCAGCAA-CCCUUGGA-ACCUGA-UCCGGGUAAUGCCGGCG-GAGGGA--AGCCUAUGCGGAAGACC\n"
     ]
    }
   ],
   "source": [
    "gapped_rnas, organism_names_2_nse, consensus_2d = rfam.get_entry(rfam_id='RF00059')\n",
    "\n",
    "for gapped_rna in gapped_rnas[:10]:\n",
    "    print \"sequence for %s: %s\"%(gapped_rna.name, gapped_rna.sequence)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The function consensus2d_to_base_pairs() allows to easily infer the secondary structure for a given RNA sequence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "AE014074.1/464992-464896\n",
      "UAUUUCACAAAGGAGUGCUUUGGCUGAGAUCGCAAUUGCGAAAUCCUGAGGACCUGAUCUUGUUAGUACAAGCGUAGGGAUUGUGACCAAUAAUCAA\n",
      ".....[[[[[((((..(((()))).....(((((())))))..)))).....((((..((((......))))..))))..]]]]]............\n",
      "AL935263.2/99249-99354\n",
      "UUUAAACACUAGGGGUGUCCAAAAAUGGGCUGAGAUGGUGCUGUAAGUACCGAUCCCUUUGAACCUGUAAGCUCAAACUUGCGUAGGAAAGUGUCACAGCUAAUGU\n",
      ".....[[[[[((((..(((((....))))).....((((((.....))))))..)))).....((((.(((......)))...))))..]]]]]............\n",
      "BX248356.1/232868-232999\n",
      "UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCCACAAAGCUGAGACAGGGCGAGAAGACGUGCACGUCCCUGAACCGUUGAACCUGAUCCGGGUAAUACCGGCGAUAGGAAGAAUAAUGAACCGAUCG\n",
      ".....[[[[[((((..(((((..................))))).....((((((...............))))))..))))....((((..((((......))))...))))..]]]]]............\n",
      "AK120238.1/2075-2184\n",
      "CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGAAUUGGGCUGAGAAAGUCCCUUUGAACCUGAACAGGAUAAUGCCUGCGAAGGGAGUGUGCAUUUCUACUUUU\n",
      ".....[[[[[((((..(((((....................))))).....()..)))).....((((..((((......))))..))))...]]]]]............\n",
      "X54035.1/52-153\n",
      "GAUGACCACAAGGGGAGCAUUAAAGCUGAGAGUGAGCGGUUUCGUUCUGACCCUUUGAACCUGUUAGUUAACGCUGGCGUAGGGAUGUGGCAAAUGCAAUAG\n",
      ".....[[[[[((((..((((())))).....(.((((......)))))..)))).....((((.(((......)))...))))..]]]]]............\n",
      "AE008691.1/1524796-1524694\n",
      "CUCAAGUGCUAGGGGAGCCAAAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGAAAU\n",
      ".....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............\n",
      "AB087258.1/15478-15568\n",
      "CCGACUUAGUCGGGGUGCUGAUAACAGCUGAGAUAAUACCCGUGAACCUGAUACAGUUAAUACUGACGUAGGAAACUAAUGGUCUUACUCC\n",
      ".....[[[[[((((..(((((..))))).....(.)..))))....((((..((((......))))..))))..]]]]]............\n",
      "AE008691.1/2140374-2140272\n",
      "UCAAAGUGCUAGGGGAGCCAGAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGGAAU\n",
      ".....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............\n",
      "AL939111.1/162719-162830\n",
      "CAAGGCACUCGCGGGAGCCCGGACGCACCGGGCUGAGAGGGAGGCUGGCGGCCUCCGACCGUACGAACCUGAUCCGGGUCAUGCCGGCGAAGGGAGGGGCUGGACGCCCAUG\n",
      ".....[[[[[((((..(((((.......))))).....((((((......))))))..)))).....((((..((((......))))..))))..]]]]]............\n",
      "AE017221.1/302202-302311\n",
      "GGGGCAGGCUAGGGGUGCCCGAAUGGAAGGGCUGAGAGCUGGGUUUCUCCCAGCAACCCUUGGAACCUGAUCCGGGUAAUGCCGGCGGAGGGAAGCCUAUGCGGAAGACC\n",
      ".....[[[[[((((..(((((......))))).....((((((.....))))))..)))).....((((..((((......))))..))))..]]]]]............\n"
     ]
    }
   ],
   "source": [
    "from pyrna.parsers import consensus2d_to_base_pairs, to_bn\n",
    "\n",
    "for gapped_rna in gapped_rnas[:10]:\n",
    "    rna = RNA(name = gapped_rna.name, sequence = gapped_rna.sequence.replace('-',''))\n",
    "    print rna.name\n",
    "    print rna.sequence\n",
    "    print to_bn(consensus2d_to_base_pairs(gapped_rna, consensus_2d), len(rna))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Creation of secondary structures from algorithms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   orientation edge1 edge2  pos1  pos2\n",
      "0            c     (     )    41    48\n",
      "1            c     (     )    40    49\n",
      "2            c     (     )    39    50\n",
      "3            c     (     )    38    51\n",
      "4            c     (     )    37    52\n",
      "5            c     (     )    56    67\n",
      "6            c     (     )    55    68\n",
      "7            c     (     )    54    69\n",
      "8            c     (     )    53    70\n",
      "9            c     (     )    35    72\n",
      "10           c     (     )    32    75\n",
      "11           c     (     )    31    76\n",
      "12           c     (     )    29    77\n",
      "13           c     (     )    28    78\n",
      "14           c     (     )    27    79\n",
      "15           c     (     )    24    80\n",
      "16           c     (     )    23    81\n",
      "17           c     (     )    20    84\n",
      "18           c     (     )    19    85\n",
      "19           c     (     )    18    86\n",
      "20           c     (     )    17    87\n",
      "21           c     (     )    14    90\n",
      "22           c     (     )    13    91\n",
      "23           c     (     )    11    93\n",
      "24           c     (     )    10    94\n",
      "25           c     (     )     5    96\n",
      "26           c     (     )     4    97\n",
      "27           c     (     )     3    98\n",
      "28           c     (     )     2    99\n",
      "29           c     (     )     1   100\n"
     ]
    }
   ],
   "source": [
    "from pyrna.computations import Rnafold\n",
    "from pyrna.features import RNA\n",
    "rna = RNA(name = 'my_rna', sequence = 'GGGGTAGGGACGGTAGGGGGACGCAGTGCAGTAACGTACCCGGTAGGGGGTAGGGGGACGCAGTAACCCCGGGGACGCAGTAACCCCACGCAGTAACCCC')\n",
    "rnafold = Rnafold() #the algorithm is launched locally, using a Docker image\n",
    "base_pairs = rnafold.fold(rna)\n",
    "print base_pairs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also compute the 2D coordinates for a plot."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "           x        y\n",
      "1     51.114  -18.631\n",
      "2     51.114   -3.631\n",
      "3     51.114   11.369\n",
      "4     51.114   26.369\n",
      "5     51.114   41.369\n",
      "6     43.659   46.738\n",
      "7     39.542   55.107\n",
      "8     39.850   64.568\n",
      "9     44.640   72.894\n",
      "10    52.900   78.063\n",
      "11    54.298   92.998\n",
      "12    47.780  106.391\n",
      "13    56.770  119.410\n",
      "14    58.167  134.345\n",
      "15    48.494  145.613\n",
      "16    49.699  160.779\n",
      "17    61.583  170.848\n",
      "18    62.981  185.783\n",
      "19    64.379  200.718\n",
      "20    65.777  215.652\n",
      "21    56.104  226.921\n",
      "22    57.308  242.086\n",
      "23    69.193  252.156\n",
      "24    70.590  267.091\n",
      "25    68.168  271.394\n",
      "26    68.550  277.290\n",
      "27    72.568  282.901\n",
      "28    74.891  297.719\n",
      "29    77.215  312.538\n",
      "30    75.104  319.761\n",
      "..       ...      ...\n",
      "71   121.623  385.454\n",
      "72   104.173  375.762\n",
      "73   112.813  362.801\n",
      "74   109.780  347.892\n",
      "75    97.371  339.735\n",
      "76    94.587  324.996\n",
      "77    92.034  310.215\n",
      "78    89.710  295.396\n",
      "79    87.387  280.577\n",
      "80    85.525  265.693\n",
      "81    84.128  250.758\n",
      "82    93.937  238.658\n",
      "83    92.307  223.532\n",
      "84    80.711  214.255\n",
      "85    79.314  199.320\n",
      "86    77.916  184.385\n",
      "87    76.518  169.451\n",
      "88    86.328  157.351\n",
      "89    84.698  142.225\n",
      "90    73.102  132.947\n",
      "91    71.704  118.012\n",
      "92    78.122  103.551\n",
      "93    69.233   91.600\n",
      "94    67.835   76.666\n",
      "95    78.142   58.016\n",
      "96    66.114   41.369\n",
      "97    66.114   26.369\n",
      "98    66.114   11.369\n",
      "99    66.114   -3.631\n",
      "100   66.114  -18.631\n",
      "\n",
      "[100 rows x 2 columns]\n"
     ]
    }
   ],
   "source": [
    "from pyrna.computations import Rnaplot\n",
    "rnaplot = Rnaplot()\n",
    "plot = rnaplot.plot(secondary_structure = base_pairs, rna = rna)\n",
    "print plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also compute a secondary structure by annotating a tertiary one. First, we need to import a 3D structure from the Protein Databank."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from pyrna.db import PDB\n",
    "from pyrna.parsers import parse_pdb\n",
    "pdb = PDB()\n",
    "tertiary_structures = parse_pdb(pdb.get_entry('1HR2'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With PyRNA, a pyrna.features.TertiaryStructure object is made with a single molecular chain. Consequently, the function parse_pdb returns a list of such objects. We can iterate over these pyrna.features.TertiaryStructure objects to annotate them with the algorithm Rnaview (Rnaview is available for computations from my own server)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Molecular chain A\n",
      "   edge1 edge2 orientation  pos1  pos2\n",
      "0      (     )           c     4   112\n",
      "1      (     )           c     5   111\n",
      "2      (     )           c     6   110\n",
      "3      (     )           c     7   109\n",
      "4      (     )           c     8   108\n",
      "5      (     )           c     9   107\n",
      "6      (     )           c    10   106\n",
      "7      {     ]           t    11   105\n",
      "8      [     }           t    12   104\n",
      "9      (     )           c    14   103\n",
      "10     (     )           c    15   102\n",
      "11     (     )           c    16   101\n",
      "12     (     )           c    17   100\n",
      "13     (     )           c    18    99\n",
      "14     (     )           c    19    98\n",
      "15     (     )           c    24    94\n",
      "16     (     )           c    25    93\n",
      "17     (     )           c    26    92\n",
      "18     (     )           c    27    91\n",
      "19     (     )           c    30    89\n",
      "20     (     )           c    31    88\n",
      "21     (     )           c    32    87\n",
      "22     (     )           c    34    80\n",
      "23     (     )           c    35    79\n",
      "24     (     )           c    36    78\n",
      "25     [     }           t    37    62\n",
      "26     [     }           t    38    61\n",
      "27     (     )           c    39    60\n",
      "28     (     )           c    40    59\n",
      "29     (     )           c    41    58\n",
      "..   ...   ...         ...   ...   ...\n",
      "53     (     ]           t    21    96\n",
      "54     (     }           t    22    99\n",
      "55     (     ]           c    33    85\n",
      "56     {     ]           t    48    51\n",
      "57     (     )           t    49   145\n",
      "58     {     }           t    51   147\n",
      "59     [     )           t    66    86\n",
      "60     {     ]           c    69    70\n",
      "61     {     }           t    79    84\n",
      "62     {     }           t    82   109\n",
      "63     {     }           c    95    98\n",
      "64     {     ]           c   114   115\n",
      "65     (     ]           t   121   145\n",
      "66     {     ]           c   122   123\n",
      "67     (     )           c   123   146\n",
      "68     !     !           !    21    98\n",
      "69     !     !           !   111   157\n",
      "70     !     !           !   124   146\n",
      "71     !     !           !     9    66\n",
      "72     !     !           !    19    21\n",
      "73     !     !           !    50   147\n",
      "74     !     !           !    61    75\n",
      "75     !     !           !     1   115\n",
      "76     !     !           !     8    81\n",
      "77     !     !           !    24   100\n",
      "78     !     !           !    50   121\n",
      "79     !     !           !    51   120\n",
      "80     !     !           !    52   148\n",
      "81     !     !           !    77    78\n",
      "82     !     !           !   112   157\n",
      "\n",
      "[83 rows x 5 columns]\n",
      "Molecular chain B\n",
      "   edge1 edge2 orientation  pos1  pos2\n",
      "0      (     )           c     2   115\n",
      "1      (     )           c     3   114\n",
      "2      (     )           c     4   113\n",
      "3      (     )           c     6   112\n",
      "4      (     )           c     7   111\n",
      "5      (     )           c     8   110\n",
      "6      (     )           c     9   109\n",
      "7      (     )           c    10   108\n",
      "8      (     )           c    11   107\n",
      "9      {     ]           t    12   106\n",
      "10     [     }           t    13   105\n",
      "11     (     )           c    15   104\n",
      "12     (     )           c    16   103\n",
      "13     (     )           c    17   102\n",
      "14     (     )           c    18   101\n",
      "15     (     )           c    19   100\n",
      "16     (     )           c    20    99\n",
      "17     (     )           c    25    95\n",
      "18     (     )           c    26    94\n",
      "19     (     )           c    27    93\n",
      "20     (     )           c    28    92\n",
      "21     (     )           c    30    91\n",
      "22     (     )           c    31    90\n",
      "23     (     )           c    32    89\n",
      "24     (     )           c    33    88\n",
      "25     (     )           c    35    81\n",
      "26     (     )           c    36    80\n",
      "27     (     )           c    37    79\n",
      "28     [     }           t    38    63\n",
      "29     [     }           t    39    62\n",
      "..   ...   ...         ...   ...   ...\n",
      "59     [     )           c     7   158\n",
      "60     {     }           c     9    82\n",
      "61     {     )           t    10    67\n",
      "62     (     }           t    23   100\n",
      "63     (     ]           c    34    86\n",
      "64     {     ]           t    49    52\n",
      "65     (     )           t    50   146\n",
      "66     {     }           t    52   148\n",
      "67     [     )           t    67    87\n",
      "68     {     ]           c    70    71\n",
      "69     {     }           t    80    85\n",
      "70     {     }           t    83   110\n",
      "71     {     }           c    96    99\n",
      "72     {     ]           c   116   117\n",
      "73     (     ]           t   122   146\n",
      "74     {     ]           c   123   124\n",
      "75     (     )           c   124   147\n",
      "76     !     !           !    14   105\n",
      "77     !     !           !   125   147\n",
      "78     !     !           !    10    87\n",
      "79     !     !           !    20    22\n",
      "80     !     !           !    23    24\n",
      "81     !     !           !    51   148\n",
      "82     !     !           !    62    76\n",
      "83     !     !           !     8    83\n",
      "84     !     !           !    25   101\n",
      "85     !     !           !    51   122\n",
      "86     !     !           !    52   121\n",
      "87     !     !           !    53   149\n",
      "88     !     !           !    98    99\n",
      "\n",
      "[89 rows x 5 columns]\n"
     ]
    }
   ],
   "source": [
    "secondary_structures = []\n",
    "for ts in tertiary_structures:\n",
    "    #the function annotate() from Rnaview returns a pyrna.features.SecondaryStructure object and its 3D counterpart as a pyrna.features.TertiaryStructure object\n",
    "    secondary_structure, tertiary_structure = Rnaview().annotate(ts)\n",
    "    secondary_structures.append(secondary_structure)\n",
    "    #the function secondary_structure_to_base_pairs() transform a pyrna.features.SecondaryStructure object into a list of base pairs stored in a pandas Dataframe\n",
    "    print \"Molecular chain %s\"%secondary_structure.rna.name\n",
    "    print secondary_structure_to_base_pairs(secondary_structure, keep_tertiaries = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Mining of a Secondary Structure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can search for all base pairs made with at least one Hoogsteen edge. Since the variable base_pairs stores a pandas Dataframe, we're using the operators available from pandas (http://pandas.pydata.org/pandas-docs/stable/indexing.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   edge1 edge2 orientation  pos1  pos2\n",
      "7      {     ]           t    11   105\n",
      "8      [     }           t    12   104\n",
      "25     [     }           t    37    62\n",
      "26     [     }           t    38    61\n",
      "52     [     )           c     4   156\n",
      "53     (     ]           t    21    96\n",
      "55     (     ]           c    33    85\n",
      "56     {     ]           t    48    51\n",
      "59     [     )           t    66    86\n",
      "60     {     ]           c    69    70\n",
      "64     {     ]           c   114   115\n",
      "65     (     ]           t   121   145\n",
      "66     {     ]           c   122   123\n"
     ]
    }
   ],
   "source": [
    "base_pairs = secondary_structure_to_base_pairs(secondary_structures[0], keep_tertiaries = True)\n",
    "print base_pairs[(base_pairs['edge1'] == '[') | (base_pairs['edge2'] == ']')]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A pyrna.features.SecondaryStructure can find all its junctions (meaning the group of single-strands linking helices). \n",
    "\n",
    "According to the number of single-strands making a junction, we distinguish:\n",
    "    \n",
    "* an apical looop (a single single-strand)\n",
    "* an inner loop (two single-strands)\n",
    "* an three-way junction (three single-strands)\n",
    "* a four-way junction (four single-strands)\n",
    "* a n-way junction (n single-strands)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "junctions = []\n",
    "for ss in secondary_structures:\n",
    "    ss.find_junctions()\n",
    "    junctions += ss.junctions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A junction is described with:\n",
    "\n",
    "* a list of single-strands making the junction\n",
    "* a location: the positions in sequence of the residues \n",
    "* a description: a string made with the single-strand sequences joined with single space characters. In the description of a junction, the character '-' means a direct link (phosphodiester bond) between two helices. Consequently, we have no single-strand linking these helices."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'single_strands': [{'name': 'SS_2', 'location': [13, 13]}], 'description': 'A -', 'location': [[12, 14], [103, 104]]}\n"
     ]
    }
   ],
   "source": [
    "print junctions[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can search for all GNRA apical loops..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Apical loop with sequence GAAA\n",
      "Apical loop with sequence GAAA\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "for junction in junctions:\n",
    "    if re.match('G[AUGC][AG]A',junction['description']):\n",
    "        print \"Apical loop with sequence %s\"%junction['description']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or for all three-way junctions..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Three-way junctions with sequence GUAU - -\n",
      "Three-way junctions with sequence GUAU - -\n"
     ]
    }
   ],
   "source": [
    "for junction in junctions:\n",
    "    if len(junction['location']) == 3:\n",
    "        print \"Three-way junctions with sequence %s\"%junction['description']"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [Root]",
   "language": "python",
   "name": "Python [Root]"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}