{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "PyRNA stores and describes an RNA secondary structure using two different data structures:\n", "\n", "* a pandas Dataframe\n", "* a pyrna.features.SecondaryStructure object\n", "\n", "The pandas Dataframe lists all the base pairs (a.k.a interactions) making the secondary structure. If a pandas Dataframe restricts the exploration of a secondary structure to the level of the base pairs, a pyrna.features.SecondaryStructure gives us access to high-level objects like: \n", "\n", "* helices,\n", "* single-strands,\n", "* junctions.\n", "\n", "In both data structures, the edges of the base pairs are described with three different characters:\n", "\n", " * '(' or ')' for Watson-Crick edges,\n", " * '[' or ']' for Hoogsteen edges,\n", " * '{' or '}' for Sugar edges.\n", " \n", "A base pair can also be in a cis ('c') or trans ('t') orientation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of secondary structures from scratch" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " orientation edge1 edge2 pos1 pos2\n", "0 c ( ) 4 10\n", "1 c { ) 3 11\n", "2 c ( ) 2 12\n", "3 c ( ) 1 13\n" ] } ], "source": [ "import json #to have a better output for dict describing pyrna.features.RNA objects (interactions, helices, single-strands,..)\n", "\n", "from pyrna.parsers import parse_bn\n", "rna = RNA(name = 'my_rna', sequence = 'GGGGGACAACCCC')\n", "bn = '(({(.....))))'\n", "base_pairs = parse_bn(bn)\n", "print base_pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we transform this list of base-pairs into a pyrna.features.SecondaryStructure object." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pyrna.parsers import base_pairs_to_secondary_structure\n", "ss = base_pairs_to_secondary_structure(rna, base_pairs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can browse over the helices. An helix contains a field to store non-canonical base pairs (meaning base pairs different from c() AU, c() GU or c() GC. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"length\": 4, \n", " \"interactions\": [\n", " {\n", " \"edge1\": \"{\", \n", " \"edge2\": \")\", \n", " \"orientation\": \"c\", \n", " \"location\": [\n", " [\n", " 3, \n", " 3\n", " ], \n", " [\n", " 11, \n", " 11\n", " ]\n", " ]\n", " }\n", " ], \n", " \"name\": \"H1\", \n", " \"location\": [\n", " [\n", " 1, \n", " 4\n", " ], \n", " [\n", " 10, \n", " 13\n", " ]\n", " ]\n", "}\n", "\n", "Any non canonical base pairs in this helix?\n", "\n", "{\n", " \"edge1\": \"{\", \n", " \"edge2\": \")\", \n", " \"orientation\": \"c\", \n", " \"location\": [\n", " [\n", " 3, \n", " 3\n", " ], \n", " [\n", " 11, \n", " 11\n", " ]\n", " ]\n", "}\n" ] } ], "source": [ "for helix in ss.helices:\n", " print json.dumps(helix, indent = 2)\n", " print \"\\nAny non canonical base pairs in this helix?\\n\"\n", " for interaction in helix['interactions']:\n", " print json.dumps(interaction, indent = 2) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or over the single-strands..." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"name\": \"SS1\", \n", " \"location\": [\n", " 5, \n", " 9\n", " ]\n", "}\n" ] } ], "source": [ "for single_strand in ss.single_strands:\n", " print json.dumps(single_strand, indent = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Any high-level object from a pyrna.features.SecondaryStructure is described with a \"location\". A location is a N x 2 matrix where N is the number of blocks of contiguous residues, each block being described with its start and end positions.\n", "\n", "An helix location will be a matrix like [ [1,4] , [10,13] ], meaning that the first strand is between residues 1 and 4, and the second strand is between residues 10 and 13.\n", "\n", "A base pair location will be a matrix like [ [3,3] , [11,11] ], meaning a base pair between residues 3 and 11." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From a pyrna.features.SecondaryStructure object, we can easily go back to a list of base pairs stored in a pandas Dataframe." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " edge1 edge2 orientation pos1 pos2\n", "0 ( ) c 1 13\n", "1 ( ) c 2 12\n", "2 { ) c 3 11\n", "3 ( ) c 4 10\n" ] } ], "source": [ "from pyrna.parsers import secondary_structure_to_base_pairs\n", "print secondary_structure_to_base_pairs(ss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can create a new secondary structure containing a single tertiary interaction. A tertiary interaction is defined as a base pair which is not contiguous to another base-pair." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " orientation edge1 edge2 pos1 pos2\n", "0 c ( ) 6 12\n", "1 c ( ) 4 15\n", "2 c { ) 3 16\n", "3 c ( ) 2 17\n", "4 c ( ) 1 18\n" ] } ], "source": [ "rna = RNA(name = 'my_rna', sequence = 'GGGGGACGCAGTAACCCC')\n", "bn = '(({(.(.....)..))))'\n", "base_pairs = parse_bn(bn)\n", "print base_pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The single tertiary interaction is stored in a pyrna.features.SecondaryStructure object." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{\n", " \"edge1\": \"(\", \n", " \"edge2\": \")\", \n", " \"orientation\": \"c\", \n", " \"location\": [\n", " [\n", " 6, \n", " 6\n", " ], \n", " [\n", " 12, \n", " 12\n", " ]\n", " ]\n", "}\n" ] } ], "source": [ "ss = base_pairs_to_secondary_structure(rna, base_pairs)\n", "for tertiary_interaction in ss.tertiary_interactions:\n", " print json.dumps(tertiary_interaction, indent = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to go back to a list of base pairs in a pandas Dataframe, we need to precise if we want to keep the tertiary interactions..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " edge1 edge2 orientation pos1 pos2\n", "0 ( ) c 1 18\n", "1 ( ) c 2 17\n", "2 { ) c 3 16\n", "3 ( ) c 4 15\n", "4 ( ) c 6 12\n" ] } ], "source": [ "print secondary_structure_to_base_pairs(ss, keep_tertiaries=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...or not" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " edge1 edge2 orientation pos1 pos2\n", "0 ( ) c 1 18\n", "1 ( ) c 2 17\n", "2 { ) c 3 16\n", "3 ( ) c 4 15\n" ] } ], "source": [ "print secondary_structure_to_base_pairs(ss)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of secondary structures from files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyRNA is able to parse several file formats describing secondary structures. " ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">ft3100 FANTOM3\n", "TAACAATCTGCTGAAAGGTACCGTCGGAGGGAGCTTTGTTGCCAGCGCCA\n", "GAAACGCCGGTTTAACCAGCGCCGAAGTGAGCGCAGTGATTAAAGCCATG\n", "CAGTGGCAAATGGATTTCCGCAAACTGAAAAAAGGCGATGAATTTGCGGT\n", ".......((((.....(((((((.((...((.(((........))).)).\n", "....)).))))...)))..((((..(((..(((....((((...(((((.\n", "..))))).....))))..)))..))).......))))........)))).\n" ] } ], "source": [ "from pyrna.parsers import parse_vienna\n", "\n", "h = open('../data/ft3100_2D_with_bracket_notation.fasta')\n", "vienna_content = h.read()\n", "h.close()\n", "\n", "print vienna_content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a Vienna file, several molecules and bracket notations can be stored. Consequently, the function parse_vienna() returns a list of RNA objects and a list of pandas Dataframes." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " orientation edge1 edge2 pos1 pos2\n", "0 c ( ) 35 44\n", "1 c ( ) 34 45\n", "2 c ( ) 33 46\n", "3 c ( ) 31 48\n", "4 c ( ) 30 49\n", "5 c ( ) 26 55\n", "6 c ( ) 25 56\n", "7 c ( ) 23 58\n", "8 c ( ) 22 59\n", "9 c ( ) 21 60\n", "10 c ( ) 20 61\n", "11 c ( ) 19 65\n", "12 c ( ) 18 66\n", "13 c ( ) 17 67\n", "14 c ( ) 99 103\n", "15 c ( ) 98 104\n", "16 c ( ) 97 105\n", "17 c ( ) 96 106\n", "18 c ( ) 95 107\n", "19 c ( ) 91 113\n", "20 c ( ) 90 114\n", "21 c ( ) 89 115\n", "22 c ( ) 88 116\n", "23 c ( ) 83 119\n", "24 c ( ) 82 120\n", "25 c ( ) 81 121\n", "26 c ( ) 78 124\n", "27 c ( ) 77 125\n", "28 c ( ) 76 126\n", "29 c ( ) 73 134\n", "30 c ( ) 72 135\n", "31 c ( ) 71 136\n", "32 c ( ) 70 137\n", "33 c ( ) 11 146\n", "34 c ( ) 10 147\n", "35 c ( ) 9 148\n", "36 c ( ) 8 149\n" ] } ], "source": [ "all_molecules, all_base_pairs = parse_vienna(vienna_content)\n", "\n", "for base_pairs in all_base_pairs:\n", " print base_pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of secondary structures from databases" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "PyRNA allows to recover Rfam entries directly from the database website." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pyrna.db import Rfam\n", "rfam = Rfam(use_website = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function get_entry() returns:\n", " \n", "* a list of gapped RNA objects\n", "* a dictionary to map species labels to name/start-end\n", "* a pandas Dataframe listing the paired columns in the Rfam alignment (a.k.a a consensus secondary structure)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "sequence for AE014074.1/464992-464896: UAUUUCACAAAGGAGUGCUU---------------------------------------------------------UG-GCUGAGAUCGCAA--------------------------------------------------UUGCGAAA-UCCUGAGG-ACCUGA-UCUUGUUAGUACAAGCG-UAGGGA--UUGUGACCAAUAAUCAA\n", "sequence for AL935263.2/99249-99354: UUUAAACACUAGGGGUGUCCAAAA---------------------------------------------------AUGG-GCUGAGAUGGUGCUGUAA---------------------------------------------GUACCGAU-CCCUUUGA-ACCUG--U-AAGCUCAAACUUGCG-UAGGAA--AGUGUCACAGCUAAUGU\n", "sequence for BX248356.1/232868-232999: UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCC-------------------------------------ACAAA-GCUGAGACAGGGCGAGAAGACGUGCACG-----------------------------------UCCCUGAA-CCGUUGA--ACCUGA-UCCGGGUAAUACCGGCGAUAGGAA--GAAUAAUGAACCGAUCG\n", "sequence for AK120238.1/2075-2184: CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGA-----------------------------------AUUGG-GCUGAGAA------------------------------------------------------------AGU-CCCUUUGA-ACCUGA-ACAGGAUAAUGCCUGCG-AAGGGA-GUGUGCAUUUCUACUUUU\n", "sequence for X54035.1/52-153: GAUGACCACAAGGGGAGCAUU-------------------------------------------------------AAA-GCUGAGAGUGAGCGGUUUC--------------------------------------------GUUC-UGA-CCCUUUGA-ACCUG--U-UAGUUAACGCUGGCG-UAGGGA--UGUGGCAAAUGCAAUAG\n", "sequence for AE008691.1/1524796-1524694: CUCAAGUGCUAGGGGAGCCAAA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGAAAU\n", "sequence for AB087258.1/15478-15568: CCGACUUAGUCGGGGUGCUGAU-----------------------------------------------------AACA-GCUGAGAUA-----------------------------------------------------------AUA-CCCGUGA--ACCUGA-UACAGUUAAUACUGACG-UAGGAA--ACUAAUGGUCUUACUCC\n", "sequence for AE008691.1/2140374-2140272: UCAAAGUGCUAGGGGAGCCAGA-----------------------------------------------------AAUG-GCUGAGAGGGGAUUC------------------------------------------------GUUCCCGA-CCCUUAGA-ACCUGA-CCAGGGUAAUGCCUGCG-AAGGGA--AGCACGUUUUACGGAAU\n", "sequence for AL939111.1/162719-162830: CAAGGCACUCGCGGGAGCCCGGACGC------------------------------------------------ACCGG-GCUGAGAGGGAGGCUGGCG--------------------------------------------GCCUCCGA-CCGUACGA-ACCUGA-UCCGGGUCAUGCCGGCG-AAGGGA--GGGGCUGGACGCCCAUG\n", "sequence for AE017221.1/302202-302311: GGGGCAGGCUAGGGGUGCCCGAAUG-------------------------------------------------GAAGG-GCUGAGAGCUGGGUUUCU---------------------------------------------CCCAGCAA-CCCUUGGA-ACCUGA-UCCGGGUAAUGCCGGCG-GAGGGA--AGCCUAUGCGGAAGACC\n" ] } ], "source": [ "gapped_rnas, organism_names_2_nse, consensus_2d = rfam.get_entry(rfam_id='RF00059')\n", "\n", "for gapped_rna in gapped_rnas[:10]:\n", " print \"sequence for %s: %s\"%(gapped_rna.name, gapped_rna.sequence)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function consensus2d_to_base_pairs() allows to easily infer the secondary structure for a given RNA sequence" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "AE014074.1/464992-464896\n", "UAUUUCACAAAGGAGUGCUUUGGCUGAGAUCGCAAUUGCGAAAUCCUGAGGACCUGAUCUUGUUAGUACAAGCGUAGGGAUUGUGACCAAUAAUCAA\n", ".....[[[[[((((..(((()))).....(((((())))))..)))).....((((..((((......))))..))))..]]]]]............\n", "AL935263.2/99249-99354\n", "UUUAAACACUAGGGGUGUCCAAAAAUGGGCUGAGAUGGUGCUGUAAGUACCGAUCCCUUUGAACCUGUAAGCUCAAACUUGCGUAGGAAAGUGUCACAGCUAAUGU\n", ".....[[[[[((((..(((((....))))).....((((((.....))))))..)))).....((((.(((......)))...))))..]]]]]............\n", "BX248356.1/232868-232999\n", "UUUAUAAAUCACGGGUGCUGGACGGCAUACGUUUGCCACAAAGCUGAGACAGGGCGAGAAGACGUGCACGUCCCUGAACCGUUGAACCUGAUCCGGGUAAUACCGGCGAUAGGAAGAAUAAUGAACCGAUCG\n", ".....[[[[[((((..(((((..................))))).....((((((...............))))))..))))....((((..((((......))))...))))..]]]]]............\n", "AK120238.1/2075-2184\n", "CUAUGUUAGGAGGUGGCCUCUUGGCCUGGAUUGUUGUGAAUUGGGCUGAGAAAGUCCCUUUGAACCUGAACAGGAUAAUGCCUGCGAAGGGAGUGUGCAUUUCUACUUUU\n", ".....[[[[[((((..(((((....................))))).....()..)))).....((((..((((......))))..))))...]]]]]............\n", "X54035.1/52-153\n", "GAUGACCACAAGGGGAGCAUUAAAGCUGAGAGUGAGCGGUUUCGUUCUGACCCUUUGAACCUGUUAGUUAACGCUGGCGUAGGGAUGUGGCAAAUGCAAUAG\n", ".....[[[[[((((..((((())))).....(.((((......)))))..)))).....((((.(((......)))...))))..]]]]]............\n", "AE008691.1/1524796-1524694\n", "CUCAAGUGCUAGGGGAGCCAAAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGAAAU\n", ".....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............\n", "AB087258.1/15478-15568\n", "CCGACUUAGUCGGGGUGCUGAUAACAGCUGAGAUAAUACCCGUGAACCUGAUACAGUUAAUACUGACGUAGGAAACUAAUGGUCUUACUCC\n", ".....[[[[[((((..(((((..))))).....(.)..))))....((((..((((......))))..))))..]]]]]............\n", "AE008691.1/2140374-2140272\n", "UCAAAGUGCUAGGGGAGCCAGAAAUGGCUGAGAGGGGAUUCGUUCCCGACCCUUAGAACCUGACCAGGGUAAUGCCUGCGAAGGGAAGCACGUUUUACGGAAU\n", ".....[[[[[((((..(((((..))))).....((((((..))))))..)))).....((((..((((......))))..))))..]]]]]............\n", "AL939111.1/162719-162830\n", "CAAGGCACUCGCGGGAGCCCGGACGCACCGGGCUGAGAGGGAGGCUGGCGGCCUCCGACCGUACGAACCUGAUCCGGGUCAUGCCGGCGAAGGGAGGGGCUGGACGCCCAUG\n", ".....[[[[[((((..(((((.......))))).....((((((......))))))..)))).....((((..((((......))))..))))..]]]]]............\n", "AE017221.1/302202-302311\n", "GGGGCAGGCUAGGGGUGCCCGAAUGGAAGGGCUGAGAGCUGGGUUUCUCCCAGCAACCCUUGGAACCUGAUCCGGGUAAUGCCGGCGGAGGGAAGCCUAUGCGGAAGACC\n", ".....[[[[[((((..(((((......))))).....((((((.....))))))..)))).....((((..((((......))))..))))..]]]]]............\n" ] } ], "source": [ "from pyrna.parsers import consensus2d_to_base_pairs, to_bn\n", "\n", "for gapped_rna in gapped_rnas[:10]:\n", " rna = RNA(name = gapped_rna.name, sequence = gapped_rna.sequence.replace('-',''))\n", " print rna.name\n", " print rna.sequence\n", " print to_bn(consensus2d_to_base_pairs(gapped_rna, consensus_2d), len(rna))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Creation of secondary structures from algorithms" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " orientation edge1 edge2 pos1 pos2\n", "0 c ( ) 41 48\n", "1 c ( ) 40 49\n", "2 c ( ) 39 50\n", "3 c ( ) 38 51\n", "4 c ( ) 37 52\n", "5 c ( ) 56 67\n", "6 c ( ) 55 68\n", "7 c ( ) 54 69\n", "8 c ( ) 53 70\n", "9 c ( ) 35 72\n", "10 c ( ) 32 75\n", "11 c ( ) 31 76\n", "12 c ( ) 29 77\n", "13 c ( ) 28 78\n", "14 c ( ) 27 79\n", "15 c ( ) 24 80\n", "16 c ( ) 23 81\n", "17 c ( ) 20 84\n", "18 c ( ) 19 85\n", "19 c ( ) 18 86\n", "20 c ( ) 17 87\n", "21 c ( ) 14 90\n", "22 c ( ) 13 91\n", "23 c ( ) 11 93\n", "24 c ( ) 10 94\n", "25 c ( ) 5 96\n", "26 c ( ) 4 97\n", "27 c ( ) 3 98\n", "28 c ( ) 2 99\n", "29 c ( ) 1 100\n" ] } ], "source": [ "from pyrna.computations import Rnafold\n", "from pyrna.features import RNA\n", "rna = RNA(name = 'my_rna', sequence = 'GGGGTAGGGACGGTAGGGGGACGCAGTGCAGTAACGTACCCGGTAGGGGGTAGGGGGACGCAGTAACCCCGGGGACGCAGTAACCCCACGCAGTAACCCC')\n", "rnafold = Rnafold() #the algorithm is launched locally, using a Docker image\n", "base_pairs = rnafold.fold(rna)\n", "print base_pairs" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute the 2D coordinates for a plot." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " x y\n", "1 51.114 -18.631\n", "2 51.114 -3.631\n", "3 51.114 11.369\n", "4 51.114 26.369\n", "5 51.114 41.369\n", "6 43.659 46.738\n", "7 39.542 55.107\n", "8 39.850 64.568\n", "9 44.640 72.894\n", "10 52.900 78.063\n", "11 54.298 92.998\n", "12 47.780 106.391\n", "13 56.770 119.410\n", "14 58.167 134.345\n", "15 48.494 145.613\n", "16 49.699 160.779\n", "17 61.583 170.848\n", "18 62.981 185.783\n", "19 64.379 200.718\n", "20 65.777 215.652\n", "21 56.104 226.921\n", "22 57.308 242.086\n", "23 69.193 252.156\n", "24 70.590 267.091\n", "25 68.168 271.394\n", "26 68.550 277.290\n", "27 72.568 282.901\n", "28 74.891 297.719\n", "29 77.215 312.538\n", "30 75.104 319.761\n", ".. ... ...\n", "71 121.623 385.454\n", "72 104.173 375.762\n", "73 112.813 362.801\n", "74 109.780 347.892\n", "75 97.371 339.735\n", "76 94.587 324.996\n", "77 92.034 310.215\n", "78 89.710 295.396\n", "79 87.387 280.577\n", "80 85.525 265.693\n", "81 84.128 250.758\n", "82 93.937 238.658\n", "83 92.307 223.532\n", "84 80.711 214.255\n", "85 79.314 199.320\n", "86 77.916 184.385\n", "87 76.518 169.451\n", "88 86.328 157.351\n", "89 84.698 142.225\n", "90 73.102 132.947\n", "91 71.704 118.012\n", "92 78.122 103.551\n", "93 69.233 91.600\n", "94 67.835 76.666\n", "95 78.142 58.016\n", "96 66.114 41.369\n", "97 66.114 26.369\n", "98 66.114 11.369\n", "99 66.114 -3.631\n", "100 66.114 -18.631\n", "\n", "[100 rows x 2 columns]\n" ] } ], "source": [ "from pyrna.computations import Rnaplot\n", "rnaplot = Rnaplot()\n", "plot = rnaplot.plot(secondary_structure = base_pairs, rna = rna)\n", "print plot" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also compute a secondary structure by annotating a tertiary one. First, we need to import a 3D structure from the Protein Databank." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from pyrna.db import PDB\n", "from pyrna.parsers import parse_pdb\n", "pdb = PDB()\n", "tertiary_structures = parse_pdb(pdb.get_entry('1HR2'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With PyRNA, a pyrna.features.TertiaryStructure object is made with a single molecular chain. Consequently, the function parse_pdb returns a list of such objects. We can iterate over these pyrna.features.TertiaryStructure objects to annotate them with the algorithm Rnaview (Rnaview is available for computations from my own server)." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Molecular chain A\n", " edge1 edge2 orientation pos1 pos2\n", "0 ( ) c 4 112\n", "1 ( ) c 5 111\n", "2 ( ) c 6 110\n", "3 ( ) c 7 109\n", "4 ( ) c 8 108\n", "5 ( ) c 9 107\n", "6 ( ) c 10 106\n", "7 { ] t 11 105\n", "8 [ } t 12 104\n", "9 ( ) c 14 103\n", "10 ( ) c 15 102\n", "11 ( ) c 16 101\n", "12 ( ) c 17 100\n", "13 ( ) c 18 99\n", "14 ( ) c 19 98\n", "15 ( ) c 24 94\n", "16 ( ) c 25 93\n", "17 ( ) c 26 92\n", "18 ( ) c 27 91\n", "19 ( ) c 30 89\n", "20 ( ) c 31 88\n", "21 ( ) c 32 87\n", "22 ( ) c 34 80\n", "23 ( ) c 35 79\n", "24 ( ) c 36 78\n", "25 [ } t 37 62\n", "26 [ } t 38 61\n", "27 ( ) c 39 60\n", "28 ( ) c 40 59\n", "29 ( ) c 41 58\n", ".. ... ... ... ... ...\n", "53 ( ] t 21 96\n", "54 ( } t 22 99\n", "55 ( ] c 33 85\n", "56 { ] t 48 51\n", "57 ( ) t 49 145\n", "58 { } t 51 147\n", "59 [ ) t 66 86\n", "60 { ] c 69 70\n", "61 { } t 79 84\n", "62 { } t 82 109\n", "63 { } c 95 98\n", "64 { ] c 114 115\n", "65 ( ] t 121 145\n", "66 { ] c 122 123\n", "67 ( ) c 123 146\n", "68 ! ! ! 21 98\n", "69 ! ! ! 111 157\n", "70 ! ! ! 124 146\n", "71 ! ! ! 9 66\n", "72 ! ! ! 19 21\n", "73 ! ! ! 50 147\n", "74 ! ! ! 61 75\n", "75 ! ! ! 1 115\n", "76 ! ! ! 8 81\n", "77 ! ! ! 24 100\n", "78 ! ! ! 50 121\n", "79 ! ! ! 51 120\n", "80 ! ! ! 52 148\n", "81 ! ! ! 77 78\n", "82 ! ! ! 112 157\n", "\n", "[83 rows x 5 columns]\n", "Molecular chain B\n", " edge1 edge2 orientation pos1 pos2\n", "0 ( ) c 2 115\n", "1 ( ) c 3 114\n", "2 ( ) c 4 113\n", "3 ( ) c 6 112\n", "4 ( ) c 7 111\n", "5 ( ) c 8 110\n", "6 ( ) c 9 109\n", "7 ( ) c 10 108\n", "8 ( ) c 11 107\n", "9 { ] t 12 106\n", "10 [ } t 13 105\n", "11 ( ) c 15 104\n", "12 ( ) c 16 103\n", "13 ( ) c 17 102\n", "14 ( ) c 18 101\n", "15 ( ) c 19 100\n", "16 ( ) c 20 99\n", "17 ( ) c 25 95\n", "18 ( ) c 26 94\n", "19 ( ) c 27 93\n", "20 ( ) c 28 92\n", "21 ( ) c 30 91\n", "22 ( ) c 31 90\n", "23 ( ) c 32 89\n", "24 ( ) c 33 88\n", "25 ( ) c 35 81\n", "26 ( ) c 36 80\n", "27 ( ) c 37 79\n", "28 [ } t 38 63\n", "29 [ } t 39 62\n", ".. ... ... ... ... ...\n", "59 [ ) c 7 158\n", "60 { } c 9 82\n", "61 { ) t 10 67\n", "62 ( } t 23 100\n", "63 ( ] c 34 86\n", "64 { ] t 49 52\n", "65 ( ) t 50 146\n", "66 { } t 52 148\n", "67 [ ) t 67 87\n", "68 { ] c 70 71\n", "69 { } t 80 85\n", "70 { } t 83 110\n", "71 { } c 96 99\n", "72 { ] c 116 117\n", "73 ( ] t 122 146\n", "74 { ] c 123 124\n", "75 ( ) c 124 147\n", "76 ! ! ! 14 105\n", "77 ! ! ! 125 147\n", "78 ! ! ! 10 87\n", "79 ! ! ! 20 22\n", "80 ! ! ! 23 24\n", "81 ! ! ! 51 148\n", "82 ! ! ! 62 76\n", "83 ! ! ! 8 83\n", "84 ! ! ! 25 101\n", "85 ! ! ! 51 122\n", "86 ! ! ! 52 121\n", "87 ! ! ! 53 149\n", "88 ! ! ! 98 99\n", "\n", "[89 rows x 5 columns]\n" ] } ], "source": [ "secondary_structures = []\n", "for ts in tertiary_structures:\n", " #the function annotate() from Rnaview returns a pyrna.features.SecondaryStructure object and its 3D counterpart as a pyrna.features.TertiaryStructure object\n", " secondary_structure, tertiary_structure = Rnaview().annotate(ts)\n", " secondary_structures.append(secondary_structure)\n", " #the function secondary_structure_to_base_pairs() transform a pyrna.features.SecondaryStructure object into a list of base pairs stored in a pandas Dataframe\n", " print \"Molecular chain %s\"%secondary_structure.rna.name\n", " print secondary_structure_to_base_pairs(secondary_structure, keep_tertiaries = True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Mining of a Secondary Structure" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can search for all base pairs made with at least one Hoogsteen edge. Since the variable base_pairs stores a pandas Dataframe, we're using the operators available from pandas (http://pandas.pydata.org/pandas-docs/stable/indexing.html)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " edge1 edge2 orientation pos1 pos2\n", "7 { ] t 11 105\n", "8 [ } t 12 104\n", "25 [ } t 37 62\n", "26 [ } t 38 61\n", "52 [ ) c 4 156\n", "53 ( ] t 21 96\n", "55 ( ] c 33 85\n", "56 { ] t 48 51\n", "59 [ ) t 66 86\n", "60 { ] c 69 70\n", "64 { ] c 114 115\n", "65 ( ] t 121 145\n", "66 { ] c 122 123\n" ] } ], "source": [ "base_pairs = secondary_structure_to_base_pairs(secondary_structures[0], keep_tertiaries = True)\n", "print base_pairs[(base_pairs['edge1'] == '[') | (base_pairs['edge2'] == ']')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A pyrna.features.SecondaryStructure can find all its junctions (meaning the group of single-strands linking helices). \n", "\n", "According to the number of single-strands making a junction, we distinguish:\n", " \n", "* an apical looop (a single single-strand)\n", "* an inner loop (two single-strands)\n", "* an three-way junction (three single-strands)\n", "* a four-way junction (four single-strands)\n", "* a n-way junction (n single-strands)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "junctions = []\n", "for ss in secondary_structures:\n", " ss.find_junctions()\n", " junctions += ss.junctions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A junction is described with:\n", "\n", "* a list of single-strands making the junction\n", "* a location: the positions in sequence of the residues \n", "* a description: a string made with the single-strand sequences joined with single space characters. In the description of a junction, the character '-' means a direct link (phosphodiester bond) between two helices. Consequently, we have no single-strand linking these helices." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'single_strands': [{'name': 'SS_2', 'location': [13, 13]}], 'description': 'A -', 'location': [[12, 14], [103, 104]]}\n" ] } ], "source": [ "print junctions[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can search for all GNRA apical loops..." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Apical loop with sequence GAAA\n", "Apical loop with sequence GAAA\n" ] } ], "source": [ "import re\n", "for junction in junctions:\n", " if re.match('G[AUGC][AG]A',junction['description']):\n", " print \"Apical loop with sequence %s\"%junction['description']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or for all three-way junctions..." ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Three-way junctions with sequence GUAU - -\n", "Three-way junctions with sequence GUAU - -\n" ] } ], "source": [ "for junction in junctions:\n", " if len(junction['location']) == 3:\n", " print \"Three-way junctions with sequence %s\"%junction['description']" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [Root]", "language": "python", "name": "Python [Root]" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 0 }