{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Intermine-Python: Tutorial 3: More about Constraints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the previous tutorial, we learnt about adding constraints to our query so that we could filter the results. In this tutorial we will take a look at some more contraints and the different types of constraints. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from intermine.webservice import Service" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "service = Service(\"www.flymine.org/flymine/service\")\n", "query=service.new_query(\"Gene\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Unary Constraint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first type of constraint that we will look at is a Unary Constraint. A Unary Constraint is one that does not take any value but can be used to check if a particular attirbute is absent or present. The Unary constraints are IS Null and IS NOT Null. We can look at a small example. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_constraint(\"primaryIdentifier\",\"IS NOT NULL\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='-' description=None id=1000415 length=12653 name='zydeco' primaryIdentifier='FBgn0265767' score=None scoreType=None secondaryIdentifier='CG2893' symbol='zyd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1004698 length=1951 name=None primaryIdentifier='FBgn0039942' score=None scoreType=None secondaryIdentifier='CG17163' symbol='CG17163'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1005938 length=12892 name='Rho GTPase activating protein at 1A' primaryIdentifier='FBgn0025836' score=None scoreType=None secondaryIdentifier='CG40494' symbol='RhoGAP1A'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1007519 length=21475 name='verthandi' primaryIdentifier='FBgn0260987' score=None scoreType=None secondaryIdentifier='CG17436' symbol='vtd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1015398 length=14286 name='Maf1' primaryIdentifier='FBgn0267861' score=None scoreType=None secondaryIdentifier='CG40196' symbol='Maf1'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018843 length=12844 name=None primaryIdentifier='FBgn0039941' score=None scoreType=None secondaryIdentifier='CG17167' symbol='CG17167'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018936 length=11613 name=None primaryIdentifier='FBgn0040031' score=None scoreType=None secondaryIdentifier='CG12061' symbol='CG12061'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019191 length=45356 name=None primaryIdentifier='FBgn0039959' score=None scoreType=None secondaryIdentifier='CG17514' symbol='CG17514'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019474 length=21673 name=None primaryIdentifier='FBgn0039955' score=None scoreType=None secondaryIdentifier='CG41099' symbol='CG41099'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1020233 length=11099 name=None primaryIdentifier='FBgn0040056' score=None scoreType=None secondaryIdentifier='CG17698' symbol='CG17698'\n" ] } ], "source": [ "for row in query.rows(size=10):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Binary Constraint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next type of constraint is a Binary Constraint. This refers to constraints that take a value. Most of the constraints that we looked at in the second tutorial were binary constraints. Binary constraints are the largest group of constraints. The operators are =,<=,>=,<,>,!=" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "= 12000>" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_constraint(\"length\",\">=\",\"12000\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='-' description=None id=1000415 length=12653 name='zydeco' primaryIdentifier='FBgn0265767' score=None scoreType=None secondaryIdentifier='CG2893' symbol='zyd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1005938 length=12892 name='Rho GTPase activating protein at 1A' primaryIdentifier='FBgn0025836' score=None scoreType=None secondaryIdentifier='CG40494' symbol='RhoGAP1A'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1007519 length=21475 name='verthandi' primaryIdentifier='FBgn0260987' score=None scoreType=None secondaryIdentifier='CG17436' symbol='vtd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1015398 length=14286 name='Maf1' primaryIdentifier='FBgn0267861' score=None scoreType=None secondaryIdentifier='CG40196' symbol='Maf1'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018843 length=12844 name=None primaryIdentifier='FBgn0039941' score=None scoreType=None secondaryIdentifier='CG17167' symbol='CG17167'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019191 length=45356 name=None primaryIdentifier='FBgn0039959' score=None scoreType=None secondaryIdentifier='CG17514' symbol='CG17514'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019474 length=21673 name=None primaryIdentifier='FBgn0039955' score=None scoreType=None secondaryIdentifier='CG41099' symbol='CG41099'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1020460 length=32088 name='kelch like family member 10' primaryIdentifier='FBgn0040038' score=None scoreType=None secondaryIdentifier='CG12423' symbol='klhl10'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1058972 length=133933 name=None primaryIdentifier='FBgn0058006' score=None scoreType=None secondaryIdentifier='CG40006' symbol='CG40006'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1060939 length=76790 name='neverland' primaryIdentifier='FBgn0259697' score=None scoreType=None secondaryIdentifier='CG40050' symbol='nvd'\n" ] } ], "source": [ "for row in query.rows(size=10):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above constraint is an example of a binary constraint. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Ternary Constraint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now look at Ternary constraints. A ternary constraint is a type of constraint which has one required value and one optional value. Currently, intermine supports only one such type of operator: LOOKUP. The lookup operator searches through all the fields in a particular class for the value specified by the user. In the example given below, it will search through the entire gene class to find if any of the fields has an occurence of \"zen\". The advantage of this is that you do not need to remember if zen is a symbol or a name or a primaryIdentifier. However, this may lead to ambiguous results and so you can use the optional extra_value parameter to limit the search to the type of object (for example, organism in genes). " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query2=service.new_query()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query2.add_constraint(\"Gene\",\"LOOKUP\",\"zen\",extra_value=\"D. melanogaster\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='84A5-84A5' description=None id=1007678 length=1331 name='zerknullt' primaryIdentifier='FBgn0004053' score=None scoreType=None secondaryIdentifier='CG1046' symbol='zen'\n" ] } ], "source": [ "for row in query2.rows():\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Multi-Value Constraint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next constraint type that we will look at is Multi-Value constraints. This allows the constraint to take multiple values. The two operators that are allowed are ONE OF and NONE OF. " ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query3=service.new_query(\"Gene\")" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query3.add_constraint(\"symbol\",\"NONE OF\",['zen','eve'])" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='-' description=None id=1000415 length=12653 name='zydeco' primaryIdentifier='FBgn0265767' score=None scoreType=None secondaryIdentifier='CG2893' symbol='zyd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1004698 length=1951 name=None primaryIdentifier='FBgn0039942' score=None scoreType=None secondaryIdentifier='CG17163' symbol='CG17163'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1005938 length=12892 name='Rho GTPase activating protein at 1A' primaryIdentifier='FBgn0025836' score=None scoreType=None secondaryIdentifier='CG40494' symbol='RhoGAP1A'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1007519 length=21475 name='verthandi' primaryIdentifier='FBgn0260987' score=None scoreType=None secondaryIdentifier='CG17436' symbol='vtd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1015398 length=14286 name='Maf1' primaryIdentifier='FBgn0267861' score=None scoreType=None secondaryIdentifier='CG40196' symbol='Maf1'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018843 length=12844 name=None primaryIdentifier='FBgn0039941' score=None scoreType=None secondaryIdentifier='CG17167' symbol='CG17167'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018936 length=11613 name=None primaryIdentifier='FBgn0040031' score=None scoreType=None secondaryIdentifier='CG12061' symbol='CG12061'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019191 length=45356 name=None primaryIdentifier='FBgn0039959' score=None scoreType=None secondaryIdentifier='CG17514' symbol='CG17514'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019474 length=21673 name=None primaryIdentifier='FBgn0039955' score=None scoreType=None secondaryIdentifier='CG41099' symbol='CG41099'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1020233 length=11099 name=None primaryIdentifier='FBgn0040056' score=None scoreType=None secondaryIdentifier='CG17698' symbol='CG17698'\n" ] } ], "source": [ "for row in query3.rows(size=10):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### List Constraint" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "List Constraints: List constraints allow users to create a named list of objects and then use the operators IN and NOT IN to use those named lists in queries. An example for the same is below. The path in such a query must always be a Class (for example - Gene is a valid path). The available lists in intermine can be found at: http://www.flymine.org/flymine/bag.do?subtab=view ." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query4=service.new_query()" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query4.add_constraint(\"Gene\",\"IN\",\"PL FlyAtlas_brain_top\")" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='10A3-10A3' description=None id=1107310 length=2075 name=None primaryIdentifier='FBgn0030259' score=None scoreType=None secondaryIdentifier='CG1545' symbol='CG1545'\n", "Gene: briefDescription=None cytoLocation='11D8-11D8' description=None id=1039485 length=90456 name='radish' primaryIdentifier='FBgn0265597' score=None scoreType=None secondaryIdentifier='CG44424' symbol='rad'\n", "Gene: briefDescription=None cytoLocation='14A1-14A3' description=None id=1040836 length=26224 name='mind-meld' primaryIdentifier='FBgn0259110' score=None scoreType=None secondaryIdentifier='CG42252' symbol='mmd'\n", "Gene: briefDescription=None cytoLocation='16F3-16F5' description=None id=1059829 length=138941 name='Shaker' primaryIdentifier='FBgn0003380' score=None scoreType=None secondaryIdentifier='CG12348' symbol='Sh'\n", "Gene: briefDescription=None cytoLocation='18C2-18C3' description=None id=1076481 length=21373 name='nicotinic Acetylcholine Receptor alpha7' primaryIdentifier='FBgn0086778' score=None scoreType=None secondaryIdentifier='CG32538' symbol='nAChRalpha7'\n", "Gene: briefDescription=None cytoLocation='19A4-19A4' description=None id=1174409 length=58027 name='Dopamine 2-like receptor' primaryIdentifier='FBgn0053517' score=None scoreType=None secondaryIdentifier='CG33517' symbol='Dop2R'\n", "Gene: briefDescription=None cytoLocation='24C9-24D1' description=None id=1008362 length=88786 name='friend of echinoid' primaryIdentifier='FBgn0051774' score=None scoreType=None secondaryIdentifier='CG31774' symbol='fred'\n", "Gene: briefDescription=None cytoLocation='25B4-25B4' description=None id=1699681 length=1564 name=None primaryIdentifier='FBgn0031650' score=None scoreType=None secondaryIdentifier='CG14044' symbol='CG14044'\n", "Gene: briefDescription=None cytoLocation='30D1-30E1' description=None id=1433896 length=92934 name='nicotinic Acetylcholine Receptor alpha6' primaryIdentifier='FBgn0032151' score=None scoreType=None secondaryIdentifier='CG4128' symbol='nAChRalpha6'\n", "Gene: briefDescription=None cytoLocation='42D4-42D6' description=None id=1181524 length=10005 name=None primaryIdentifier='FBgn0033108' score=None scoreType=None secondaryIdentifier='CG15236' symbol='CG15236'\n" ] } ], "source": [ "for row in query4.rows(size=10):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Sub-Class Constraints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The intermine database is a hierarchical database. Sub-class constraints allow you to specify a sub-class of a class to constrain a path to. This basically allows us to constrain our results to only those items of the sub class. The example below is an example of a sub-class constraint. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query5=service.new_query(\"Gene\")" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query5.add_constraint(\"ontologyAnnotations\",\"GOAnnotation\")" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gene: briefDescription=None cytoLocation='-' description=None id=1000415 length=12653 name='zydeco' primaryIdentifier='FBgn0265767' score=None scoreType=None secondaryIdentifier='CG2893' symbol='zyd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1005938 length=12892 name='Rho GTPase activating protein at 1A' primaryIdentifier='FBgn0025836' score=None scoreType=None secondaryIdentifier='CG40494' symbol='RhoGAP1A'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1007519 length=21475 name='verthandi' primaryIdentifier='FBgn0260987' score=None scoreType=None secondaryIdentifier='CG17436' symbol='vtd'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1015398 length=14286 name='Maf1' primaryIdentifier='FBgn0267861' score=None scoreType=None secondaryIdentifier='CG40196' symbol='Maf1'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018843 length=12844 name=None primaryIdentifier='FBgn0039941' score=None scoreType=None secondaryIdentifier='CG17167' symbol='CG17167'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1018936 length=11613 name=None primaryIdentifier='FBgn0040031' score=None scoreType=None secondaryIdentifier='CG12061' symbol='CG12061'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019191 length=45356 name=None primaryIdentifier='FBgn0039959' score=None scoreType=None secondaryIdentifier='CG17514' symbol='CG17514'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1019474 length=21673 name=None primaryIdentifier='FBgn0039955' score=None scoreType=None secondaryIdentifier='CG41099' symbol='CG41099'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1020233 length=11099 name=None primaryIdentifier='FBgn0040056' score=None scoreType=None secondaryIdentifier='CG17698' symbol='CG17698'\n", "Gene: briefDescription=None cytoLocation='-' description=None id=1020460 length=32088 name='kelch like family member 10' primaryIdentifier='FBgn0040038' score=None scoreType=None secondaryIdentifier='CG12423' symbol='klhl10'\n" ] } ], "source": [ "for row in query5.rows(size=10):\n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Unlike most constraints, Sub-class constraints do not have an operator that is specified as a parameter to a constraint. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Loop Constraints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Loop Constraints assert that two paths refer to the same object. The valid operators are IS and IS NOT. The Path and LoopPath in such a query must always be a Class(for example - Gene is a valid path). Also, the operators IS and IS NOT map to the ops = and != when they are used in XML serialisation. The example below is an example of a Loop Constraint. " ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query=service.new_query(\"Gene\")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_view(\"homologues.gene.primaryIdentifier\",\"homologues.homologue.primaryIdentifier\")" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_constraint(\"Gene\", \"IN\", \"H. sapiens orthologues of FlyTF_site_specific_1\", code = \"A\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_constraint(\"homologues.homologue\", \"IS NOT\", \"Gene\", code = \"B\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1390 10488 \n", "1390 1385 \n", "1390 1388 \n", "1390 148327 \n", "1390 22926 \n", "1390 286319 \n", "1390 466 \n", "1390 64764 \n", "1390 84699 \n", "1390 90993 \n" ] } ], "source": [ "for row in query.rows(size=10):\n", " print(row[\"homologues.gene.primaryIdentifier\"],row[\"homologues.homologue.primaryIdentifier\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Range Constraints" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Range Constraints are used for testing where a value lies relative to a set of ranges.These constraints require that the value of the path they constrain should lie in relationship to the set of values passed according to the specific operator. Valid operators are OVERLAPS, DOES NOT OVERLAP, WITHIN, OUTSIDE, CONTAINS and DOES NOT CONTAIN. Here is an example of Range Constraint. " ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "query=service.new_query()" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_view(\"SequenceFeature.organism.shortName\" \"SequenceFeature.chromosomeLocation.locatedOn.primaryIdentifier\" \"SequenceFeature.chromosomeLocation.start\" \"SequenceFeature.chromosomeLocation.end\" )" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "query.add_constraint(\"chromosomeLocation\", \"OVERLAPS\", [\"X:94248091..143371935\"])" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SequenceFeature: organism.shortName='H. sapiens' chromosomeLocation.locatedOn.primaryIdentifier='X' chromosomeLocation.start=62462543 chromosomeLocation.end=114281198 \n", "SequenceFeature: organism.shortName='H. sapiens' chromosomeLocation.locatedOn.primaryIdentifier='X' chromosomeLocation.start=94309037 chromosomeLocation.end=94319036 \n", "SequenceFeature: organism.shortName='H. sapiens' chromosomeLocation.locatedOn.primaryIdentifier='X' chromosomeLocation.start=94309037 chromosomeLocation.end=94320117 \n", "SequenceFeature: organism.shortName='H. sapiens' chromosomeLocation.locatedOn.primaryIdentifier='X' chromosomeLocation.start=94314037 chromosomeLocation.end=94319036 \n" ] } ], "source": [ "for row in query.rows(size=4): \n", " print(row)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial summed up some of the important constraint types. In the next tutorial we will look at some of the other features of a query. " ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }