{ "metadata": { "name": "", "signature": "sha256:e0a1dfaeb3f9fadf3cd3cfef57a2cdd460de15fc1d23030d58875e90ef48674e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Continuum's [Blaze](http://blaze.pydata.org/docs/latest/index.html) package provides an interesting way of working with datasets from within Python. Inspired by this [blog post](http://continuum.io/blog/blaze-datasets), here's a quick exploration of using Blaze Expressions" ] }, { "cell_type": "code", "collapsed": false, "input": [ "%pylab inline\n", "import blaze\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Populating the interactive namespace from numpy and matplotlib\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some technical details before getting started.\n", "\n", "This is the first post I've done using Python 3.4. I'm using the [anaconda](http://continuum.io/downloads) distribution and its excellent conda package manager to handle installing packages.\n", "\n", "The Blaze version I'm using is from the blaze channel: `conda install -c blaze blaze` and I got psycopg2 for Python3.4 (to allow connections to PostgreSQL) from the pandas channel: `conda install -c pandas psycopg2`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "chembl = blaze.Data('postgresql://localhost/chembl_19')" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Blaze collects information about the schema while connecting, so I have access to all the tables and their columns. This doesn't show in the static blog post, but I can use IPython's tab completion on table names and/or columns. \n", "\n", "Start with a basic join, blaze figures out the column(s) to use:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "docs_and_compounds = blaze.join(chembl.docs,chembl.compound_records)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "docs_and_compounds.fields" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ "['doc_id',\n", " 'journal',\n", " 'year',\n", " 'volume',\n", " 'issue',\n", " 'first_page',\n", " 'last_page',\n", " 'pubmed_id',\n", " 'doi',\n", " 'chembl_id',\n", " 'title',\n", " 'doc_type',\n", " 'authors',\n", " 'abstract',\n", " 'record_id',\n", " 'molregno',\n", " 'compound_key',\n", " 'compound_name',\n", " 'src_id',\n", " 'src_compound_id']" ] } ], "prompt_number": 4 }, { "cell_type": "code", "collapsed": false, "input": [ "docs_and_compounds.nrows" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "1637862" ], "metadata": {}, "output_type": "pyout", "prompt_number": 5, "text": [ "1637862" ] } ], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "blaze.sort(docs_and_compounds,key='year',ascending=False)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doc_idjournalyearvolumeissuefirst_pagelast_pagepubmed_iddoichembl_idtitledoc_typeauthorsabstractrecord_idmolregnocompound_keycompound_namesrc_idsrc_compound_id
0 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979019 1626883 MMV672625 None 23 MMV672625
1 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979020 1626884 MMV672626 None 23 MMV672626
2 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979021 1626885 MMV672686 None 23 MMV672686
3 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979022 1626886 MMV672687 None 23 MMV672687
4 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979023 1626887 MMV672688 None 23 MMV672688
5 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979024 1626888 MMV672689 None 23 MMV672689
6 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979025 1626889 MMV672723 None 23 MMV672723
7 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979026 1626890 MMV672725 None 23 MMV672725
8 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979027 1626891 MMV672726 None 23 MMV672726
9 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979028 1626892 MMV672727 None 23 MMV672727
10 77449 None None None None None None None 10.6019/CHEMBL3137547 CHEMBL3137547 Open Source Malaria Deposition 2. http://malar... DATASET None None 1979029 1626893 MMV672730 None 23 MMV672730
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ " doc_id journal year volume issue first_page last_page pubmed_id \\\n", "0 77449 None None None None None None None \n", "1 77449 None None None None None None None \n", "2 77449 None None None None None None None \n", "3 77449 None None None None None None None \n", "4 77449 None None None None None None None \n", "5 77449 None None None None None None None \n", "6 77449 None None None None None None None \n", "7 77449 None None None None None None None \n", "8 77449 None None None None None None None \n", "9 77449 None None None None None None None \n", "10 77449 None None None None None None None \n", "\n", " doi chembl_id \\\n", "0 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "1 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "2 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "3 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "4 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "5 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "6 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "7 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "8 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "9 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "10 10.6019/CHEMBL3137547 CHEMBL3137547 \n", "\n", " title doc_type authors \\\n", "0 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "1 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "2 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "3 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "4 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "5 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "6 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "7 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "8 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "9 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "10 Open Source Malaria Deposition 2. http://malar... DATASET None \n", "\n", " abstract record_id molregno compound_key compound_name src_id \\\n", "0 None 1979019 1626883 MMV672625 None 23 \n", "1 None 1979020 1626884 MMV672626 None 23 \n", "2 None 1979021 1626885 MMV672686 None 23 \n", "3 None 1979022 1626886 MMV672687 None 23 \n", "4 None 1979023 1626887 MMV672688 None 23 \n", "5 None 1979024 1626888 MMV672689 None 23 \n", "6 None 1979025 1626889 MMV672723 None 23 \n", "7 None 1979026 1626890 MMV672725 None 23 \n", "8 None 1979027 1626891 MMV672726 None 23 \n", "9 None 1979028 1626892 MMV672727 None 23 \n", "10 None 1979029 1626893 MMV672730 None 23 \n", "\n", " src_compound_id \n", "0 MMV672625 \n", "1 MMV672626 \n", "2 MMV672686 \n", "3 MMV672687 \n", "4 MMV672688 \n", "5 MMV672689 \n", "6 MMV672723 \n", "7 MMV672725 \n", "8 MMV672726 \n", "9 MMV672727 \n", "..." ] } ], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get rid of the unpublished datasets (doc_id=-1) and datasets where the year is None by doing a query on the docs_and_compounds object:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "docs_and_compounds = docs_and_compounds[(docs_and_compounds['year']>0) & (docs_and_compounds['doc_id']>-1)]\n", "docs_and_compounds.nrows" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "1095936" ], "metadata": {}, "output_type": "pyout", "prompt_number": 6, "text": [ "1095936" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "docs_and_compounds" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
doc_idjournalyearvolumeissuefirst_pagelast_pagepubmed_iddoichembl_idtitledoc_typeauthorsabstractrecord_idmolregnocompound_keycompound_namesrc_idsrc_compound_id
0 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350140 292848 15 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph... 1 None
1 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350141 292849 16 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph... 1 None
2 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350142 292858 17 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
3 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350143 292858 18 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... 1 None
4 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350144 292870 20 4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe... 1 None
5 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350145 292107 21 (3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl... 1 None
6 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350146 292254 22 4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr... 1 None
7 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350147 292403 23 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
8 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350148 292557 24 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... 1 None
9 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350149 292687 25 (3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p... 1 None
10 1 J. Med. Chem. 2004 47 1 1 9 14695813 None CHEMBL1139451 None PUBLICATION None None 350150 175106 Ezetimibe (3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro... 1 None
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 7, "text": [ " doc_id journal year volume issue first_page last_page pubmed_id \\\n", "0 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "1 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "2 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "3 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "4 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "5 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "6 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "7 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "8 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "9 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "10 1 J. Med. Chem. 2004 47 1 1 9 14695813 \n", "\n", " doi chembl_id title doc_type authors abstract record_id \\\n", "0 None CHEMBL1139451 None PUBLICATION None None 350140 \n", "1 None CHEMBL1139451 None PUBLICATION None None 350141 \n", "2 None CHEMBL1139451 None PUBLICATION None None 350142 \n", "3 None CHEMBL1139451 None PUBLICATION None None 350143 \n", "4 None CHEMBL1139451 None PUBLICATION None None 350144 \n", "5 None CHEMBL1139451 None PUBLICATION None None 350145 \n", "6 None CHEMBL1139451 None PUBLICATION None None 350146 \n", "7 None CHEMBL1139451 None PUBLICATION None None 350147 \n", "8 None CHEMBL1139451 None PUBLICATION None None 350148 \n", "9 None CHEMBL1139451 None PUBLICATION None None 350149 \n", "10 None CHEMBL1139451 None PUBLICATION None None 350150 \n", "\n", " molregno compound_key compound_name \\\n", "0 292848 15 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((Z)-3-ph... \n", "1 292849 16 (3R,4S)-1,4-Bis-(4-methoxy-phenyl)-3-((E)-3-ph... \n", "2 292858 17 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... \n", "3 292858 18 (S)-7-(4-Chloro-phenyl)-3-(4-methoxy-phenyl)-2... \n", "4 292870 20 4-[(2S,3R)-1-(4-Hydroxy-phenyl)-4-oxo-3-(3-phe... \n", "5 292107 21 (3R,4S)-1,4-Bis-(4-hydroxy-phenyl)-3-(3-phenyl... \n", "6 292254 22 4-[1-(4-formylphenyl)-3-[3-(4-hydroxyphenyl)pr... \n", "7 292403 23 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... \n", "8 292557 24 4-[1-(4-formylphenyl)-3-(3-hydroxy-3-phenylpro... \n", "9 292687 25 (3R,4S)-1,4-Bis-(4-acetyl-phenyl)-3-(3-oxo-3-p... \n", "10 175106 Ezetimibe (3R,4S)-1-(4-Fluoro-phenyl)-3-[(S)-3-(4-fluoro... \n", "\n", " src_id src_compound_id \n", "0 1 None \n", "1 1 None \n", "2 1 None \n", "3 1 None \n", "4 1 None \n", "5 1 None \n", "6 1 None \n", "7 1 None \n", "8 1 None \n", "9 1 None \n", "..." ] } ], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find the first year each molregno appeared using `blaze.by`. This is analogous to SQLs \"group by\"." ] }, { "cell_type": "code", "collapsed": false, "input": [ "minyear = blaze.by(docs_and_compounds.molregno,yr=docs_and_compounds.year.min())\n", "minyear" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
molregnoyr
0 1 1983
1 2 1983
2 3 1983
3 4 1983
4 5 1983
5 6 1983
6 7 1983
7 8 1983
8 9 1983
9 10 1997
10 11 1993
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " molregno yr\n", "0 1 1983\n", "1 2 1983\n", "2 3 1983\n", "3 4 1983\n", "4 5 1983\n", "5 6 1983\n", "6 7 1983\n", "7 8 1983\n", "8 9 1983\n", "9 10 1997\n", "..." ] } ], "prompt_number": 8 }, { "cell_type": "code", "collapsed": false, "input": [ "year_counts = blaze.by(minyear.yr,cnt=minyear.molregno.count())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "blaze.sort(year_counts,key='yr',ascending=False)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
yrcnt
0 2014 15131
1 2013 70429
2 2012 69407
3 2011 67026
4 2010 84525
5 2009 65118
6 2008 61158
7 2007 52060
8 2006 33837
9 2005 30778
10 2004 31118
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " yr cnt\n", "0 2014 15131\n", "1 2013 70429\n", "2 2012 69407\n", "3 2011 67026\n", "4 2010 84525\n", "5 2009 65118\n", "6 2008 61158\n", "7 2007 52060\n", "8 2006 33837\n", "9 2005 30778\n", "..." ] } ], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's what we had before. Excellent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try something else: the number of documents per compound. We'll use the limited set of docs that have years and doc_ids" ] }, { "cell_type": "code", "collapsed": false, "input": [ "docs_per_compound = blaze.by(docs_and_compounds.molregno,ndocs=docs_and_compounds.doc_id.nunique())" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "blaze.sort(docs_per_compound,key='ndocs',ascending=False)" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
molregnondocs
0 78759 1140
1 241 1045
2 6579 761
3 8062 660
4 8873 617
5 173 593
6 13758 564
7 305519 544
8 27307 539
9 365189 530
10 70140 501
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " molregno ndocs\n", "0 78759 1140\n", "1 241 1045\n", "2 6579 761\n", "3 8062 660\n", "4 8873 617\n", "5 173 593\n", "6 13758 564\n", "7 305519 544\n", "8 27307 539\n", "9 365189 530\n", "..." ] } ], "prompt_number": 12 }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's also more or less what we got before.\n", "\n", "Blaze also makes it easy to pull the data from the query (which is only executed in a lazy manner) into other formats. Here's an example of grabbing it as a Pandas DataFrame (for more on this, look at the [blaze migrations blog post](http://continuum.io/blog/blaze-migrations))" ] }, { "cell_type": "code", "collapsed": false, "input": [ "df = blaze.into(pd.DataFrame,docs_per_compound)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "_=hist(df['ndocs'],bins=20,log=True)\n", "_=xlabel('num docs per compound')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "display_data", "png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAESCAYAAAAVLtXjAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAFdxJREFUeJzt3X+0ZWV93/H3x0HBX0igTQw4roEIcWg1GOtgbRLOihqv\nIXFStYFpYhUbm7qCpj+iJMbFurOWbW1jV2OkxLbCxJg4YA1RacIaSevYqaBAFJjIDDIt084wy9E2\nUmOaBJBv/9j7wpmbGebcc/a9Z87Z79dad83ezzn72c9z7p3vfe53P/vZqSokSf3ypGk3QJK09gz+\nktRDBn9J6iGDvyT1kMFfknrI4C9JPWTwl6QeMvhLUg+d1HWFSQK8B3gmcEdV/WbX55AkTWY1Rv4/\nAZwFPAQcXIX6JUkTGin4J7k2yeEku5eVLyTZm+S+JFe0xecBn6uqXwDe2nF7JUkdGHXkvw1YGC5I\nsg64qi0/H9iSZCPNaP/B9m2PdtROSVKHRgr+VbUL+May4k3AvqraX1UPA9cBm4EbgFcl+TVgZ4dt\nlSR1ZJILvmcBB4b2DwIXVtWfAT8zUaskSatqkuA/0VrQSVxLWpLGUFWZtI5Jgv8DwPqh/fWscHbP\nmWeew6mnns6pp56+4pO//vULvOMd/3jFx62VxcVFFhcXp92MVTPP/ZvnvoH9m1U7d+5k586dbN26\ntZP6Jgn+dwDnJtkAHAIuAbaspIJDh67m0KFxTn0T55zz5XEOlKSZNBgMGAwGaxv8k2wHLgLOSHIA\nuLKqtiW5HNgBrAOuqao9Kzv9rcCg/VqJA8DnV3iMJM2upZF/V0YK/lV11BF9Vd0E3DT+6RfHP/QE\nNxgMpt2EVTXP/ZvnvoH9m1VTGfmvnkXGG/mf+Ob1B3DJPPdvnvsG9m9WdT3yz7Qe4N7M9hn33B/i\n0ks/z/btH+q0TZJ0okvSyWwfV/WUpB4y7SNJM8C0D2DaR1JfmfaRJI3NtI8kzQDTPoBpH0l9ZdpH\nkjQ20z6SNANM+wCmfST1lWkfSdLYDP6S1EMGf0nqIS/4StIM8IIv4AVfSX3lBV9J0tgM/pLUQwZ/\nSeohg78k9ZDBX5J6yKmekjQDnOoJONVTUl851VOSNDaDvyT1kMFfknqo8+CfZJBkV5JfT3JR1/VL\nkia3GiP/R4E/AU4GDq5C/ZKkCY0U/JNcm+Rwkt3LyheS7E1yX5Ir2uJdVfWjwC8CWzturySpA6OO\n/LcBC8MFSdYBV7Xl5wNbkmysx+eOPkgz+pcknWBGusmrqnYl2bCseBOwr6r2AyS5Dtic5PnAq4DT\ngA901lJJUmcmucP3LODA0P5B4MKqei/wuxO1SpK0qiYJ/h3cGrw4tD3AZR4k6UhdL+uwZJLg/wCw\nfmh/PSue3bM4weklaf4NBgMGg8Fj+1u3djOPZpLgfwdwbnst4BBwCbBlZVUs4ohfko6v678ARp3q\nuR24BTgvyYEkl1XVI8DlwA7gHuD6qtrTWcskSavGVT0laYa4qqckaWw+zEWSZoAPcwFM+0jqq67S\nPo78JWkGOPIHHPlL6isv+EqSxmbaR5JmgGkfwLSPpL4y7SNJGpvBX5J6yJy/JM0Ac/6AOX9JfWXO\nX5I0NoO/JPWQOX9JmgHm/AFz/pL6ypy/JGlsBn9J6iGDvyT1kMFfknrI4C9JPeRUT0maAU71BJzq\nKamvnOopSRqbwV+SesjgL0k9tCrBP8nTk9ye5OLVqF+SNJnVGvm/E7h+leqWJE1opOCf5Nokh5Ps\nXla+kGRvkvuSXNGWvRK4B/h6982VJHVh1JH/NmBhuCDJOuCqtvx8YEuSjcBFwEuBvwu8JcnEU5Ik\nSd0a6SavqtqVZMOy4k3AvqraD5DkOmBzVb273X8j8PWa1o0EkqRjmuQO37OAA0P7B4ELl3aq6sMT\n1C1JWkWTBP8ORvSLQ9sDXOZBko7U9bIOSyYJ/g8A64f219OM/ldgcYLTS9L8GwwGDAaDx/a3bt3a\nSb2TBP87gHPbawGHgEuALSurYhFH/JJ0fF3/BTDqVM/twC3AeUkOJLmsqh4BLgd20EztvL6q9qzs\n9IsY+CXp+AaDAYuLi53VN+psn6OO6KvqJuCm8U+/iCN/STo+l3QGXNJZUl91taSzD3ORpBngyB9w\n5C+prxz5S1KPOPIHHPlL6isf4yhJGptpH0maAaZ9ANM+kvrKtI8kaWwGf0nqIXP+kjQDzPkD5vwl\n9ZU5f0nS2Az+ktRD5vwlaQaY8wfM+UvqK3P+kqSxGfwlqYcM/pLUQwZ/Seohg78k9ZBTPSVpBjjV\nE3Cqp6S+cqqnJGlsBn9J6iGDvyT1kMFfknqo8+Cf5PlJfj3Jx5L8/a7rlyRNrvPgX1V7q+qtwKXA\nq7quX5I0uZGCf5JrkxxOsntZ+UKSvUnuS3LFUPmPA78HXNdtcyVJXRh15L8NWBguSLIOuKotPx/Y\nkmQjQFXdWFWvBt7YYVslSR0Z6Q7fqtqVZMOy4k3AvqraD5DkOmBzku8EXgucAnyms5ZKkjozyfIO\nZwEHhvYPAhdW1WeBz45WxeLQ9gCXeZCkI3W9rMOSSYJ/B+tCLE5ehSTNscFgwGAweGx/69atndQ7\nSfB/AFg/tL+eZvS/Aos44pek4+v6L4BJpnreAZybZEOSpwCXAJ9aWRWLGPgl6fgGgwGLi4ud1TfS\nyD/JduAi4IwkB4Arq2pbksuBHcA64Jqq2rOy0y/iyF+Sjs8lnQGXdJbUV10t6ezDXCRpBjjyBxz5\nS+orR/6S1COO/AFH/pL6ysc4SpLGZtpHkmaAaR/AtI+kvjLtI0kam8FfknrInL8kzQBz/oA5f0l9\nZc5fkjQ2g78k9ZDBX5J6yAu+kjQDvOALeMFXUl95wVeSNDaDvyT1kMFfknrI4C9JPeRsH0maAc72\nAeBDwFsmbsO0+i5J45qTZ/hOapLgPfFnJ0kzy5y/JPWQwV+SesjgL0k9tCo5/ySbgYuBU4Frqurm\n1TiPJGk8qxL8q+qTwCeTnAa8DzD4S9IJZOS0T5JrkxxOsntZ+UKSvUnuS3LFssPeDVzVRUMlSd1Z\nSc5/G7AwXJBkHU1wXwDOB7Yk2ZjGvwRuqqo7O2utJKkTI6d9qmpXkg3LijcB+6pqP0CS64DNwCuA\nlwOnJnleVf27TlorSerEpDn/s4ADQ/sHgQur6m3AB45/+OLQ9gCXeZCkI3W9rMOSSYP/hOsjLE54\nekmab4PBgMFg8Nj+1q1bO6l30uD/ALB+aH89zeh/RIs44pek45vqwm5tzv/GqnpBu38ScC9Nfv8Q\ncBuwpar2jFBXBwu7Tba2jwu7SZo1a76wW5LtwEXAGUkOAFdW1bYklwM7gHU0N3QdN/A/bhFH/pJ0\nfC7pDDjyl9RXc7Kk8yKO/CXp+Bz5A478JfVVVyN/V/WUpB7qddonmeyXp385SForpn2ArtI+po0k\nzRrTPpKksfU67SNJs8K0D2DaR1JfmfaRJI3NtI8kzQDTPoBpH0l9ZdpHkjQ2g78k9ZDBX5J6yAu+\nkjQDvOALeMFXUl/NyXr+s82F4STNKoP/RCb9y0OSpsMLvpLUQwZ/SeohZ/tI0gxwtg9wosz2cbaQ\npLXmbJ854GwhSdNi8J8qZwtJmg4v+EpSDxn8JamHOg/+Sc5O8qEk/7HruiVJ3eg8+FfV/VX1M13X\nq78syURfkvprpOCf5Nokh5PsXla+kGRvkvuSXLE6TdSx1QRfkvps1JH/NmBhuCDJOuCqtvx8YEuS\njd02T5K0GkYK/lW1C/jGsuJNwL6q2l9VDwPXAZuTnJ7kg8AF/jUgSSemSeb5nwUcGNo/CFxYVX8M\n/MOJWiVJWlWTBP8OEseLQ9sDXONHko7U9Zo+SyYJ/g8A64f219OM/ldogEF/OiaZ8ePSEtLaGAwG\nDAaD6S3slmQDcGNVvaDdPwm4F3g5cAi4DdhSVXtGrK/3C7vN7vEuSidNy5ou7JZkO3ARcEaSA8CV\nVbUtyeXADmAdcM2ogf9xizjyl6Tjc0lnwJH/tI935C9Ny5ws6byII39JOj5H/oAj/2kf78hfmpau\nRv6u6ilJPWTaR5JmgGkfwLTPtI837SNNi2kfSdLYTPtI0gww7QOY9pn28aZ9pGkx7SNJGptpH82c\nLh5B6V8umjWmfQDTPtM+frppnyb4T9Z3g79mlWkfSdLYDP6S1EMGf0nqIS/4as11ccFW6hsv+AJe\n8J328ZNdMO3igq0XfNVXXvCVJI3N4C9JPWTwl6QeMvhLUg8Z/CWph5zqqV6adLqps4W01pzqCTjV\nc9rHz/5UT6eKalY51VOSNDaDvyT1kMFfknqo8wu+SZ4OXA38BbCzqj7a9TkkSZNZjZH/a4GPVdU/\nAF6zCvVLkiY0UvBPcm2Sw0l2LytfSLI3yX1JrmiLzwIOtNvf7rCtM2bntBuwqrqccnbi2TntBqyq\n+f7ezX//ujLqyH8bsDBckGQdcFVbfj6wJclG4CCwfoX1z6Gd027Aqprv/2A7p92AVTXf37v5719X\nRgrOVbUL+May4k3AvqraX1UPA9cBm4EbgNcluRr4VJeNlSR1Y5ILvsPpHWhG/BdW1f8D3jxRqyRJ\nq2rkO3yTbABurKoXtPuvAxaq6i3t/k/TBP+3jVift0hK0hi6uMN3kpH/Azye26fdPjjqwV00XpI0\nnkkuyN4BnJtkQ5KnAJdgjl+SZsKoUz23A7cA5yU5kOSyqnoEuBzYAdwDXF9Ve1avqZKkrow622dL\nVZ1ZVSdX1fqq2taW31RV31tVz6uqfzFKXce4N2CmJFmf5DNJvpzkj5K8vS0/PcnNSb6S5NNJThs6\n5pfaPu9N8iPTa/3okqxL8qUkN7b7c9O/JKcl+XiSPUnuSXLhvPSvbeuXk+xO8tEkJ89y3452n9E4\n/Uny4vYzuS/J+9e6H8dyjP79SvuzeVeSG5I8a+i1bvpXVWv2BawD9gEbgCcDdwIb17INHfXj2cAF\n7fYzgHuBjcC/At7Zll8BvLfdPr/t65Pbvu8DnjTtfozQz38C/DbwqXZ/bvoHfBh4c7t9EvCseehf\n277/AZzc7l8PvHGW+wb8IPAiYPdQ2Ur6szSx5TZgU7v9+zQTVk7U/r1y6fsAvHc1+rfWN2Ed696A\nmVJVX62qO9vtbwF7aKa+voYmqND++xPt9mZge1U9XFX7ab5hm9a00SuU5DnAj9I8PGHp4vxc9K8d\nRf1gVV0LUFWPVNX/ZT76903gYeBpSU4CngYcYob7Vke/z2gl/bkwyXcDz6yq29r3/ebQMVN1tP5V\n1c1V9Wi7+wXgOe12Z/1b6+B/tHsDzlrjNnSqnQL7Ippv0HdV1eH2pcPAd7XbZ3LkTKhZ6Pe/Ad4B\nPDpUNi/9Oxv4epJtSb6Y5D+0CxLOfP+q6o+Bfw38L5qg/2BV3cwc9G2ZlfZnefkDzEY/oblv6vfb\n7c76t9bBf67m9id5BvA7wM9X1Z8Mv1bN315P1N8T9rNI8mPA16rqSzw+6j/CLPePJs3z/cDVVfX9\nwJ8Cvzj8hlntX5LvAf4RTUrgTOAZ7T04j5nVvh3LCP2ZWUl+GXioVmF15LUO/hPdG3AiSfJkmsD/\nkar6RFt8OMmz29e/G/haW768389py05ULwNek+R+YDvww0k+wvz07yBwsKpub/c/TvPL4Ktz0L+/\nAdxSVf+nmhl5NwB/k/no27CV/CwebMufs6z8hO5nkjfRpF5/aqi4s/6tdfCfi3sDkgS4Brinqn51\n6KVP0Vxco/33E0PllyZ5SpKzgXNpLs6ckKrqXdXM6jobuBT4L1X1Buanf18FDiQ5ry16BfBl4EZm\nv397gZcmeWr7c/oKmqnY89C3YSv6WWy/599sZ3UFeMPQMSecJAs0adfNVfXnQy91178pXNl+Nc3s\nmH3AL631+Tvqww/Q5MLvBL7Ufi0ApwN/AHwF+DRw2tAx72r7vBd41bT7sIK+XsTjs33mpn/A9wG3\nA3fRjI6fNS/9A95J88tsN83F0CfPct9o/vo8BDxEc83wsnH6A7y4/Uz2Ab827X49Qf/eDNwH/M+h\n+HJ11/0beW0fSdL86PF6+5LUXwZ/Seohg78k9ZDBX5J6yOAvST1k8JekHjL464SU5FvTbkOf+Hn3\nj8FfJ6oT8gaUJPP6f+aE/Ly1eub1B1kdaZfi2JPk36d5cM2OJKe0r+1M8uJ2+6+0awGR5E1JPtE+\nZOP+JJcn+YV2Bc1bk3zHUc5zdvva3UneM1Se9sEWu9vXfnLotSvasjuT/PO27O1pHmRyV5on0C0/\nz5uSfDLNw3i+kuTKodd+OskX0jzA5oNLgT7Jt5K8L8mdwEuX1fe8JH/QtuEP21vuOVqbkwySfLb9\nbP57kvcmeUOS29r3ndO+7zfa89+e5N4kF7flp6RZifTu9rMcDPXpA0Nt+k9Jfmio7e9p23drku98\nos9b/WHw1yieB1xVVX8deBB4XVv+RKsp/jXgbwMvAf4Z8M1qVtC8Ffh7R3n/+4F/W1UvpLnVfclr\naZZieCHNOjW/kuTZSV5Ns6b7pqq6gObhHtA82OOCqvo+4GeP0baXtPW+EPg7aZ6AtBH4SeBlVfUi\nmuU7lhbUehrw+aq6oKpuWVbXbwMfaNuwtIDa647W5vb9L2zbtZFm/ZXvqapNNM9NeNtQvc+tqpcA\nFwMfTHIy8HPAt9vPaAvw4bZ8+fdgeP9pwK1t+/4r8Ja2/Fift3rC4K9R3F9Vd7fbf0izXPDxfKaq\n/rSq/jfNL4wb2/Ldxzj+ZTRrnAD81lD5DwAfrcbXgM/SBO+XA9dWu+hVVT3Yvv9u4KNJfgr49jHa\n9umq+kZ77A3tOV5OszbKHUm+BPwwzbr/tPX8zvJKkjwTOLOqPtm24aGq+jPgbx2jzQXcXlWHq+oh\nmjVYdrTV/dHQ51LAx9o699E8mev5bb2/1ZbfS7P2y9LidMfyUFX9Xrs9/L071uetnjhp2g3QTPiL\noe1vA6e024/w+ADiFI40fMyjQ/uPsrKfu+IYzxQ4RvnFwA8BPw78cpIXVNXwL4Hlo+QMlX24qt51\nlDr/vFa+CNbyti0dP+7nsnT80eod/j7Akd+Lh5edz//zAhz5azxLAWg/zfrxAK9f4bHLfY5m+Wg4\ncv3yXcAlSZ6U5K/SBPYvADcDlyV5KkCS72iXsn1uVe2keTjLs4CnH+X8r2zf/1Sax+L9N+A/A69v\nz7H0gPDnPlFHqnmAz8Ekm9tjTm7rPFqbb3uCvi8XmnRU0jyc5RyaFRx3LX02aZajfi7NCrn7gQva\n969ntMcwHuvzVk8Y/DWKY+WU3we8NckXgTOGypdfC1i+fbRR9M8DP5fkbponUBVAVf0uTSrnLpoA\n/Y6q+lpV7aBZ23wpTfNPgXXAR9o6vgi8v6q+eZS230aTxrkL+HhVfbGq9gDvBj6d5C6aZYKfPXTM\nsbwBeHt7zOdoHi941DY/Qd+Xfy5F8xjG22ge3/ezbZroauBJbf+uA95YzbNcPwfcT7Nu//tp0jvD\n9R7tHEf9vNUfLumsXknzdKQXV9XbjvfeaUmyDbixqm6Ydls0vxz5q2/m9nmv0ko48pekHnLkL0k9\nZPCXpB4y+EtSDxn8JamHDP6S1EMGf0nqof8Puqs+zHh8z9cAAAAASUVORK5CYII=\n", "text": [ "" ] } ], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "len(df[df['ndocs']>10])" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 18, "text": [ "2381" ] } ], "prompt_number": 18 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }