{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Lede Program\n", "## Data and databases\n", "## Dealing with craptastical data sources, text and otherwise\n", "\n", "Biggest bugbear: pdf\n", "\n", " but in general, any sort of image file, with something you want to extract from it\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sad part of story: \n", "\n", "best current commercial tools better than open source tools for many cases\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##the lesser evil (but still very evil)\n", "###pdfs with text information included as instructions for drawing text\n", "####AKA\n", "###you can try to copy and text gets highlighted and you can copy\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##helpful resources\n", "\n", "`https://www.propublica.org/nerds/item/turning-pdfs-to-text-doc-dollars-guide`\n", "\n", "`https://thomaslevine.com/!/parsing-pdfs/`\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## FIRST LINE OF DEFENSE: `TABULA`\n", "\n", "### free software by journalists for journalists and friends\n", "## *La Naciòn*, Knight foundation funding, Propublica\n", "\n", "download and run locally in browser\n", "\n", "http://tabula.technology/ \n", "\n", "requires JAVA runtime (boo).\n", "\n", "OR\n", "\n", "access someone else's open server:\n", "\n", "http://tabula.dataninja.it/\n", "\n", "(this is the older less good version).\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "list=pd.read_csv(\"pdf_examples/tabula-AFD-130118-015.csv\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MAJCOM, FOA, EtcOrganizational LevelFinding TypeQuantityItem(s) discoveredLocation
0ACCStaffUnprofessional1photoWorkplace Common Area
1ACCStaffUnprofessional1newspaper with unprofessional coverWorkplace Common Area
2ACCSquadronUnprofessional1magazineWorkplace Common Area
3ACCSquadronInappropriate/Offensive1Bumper stickerCar
4ACCSquadronUnprofessional6signs with unproffesional languageWorkplace Common Area
\n", "
" ], "text/plain": [ " MAJCOM, FOA, Etc Organizational Level Finding Type Quantity \\\n", "0 ACC Staff Unprofessional 1 \n", "1 ACC Staff Unprofessional 1 \n", "2 ACC Squadron Unprofessional 1 \n", "3 ACC Squadron Inappropriate/Offensive 1 \n", "4 ACC Squadron Unprofessional 6 \n", "\n", " Item(s) discovered Location \n", "0 photo Workplace Common Area \n", "1 newspaper with unprofessional cover Workplace Common Area \n", "2 magazine Workplace Common Area \n", "3 Bumper sticker Car \n", "4 signs with unproffesional language Workplace Common Area " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MAJCOM, FOA, EtcOrganizational LevelFinding TypeQuantityItem(s) discovered
Location
1inX1in deck of cards with \\rnude drawing; 2inX2.5in post \\rcard drawing depicting front of \\rairplane with female drawing11111
A-10 Ladder Doors11111
Acft Dock desk11111
Air Terminal Operations Bldg11111
Aircraft22222
Aircraft Parts Store11111
Airfield Server/network drive11111
Airman & Family Readiness \\rCenter11111
Airmen's common work area11111
Although \\rhumorous/functional…could be \\rperceived as offensive33333
Ammo Facility11111
Anti-religious sentiment does \\rnot promote a proper work \\renvironment11111
Auditorium22222
Auto Hobby Shop22222
Avionics Programs Office22222
Avionics section22222
Back of office door11111
Bar11111
Bar (class gift)11111
Bar (visiting unit gift)11111
Base Common Area11111
Base Library11111
Base operations men’s room11111
Bathroom88888
Bathroom (M)11111
Bathroom (W)11111
Bathroom Stall11111
Bathroom stalls66666
Bathroom wall11111
Bias against mentally \\rhandicapped people is not \\rappropriate in the work place11111
..................
computer44444
computer files3232323232
computer room11111
detrimental to good order and \\rdiscipline22222
explicit material66666
explicit material/mild nudity11111
explicit material/sexuality mild \\rnudity11111
explicit/mild nudity11111
explicit/violent/vulgar material11111
foyer11111
inside the drawer of a common \\ruse desk11111
latrine22222
member’s office22222
nudity/inappropriate subject \\rmatter11111
office cubicles11111
on latrine board11111
potential for inappropriate \\rcontent11111
server55555
sexually explicit item in \\rcommon area11111
sexually explicit/offensive11111
sexually explicit/profane item in \\rcommon area22222
share drive88888
shared drive7777777777
shared drive/history micro film r22222
shelf33333
storage closet11111
unprofessional comments22222
vulgar11111
vulgar or offensive language44444
workspace11111
\n", "

577 rows × 5 columns

\n", "
" ], "text/plain": [ " MAJCOM, FOA, Etc \\\n", "Location \n", "1inX1in deck of cards with \\rnude drawing; 2inX... 1 \n", "A-10 Ladder Doors 1 \n", "Acft Dock desk 1 \n", "Air Terminal Operations Bldg 1 \n", "Aircraft 2 \n", "Aircraft Parts Store 1 \n", "Airfield Server/network drive 1 \n", "Airman & Family Readiness \\rCenter 1 \n", "Airmen's common work area 1 \n", "Although \\rhumorous/functional…could be \\rperce... 3 \n", "Ammo Facility 1 \n", "Anti-religious sentiment does \\rnot promote a p... 1 \n", "Auditorium 2 \n", "Auto Hobby Shop 2 \n", "Avionics Programs Office 2 \n", "Avionics section 2 \n", "Back of office door 1 \n", "Bar 1 \n", "Bar (class gift) 1 \n", "Bar (visiting unit gift) 1 \n", "Base Common Area 1 \n", "Base Library 1 \n", "Base operations men’s room 1 \n", "Bathroom 8 \n", "Bathroom (M) 1 \n", "Bathroom (W) 1 \n", "Bathroom Stall 1 \n", "Bathroom stalls 6 \n", "Bathroom wall 1 \n", "Bias against mentally \\rhandicapped people is n... 1 \n", "... ... \n", "computer 4 \n", "computer files 32 \n", "computer room 1 \n", "detrimental to good order and \\rdiscipline 2 \n", "explicit material 6 \n", "explicit material/mild nudity 1 \n", "explicit material/sexuality mild \\rnudity 1 \n", "explicit/mild nudity 1 \n", "explicit/violent/vulgar material 1 \n", "foyer 1 \n", "inside the drawer of a common \\ruse desk 1 \n", "latrine 2 \n", "member’s office 2 \n", "nudity/inappropriate subject \\rmatter 1 \n", "office cubicles 1 \n", "on latrine board 1 \n", "potential for inappropriate \\rcontent 1 \n", "server 5 \n", "sexually explicit item in \\rcommon area 1 \n", "sexually explicit/offensive 1 \n", "sexually explicit/profane item in \\rcommon area 2 \n", "share drive 8 \n", "shared drive 77 \n", "shared drive/history micro film r 2 \n", "shelf 3 \n", "storage closet 1 \n", "unprofessional comments 2 \n", "vulgar 1 \n", "vulgar or offensive language 4 \n", "workspace 1 \n", "\n", " Organizational Level \\\n", "Location \n", "1inX1in deck of cards with \\rnude drawing; 2inX... 1 \n", "A-10 Ladder Doors 1 \n", "Acft Dock desk 1 \n", "Air Terminal Operations Bldg 1 \n", "Aircraft 2 \n", "Aircraft Parts Store 1 \n", "Airfield Server/network drive 1 \n", "Airman & Family Readiness \\rCenter 1 \n", "Airmen's common work area 1 \n", "Although \\rhumorous/functional…could be \\rperce... 3 \n", "Ammo Facility 1 \n", "Anti-religious sentiment does \\rnot promote a p... 1 \n", "Auditorium 2 \n", "Auto Hobby Shop 2 \n", "Avionics Programs Office 2 \n", "Avionics section 2 \n", "Back of office door 1 \n", "Bar 1 \n", "Bar (class gift) 1 \n", "Bar (visiting unit gift) 1 \n", "Base Common Area 1 \n", "Base Library 1 \n", "Base operations men’s room 1 \n", "Bathroom 8 \n", "Bathroom (M) 1 \n", "Bathroom (W) 1 \n", "Bathroom Stall 1 \n", "Bathroom stalls 6 \n", "Bathroom wall 1 \n", "Bias against mentally \\rhandicapped people is n... 1 \n", "... ... \n", "computer 4 \n", "computer files 32 \n", "computer room 1 \n", "detrimental to good order and \\rdiscipline 2 \n", "explicit material 6 \n", "explicit material/mild nudity 1 \n", "explicit material/sexuality mild \\rnudity 1 \n", "explicit/mild nudity 1 \n", "explicit/violent/vulgar material 1 \n", "foyer 1 \n", "inside the drawer of a common \\ruse desk 1 \n", "latrine 2 \n", "member’s office 2 \n", "nudity/inappropriate subject \\rmatter 1 \n", "office cubicles 1 \n", "on latrine board 1 \n", "potential for inappropriate \\rcontent 1 \n", "server 5 \n", "sexually explicit item in \\rcommon area 1 \n", "sexually explicit/offensive 1 \n", "sexually explicit/profane item in \\rcommon area 2 \n", "share drive 8 \n", "shared drive 77 \n", "shared drive/history micro film r 2 \n", "shelf 3 \n", "storage closet 1 \n", "unprofessional comments 2 \n", "vulgar 1 \n", "vulgar or offensive language 4 \n", "workspace 1 \n", "\n", " Finding Type Quantity \\\n", "Location \n", "1inX1in deck of cards with \\rnude drawing; 2inX... 1 1 \n", "A-10 Ladder Doors 1 1 \n", "Acft Dock desk 1 1 \n", "Air Terminal Operations Bldg 1 1 \n", "Aircraft 2 2 \n", "Aircraft Parts Store 1 1 \n", "Airfield Server/network drive 1 1 \n", "Airman & Family Readiness \\rCenter 1 1 \n", "Airmen's common work area 1 1 \n", "Although \\rhumorous/functional…could be \\rperce... 3 3 \n", "Ammo Facility 1 1 \n", "Anti-religious sentiment does \\rnot promote a p... 1 1 \n", "Auditorium 2 2 \n", "Auto Hobby Shop 2 2 \n", "Avionics Programs Office 2 2 \n", "Avionics section 2 2 \n", "Back of office door 1 1 \n", "Bar 1 1 \n", "Bar (class gift) 1 1 \n", "Bar (visiting unit gift) 1 1 \n", "Base Common Area 1 1 \n", "Base Library 1 1 \n", "Base operations men’s room 1 1 \n", "Bathroom 8 8 \n", "Bathroom (M) 1 1 \n", "Bathroom (W) 1 1 \n", "Bathroom Stall 1 1 \n", "Bathroom stalls 6 6 \n", "Bathroom wall 1 1 \n", "Bias against mentally \\rhandicapped people is n... 1 1 \n", "... ... ... \n", "computer 4 4 \n", "computer files 32 32 \n", "computer room 1 1 \n", "detrimental to good order and \\rdiscipline 2 2 \n", "explicit material 6 6 \n", "explicit material/mild nudity 1 1 \n", "explicit material/sexuality mild \\rnudity 1 1 \n", "explicit/mild nudity 1 1 \n", "explicit/violent/vulgar material 1 1 \n", "foyer 1 1 \n", "inside the drawer of a common \\ruse desk 1 1 \n", "latrine 2 2 \n", "member’s office 2 2 \n", "nudity/inappropriate subject \\rmatter 1 1 \n", "office cubicles 1 1 \n", "on latrine board 1 1 \n", "potential for inappropriate \\rcontent 1 1 \n", "server 5 5 \n", "sexually explicit item in \\rcommon area 1 1 \n", "sexually explicit/offensive 1 1 \n", "sexually explicit/profane item in \\rcommon area 2 2 \n", "share drive 8 8 \n", "shared drive 77 77 \n", "shared drive/history micro film r 2 2 \n", "shelf 3 3 \n", "storage closet 1 1 \n", "unprofessional comments 2 2 \n", "vulgar 1 1 \n", "vulgar or offensive language 4 4 \n", "workspace 1 1 \n", "\n", " Item(s) discovered \n", "Location \n", "1inX1in deck of cards with \\rnude drawing; 2inX... 1 \n", "A-10 Ladder Doors 1 \n", "Acft Dock desk 1 \n", "Air Terminal Operations Bldg 1 \n", "Aircraft 2 \n", "Aircraft Parts Store 1 \n", "Airfield Server/network drive 1 \n", "Airman & Family Readiness \\rCenter 1 \n", "Airmen's common work area 1 \n", "Although \\rhumorous/functional…could be \\rperce... 3 \n", "Ammo Facility 1 \n", "Anti-religious sentiment does \\rnot promote a p... 1 \n", "Auditorium 2 \n", "Auto Hobby Shop 2 \n", "Avionics Programs Office 2 \n", "Avionics section 2 \n", "Back of office door 1 \n", "Bar 1 \n", "Bar (class gift) 1 \n", "Bar (visiting unit gift) 1 \n", "Base Common Area 1 \n", "Base Library 1 \n", "Base operations men’s room 1 \n", "Bathroom 8 \n", "Bathroom (M) 1 \n", "Bathroom (W) 1 \n", "Bathroom Stall 1 \n", "Bathroom stalls 6 \n", "Bathroom wall 1 \n", "Bias against mentally \\rhandicapped people is n... 1 \n", "... ... \n", "computer 4 \n", "computer files 32 \n", "computer room 1 \n", "detrimental to good order and \\rdiscipline 2 \n", "explicit material 6 \n", "explicit material/mild nudity 1 \n", "explicit material/sexuality mild \\rnudity 1 \n", "explicit/mild nudity 1 \n", "explicit/violent/vulgar material 1 \n", "foyer 1 \n", "inside the drawer of a common \\ruse desk 1 \n", "latrine 2 \n", "member’s office 2 \n", "nudity/inappropriate subject \\rmatter 1 \n", "office cubicles 1 \n", "on latrine board 1 \n", "potential for inappropriate \\rcontent 1 \n", "server 5 \n", "sexually explicit item in \\rcommon area 1 \n", "sexually explicit/offensive 1 \n", "sexually explicit/profane item in \\rcommon area 2 \n", "share drive 8 \n", "shared drive 77 \n", "shared drive/history micro film r 2 \n", "shelf 3 \n", "storage closet 1 \n", "unprofessional comments 2 \n", "vulgar 1 \n", "vulgar or offensive language 4 \n", "workspace 1 \n", "\n", "[577 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list.groupby(by=\"Location\").count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### still a lot of data munging to get into working form\n", "\n", " Hello *REGEX* my old friend,\n", " I've come to talk with you once again" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# second line of defense: `pdftotext`\n", "\n", "- part of the poppler-utils in most linux flavors\n", "\n", "`apt-get install poppler-utils`\n", "\n", "\n", "- Mac or Windows download from:\n", "\n", "`http://www.foolabs.com/xpdf/home.html`\n", "\n", "\n", "implementations *vary* a lot. Better on Linux than on Mac. \n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/home/mljones/repositories/courses/databases-2015/pdf_examples\n" ] } ], "source": [ "cd pdf_examples/" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#does basic conversio\n", "!pdftotext p5.pdf\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Keynote Talk\r\n", "\r\n", "The Mathematics of Causal Inference\r\n", "Judea Pearl\r\n", "Computer Science Department\r\n", "University of California Los Angeles\r\n", "Los Angeles, CA 90024, USA\r\n", "\r\n", "judea@cs.ucla.edu\r\n", "\r\n", "Abstract\r\n", "I will review concepts, principles, and mathematical tools that were found useful in applications involving\r\n", "causal and counterfactual relationships. This semantical framework, enriched with a few ideas from logic\r\n", "and graph theory, gives rise to a complete, coherent, and friendly calculus of causation that unifies the\r\n", "graphical and counterfactual approaches to causation and resolves many long-standing problems in several\r\n", "of the sciences. These include questions of causal effect estimation, policy analysis, and the integration of\r\n", "data from diverse studies. Of special interest to KDD researchers would be the following topics:\r\n", "1. The Mediation Formula, and what it tells us about direct and indirect effects.\r\n", "2. What mathematics can tell us about “external validity” or “generalizing from experiments”\r\n", "3. What can graph theory tell us about recovering from sample-selection bias.\r\n", "Categories and Subject Descriptors: G.m [Mathematics of Computing]: Miscellaneous\r\n", "General Terms: Theory\r\n", "\r\n", "Bio\r\n", "Judea Pearl is a professor of computer science and statistics at the University of California, Los Angeles. He is\r\n", "a graduate of the Technion, Israel, and has joined the faculty of UCLA in 1970, where he currently directs the\r\n", "Cognitive Systems Laboratory and conducts research in artificial intelligence, causal inference and philosophy\r\n", "of science. He has authored three books: Heuristics (1984), Probabilistic Reasoning (1988), and Causality\r\n", "(2000;2009). A member of the National Academy of Engineering, and a Founding Fellow the American\r\n", "Association for Artificial Intelligence (AAAI), Judea Pearl is the recipient of the 2008 Benjamin Franklin\r\n", "Medal for Computer and Cognitive Science and this year’s David Rumelhart Prize from the Cognitive Science\r\n", "Society.\r\n", "\r\n", "Copyright is held by the author/owner(s).\r\n", "KDD’11, August 21–24, 2011, San Diego, California, USA.\r\n", "ACM 978-1-4503-0813-7/11/08.\r\n", "\r\n", "5\r\n", "\r\n", "\f" ] } ], "source": [ "!cat p5.txt\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!pdftotext -layout p5.pdf\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Keynote Talk\r\n", " The Mathematics of Causal Inference\r\n", "\r\n", " Judea Pearl\r\n", " Computer Science Department\r\n", " University of California Los Angeles\r\n", " Los Angeles, CA 90024, USA\r\n", " judea@cs.ucla.edu\r\n", "\r\n", "\r\n", "Abstract\r\n", "I will review concepts, principles, and mathematical tools that were found useful in applications involving\r\n", "causal and counterfactual relationships. This semantical framework, enriched with a few ideas from logic\r\n", "and graph theory, gives rise to a complete, coherent, and friendly calculus of causation that unifies the\r\n", "graphical and counterfactual approaches to causation and resolves many long-standing problems in several\r\n", "of the sciences. These include questions of causal effect estimation, policy analysis, and the integration of\r\n", "data from diverse studies. Of special interest to KDD researchers would be the following topics:\r\n", "\r\n", " 1. The Mediation Formula, and what it tells us about direct and indirect effects.\r\n", " 2. What mathematics can tell us about “external validity” or “generalizing from experiments”\r\n", " 3. What can graph theory tell us about recovering from sample-selection bias.\r\n", "\r\n", "\r\n", "Categories and Subject Descriptors: G.m [Mathematics of Computing]: Miscellaneous\r\n", "General Terms: Theory\r\n", "\r\n", "Bio\r\n", "Judea Pearl is a professor of computer science and statistics at the University of California, Los Angeles. He is\r\n", "a graduate of the Technion, Israel, and has joined the faculty of UCLA in 1970, where he currently directs the\r\n", "Cognitive Systems Laboratory and conducts research in artificial intelligence, causal inference and philosophy\r\n", "of science. He has authored three books: Heuristics (1984), Probabilistic Reasoning (1988), and Causality\r\n", "(2000;2009). A member of the National Academy of Engineering, and a Founding Fellow the American\r\n", "Association for Artificial Intelligence (AAAI), Judea Pearl is the recipient of the 2008 Benjamin Franklin\r\n", "Medal for Computer and Cognitive Science and this year’s David Rumelhart Prize from the Cognitive Science\r\n", "Society.\r\n", "\r\n", "\r\n", "\r\n", "\r\n", "Copyright is held by the author/owner(s).\r\n", "KDD’11, August 21–24, 2011, San Diego, California, USA.\r\n", "ACM 978-1-4503-0813-7/11/08.\r\n", "\r\n", "\r\n", "\r\n", "\r\n", " 5\r\n", "\f" ] } ], "source": [ "!cat p5.txt" ] }, { "cell_type": "markdown", "metadata": { "collapsed": false }, "source": [ "Let's check out a yucky scanned then OCR'd table from our good friends at DARPA. (It doesn't work on Tabula, alas!)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "!pdftotext 12-F-1039_1999-DARPA-Funding-List.pdf" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "A\r\n", "1 FY\r\n", "2\r\n", "1420 1999\r\n", "1421\r\n", "1422\r\n", "1423\r\n", "1424\r\n", "1425\r\n", "1426\r\n" ] } ], "source": [ "!head 12-F-1039_1999-DARPA-Funding-List.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# key parameter: `-layout` OR `-fixed` (and a number say 2 or 10)\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "!pdftotext -layout 12-F-1039_1999-DARPA-Funding-List.pdf" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " A B c D E F G\r\n", " 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE AMOUNT\r\n", " 2\r\n", "1420 1999 MDA97292J 1029 GR20 CNRI INFORMATION MANAGEMENT 12/10/1998 $687,000.00\r\n", "1421 MDA97292J1 029 GR22 CNRI COMMUNICATOR 4/22/1999 $400,000.00\r\n", "1422 MDA97292J1 029 GR22 CNRI WEBINABOX 4122/1999 $360,000.00\r\n", "1423 MDA97292J1 029 P00025 CNRI WEBINABOX 8/24/1999 $0.00\r\n", "1424 MDA972931 0030 P00009 GEORGIATEC HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 $1 ,210,694.00\r\n", "1425 MDA9729320014 P00017 USDISPLAYC FLAT PANEL DISPLAYS 8116/1999 $5,794,000.00\r\n", "1426 MDA97293C0016 P00043 SYSPLANCOR CHPS: Combat Hybrid Power Systems 1nt1999 $79,441.00\r\n" ] } ], "source": [ "!head 12-F-1039_1999-DARPA-Funding-List.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That looks like something we might be able to struggle with!\n", "\n", "Let's try it!\n", "\n", "Lots of ways of tackling it but the easiest is probably `pandas`' `read_table` function." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "#first just make sure in control of encoding\n", "!pdftotext -layout -enc \"UTF-8\" 12-F-1039_1999-DARPA-Funding-List.pdf" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": true }, "outputs": [], "source": [ "darpa1999=pd.read_table(\"12-F-1039_1999-DARPA-Funding-List.txt\", sep=\"\\t\", encoding=\"UTF-8\", header=1)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE AMOUNT
02
11420 1999 MDA97292J 1029 GR20 CNRI ...
21421 MDA97292J1 029 GR22 CNRI ...
31422 MDA97292J1 029 GR22 CNRI ...
41423 MDA97292J1 029 P00025 CNRI ...
51424 MDA972931 0030 P00009 GEORGI...
61425 MDA9729320014 P00017 USDISP...
71426 MDA97293C0016 P00043 SYSPLA...
81427 MDA97294C0003 A00003 BELLAT...
91428 MDA97294C0003 P00026 BELLAT...
101429 MDA97294C0003 P00027 BELLAT...
111430 MDA97294C0003 P00028 BELLAT...
121431 MDA97294C0003 P00029 BELLAT...
131432 MDA97294C0003 P00030 BELLAT...
141433 MDA97294C0003 P00031 BELLAT...
151434 MDA97294C0003 P00032 BELLAT...
161435 MDA97294C0016 P00026 BDMFED...
171436 MDA97294C0016 P00027 BDMFED...
181437 MDA97294C0016 P00028 BDMFED...
191438 MDA97294C0016 P00029 BDMFED...
201439 MDA97294C0016 P00030 BDMFED...
211440 MDA97294D0001 D003/P16 VRT ...
221441 MDA97294D0001 0032/3 VRT ...
231442 MDA97294D0001 003202 VALLEY...
241443 MDA972951 0016 GR03 ARIZON...
251444 MDA9729530027 P00014 BELLCO...
261445 MDA9729530029 A00009 PLANAR...
271446 MDA9729530029 GR0008 PLANAR...
281447 MDA9729530036 GR06 ITNENE...
291448 MDA9729530042 GR011 CRAYRE...
......
4881880 MDA97299F0028 D001 DIGITSY...
4891881 MDA97299F0029 DO DTAI ...
4901882 MDA97299F0030 BASIC BOOZALL...
4911883 MDA97299F0031 BASIC SCHAFER...
4921884 MDA97299F0032 DO BRADSON...
4931885 MDA97299F0033 DO SYSPLAN...
4941886 MDA97299F0033 P00001 SYSPLAN...
4951887 MDA97299F0034 BASIC DIGITSY...
4961888 MDA97299M0002 DO INFOSYS...
4971889 MDA97299M0003 DO SRC ...
498A B c D ...
4991 FY CONTRACT NUMBER CONTRACT MOD PERFORME...
5002
5011890 MDA97299M0004 DO ARDAK ...
5021891 MDA97299M0004 P00001 ARDAK ...
5031892 MDA97299M0004 P00002 ARDAK ...
5041893 MDA97299M0005 DO SHA ...
5051894 MDA97299M0005 P00001 SHA ...
5061895 MDA97299M0006 DO VISTARE...
5071896 MDA97299M0007 DO VISUALE...
5081897 MDA97299M0008 BASIC BLUE RI...
5091898 MDA97299M0009 DO QRI ...
5101899 MDA97299M001 0 DO PRAJAIN...
5111900 MDA97299M0011 BASIC lVI ...
5121901 MDA97299M0012 BASIC JERRYCO...
5131902 MDA97299M0013 DO DIAMOND...
5141903 MDA9769630014 P00007 SDLINC ...
5151904 ...
5161905
517
\n", "

518 rows × 1 columns

\n", "
" ], "text/plain": [ " 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE AMOUNT\n", "0 2 \n", "1 1420 1999 MDA97292J 1029 GR20 CNRI ... \n", "2 1421 MDA97292J1 029 GR22 CNRI ... \n", "3 1422 MDA97292J1 029 GR22 CNRI ... \n", "4 1423 MDA97292J1 029 P00025 CNRI ... \n", "5 1424 MDA972931 0030 P00009 GEORGI... \n", "6 1425 MDA9729320014 P00017 USDISP... \n", "7 1426 MDA97293C0016 P00043 SYSPLA... \n", "8 1427 MDA97294C0003 A00003 BELLAT... \n", "9 1428 MDA97294C0003 P00026 BELLAT... \n", "10 1429 MDA97294C0003 P00027 BELLAT... \n", "11 1430 MDA97294C0003 P00028 BELLAT... \n", "12 1431 MDA97294C0003 P00029 BELLAT... \n", "13 1432 MDA97294C0003 P00030 BELLAT... \n", "14 1433 MDA97294C0003 P00031 BELLAT... \n", "15 1434 MDA97294C0003 P00032 BELLAT... \n", "16 1435 MDA97294C0016 P00026 BDMFED... \n", "17 1436 MDA97294C0016 P00027 BDMFED... \n", "18 1437 MDA97294C0016 P00028 BDMFED... \n", "19 1438 MDA97294C0016 P00029 BDMFED... \n", "20 1439 MDA97294C0016 P00030 BDMFED... \n", "21 1440 MDA97294D0001 D003/P16 VRT ... \n", "22 1441 MDA97294D0001 0032/3 VRT ... \n", "23 1442 MDA97294D0001 003202 VALLEY... \n", "24 1443 MDA972951 0016 GR03 ARIZON... \n", "25 1444 MDA9729530027 P00014 BELLCO... \n", "26 1445 MDA9729530029 A00009 PLANAR... \n", "27 1446 MDA9729530029 GR0008 PLANAR... \n", "28 1447 MDA9729530036 GR06 ITNENE... \n", "29 1448 MDA9729530042 GR011 CRAYRE... \n", ".. ... \n", "488 1880 MDA97299F0028 D001 DIGITSY... \n", "489 1881 MDA97299F0029 DO DTAI ... \n", "490 1882 MDA97299F0030 BASIC BOOZALL... \n", "491 1883 MDA97299F0031 BASIC SCHAFER... \n", "492 1884 MDA97299F0032 DO BRADSON... \n", "493 1885 MDA97299F0033 DO SYSPLAN... \n", "494 1886 MDA97299F0033 P00001 SYSPLAN... \n", "495 1887 MDA97299F0034 BASIC DIGITSY... \n", "496 1888 MDA97299M0002 DO INFOSYS... \n", "497 1889 MDA97299M0003 DO SRC ... \n", "498 \f", " A B c D ... \n", "499 1 FY CONTRACT NUMBER CONTRACT MOD PERFORME... \n", "500 2 \n", "501 1890 MDA97299M0004 DO ARDAK ... \n", "502 1891 MDA97299M0004 P00001 ARDAK ... \n", "503 1892 MDA97299M0004 P00002 ARDAK ... \n", "504 1893 MDA97299M0005 DO SHA ... \n", "505 1894 MDA97299M0005 P00001 SHA ... \n", "506 1895 MDA97299M0006 DO VISTARE... \n", "507 1896 MDA97299M0007 DO VISUALE... \n", "508 1897 MDA97299M0008 BASIC BLUE RI... \n", "509 1898 MDA97299M0009 DO QRI ... \n", "510 1899 MDA97299M001 0 DO PRAJAIN... \n", "511 1900 MDA97299M0011 BASIC lVI ... \n", "512 1901 MDA97299M0012 BASIC JERRYCO... \n", "513 1902 MDA97299M0013 DO DIAMOND... \n", "514 1903 MDA9769630014 P00007 SDLINC ... \n", "515 1904 ... \n", "516 1905 \n", "517 \f", " \n", "\n", "[518 rows x 1 columns]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "You'll recall the `sep=\"\\t\"` or `sep=\"|\"` to tell `pd.read_csv` to look for tabs.\n", "\n", "The trick here is to look for `spaces`. Fortunately, we don't have to convert spaces to tabs. We just tell it that a number of spaces are the delimited using standard regex: `\\s+`!\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mljones/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py:648: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators; you can avoid this warning by specifying engine='python'.\n", " ParserWarning)\n" ] } ], "source": [ "darpa1999=pd.read_table(\"12-F-1039_1999-DARPA-Funding-List.txt\", sep=\"\\s\\s+\", encoding=\"UTF-8\", header=0)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ABcDEFG
01 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
12NoneNoneNoneNoneNoneNone
21420 1999MDA97292J 1029GR20CNRIINFORMATION MANAGEMENT12/10/1998$687,000.00
31421MDA97292J1 029GR22CNRICOMMUNICATOR4/22/1999$400,000.00
41422MDA97292J1 029GR22CNRIWEBINABOX4122/1999$360,000.00
51423MDA97292J1 029P00025CNRIWEBINABOX8/24/1999$0.00
61424MDA972931 0030P00009GEORGIATECHIGH DEFINITION SYSTEMS (HDS)1/29/1999$1 ,210,694.00
71425MDA9729320014P00017USDISPLAYCFLAT PANEL DISPLAYS8116/1999$5,794,000.00
81426MDA97293C0016P00043SYSPLANCORCHPS: Combat Hybrid Power Systems1nt1999$79,441.00
91427MDA97294C0003A00003BELLATLANTNEXT GENERATION INTERNET8/28/1998$0.00
101428MDA97294C0003P00026BELLATLANTNEXT GENERATION INTERNET1/2011999$332,197.00
111429MDA97294C0003P00027BELLATLANTNEXT GENERATION INTERNET2/4/1999$94,750.00
121430MDA97294C0003P00028BELLATLANTNEXT GENERATION INTERNET2/22/1999$450,000.00
131431MDA97294C0003P00029BELLATLANTNEXT GENERATION INTERNET3/1/1999$254,750.00
141432MDA97294C0003P00030BELLATLANTNEXT GENERATION INTERNET4/1 2/1999$0.00
151433MDA97294C0003P00031BELLATLANTNEXT GENERATION INTERNET4/1 3/1999$254,750.00
161434MDA97294C0003P00032BELLATLANTNEXT GENERATION INTERNET9/8/1999$254,750.00
171435MDA97294C0016P00026BDMFEDERALSTOWACTD2/1 2/1999$117,000.00
181436MDA97294C0016P00027BDMFEDERALSTOWACTD3/1/1999$273,000.00
191437MDA97294C0016P00028BDMFEDERALIMAGE UNDERSTANDING3/2911999$150,166.00
201438MDA97294C0016P00029BDMFEDERALSTOWACTD5/27/1999$40,000.00
211439MDA97294C0016P00030BDMFEDERALSTOWACTD911 /1999$55,930.00
221440MDA97294D0001D003/P16VRTBADD12/9/1998$73,374.00
231441MDA97294D00010032/3VRTAGILE INFO CONTROL ENVIRONMENT2/12/1999$100,095.00
241442MDA97294D0001003202VALLEYELECAGILE INFO CONTROL ENVIRONMENT12/22/1998$100,095.00
251443MDA972951 0016GR03ARIZONASTAVLSI PHOTONICS3/1 5/1999$149,984.00
261444MDA9729530027P00014BELLCOREBROADBAND INFORMATION TECHNOLOGY1/4/1999$4,547,200.00
271445MDA9729530029A00009PLANARAMERHIGH DEFINITION SYSTEMS (HDS)5/4/1999$0.00
281446MDA9729530029GR0008PLANARAMERHIGH DEFINITION SYSTEMS (HDS)11/10/1998$7,570,137.00
291447MDA9729530036GR06ITNENERGYSPHOTOVOLTAICS (VP)11/1 8/1998$558,900.00
........................
4881879M DA97299F0028DODIGITSYSINCONTRACT ADMINISTRATION7/14/1999$90,000.00
4891880MDA97299F0028D001DIGITSYSINCONTRACTS MANAGEMENT6/30/1999$4,422.00
4901881MDA97299F0029DODTAITECH INTEGRATION CENTER/TECH DEV CENTER8/4/1999$100,000.00
4911882MDA97299F0030BASICBOOZALLENPOLYMER MATERIALS (CONG ADD)5/15/1999$423,916.45
4921883MDA97299F0031BASICSCHAFERCEROS (FENCED)8/2/1999$59,972.00
4931884MDA97299F0032DOBRADSONCORADVANCED SHIP/SENSOR SYSTEMS MRN-028/9/1999$43,425.18
4941885MDA97299F0033DOSYSPLANCORCONTRACTS MANAGEMENT8/30/1999$37,075.00
4951886MDA97299F0033P00001SYSPLANCORCONTRACTS MANAGEMENT9/13/1999$0.00
4961887MDA97299F0034BASICDIGITSYSINCONTRACTS MANAGEMENT8/31/1999$64,755.00
4971888MDA97299M0002DOINFOSYSLABADVANCED GROUND SURVELLIANCE3/1211999$99,729.00
4981889MDA97299M0003DOSRCADVANCED MICROELECTRONICS4/14/1999$10,000.00
499ABcDEFG
5001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
5012NoneNoneNoneNoneNoneNone
5021890MDA97299M0004DOARDAKBW MEDICAL DIAGNOSTICS3/30/1999$99,970.00
5031891MDA97299M0004P00001ARDAKBW MEDICAL DIAGNOSTICS5/26/1999$0.00
5041892MDA97299M0004P00002ARDAKBW MEDICAL DIAGNOSTICS8/4/1999$0.00
5051893MDA97299M0005DOSHASENSOR EMULATION5/4/1999$100,000.00
5061894MDA97299M0005P00001SHASENSOR EMULATION5/12/1999$0.00
5071895MDA97299M0006DOVISTARESEAUNDERSEA LITTORAL WARFARE4/12/1999$74,827.00
5081896MDA97299M0007DOVISUALEYESCOMBAT CASUALTY DIAGNOSTICS:ULTRASOUND5/3/1999$59,500.00
5091897MDA97299M0008BASICBLUE RIDGEOFFICE/PROGRAM SUPPORT (related to VTAX4)5/11/1999$48,566.00
5101898MDA97299M0009DOQRIADVANCED SIMULATION TECH6/29/1999$99,494.00
5111899MDA97299M001 0DOPRAJAINCCOUNTER MEASURES6/14/1999$80,460.00
5121900MDA97299M0011BASIClVICOUNTER MEASURES7/16/1999$90,000.00
5131901MDA97299M0012BASICJERRYCOOKECONTRACT ADMINISTRATION5/3/1999$100,000.00
5141902MDA97299M0013DODIAMONDBACTECH INTEGRATION CENTER/TECH DEV CENTER9/8/1999$50,000.00
5151903MDA9769630014P00007SDLINCSOLAR BLIND DETECTORS7/9/1999$0.00
5161904FY SUBTOTAL: $340,495,021.94NoneNoneNoneNoneNone
5171905NoneNoneNoneNoneNoneNone
\n", "

518 rows × 7 columns

\n", "
" ], "text/plain": [ " A B c \\\n", "0 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE \n", "1 2 None None \n", "2 1420 1999 MDA97292J 1029 GR20 \n", "3 1421 MDA97292J1 029 GR22 \n", "4 1422 MDA97292J1 029 GR22 \n", "5 1423 MDA97292J1 029 P00025 \n", "6 1424 MDA972931 0030 P00009 \n", "7 1425 MDA9729320014 P00017 \n", "8 1426 MDA97293C0016 P00043 \n", "9 1427 MDA97294C0003 A00003 \n", "10 1428 MDA97294C0003 P00026 \n", "11 1429 MDA97294C0003 P00027 \n", "12 1430 MDA97294C0003 P00028 \n", "13 1431 MDA97294C0003 P00029 \n", "14 1432 MDA97294C0003 P00030 \n", "15 1433 MDA97294C0003 P00031 \n", "16 1434 MDA97294C0003 P00032 \n", "17 1435 MDA97294C0016 P00026 \n", "18 1436 MDA97294C0016 P00027 \n", "19 1437 MDA97294C0016 P00028 \n", "20 1438 MDA97294C0016 P00029 \n", "21 1439 MDA97294C0016 P00030 \n", "22 1440 MDA97294D0001 D003/P16 \n", "23 1441 MDA97294D0001 0032/3 \n", "24 1442 MDA97294D0001 003202 \n", "25 1443 MDA972951 0016 GR03 \n", "26 1444 MDA9729530027 P00014 \n", "27 1445 MDA9729530029 A00009 \n", "28 1446 MDA9729530029 GR0008 \n", "29 1447 MDA9729530036 GR06 \n", ".. ... ... ... \n", "488 1879 M DA97299F0028 DO \n", "489 1880 MDA97299F0028 D001 \n", "490 1881 MDA97299F0029 DO \n", "491 1882 MDA97299F0030 BASIC \n", "492 1883 MDA97299F0031 BASIC \n", "493 1884 MDA97299F0032 DO \n", "494 1885 MDA97299F0033 DO \n", "495 1886 MDA97299F0033 P00001 \n", "496 1887 MDA97299F0034 BASIC \n", "497 1888 MDA97299M0002 DO \n", "498 1889 MDA97299M0003 DO \n", "499 A B c \n", "500 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE \n", "501 2 None None \n", "502 1890 MDA97299M0004 DO \n", "503 1891 MDA97299M0004 P00001 \n", "504 1892 MDA97299M0004 P00002 \n", "505 1893 MDA97299M0005 DO \n", "506 1894 MDA97299M0005 P00001 \n", "507 1895 MDA97299M0006 DO \n", "508 1896 MDA97299M0007 DO \n", "509 1897 MDA97299M0008 BASIC \n", "510 1898 MDA97299M0009 DO \n", "511 1899 MDA97299M001 0 DO \n", "512 1900 MDA97299M0011 BASIC \n", "513 1901 MDA97299M0012 BASIC \n", "514 1902 MDA97299M0013 DO \n", "515 1903 MDA9769630014 P00007 \n", "516 1904 FY SUBTOTAL: $340,495,021.94 None \n", "517 1905 None None \n", "\n", " D E F \\\n", "0 AWARD DATE AMOUNT None \n", "1 None None None \n", "2 CNRI INFORMATION MANAGEMENT 12/10/1998 \n", "3 CNRI COMMUNICATOR 4/22/1999 \n", "4 CNRI WEBINABOX 4122/1999 \n", "5 CNRI WEBINABOX 8/24/1999 \n", "6 GEORGIATEC HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 \n", "7 USDISPLAYC FLAT PANEL DISPLAYS 8116/1999 \n", "8 SYSPLANCOR CHPS: Combat Hybrid Power Systems 1nt1999 \n", "9 BELLATLANT NEXT GENERATION INTERNET 8/28/1998 \n", "10 BELLATLANT NEXT GENERATION INTERNET 1/2011999 \n", "11 BELLATLANT NEXT GENERATION INTERNET 2/4/1999 \n", "12 BELLATLANT NEXT GENERATION INTERNET 2/22/1999 \n", "13 BELLATLANT NEXT GENERATION INTERNET 3/1/1999 \n", "14 BELLATLANT NEXT GENERATION INTERNET 4/1 2/1999 \n", "15 BELLATLANT NEXT GENERATION INTERNET 4/1 3/1999 \n", "16 BELLATLANT NEXT GENERATION INTERNET 9/8/1999 \n", "17 BDMFEDERAL STOWACTD 2/1 2/1999 \n", "18 BDMFEDERAL STOWACTD 3/1/1999 \n", "19 BDMFEDERAL IMAGE UNDERSTANDING 3/2911999 \n", "20 BDMFEDERAL STOWACTD 5/27/1999 \n", "21 BDMFEDERAL STOWACTD 911 /1999 \n", "22 VRT BADD 12/9/1998 \n", "23 VRT AGILE INFO CONTROL ENVIRONMENT 2/12/1999 \n", "24 VALLEYELEC AGILE INFO CONTROL ENVIRONMENT 12/22/1998 \n", "25 ARIZONASTA VLSI PHOTONICS 3/1 5/1999 \n", "26 BELLCORE BROADBAND INFORMATION TECHNOLOGY 1/4/1999 \n", "27 PLANARAMER HIGH DEFINITION SYSTEMS (HDS) 5/4/1999 \n", "28 PLANARAMER HIGH DEFINITION SYSTEMS (HDS) 11/10/1998 \n", "29 ITNENERGYS PHOTOVOLTAICS (VP) 11/1 8/1998 \n", ".. ... ... ... \n", "488 DIGITSYSIN CONTRACT ADMINISTRATION 7/14/1999 \n", "489 DIGITSYSIN CONTRACTS MANAGEMENT 6/30/1999 \n", "490 DTAI TECH INTEGRATION CENTER/TECH DEV CENTER 8/4/1999 \n", "491 BOOZALLEN POLYMER MATERIALS (CONG ADD) 5/15/1999 \n", "492 SCHAFER CEROS (FENCED) 8/2/1999 \n", "493 BRADSONCOR ADVANCED SHIP/SENSOR SYSTEMS MRN-02 8/9/1999 \n", "494 SYSPLANCOR CONTRACTS MANAGEMENT 8/30/1999 \n", "495 SYSPLANCOR CONTRACTS MANAGEMENT 9/13/1999 \n", "496 DIGITSYSIN CONTRACTS MANAGEMENT 8/31/1999 \n", "497 INFOSYSLAB ADVANCED GROUND SURVELLIANCE 3/1211999 \n", "498 SRC ADVANCED MICROELECTRONICS 4/14/1999 \n", "499 D E F \n", "500 AWARD DATE AMOUNT None \n", "501 None None None \n", "502 ARDAK BW MEDICAL DIAGNOSTICS 3/30/1999 \n", "503 ARDAK BW MEDICAL DIAGNOSTICS 5/26/1999 \n", "504 ARDAK BW MEDICAL DIAGNOSTICS 8/4/1999 \n", "505 SHA SENSOR EMULATION 5/4/1999 \n", "506 SHA SENSOR EMULATION 5/12/1999 \n", "507 VISTARESEA UNDERSEA LITTORAL WARFARE 4/12/1999 \n", "508 VISUALEYES COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND 5/3/1999 \n", "509 BLUE RIDGE OFFICE/PROGRAM SUPPORT (related to VTAX4) 5/11/1999 \n", "510 QRI ADVANCED SIMULATION TECH 6/29/1999 \n", "511 PRAJAINC COUNTER MEASURES 6/14/1999 \n", "512 lVI COUNTER MEASURES 7/16/1999 \n", "513 JERRYCOOKE CONTRACT ADMINISTRATION 5/3/1999 \n", "514 DIAMONDBAC TECH INTEGRATION CENTER/TECH DEV CENTER 9/8/1999 \n", "515 SDLINC SOLAR BLIND DETECTORS 7/9/1999 \n", "516 None None None \n", "517 None None None \n", "\n", " G \n", "0 None \n", "1 None \n", "2 $687,000.00 \n", "3 $400,000.00 \n", "4 $360,000.00 \n", "5 $0.00 \n", "6 $1 ,210,694.00 \n", "7 $5,794,000.00 \n", "8 $79,441.00 \n", "9 $0.00 \n", "10 $332,197.00 \n", "11 $94,750.00 \n", "12 $450,000.00 \n", "13 $254,750.00 \n", "14 $0.00 \n", "15 $254,750.00 \n", "16 $254,750.00 \n", "17 $117,000.00 \n", "18 $273,000.00 \n", "19 $150,166.00 \n", "20 $40,000.00 \n", "21 $55,930.00 \n", "22 $73,374.00 \n", "23 $100,095.00 \n", "24 $100,095.00 \n", "25 $149,984.00 \n", "26 $4,547,200.00 \n", "27 $0.00 \n", "28 $7,570,137.00 \n", "29 $558,900.00 \n", ".. ... \n", "488 $90,000.00 \n", "489 $4,422.00 \n", "490 $100,000.00 \n", "491 $423,916.45 \n", "492 $59,972.00 \n", "493 $43,425.18 \n", "494 $37,075.00 \n", "495 $0.00 \n", "496 $64,755.00 \n", "497 $99,729.00 \n", "498 $10,000.00 \n", "499 G \n", "500 None \n", "501 None \n", "502 $99,970.00 \n", "503 $0.00 \n", "504 $0.00 \n", "505 $100,000.00 \n", "506 $0.00 \n", "507 $74,827.00 \n", "508 $59,500.00 \n", "509 $48,566.00 \n", "510 $99,494.00 \n", "511 $80,460.00 \n", "512 $90,000.00 \n", "513 $100,000.00 \n", "514 $50,000.00 \n", "515 $0.00 \n", "516 None \n", "517 None \n", "\n", "[518 rows x 7 columns]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [], "source": [ "darpa1999.columns=[\"Number\", \"CONTRACT_NUMBER\", \"CONTRACT_MOD\", \"PERFORMER\",\"PROGRAM_TITLE\",\"AWARD_DATE\",\"AMOUNT\"]" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
01 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
12NoneNoneNoneNoneNoneNone
21420 1999MDA97292J 1029GR20CNRIINFORMATION MANAGEMENT12/10/1998$687,000.00
31421MDA97292J1 029GR22CNRICOMMUNICATOR4/22/1999$400,000.00
41422MDA97292J1 029GR22CNRIWEBINABOX4122/1999$360,000.00
51423MDA97292J1 029P00025CNRIWEBINABOX8/24/1999$0.00
61424MDA972931 0030P00009GEORGIATECHIGH DEFINITION SYSTEMS (HDS)1/29/1999$1 ,210,694.00
71425MDA9729320014P00017USDISPLAYCFLAT PANEL DISPLAYS8116/1999$5,794,000.00
81426MDA97293C0016P00043SYSPLANCORCHPS: Combat Hybrid Power Systems1nt1999$79,441.00
91427MDA97294C0003A00003BELLATLANTNEXT GENERATION INTERNET8/28/1998$0.00
101428MDA97294C0003P00026BELLATLANTNEXT GENERATION INTERNET1/2011999$332,197.00
111429MDA97294C0003P00027BELLATLANTNEXT GENERATION INTERNET2/4/1999$94,750.00
121430MDA97294C0003P00028BELLATLANTNEXT GENERATION INTERNET2/22/1999$450,000.00
131431MDA97294C0003P00029BELLATLANTNEXT GENERATION INTERNET3/1/1999$254,750.00
141432MDA97294C0003P00030BELLATLANTNEXT GENERATION INTERNET4/1 2/1999$0.00
151433MDA97294C0003P00031BELLATLANTNEXT GENERATION INTERNET4/1 3/1999$254,750.00
161434MDA97294C0003P00032BELLATLANTNEXT GENERATION INTERNET9/8/1999$254,750.00
171435MDA97294C0016P00026BDMFEDERALSTOWACTD2/1 2/1999$117,000.00
181436MDA97294C0016P00027BDMFEDERALSTOWACTD3/1/1999$273,000.00
191437MDA97294C0016P00028BDMFEDERALIMAGE UNDERSTANDING3/2911999$150,166.00
201438MDA97294C0016P00029BDMFEDERALSTOWACTD5/27/1999$40,000.00
211439MDA97294C0016P00030BDMFEDERALSTOWACTD911 /1999$55,930.00
221440MDA97294D0001D003/P16VRTBADD12/9/1998$73,374.00
231441MDA97294D00010032/3VRTAGILE INFO CONTROL ENVIRONMENT2/12/1999$100,095.00
241442MDA97294D0001003202VALLEYELECAGILE INFO CONTROL ENVIRONMENT12/22/1998$100,095.00
251443MDA972951 0016GR03ARIZONASTAVLSI PHOTONICS3/1 5/1999$149,984.00
261444MDA9729530027P00014BELLCOREBROADBAND INFORMATION TECHNOLOGY1/4/1999$4,547,200.00
271445MDA9729530029A00009PLANARAMERHIGH DEFINITION SYSTEMS (HDS)5/4/1999$0.00
281446MDA9729530029GR0008PLANARAMERHIGH DEFINITION SYSTEMS (HDS)11/10/1998$7,570,137.00
291447MDA9729530036GR06ITNENERGYSPHOTOVOLTAICS (VP)11/1 8/1998$558,900.00
........................
4881879M DA97299F0028DODIGITSYSINCONTRACT ADMINISTRATION7/14/1999$90,000.00
4891880MDA97299F0028D001DIGITSYSINCONTRACTS MANAGEMENT6/30/1999$4,422.00
4901881MDA97299F0029DODTAITECH INTEGRATION CENTER/TECH DEV CENTER8/4/1999$100,000.00
4911882MDA97299F0030BASICBOOZALLENPOLYMER MATERIALS (CONG ADD)5/15/1999$423,916.45
4921883MDA97299F0031BASICSCHAFERCEROS (FENCED)8/2/1999$59,972.00
4931884MDA97299F0032DOBRADSONCORADVANCED SHIP/SENSOR SYSTEMS MRN-028/9/1999$43,425.18
4941885MDA97299F0033DOSYSPLANCORCONTRACTS MANAGEMENT8/30/1999$37,075.00
4951886MDA97299F0033P00001SYSPLANCORCONTRACTS MANAGEMENT9/13/1999$0.00
4961887MDA97299F0034BASICDIGITSYSINCONTRACTS MANAGEMENT8/31/1999$64,755.00
4971888MDA97299M0002DOINFOSYSLABADVANCED GROUND SURVELLIANCE3/1211999$99,729.00
4981889MDA97299M0003DOSRCADVANCED MICROELECTRONICS4/14/1999$10,000.00
499ABcDEFG
5001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
5012NoneNoneNoneNoneNoneNone
5021890MDA97299M0004DOARDAKBW MEDICAL DIAGNOSTICS3/30/1999$99,970.00
5031891MDA97299M0004P00001ARDAKBW MEDICAL DIAGNOSTICS5/26/1999$0.00
5041892MDA97299M0004P00002ARDAKBW MEDICAL DIAGNOSTICS8/4/1999$0.00
5051893MDA97299M0005DOSHASENSOR EMULATION5/4/1999$100,000.00
5061894MDA97299M0005P00001SHASENSOR EMULATION5/12/1999$0.00
5071895MDA97299M0006DOVISTARESEAUNDERSEA LITTORAL WARFARE4/12/1999$74,827.00
5081896MDA97299M0007DOVISUALEYESCOMBAT CASUALTY DIAGNOSTICS:ULTRASOUND5/3/1999$59,500.00
5091897MDA97299M0008BASICBLUE RIDGEOFFICE/PROGRAM SUPPORT (related to VTAX4)5/11/1999$48,566.00
5101898MDA97299M0009DOQRIADVANCED SIMULATION TECH6/29/1999$99,494.00
5111899MDA97299M001 0DOPRAJAINCCOUNTER MEASURES6/14/1999$80,460.00
5121900MDA97299M0011BASIClVICOUNTER MEASURES7/16/1999$90,000.00
5131901MDA97299M0012BASICJERRYCOOKECONTRACT ADMINISTRATION5/3/1999$100,000.00
5141902MDA97299M0013DODIAMONDBACTECH INTEGRATION CENTER/TECH DEV CENTER9/8/1999$50,000.00
5151903MDA9769630014P00007SDLINCSOLAR BLIND DETECTORS7/9/1999$0.00
5161904FY SUBTOTAL: $340,495,021.94NoneNoneNoneNoneNone
5171905NoneNoneNoneNoneNoneNone
\n", "

518 rows × 7 columns

\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD \\\n", "0 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE \n", "1 2 None None \n", "2 1420 1999 MDA97292J 1029 GR20 \n", "3 1421 MDA97292J1 029 GR22 \n", "4 1422 MDA97292J1 029 GR22 \n", "5 1423 MDA97292J1 029 P00025 \n", "6 1424 MDA972931 0030 P00009 \n", "7 1425 MDA9729320014 P00017 \n", "8 1426 MDA97293C0016 P00043 \n", "9 1427 MDA97294C0003 A00003 \n", "10 1428 MDA97294C0003 P00026 \n", "11 1429 MDA97294C0003 P00027 \n", "12 1430 MDA97294C0003 P00028 \n", "13 1431 MDA97294C0003 P00029 \n", "14 1432 MDA97294C0003 P00030 \n", "15 1433 MDA97294C0003 P00031 \n", "16 1434 MDA97294C0003 P00032 \n", "17 1435 MDA97294C0016 P00026 \n", "18 1436 MDA97294C0016 P00027 \n", "19 1437 MDA97294C0016 P00028 \n", "20 1438 MDA97294C0016 P00029 \n", "21 1439 MDA97294C0016 P00030 \n", "22 1440 MDA97294D0001 D003/P16 \n", "23 1441 MDA97294D0001 0032/3 \n", "24 1442 MDA97294D0001 003202 \n", "25 1443 MDA972951 0016 GR03 \n", "26 1444 MDA9729530027 P00014 \n", "27 1445 MDA9729530029 A00009 \n", "28 1446 MDA9729530029 GR0008 \n", "29 1447 MDA9729530036 GR06 \n", ".. ... ... ... \n", "488 1879 M DA97299F0028 DO \n", "489 1880 MDA97299F0028 D001 \n", "490 1881 MDA97299F0029 DO \n", "491 1882 MDA97299F0030 BASIC \n", "492 1883 MDA97299F0031 BASIC \n", "493 1884 MDA97299F0032 DO \n", "494 1885 MDA97299F0033 DO \n", "495 1886 MDA97299F0033 P00001 \n", "496 1887 MDA97299F0034 BASIC \n", "497 1888 MDA97299M0002 DO \n", "498 1889 MDA97299M0003 DO \n", "499 A B c \n", "500 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE \n", "501 2 None None \n", "502 1890 MDA97299M0004 DO \n", "503 1891 MDA97299M0004 P00001 \n", "504 1892 MDA97299M0004 P00002 \n", "505 1893 MDA97299M0005 DO \n", "506 1894 MDA97299M0005 P00001 \n", "507 1895 MDA97299M0006 DO \n", "508 1896 MDA97299M0007 DO \n", "509 1897 MDA97299M0008 BASIC \n", "510 1898 MDA97299M0009 DO \n", "511 1899 MDA97299M001 0 DO \n", "512 1900 MDA97299M0011 BASIC \n", "513 1901 MDA97299M0012 BASIC \n", "514 1902 MDA97299M0013 DO \n", "515 1903 MDA9769630014 P00007 \n", "516 1904 FY SUBTOTAL: $340,495,021.94 None \n", "517 1905 None None \n", "\n", " PERFORMER PROGRAM_TITLE AWARD_DATE \\\n", "0 AWARD DATE AMOUNT None \n", "1 None None None \n", "2 CNRI INFORMATION MANAGEMENT 12/10/1998 \n", "3 CNRI COMMUNICATOR 4/22/1999 \n", "4 CNRI WEBINABOX 4122/1999 \n", "5 CNRI WEBINABOX 8/24/1999 \n", "6 GEORGIATEC HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 \n", "7 USDISPLAYC FLAT PANEL DISPLAYS 8116/1999 \n", "8 SYSPLANCOR CHPS: Combat Hybrid Power Systems 1nt1999 \n", "9 BELLATLANT NEXT GENERATION INTERNET 8/28/1998 \n", "10 BELLATLANT NEXT GENERATION INTERNET 1/2011999 \n", "11 BELLATLANT NEXT GENERATION INTERNET 2/4/1999 \n", "12 BELLATLANT NEXT GENERATION INTERNET 2/22/1999 \n", "13 BELLATLANT NEXT GENERATION INTERNET 3/1/1999 \n", "14 BELLATLANT NEXT GENERATION INTERNET 4/1 2/1999 \n", "15 BELLATLANT NEXT GENERATION INTERNET 4/1 3/1999 \n", "16 BELLATLANT NEXT GENERATION INTERNET 9/8/1999 \n", "17 BDMFEDERAL STOWACTD 2/1 2/1999 \n", "18 BDMFEDERAL STOWACTD 3/1/1999 \n", "19 BDMFEDERAL IMAGE UNDERSTANDING 3/2911999 \n", "20 BDMFEDERAL STOWACTD 5/27/1999 \n", "21 BDMFEDERAL STOWACTD 911 /1999 \n", "22 VRT BADD 12/9/1998 \n", "23 VRT AGILE INFO CONTROL ENVIRONMENT 2/12/1999 \n", "24 VALLEYELEC AGILE INFO CONTROL ENVIRONMENT 12/22/1998 \n", "25 ARIZONASTA VLSI PHOTONICS 3/1 5/1999 \n", "26 BELLCORE BROADBAND INFORMATION TECHNOLOGY 1/4/1999 \n", "27 PLANARAMER HIGH DEFINITION SYSTEMS (HDS) 5/4/1999 \n", "28 PLANARAMER HIGH DEFINITION SYSTEMS (HDS) 11/10/1998 \n", "29 ITNENERGYS PHOTOVOLTAICS (VP) 11/1 8/1998 \n", ".. ... ... ... \n", "488 DIGITSYSIN CONTRACT ADMINISTRATION 7/14/1999 \n", "489 DIGITSYSIN CONTRACTS MANAGEMENT 6/30/1999 \n", "490 DTAI TECH INTEGRATION CENTER/TECH DEV CENTER 8/4/1999 \n", "491 BOOZALLEN POLYMER MATERIALS (CONG ADD) 5/15/1999 \n", "492 SCHAFER CEROS (FENCED) 8/2/1999 \n", "493 BRADSONCOR ADVANCED SHIP/SENSOR SYSTEMS MRN-02 8/9/1999 \n", "494 SYSPLANCOR CONTRACTS MANAGEMENT 8/30/1999 \n", "495 SYSPLANCOR CONTRACTS MANAGEMENT 9/13/1999 \n", "496 DIGITSYSIN CONTRACTS MANAGEMENT 8/31/1999 \n", "497 INFOSYSLAB ADVANCED GROUND SURVELLIANCE 3/1211999 \n", "498 SRC ADVANCED MICROELECTRONICS 4/14/1999 \n", "499 D E F \n", "500 AWARD DATE AMOUNT None \n", "501 None None None \n", "502 ARDAK BW MEDICAL DIAGNOSTICS 3/30/1999 \n", "503 ARDAK BW MEDICAL DIAGNOSTICS 5/26/1999 \n", "504 ARDAK BW MEDICAL DIAGNOSTICS 8/4/1999 \n", "505 SHA SENSOR EMULATION 5/4/1999 \n", "506 SHA SENSOR EMULATION 5/12/1999 \n", "507 VISTARESEA UNDERSEA LITTORAL WARFARE 4/12/1999 \n", "508 VISUALEYES COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND 5/3/1999 \n", "509 BLUE RIDGE OFFICE/PROGRAM SUPPORT (related to VTAX4) 5/11/1999 \n", "510 QRI ADVANCED SIMULATION TECH 6/29/1999 \n", "511 PRAJAINC COUNTER MEASURES 6/14/1999 \n", "512 lVI COUNTER MEASURES 7/16/1999 \n", "513 JERRYCOOKE CONTRACT ADMINISTRATION 5/3/1999 \n", "514 DIAMONDBAC TECH INTEGRATION CENTER/TECH DEV CENTER 9/8/1999 \n", "515 SDLINC SOLAR BLIND DETECTORS 7/9/1999 \n", "516 None None None \n", "517 None None None \n", "\n", " AMOUNT \n", "0 None \n", "1 None \n", "2 $687,000.00 \n", "3 $400,000.00 \n", "4 $360,000.00 \n", "5 $0.00 \n", "6 $1 ,210,694.00 \n", "7 $5,794,000.00 \n", "8 $79,441.00 \n", "9 $0.00 \n", "10 $332,197.00 \n", "11 $94,750.00 \n", "12 $450,000.00 \n", "13 $254,750.00 \n", "14 $0.00 \n", "15 $254,750.00 \n", "16 $254,750.00 \n", "17 $117,000.00 \n", "18 $273,000.00 \n", "19 $150,166.00 \n", "20 $40,000.00 \n", "21 $55,930.00 \n", "22 $73,374.00 \n", "23 $100,095.00 \n", "24 $100,095.00 \n", "25 $149,984.00 \n", "26 $4,547,200.00 \n", "27 $0.00 \n", "28 $7,570,137.00 \n", "29 $558,900.00 \n", ".. ... \n", "488 $90,000.00 \n", "489 $4,422.00 \n", "490 $100,000.00 \n", "491 $423,916.45 \n", "492 $59,972.00 \n", "493 $43,425.18 \n", "494 $37,075.00 \n", "495 $0.00 \n", "496 $64,755.00 \n", "497 $99,729.00 \n", "498 $10,000.00 \n", "499 G \n", "500 None \n", "501 None \n", "502 $99,970.00 \n", "503 $0.00 \n", "504 $0.00 \n", "505 $100,000.00 \n", "506 $0.00 \n", "507 $74,827.00 \n", "508 $59,500.00 \n", "509 $48,566.00 \n", "510 $99,494.00 \n", "511 $80,460.00 \n", "512 $90,000.00 \n", "513 $100,000.00 \n", "514 $50,000.00 \n", "515 $0.00 \n", "516 None \n", "517 None \n", "\n", "[518 rows x 7 columns]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [], "source": [ "darpa1999=darpa1999[2:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A different problem! The columns titles are repeated at top of each sheet!\n", "\n", "Lots of ways to resolve and eliminate the unnecessary rows.\n", "\n", "In many cases, means that you'll have column names as values. Pick out just those ones and clean your data.\n" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
49ABcDEFG
99ABcDEFG
149ABcDEFG
199ABcDEFG
249ABcDEFG
299ABcDEFG
349ABcDEFG
399ABcDEFG
449ABcDEFG
499ABcDEFG
\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD PERFORMER PROGRAM_TITLE AWARD_DATE \\\n", "49 A B c D E F \n", "99 A B c D E F \n", "149 A B c D E F \n", "199 A B c D E F \n", "249 A B c D E F \n", "299 A B c D E F \n", "349 A B c D E F \n", "399 A B c D E F \n", "449 A B c D E F \n", "499 A B c D E F \n", "\n", " AMOUNT \n", "49 G \n", "99 G \n", "149 G \n", "199 G \n", "249 G \n", "299 G \n", "349 G \n", "399 G \n", "449 G \n", "499 G " ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999[darpa1999[\"Number\"]==(\"A\")]" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
501 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
1001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
1501 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
2001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
2501 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
3001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
3501 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
4001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
5001 FYCONTRACT NUMBER CONTRACT MOD PERFORMERPROGRAM TITLEAWARD DATEAMOUNTNoneNone
\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD PERFORMER \\\n", "50 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "100 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "150 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "200 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "250 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "300 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "350 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "400 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "500 1 FY CONTRACT NUMBER CONTRACT MOD PERFORMER PROGRAM TITLE AWARD DATE \n", "\n", " PROGRAM_TITLE AWARD_DATE AMOUNT \n", "50 AMOUNT None None \n", "100 AMOUNT None None \n", "150 AMOUNT None None \n", "200 AMOUNT None None \n", "250 AMOUNT None None \n", "300 AMOUNT None None \n", "350 AMOUNT None None \n", "400 AMOUNT None None \n", "500 AMOUNT None None " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999[darpa1999[\"Number\"]==(\"1 FY\")]" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [], "source": [ "rows_to_include=(darpa1999[\"Number\"]!=\"A\") & (darpa1999[\"Number\"]!=\"1 FY\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": true }, "outputs": [], "source": [ "darpa1999=darpa1999[rows_to_include]" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
21420 1999MDA97292J 1029GR20CNRIINFORMATION MANAGEMENT12/10/1998$687,000.00
31421MDA97292J1 029GR22CNRICOMMUNICATOR4/22/1999$400,000.00
41422MDA97292J1 029GR22CNRIWEBINABOX4122/1999$360,000.00
51423MDA97292J1 029P00025CNRIWEBINABOX8/24/1999$0.00
61424MDA972931 0030P00009GEORGIATECHIGH DEFINITION SYSTEMS (HDS)1/29/1999$1 ,210,694.00
71425MDA9729320014P00017USDISPLAYCFLAT PANEL DISPLAYS8116/1999$5,794,000.00
81426MDA97293C0016P00043SYSPLANCORCHPS: Combat Hybrid Power Systems1nt1999$79,441.00
91427MDA97294C0003A00003BELLATLANTNEXT GENERATION INTERNET8/28/1998$0.00
101428MDA97294C0003P00026BELLATLANTNEXT GENERATION INTERNET1/2011999$332,197.00
111429MDA97294C0003P00027BELLATLANTNEXT GENERATION INTERNET2/4/1999$94,750.00
121430MDA97294C0003P00028BELLATLANTNEXT GENERATION INTERNET2/22/1999$450,000.00
131431MDA97294C0003P00029BELLATLANTNEXT GENERATION INTERNET3/1/1999$254,750.00
141432MDA97294C0003P00030BELLATLANTNEXT GENERATION INTERNET4/1 2/1999$0.00
151433MDA97294C0003P00031BELLATLANTNEXT GENERATION INTERNET4/1 3/1999$254,750.00
161434MDA97294C0003P00032BELLATLANTNEXT GENERATION INTERNET9/8/1999$254,750.00
171435MDA97294C0016P00026BDMFEDERALSTOWACTD2/1 2/1999$117,000.00
181436MDA97294C0016P00027BDMFEDERALSTOWACTD3/1/1999$273,000.00
191437MDA97294C0016P00028BDMFEDERALIMAGE UNDERSTANDING3/2911999$150,166.00
201438MDA97294C0016P00029BDMFEDERALSTOWACTD5/27/1999$40,000.00
211439MDA97294C0016P00030BDMFEDERALSTOWACTD911 /1999$55,930.00
221440MDA97294D0001D003/P16VRTBADD12/9/1998$73,374.00
231441MDA97294D00010032/3VRTAGILE INFO CONTROL ENVIRONMENT2/12/1999$100,095.00
241442MDA97294D0001003202VALLEYELECAGILE INFO CONTROL ENVIRONMENT12/22/1998$100,095.00
251443MDA972951 0016GR03ARIZONASTAVLSI PHOTONICS3/1 5/1999$149,984.00
261444MDA9729530027P00014BELLCOREBROADBAND INFORMATION TECHNOLOGY1/4/1999$4,547,200.00
271445MDA9729530029A00009PLANARAMERHIGH DEFINITION SYSTEMS (HDS)5/4/1999$0.00
281446MDA9729530029GR0008PLANARAMERHIGH DEFINITION SYSTEMS (HDS)11/10/1998$7,570,137.00
291447MDA9729530036GR06ITNENERGYSPHOTOVOLTAICS (VP)11/1 8/1998$558,900.00
301448MDA9729530042GR011CRAYRESEARSHOCC6nt1999$1 ,289,562.00
311449MDA97295C0004P00008UMASSLARGE MILLIMETER TELESCOPE8/30/1999$1 ,151 ,500.00
........................
4861877MDA97299F0025BASICSYSPLANCORCOUNTER UNDERGROUND FACILITIES6/25/1999$251 ,924.00
4871878MDA97299F0027DOORIONSCSYSCOUNTER MEASURES6/11/1999$199,991 .00
4881879M DA97299F0028DODIGITSYSINCONTRACT ADMINISTRATION7/14/1999$90,000.00
4891880MDA97299F0028D001DIGITSYSINCONTRACTS MANAGEMENT6/30/1999$4,422.00
4901881MDA97299F0029DODTAITECH INTEGRATION CENTER/TECH DEV CENTER8/4/1999$100,000.00
4911882MDA97299F0030BASICBOOZALLENPOLYMER MATERIALS (CONG ADD)5/15/1999$423,916.45
4921883MDA97299F0031BASICSCHAFERCEROS (FENCED)8/2/1999$59,972.00
4931884MDA97299F0032DOBRADSONCORADVANCED SHIP/SENSOR SYSTEMS MRN-028/9/1999$43,425.18
4941885MDA97299F0033DOSYSPLANCORCONTRACTS MANAGEMENT8/30/1999$37,075.00
4951886MDA97299F0033P00001SYSPLANCORCONTRACTS MANAGEMENT9/13/1999$0.00
4961887MDA97299F0034BASICDIGITSYSINCONTRACTS MANAGEMENT8/31/1999$64,755.00
4971888MDA97299M0002DOINFOSYSLABADVANCED GROUND SURVELLIANCE3/1211999$99,729.00
4981889MDA97299M0003DOSRCADVANCED MICROELECTRONICS4/14/1999$10,000.00
5012NoneNoneNoneNoneNoneNone
5021890MDA97299M0004DOARDAKBW MEDICAL DIAGNOSTICS3/30/1999$99,970.00
5031891MDA97299M0004P00001ARDAKBW MEDICAL DIAGNOSTICS5/26/1999$0.00
5041892MDA97299M0004P00002ARDAKBW MEDICAL DIAGNOSTICS8/4/1999$0.00
5051893MDA97299M0005DOSHASENSOR EMULATION5/4/1999$100,000.00
5061894MDA97299M0005P00001SHASENSOR EMULATION5/12/1999$0.00
5071895MDA97299M0006DOVISTARESEAUNDERSEA LITTORAL WARFARE4/12/1999$74,827.00
5081896MDA97299M0007DOVISUALEYESCOMBAT CASUALTY DIAGNOSTICS:ULTRASOUND5/3/1999$59,500.00
5091897MDA97299M0008BASICBLUE RIDGEOFFICE/PROGRAM SUPPORT (related to VTAX4)5/11/1999$48,566.00
5101898MDA97299M0009DOQRIADVANCED SIMULATION TECH6/29/1999$99,494.00
5111899MDA97299M001 0DOPRAJAINCCOUNTER MEASURES6/14/1999$80,460.00
5121900MDA97299M0011BASIClVICOUNTER MEASURES7/16/1999$90,000.00
5131901MDA97299M0012BASICJERRYCOOKECONTRACT ADMINISTRATION5/3/1999$100,000.00
5141902MDA97299M0013DODIAMONDBACTECH INTEGRATION CENTER/TECH DEV CENTER9/8/1999$50,000.00
5151903MDA9769630014P00007SDLINCSOLAR BLIND DETECTORS7/9/1999$0.00
5161904FY SUBTOTAL: $340,495,021.94NoneNoneNoneNoneNone
5171905NoneNoneNoneNoneNoneNone
\n", "

497 rows × 7 columns

\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD PERFORMER \\\n", "2 1420 1999 MDA97292J 1029 GR20 CNRI \n", "3 1421 MDA97292J1 029 GR22 CNRI \n", "4 1422 MDA97292J1 029 GR22 CNRI \n", "5 1423 MDA97292J1 029 P00025 CNRI \n", "6 1424 MDA972931 0030 P00009 GEORGIATEC \n", "7 1425 MDA9729320014 P00017 USDISPLAYC \n", "8 1426 MDA97293C0016 P00043 SYSPLANCOR \n", "9 1427 MDA97294C0003 A00003 BELLATLANT \n", "10 1428 MDA97294C0003 P00026 BELLATLANT \n", "11 1429 MDA97294C0003 P00027 BELLATLANT \n", "12 1430 MDA97294C0003 P00028 BELLATLANT \n", "13 1431 MDA97294C0003 P00029 BELLATLANT \n", "14 1432 MDA97294C0003 P00030 BELLATLANT \n", "15 1433 MDA97294C0003 P00031 BELLATLANT \n", "16 1434 MDA97294C0003 P00032 BELLATLANT \n", "17 1435 MDA97294C0016 P00026 BDMFEDERAL \n", "18 1436 MDA97294C0016 P00027 BDMFEDERAL \n", "19 1437 MDA97294C0016 P00028 BDMFEDERAL \n", "20 1438 MDA97294C0016 P00029 BDMFEDERAL \n", "21 1439 MDA97294C0016 P00030 BDMFEDERAL \n", "22 1440 MDA97294D0001 D003/P16 VRT \n", "23 1441 MDA97294D0001 0032/3 VRT \n", "24 1442 MDA97294D0001 003202 VALLEYELEC \n", "25 1443 MDA972951 0016 GR03 ARIZONASTA \n", "26 1444 MDA9729530027 P00014 BELLCORE \n", "27 1445 MDA9729530029 A00009 PLANARAMER \n", "28 1446 MDA9729530029 GR0008 PLANARAMER \n", "29 1447 MDA9729530036 GR06 ITNENERGYS \n", "30 1448 MDA9729530042 GR011 CRAYRESEAR \n", "31 1449 MDA97295C0004 P00008 UMASS \n", ".. ... ... ... ... \n", "486 1877 MDA97299F0025 BASIC SYSPLANCOR \n", "487 1878 MDA97299F0027 DO ORIONSCSYS \n", "488 1879 M DA97299F0028 DO DIGITSYSIN \n", "489 1880 MDA97299F0028 D001 DIGITSYSIN \n", "490 1881 MDA97299F0029 DO DTAI \n", "491 1882 MDA97299F0030 BASIC BOOZALLEN \n", "492 1883 MDA97299F0031 BASIC SCHAFER \n", "493 1884 MDA97299F0032 DO BRADSONCOR \n", "494 1885 MDA97299F0033 DO SYSPLANCOR \n", "495 1886 MDA97299F0033 P00001 SYSPLANCOR \n", "496 1887 MDA97299F0034 BASIC DIGITSYSIN \n", "497 1888 MDA97299M0002 DO INFOSYSLAB \n", "498 1889 MDA97299M0003 DO SRC \n", "501 2 None None None \n", "502 1890 MDA97299M0004 DO ARDAK \n", "503 1891 MDA97299M0004 P00001 ARDAK \n", "504 1892 MDA97299M0004 P00002 ARDAK \n", "505 1893 MDA97299M0005 DO SHA \n", "506 1894 MDA97299M0005 P00001 SHA \n", "507 1895 MDA97299M0006 DO VISTARESEA \n", "508 1896 MDA97299M0007 DO VISUALEYES \n", "509 1897 MDA97299M0008 BASIC BLUE RIDGE \n", "510 1898 MDA97299M0009 DO QRI \n", "511 1899 MDA97299M001 0 DO PRAJAINC \n", "512 1900 MDA97299M0011 BASIC lVI \n", "513 1901 MDA97299M0012 BASIC JERRYCOOKE \n", "514 1902 MDA97299M0013 DO DIAMONDBAC \n", "515 1903 MDA9769630014 P00007 SDLINC \n", "516 1904 FY SUBTOTAL: $340,495,021.94 None None \n", "517 1905 None None None \n", "\n", " PROGRAM_TITLE AWARD_DATE AMOUNT \n", "2 INFORMATION MANAGEMENT 12/10/1998 $687,000.00 \n", "3 COMMUNICATOR 4/22/1999 $400,000.00 \n", "4 WEBINABOX 4122/1999 $360,000.00 \n", "5 WEBINABOX 8/24/1999 $0.00 \n", "6 HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 $1 ,210,694.00 \n", "7 FLAT PANEL DISPLAYS 8116/1999 $5,794,000.00 \n", "8 CHPS: Combat Hybrid Power Systems 1nt1999 $79,441.00 \n", "9 NEXT GENERATION INTERNET 8/28/1998 $0.00 \n", "10 NEXT GENERATION INTERNET 1/2011999 $332,197.00 \n", "11 NEXT GENERATION INTERNET 2/4/1999 $94,750.00 \n", "12 NEXT GENERATION INTERNET 2/22/1999 $450,000.00 \n", "13 NEXT GENERATION INTERNET 3/1/1999 $254,750.00 \n", "14 NEXT GENERATION INTERNET 4/1 2/1999 $0.00 \n", "15 NEXT GENERATION INTERNET 4/1 3/1999 $254,750.00 \n", "16 NEXT GENERATION INTERNET 9/8/1999 $254,750.00 \n", "17 STOWACTD 2/1 2/1999 $117,000.00 \n", "18 STOWACTD 3/1/1999 $273,000.00 \n", "19 IMAGE UNDERSTANDING 3/2911999 $150,166.00 \n", "20 STOWACTD 5/27/1999 $40,000.00 \n", "21 STOWACTD 911 /1999 $55,930.00 \n", "22 BADD 12/9/1998 $73,374.00 \n", "23 AGILE INFO CONTROL ENVIRONMENT 2/12/1999 $100,095.00 \n", "24 AGILE INFO CONTROL ENVIRONMENT 12/22/1998 $100,095.00 \n", "25 VLSI PHOTONICS 3/1 5/1999 $149,984.00 \n", "26 BROADBAND INFORMATION TECHNOLOGY 1/4/1999 $4,547,200.00 \n", "27 HIGH DEFINITION SYSTEMS (HDS) 5/4/1999 $0.00 \n", "28 HIGH DEFINITION SYSTEMS (HDS) 11/10/1998 $7,570,137.00 \n", "29 PHOTOVOLTAICS (VP) 11/1 8/1998 $558,900.00 \n", "30 SHOCC 6nt1999 $1 ,289,562.00 \n", "31 LARGE MILLIMETER TELESCOPE 8/30/1999 $1 ,151 ,500.00 \n", ".. ... ... ... \n", "486 COUNTER UNDERGROUND FACILITIES 6/25/1999 $251 ,924.00 \n", "487 COUNTER MEASURES 6/11/1999 $199,991 .00 \n", "488 CONTRACT ADMINISTRATION 7/14/1999 $90,000.00 \n", "489 CONTRACTS MANAGEMENT 6/30/1999 $4,422.00 \n", "490 TECH INTEGRATION CENTER/TECH DEV CENTER 8/4/1999 $100,000.00 \n", "491 POLYMER MATERIALS (CONG ADD) 5/15/1999 $423,916.45 \n", "492 CEROS (FENCED) 8/2/1999 $59,972.00 \n", "493 ADVANCED SHIP/SENSOR SYSTEMS MRN-02 8/9/1999 $43,425.18 \n", "494 CONTRACTS MANAGEMENT 8/30/1999 $37,075.00 \n", "495 CONTRACTS MANAGEMENT 9/13/1999 $0.00 \n", "496 CONTRACTS MANAGEMENT 8/31/1999 $64,755.00 \n", "497 ADVANCED GROUND SURVELLIANCE 3/1211999 $99,729.00 \n", "498 ADVANCED MICROELECTRONICS 4/14/1999 $10,000.00 \n", "501 None None None \n", "502 BW MEDICAL DIAGNOSTICS 3/30/1999 $99,970.00 \n", "503 BW MEDICAL DIAGNOSTICS 5/26/1999 $0.00 \n", "504 BW MEDICAL DIAGNOSTICS 8/4/1999 $0.00 \n", "505 SENSOR EMULATION 5/4/1999 $100,000.00 \n", "506 SENSOR EMULATION 5/12/1999 $0.00 \n", "507 UNDERSEA LITTORAL WARFARE 4/12/1999 $74,827.00 \n", "508 COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND 5/3/1999 $59,500.00 \n", "509 OFFICE/PROGRAM SUPPORT (related to VTAX4) 5/11/1999 $48,566.00 \n", "510 ADVANCED SIMULATION TECH 6/29/1999 $99,494.00 \n", "511 COUNTER MEASURES 6/14/1999 $80,460.00 \n", "512 COUNTER MEASURES 7/16/1999 $90,000.00 \n", "513 CONTRACT ADMINISTRATION 5/3/1999 $100,000.00 \n", "514 TECH INTEGRATION CENTER/TECH DEV CENTER 9/8/1999 $50,000.00 \n", "515 SOLAR BLIND DETECTORS 7/9/1999 $0.00 \n", "516 None None None \n", "517 None None None \n", "\n", "[497 rows x 7 columns]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [], "source": [ "darpa1999.loc[2][\"Number\"]=\"1420\"" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
21420MDA97292J 1029GR20CNRIINFORMATION MANAGEMENT12/10/1998$687,000.00
31421MDA97292J1 029GR22CNRICOMMUNICATOR4/22/1999$400,000.00
41422MDA97292J1 029GR22CNRIWEBINABOX4122/1999$360,000.00
51423MDA97292J1 029P00025CNRIWEBINABOX8/24/1999$0.00
61424MDA972931 0030P00009GEORGIATECHIGH DEFINITION SYSTEMS (HDS)1/29/1999$1 ,210,694.00
71425MDA9729320014P00017USDISPLAYCFLAT PANEL DISPLAYS8116/1999$5,794,000.00
81426MDA97293C0016P00043SYSPLANCORCHPS: Combat Hybrid Power Systems1nt1999$79,441.00
91427MDA97294C0003A00003BELLATLANTNEXT GENERATION INTERNET8/28/1998$0.00
101428MDA97294C0003P00026BELLATLANTNEXT GENERATION INTERNET1/2011999$332,197.00
111429MDA97294C0003P00027BELLATLANTNEXT GENERATION INTERNET2/4/1999$94,750.00
121430MDA97294C0003P00028BELLATLANTNEXT GENERATION INTERNET2/22/1999$450,000.00
131431MDA97294C0003P00029BELLATLANTNEXT GENERATION INTERNET3/1/1999$254,750.00
141432MDA97294C0003P00030BELLATLANTNEXT GENERATION INTERNET4/1 2/1999$0.00
151433MDA97294C0003P00031BELLATLANTNEXT GENERATION INTERNET4/1 3/1999$254,750.00
161434MDA97294C0003P00032BELLATLANTNEXT GENERATION INTERNET9/8/1999$254,750.00
171435MDA97294C0016P00026BDMFEDERALSTOWACTD2/1 2/1999$117,000.00
181436MDA97294C0016P00027BDMFEDERALSTOWACTD3/1/1999$273,000.00
191437MDA97294C0016P00028BDMFEDERALIMAGE UNDERSTANDING3/2911999$150,166.00
201438MDA97294C0016P00029BDMFEDERALSTOWACTD5/27/1999$40,000.00
211439MDA97294C0016P00030BDMFEDERALSTOWACTD911 /1999$55,930.00
221440MDA97294D0001D003/P16VRTBADD12/9/1998$73,374.00
231441MDA97294D00010032/3VRTAGILE INFO CONTROL ENVIRONMENT2/12/1999$100,095.00
241442MDA97294D0001003202VALLEYELECAGILE INFO CONTROL ENVIRONMENT12/22/1998$100,095.00
251443MDA972951 0016GR03ARIZONASTAVLSI PHOTONICS3/1 5/1999$149,984.00
261444MDA9729530027P00014BELLCOREBROADBAND INFORMATION TECHNOLOGY1/4/1999$4,547,200.00
271445MDA9729530029A00009PLANARAMERHIGH DEFINITION SYSTEMS (HDS)5/4/1999$0.00
281446MDA9729530029GR0008PLANARAMERHIGH DEFINITION SYSTEMS (HDS)11/10/1998$7,570,137.00
291447MDA9729530036GR06ITNENERGYSPHOTOVOLTAICS (VP)11/1 8/1998$558,900.00
301448MDA9729530042GR011CRAYRESEARSHOCC6nt1999$1 ,289,562.00
311449MDA97295C0004P00008UMASSLARGE MILLIMETER TELESCOPE8/30/1999$1 ,151 ,500.00
........................
4861877MDA97299F0025BASICSYSPLANCORCOUNTER UNDERGROUND FACILITIES6/25/1999$251 ,924.00
4871878MDA97299F0027DOORIONSCSYSCOUNTER MEASURES6/11/1999$199,991 .00
4881879M DA97299F0028DODIGITSYSINCONTRACT ADMINISTRATION7/14/1999$90,000.00
4891880MDA97299F0028D001DIGITSYSINCONTRACTS MANAGEMENT6/30/1999$4,422.00
4901881MDA97299F0029DODTAITECH INTEGRATION CENTER/TECH DEV CENTER8/4/1999$100,000.00
4911882MDA97299F0030BASICBOOZALLENPOLYMER MATERIALS (CONG ADD)5/15/1999$423,916.45
4921883MDA97299F0031BASICSCHAFERCEROS (FENCED)8/2/1999$59,972.00
4931884MDA97299F0032DOBRADSONCORADVANCED SHIP/SENSOR SYSTEMS MRN-028/9/1999$43,425.18
4941885MDA97299F0033DOSYSPLANCORCONTRACTS MANAGEMENT8/30/1999$37,075.00
4951886MDA97299F0033P00001SYSPLANCORCONTRACTS MANAGEMENT9/13/1999$0.00
4961887MDA97299F0034BASICDIGITSYSINCONTRACTS MANAGEMENT8/31/1999$64,755.00
4971888MDA97299M0002DOINFOSYSLABADVANCED GROUND SURVELLIANCE3/1211999$99,729.00
4981889MDA97299M0003DOSRCADVANCED MICROELECTRONICS4/14/1999$10,000.00
5012NoneNoneNoneNoneNoneNone
5021890MDA97299M0004DOARDAKBW MEDICAL DIAGNOSTICS3/30/1999$99,970.00
5031891MDA97299M0004P00001ARDAKBW MEDICAL DIAGNOSTICS5/26/1999$0.00
5041892MDA97299M0004P00002ARDAKBW MEDICAL DIAGNOSTICS8/4/1999$0.00
5051893MDA97299M0005DOSHASENSOR EMULATION5/4/1999$100,000.00
5061894MDA97299M0005P00001SHASENSOR EMULATION5/12/1999$0.00
5071895MDA97299M0006DOVISTARESEAUNDERSEA LITTORAL WARFARE4/12/1999$74,827.00
5081896MDA97299M0007DOVISUALEYESCOMBAT CASUALTY DIAGNOSTICS:ULTRASOUND5/3/1999$59,500.00
5091897MDA97299M0008BASICBLUE RIDGEOFFICE/PROGRAM SUPPORT (related to VTAX4)5/11/1999$48,566.00
5101898MDA97299M0009DOQRIADVANCED SIMULATION TECH6/29/1999$99,494.00
5111899MDA97299M001 0DOPRAJAINCCOUNTER MEASURES6/14/1999$80,460.00
5121900MDA97299M0011BASIClVICOUNTER MEASURES7/16/1999$90,000.00
5131901MDA97299M0012BASICJERRYCOOKECONTRACT ADMINISTRATION5/3/1999$100,000.00
5141902MDA97299M0013DODIAMONDBACTECH INTEGRATION CENTER/TECH DEV CENTER9/8/1999$50,000.00
5151903MDA9769630014P00007SDLINCSOLAR BLIND DETECTORS7/9/1999$0.00
5161904FY SUBTOTAL: $340,495,021.94NoneNoneNoneNoneNone
5171905NoneNoneNoneNoneNoneNone
\n", "

497 rows × 7 columns

\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD PERFORMER \\\n", "2 1420 MDA97292J 1029 GR20 CNRI \n", "3 1421 MDA97292J1 029 GR22 CNRI \n", "4 1422 MDA97292J1 029 GR22 CNRI \n", "5 1423 MDA97292J1 029 P00025 CNRI \n", "6 1424 MDA972931 0030 P00009 GEORGIATEC \n", "7 1425 MDA9729320014 P00017 USDISPLAYC \n", "8 1426 MDA97293C0016 P00043 SYSPLANCOR \n", "9 1427 MDA97294C0003 A00003 BELLATLANT \n", "10 1428 MDA97294C0003 P00026 BELLATLANT \n", "11 1429 MDA97294C0003 P00027 BELLATLANT \n", "12 1430 MDA97294C0003 P00028 BELLATLANT \n", "13 1431 MDA97294C0003 P00029 BELLATLANT \n", "14 1432 MDA97294C0003 P00030 BELLATLANT \n", "15 1433 MDA97294C0003 P00031 BELLATLANT \n", "16 1434 MDA97294C0003 P00032 BELLATLANT \n", "17 1435 MDA97294C0016 P00026 BDMFEDERAL \n", "18 1436 MDA97294C0016 P00027 BDMFEDERAL \n", "19 1437 MDA97294C0016 P00028 BDMFEDERAL \n", "20 1438 MDA97294C0016 P00029 BDMFEDERAL \n", "21 1439 MDA97294C0016 P00030 BDMFEDERAL \n", "22 1440 MDA97294D0001 D003/P16 VRT \n", "23 1441 MDA97294D0001 0032/3 VRT \n", "24 1442 MDA97294D0001 003202 VALLEYELEC \n", "25 1443 MDA972951 0016 GR03 ARIZONASTA \n", "26 1444 MDA9729530027 P00014 BELLCORE \n", "27 1445 MDA9729530029 A00009 PLANARAMER \n", "28 1446 MDA9729530029 GR0008 PLANARAMER \n", "29 1447 MDA9729530036 GR06 ITNENERGYS \n", "30 1448 MDA9729530042 GR011 CRAYRESEAR \n", "31 1449 MDA97295C0004 P00008 UMASS \n", ".. ... ... ... ... \n", "486 1877 MDA97299F0025 BASIC SYSPLANCOR \n", "487 1878 MDA97299F0027 DO ORIONSCSYS \n", "488 1879 M DA97299F0028 DO DIGITSYSIN \n", "489 1880 MDA97299F0028 D001 DIGITSYSIN \n", "490 1881 MDA97299F0029 DO DTAI \n", "491 1882 MDA97299F0030 BASIC BOOZALLEN \n", "492 1883 MDA97299F0031 BASIC SCHAFER \n", "493 1884 MDA97299F0032 DO BRADSONCOR \n", "494 1885 MDA97299F0033 DO SYSPLANCOR \n", "495 1886 MDA97299F0033 P00001 SYSPLANCOR \n", "496 1887 MDA97299F0034 BASIC DIGITSYSIN \n", "497 1888 MDA97299M0002 DO INFOSYSLAB \n", "498 1889 MDA97299M0003 DO SRC \n", "501 2 None None None \n", "502 1890 MDA97299M0004 DO ARDAK \n", "503 1891 MDA97299M0004 P00001 ARDAK \n", "504 1892 MDA97299M0004 P00002 ARDAK \n", "505 1893 MDA97299M0005 DO SHA \n", "506 1894 MDA97299M0005 P00001 SHA \n", "507 1895 MDA97299M0006 DO VISTARESEA \n", "508 1896 MDA97299M0007 DO VISUALEYES \n", "509 1897 MDA97299M0008 BASIC BLUE RIDGE \n", "510 1898 MDA97299M0009 DO QRI \n", "511 1899 MDA97299M001 0 DO PRAJAINC \n", "512 1900 MDA97299M0011 BASIC lVI \n", "513 1901 MDA97299M0012 BASIC JERRYCOOKE \n", "514 1902 MDA97299M0013 DO DIAMONDBAC \n", "515 1903 MDA9769630014 P00007 SDLINC \n", "516 1904 FY SUBTOTAL: $340,495,021.94 None None \n", "517 1905 None None None \n", "\n", " PROGRAM_TITLE AWARD_DATE AMOUNT \n", "2 INFORMATION MANAGEMENT 12/10/1998 $687,000.00 \n", "3 COMMUNICATOR 4/22/1999 $400,000.00 \n", "4 WEBINABOX 4122/1999 $360,000.00 \n", "5 WEBINABOX 8/24/1999 $0.00 \n", "6 HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 $1 ,210,694.00 \n", "7 FLAT PANEL DISPLAYS 8116/1999 $5,794,000.00 \n", "8 CHPS: Combat Hybrid Power Systems 1nt1999 $79,441.00 \n", "9 NEXT GENERATION INTERNET 8/28/1998 $0.00 \n", "10 NEXT GENERATION INTERNET 1/2011999 $332,197.00 \n", "11 NEXT GENERATION INTERNET 2/4/1999 $94,750.00 \n", "12 NEXT GENERATION INTERNET 2/22/1999 $450,000.00 \n", "13 NEXT GENERATION INTERNET 3/1/1999 $254,750.00 \n", "14 NEXT GENERATION INTERNET 4/1 2/1999 $0.00 \n", "15 NEXT GENERATION INTERNET 4/1 3/1999 $254,750.00 \n", "16 NEXT GENERATION INTERNET 9/8/1999 $254,750.00 \n", "17 STOWACTD 2/1 2/1999 $117,000.00 \n", "18 STOWACTD 3/1/1999 $273,000.00 \n", "19 IMAGE UNDERSTANDING 3/2911999 $150,166.00 \n", "20 STOWACTD 5/27/1999 $40,000.00 \n", "21 STOWACTD 911 /1999 $55,930.00 \n", "22 BADD 12/9/1998 $73,374.00 \n", "23 AGILE INFO CONTROL ENVIRONMENT 2/12/1999 $100,095.00 \n", "24 AGILE INFO CONTROL ENVIRONMENT 12/22/1998 $100,095.00 \n", "25 VLSI PHOTONICS 3/1 5/1999 $149,984.00 \n", "26 BROADBAND INFORMATION TECHNOLOGY 1/4/1999 $4,547,200.00 \n", "27 HIGH DEFINITION SYSTEMS (HDS) 5/4/1999 $0.00 \n", "28 HIGH DEFINITION SYSTEMS (HDS) 11/10/1998 $7,570,137.00 \n", "29 PHOTOVOLTAICS (VP) 11/1 8/1998 $558,900.00 \n", "30 SHOCC 6nt1999 $1 ,289,562.00 \n", "31 LARGE MILLIMETER TELESCOPE 8/30/1999 $1 ,151 ,500.00 \n", ".. ... ... ... \n", "486 COUNTER UNDERGROUND FACILITIES 6/25/1999 $251 ,924.00 \n", "487 COUNTER MEASURES 6/11/1999 $199,991 .00 \n", "488 CONTRACT ADMINISTRATION 7/14/1999 $90,000.00 \n", "489 CONTRACTS MANAGEMENT 6/30/1999 $4,422.00 \n", "490 TECH INTEGRATION CENTER/TECH DEV CENTER 8/4/1999 $100,000.00 \n", "491 POLYMER MATERIALS (CONG ADD) 5/15/1999 $423,916.45 \n", "492 CEROS (FENCED) 8/2/1999 $59,972.00 \n", "493 ADVANCED SHIP/SENSOR SYSTEMS MRN-02 8/9/1999 $43,425.18 \n", "494 CONTRACTS MANAGEMENT 8/30/1999 $37,075.00 \n", "495 CONTRACTS MANAGEMENT 9/13/1999 $0.00 \n", "496 CONTRACTS MANAGEMENT 8/31/1999 $64,755.00 \n", "497 ADVANCED GROUND SURVELLIANCE 3/1211999 $99,729.00 \n", "498 ADVANCED MICROELECTRONICS 4/14/1999 $10,000.00 \n", "501 None None None \n", "502 BW MEDICAL DIAGNOSTICS 3/30/1999 $99,970.00 \n", "503 BW MEDICAL DIAGNOSTICS 5/26/1999 $0.00 \n", "504 BW MEDICAL DIAGNOSTICS 8/4/1999 $0.00 \n", "505 SENSOR EMULATION 5/4/1999 $100,000.00 \n", "506 SENSOR EMULATION 5/12/1999 $0.00 \n", "507 UNDERSEA LITTORAL WARFARE 4/12/1999 $74,827.00 \n", "508 COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND 5/3/1999 $59,500.00 \n", "509 OFFICE/PROGRAM SUPPORT (related to VTAX4) 5/11/1999 $48,566.00 \n", "510 ADVANCED SIMULATION TECH 6/29/1999 $99,494.00 \n", "511 COUNTER MEASURES 6/14/1999 $80,460.00 \n", "512 COUNTER MEASURES 7/16/1999 $90,000.00 \n", "513 CONTRACT ADMINISTRATION 5/3/1999 $100,000.00 \n", "514 TECH INTEGRATION CENTER/TECH DEV CENTER 9/8/1999 $50,000.00 \n", "515 SOLAR BLIND DETECTORS 7/9/1999 $0.00 \n", "516 None None None \n", "517 None None None \n", "\n", "[497 rows x 7 columns]" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get rid of those yucky last two rows'" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [], "source": [ "darpa1999=darpa1999[:-2]" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### re.sub like find and replace all\n", "subtitute\n", "\n", "`re.sub(\"REGULAR EXPRESSION\", , )`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "## WRITE UP MAP EXPLANATION" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "2 687000.00\n", "3 400000.00\n", "4 360000.00\n", "5 0.00\n", "6 1210694.00\n", "7 5794000.00\n", "8 79441.00\n", "9 0.00\n", "10 332197.00\n", "11 94750.00\n", "12 450000.00\n", "13 254750.00\n", "14 0.00\n", "15 254750.00\n", "16 254750.00\n", "17 117000.00\n", "18 273000.00\n", "19 150166.00\n", "20 40000.00\n", "21 55930.00\n", "22 73374.00\n", "23 100095.00\n", "24 100095.00\n", "25 149984.00\n", "26 4547200.00\n", "27 0.00\n", "28 7570137.00\n", "29 558900.00\n", "30 1289562.00\n", "31 1151500.00\n", " ... \n", "484 40000.00\n", "485 117000.00\n", "486 251924.00\n", "487 199991.00\n", "488 90000.00\n", "489 4422.00\n", "490 100000.00\n", "491 423916.45\n", "492 59972.00\n", "493 43425.18\n", "494 37075.00\n", "495 0.00\n", "496 64755.00\n", "497 99729.00\n", "498 10000.00\n", "501 \n", "502 99970.00\n", "503 0.00\n", "504 0.00\n", "505 100000.00\n", "506 0.00\n", "507 74827.00\n", "508 59500.00\n", "509 48566.00\n", "510 99494.00\n", "511 80460.00\n", "512 90000.00\n", "513 100000.00\n", "514 50000.00\n", "515 0.00\n", "Name: AMOUNT, dtype: object" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999[\"AMOUNT\"].astype(\"str\").map(lambda x: re.sub(\"[^\\d\\.\\(\\)]\", \"\", x))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:4: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n" ] } ], "source": [ "## First get rid of everything not a numeral, a period, or a `(`. \n", "## Note for Non Anglo-American sources, use you'll need to get rid of periods not commas.\n", "\n", "darpa1999[\"AMOUNT\"]=darpa1999[\"AMOUNT\"].astype(\"str\").map(lambda x: re.sub(\"[^\\d\\.\\(]\", \"\", x))\n", "\n", "#[^\\d\\.\\(] means everything but single digits or \".\" or \"(\"\n", "\n", "## for European style, you'd use `re.sub(\"[^\\d\\,\\(]\", \"\", x)`\n" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['687000.00', '400000.00', '360000.00', '0.00', '1210694.00',\n", " '5794000.00', '79441.00', '0.00', '332197.00', '94750.00',\n", " '450000.00', '254750.00', '0.00', '254750.00', '254750.00',\n", " '117000.00', '273000.00', '150166.00', '40000.00', '55930.00',\n", " '73374.00', '100095.00', '100095.00', '149984.00', '4547200.00',\n", " '0.00', '7570137.00', '558900.00', '1289562.00', '1151500.00',\n", " '498383.00', '2000000.00', '290000.00', '1516319.00', '210500.00',\n", " '210500.00', '0.00', '490071.00', '48000.00', '1321000.00',\n", " '423212.00', '3417.00', '5286.00', '0.00', '(109494.00', '65572.00',\n", " '(10000.00', '', '60000.00', '566045.00', '0.00', '1926000.00',\n", " '808440.00', '244629.00', '0.00', '250000.00', '0.00', '810734.00',\n", " '1000000.00', '2790100.00', '3497712.00', '300000.00', '0.00',\n", " '1606470.00', '874.00', '0.00', '0.00', '450000.00', '97000.00',\n", " '0.00', '50000.00', '333929.00', '97000.00', '0.00', '100000.00',\n", " '0.00', '500000.00', '0.00', '599878.00', '980000.00', '70000.00',\n", " '30000.00', '0.00', '2100000.00', '0.00', '0.00', '0.00',\n", " '921400.00', '100000.00', '1790000.00', '267500.00', '0.00',\n", " '91071.00', '40056.00', '0.00', '', '741000.00', '300000.00',\n", " '30000.00', '900000.00', '1950000.00', '499052.00', '10523.00',\n", " '35528.00', '49905.00', '52158.00', '51861.00', '800000.00', '0.00',\n", " '0.00', '365396.00', '500000.00', '400000.00', '300000.00', '0.00',\n", " '3474136.00', '3400000.00', '1380000.00', '162618.00', '0.00',\n", " '0.00', '150000.00', '15000.00', '45000.00', '0.00', '205000.00',\n", " '2897000.00', '503630.00', '0.00', '395320.00', '0.00',\n", " '6510000.00', '4600545.00', '(402700.00', '2700000.00', '898000.00',\n", " '16000.00', '650000.00', '0.00', '0.00', '9901798.00', '0.00',\n", " '20000.00', '', '55000.00', '750000.00', '0.00', '345372.00',\n", " '0.00', '209431.00', '500000.00', '290569.00', '1188000.00',\n", " '200000.00', '400795.00', '176482.00', '350000.00', '117420.00',\n", " '430533.00', '912000.00', '500000.00', '480975.00', '0.00',\n", " '105408.00', '386493.00', '107142.00', '392858.00', '199309.00',\n", " '2610132.00', '150000.00', '103950.00', '119950.00', '200000.00',\n", " '0.00', '2130000.00', '1000000.00', '9250000.00', '30950.00',\n", " '174798.00', '0.00', '0.00', '199899.00', '550000.00', '350000.00',\n", " '0.00', '200000.00', '149999.00', '100000.00', '258264.00',\n", " '327206.00', '93780.00', '', '102706.00', '0.00', '0.00',\n", " '2135000.00', '3750000.00', '100000.00', '0.00', '2200000.00',\n", " '144000.00', '4384000.00', '0.00', '1785000.00', '1700000.00',\n", " '0.00', '0.00', '56000.00', '750000.00', '1853441.00', '2036559.00',\n", " '750000.00', '20526.00', '184428.00', '256638.00', '321768.00',\n", " '204435.00', '43650.00', '350000.00', '735760.00', '1819008.00',\n", " '1500000.00', '0.00', '0.00', '7925688.00', '0.00', '800000.00',\n", " '650000.00', '186083.00', '190842.00', '(500000.00', '6565000.00',\n", " '0.00', '874010.00', '0.00', '518453.00', '0.00', '847010.00',\n", " '27000.00', '', '0.00', '3660000.00', '5600000.00', '5600000.00',\n", " '5600000.00', '5000000.00', '0.00', '5000000.00', '1427526.00',\n", " '243774.00', '0.00', '0.00', '1064928.00', '556226.00', '867000.00',\n", " '(800000.00', '200000.00', '769226.00', '1650000.00', '1915045.00',\n", " '1642515.00', '800000.00', '370416.00', '900000.00', '100000.00',\n", " '', '', '', '3333000.00', '', '4082000.00', '1515000.00',\n", " '687927.00', '116667.00', '583333.00', '143433.00', '349909.00',\n", " '2053443.00', '108700.00', '300000.00', '391300.00', '0.00',\n", " '6119332.00', '0.00', '1490998.00', '26658.00', '114281.00', '',\n", " '39415.00', '700000.00', '1293841.00', '0.00', '302233.00', '0.00',\n", " '199789.00', '247900.00', '0.00', '1200000.00', '124983.00', '0.00',\n", " '0.00', '294996.00', '244295.00', '0.00', '76655.00', '0.00',\n", " '0.00', '413864.00', '324929.00', '79167.00', '194148.00',\n", " '375000.00', '2000000.00', '(120000.00', '', '2000000.00', '0.00',\n", " '2586066.00', '100000.00', '100000.00', '50000.00', '100000.00',\n", " '156000.00', '0.00', '50000.00', '0.00', '100000.00', '450000.00',\n", " '50000.00', '250000.00', '850000.00', '200000.00', '380492.00',\n", " '155992.00', '130000.00', '', '210000.00', '355947.00', '80000.00',\n", " '32839.00', '79846.00', '97966.00', '392856.00', '70778.00',\n", " '24864.00', '395859.00', '124999.00', '1800000.00', '',\n", " '5883520.00', '1169500.00', '', '698000.00', '3075000.00',\n", " '3075000.00', '2311497.00', '500000.00', '200000.00', '8826140.00',\n", " '0.00', '0.00', '95524.00', '25000.00', '400000.00', '150000.00',\n", " '100000.00', '65000.00', '68000.00', '771805.00', '', '222649.00',\n", " '0.00', '242880.00', '50000.00', '3282970.00', '0.00', '365731.00',\n", " '(365731.00', '200000.00', '195867.00', '417065.00', '666262.00',\n", " '493980.00', '', '374990.00', '833281.00', '357869.00', '95943.00',\n", " '0.00', '100000.00', '299997.00', '500000.00', '0.00', '400000.00',\n", " '114000.00', '0.00', '1000000.00', '499927.00', '497.00',\n", " '218000.00', '176491.00', '0.00', '0.00', '273501.00', '0.00',\n", " '100000.00', '0.00', '594401.00', '0.00', '0.00', '0.00', '0.00',\n", " '0.00', '200000.00', '200000.00', '250000.00', '534702.00', '',\n", " '31569.00', '124928.00', '0.00', '168421.00', '80000.00',\n", " '60669.00', '140000.00', '40000.00', '20000.00', '82109.00',\n", " '60000.00', '104539.00', '70000.00', '', '', '59000.00',\n", " '312360.00', '290580.00', '332000.00', '320000.00', '78000.00',\n", " '285000.00', '0.00', '31697.37', '114950.00', '0.00', '393486.00',\n", " '210262.00', '300000.00', '329987.00', '500386.00', '210262.00',\n", " '129850.00', '119732.00', '119732.00', '0.00', '81772.38',\n", " '73096.56', '789358.00', '49959.00', '100000.00', '306000.00',\n", " '306000.00', '600000.00', '75000.00', '183000.00', '179993.00',\n", " '40000.00', '117000.00', '251924.00', '199991.00', '90000.00',\n", " '4422.00', '100000.00', '423916.45', '59972.00', '43425.18',\n", " '37075.00', '0.00', '64755.00', '99729.00', '10000.00', '',\n", " '99970.00', '0.00', '0.00', '100000.00', '0.00', '74827.00',\n", " '59500.00', '48566.00', '99494.00', '80460.00', '90000.00',\n", " '100000.00', '50000.00', '0.00'], dtype=object)" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999[\"AMOUNT\"].values" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:3: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " app.launch_new_instance()\n" ] } ], "source": [ "#make all the ( into negatives\n", "\n", "darpa1999[\"AMOUNT\"]=darpa1999[\"AMOUNT\"].astype(\"str\").map(lambda x: re.sub(\"[\\(]\", \"-\", x))" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "2 687000.00\n", "3 400000.00\n", "4 360000.00\n", "5 0.00\n", "6 1210694.00\n", "7 5794000.00\n", "8 79441.00\n", "9 0.00\n", "10 332197.00\n", "11 94750.00\n", "12 450000.00\n", "13 254750.00\n", "14 0.00\n", "15 254750.00\n", "16 254750.00\n", "17 117000.00\n", "18 273000.00\n", "19 150166.00\n", "20 40000.00\n", "21 55930.00\n", "22 73374.00\n", "23 100095.00\n", "24 100095.00\n", "25 149984.00\n", "26 4547200.00\n", "27 0.00\n", "28 7570137.00\n", "29 558900.00\n", "30 1289562.00\n", "31 1151500.00\n", " ... \n", "484 40000.00\n", "485 117000.00\n", "486 251924.00\n", "487 199991.00\n", "488 90000.00\n", "489 4422.00\n", "490 100000.00\n", "491 423916.45\n", "492 59972.00\n", "493 43425.18\n", "494 37075.00\n", "495 0.00\n", "496 64755.00\n", "497 99729.00\n", "498 10000.00\n", "501 NaN\n", "502 99970.00\n", "503 0.00\n", "504 0.00\n", "505 100000.00\n", "506 0.00\n", "507 74827.00\n", "508 59500.00\n", "509 48566.00\n", "510 99494.00\n", "511 80460.00\n", "512 90000.00\n", "513 100000.00\n", "514 50000.00\n", "515 0.00\n", "Name: AMOUNT, dtype: float64" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#finally convert into a numerical object. \n", "# pandas convert_objects will do the trick!\n", "darpa1999[\"AMOUNT\"].convert_objects(convert_numeric=True)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/mljones/anaconda/lib/python2.7/site-packages/IPython/kernel/__main__.py:1: SettingWithCopyWarning: \n", "A value is trying to be set on a copy of a slice from a DataFrame.\n", "Try using .loc[row_indexer,col_indexer] = value instead\n", "\n", "See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy\n", " if __name__ == '__main__':\n" ] } ], "source": [ "darpa1999[\"AMOUNT\"]=darpa1999[\"AMOUNT\"].convert_objects(convert_numeric=True)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumberCONTRACT_NUMBERCONTRACT_MODPERFORMERPROGRAM_TITLEAWARD_DATEAMOUNT
21420MDA97292J 1029GR20CNRIINFORMATION MANAGEMENT12/10/1998687000.00
31421MDA97292J1 029GR22CNRICOMMUNICATOR4/22/1999400000.00
41422MDA97292J1 029GR22CNRIWEBINABOX4122/1999360000.00
51423MDA97292J1 029P00025CNRIWEBINABOX8/24/19990.00
61424MDA972931 0030P00009GEORGIATECHIGH DEFINITION SYSTEMS (HDS)1/29/19991210694.00
71425MDA9729320014P00017USDISPLAYCFLAT PANEL DISPLAYS8116/19995794000.00
81426MDA97293C0016P00043SYSPLANCORCHPS: Combat Hybrid Power Systems1nt199979441.00
91427MDA97294C0003A00003BELLATLANTNEXT GENERATION INTERNET8/28/19980.00
101428MDA97294C0003P00026BELLATLANTNEXT GENERATION INTERNET1/2011999332197.00
111429MDA97294C0003P00027BELLATLANTNEXT GENERATION INTERNET2/4/199994750.00
121430MDA97294C0003P00028BELLATLANTNEXT GENERATION INTERNET2/22/1999450000.00
131431MDA97294C0003P00029BELLATLANTNEXT GENERATION INTERNET3/1/1999254750.00
141432MDA97294C0003P00030BELLATLANTNEXT GENERATION INTERNET4/1 2/19990.00
151433MDA97294C0003P00031BELLATLANTNEXT GENERATION INTERNET4/1 3/1999254750.00
161434MDA97294C0003P00032BELLATLANTNEXT GENERATION INTERNET9/8/1999254750.00
171435MDA97294C0016P00026BDMFEDERALSTOWACTD2/1 2/1999117000.00
181436MDA97294C0016P00027BDMFEDERALSTOWACTD3/1/1999273000.00
191437MDA97294C0016P00028BDMFEDERALIMAGE UNDERSTANDING3/2911999150166.00
201438MDA97294C0016P00029BDMFEDERALSTOWACTD5/27/199940000.00
211439MDA97294C0016P00030BDMFEDERALSTOWACTD911 /199955930.00
221440MDA97294D0001D003/P16VRTBADD12/9/199873374.00
231441MDA97294D00010032/3VRTAGILE INFO CONTROL ENVIRONMENT2/12/1999100095.00
241442MDA97294D0001003202VALLEYELECAGILE INFO CONTROL ENVIRONMENT12/22/1998100095.00
251443MDA972951 0016GR03ARIZONASTAVLSI PHOTONICS3/1 5/1999149984.00
261444MDA9729530027P00014BELLCOREBROADBAND INFORMATION TECHNOLOGY1/4/19994547200.00
271445MDA9729530029A00009PLANARAMERHIGH DEFINITION SYSTEMS (HDS)5/4/19990.00
281446MDA9729530029GR0008PLANARAMERHIGH DEFINITION SYSTEMS (HDS)11/10/19987570137.00
291447MDA9729530036GR06ITNENERGYSPHOTOVOLTAICS (VP)11/1 8/1998558900.00
301448MDA9729530042GR011CRAYRESEARSHOCC6nt19991289562.00
311449MDA97295C0004P00008UMASSLARGE MILLIMETER TELESCOPE8/30/19991151500.00
........................
4841875MDA97299F0024BASICGRCIOFFICE/PROGRAM SUPPORT (RELATED TO VSEE8)3/171199940000.00
4851876MDA97299F0024BASICGRCIWARFIGHTERS INTERNET3/17/1999117000.00
4861877MDA97299F0025BASICSYSPLANCORCOUNTER UNDERGROUND FACILITIES6/25/1999251924.00
4871878MDA97299F0027DOORIONSCSYSCOUNTER MEASURES6/11/1999199991.00
4881879M DA97299F0028DODIGITSYSINCONTRACT ADMINISTRATION7/14/199990000.00
4891880MDA97299F0028D001DIGITSYSINCONTRACTS MANAGEMENT6/30/19994422.00
4901881MDA97299F0029DODTAITECH INTEGRATION CENTER/TECH DEV CENTER8/4/1999100000.00
4911882MDA97299F0030BASICBOOZALLENPOLYMER MATERIALS (CONG ADD)5/15/1999423916.45
4921883MDA97299F0031BASICSCHAFERCEROS (FENCED)8/2/199959972.00
4931884MDA97299F0032DOBRADSONCORADVANCED SHIP/SENSOR SYSTEMS MRN-028/9/199943425.18
4941885MDA97299F0033DOSYSPLANCORCONTRACTS MANAGEMENT8/30/199937075.00
4951886MDA97299F0033P00001SYSPLANCORCONTRACTS MANAGEMENT9/13/19990.00
4961887MDA97299F0034BASICDIGITSYSINCONTRACTS MANAGEMENT8/31/199964755.00
4971888MDA97299M0002DOINFOSYSLABADVANCED GROUND SURVELLIANCE3/121199999729.00
4981889MDA97299M0003DOSRCADVANCED MICROELECTRONICS4/14/199910000.00
5012NoneNoneNoneNoneNoneNaN
5021890MDA97299M0004DOARDAKBW MEDICAL DIAGNOSTICS3/30/199999970.00
5031891MDA97299M0004P00001ARDAKBW MEDICAL DIAGNOSTICS5/26/19990.00
5041892MDA97299M0004P00002ARDAKBW MEDICAL DIAGNOSTICS8/4/19990.00
5051893MDA97299M0005DOSHASENSOR EMULATION5/4/1999100000.00
5061894MDA97299M0005P00001SHASENSOR EMULATION5/12/19990.00
5071895MDA97299M0006DOVISTARESEAUNDERSEA LITTORAL WARFARE4/12/199974827.00
5081896MDA97299M0007DOVISUALEYESCOMBAT CASUALTY DIAGNOSTICS:ULTRASOUND5/3/199959500.00
5091897MDA97299M0008BASICBLUE RIDGEOFFICE/PROGRAM SUPPORT (related to VTAX4)5/11/199948566.00
5101898MDA97299M0009DOQRIADVANCED SIMULATION TECH6/29/199999494.00
5111899MDA97299M001 0DOPRAJAINCCOUNTER MEASURES6/14/199980460.00
5121900MDA97299M0011BASIClVICOUNTER MEASURES7/16/199990000.00
5131901MDA97299M0012BASICJERRYCOOKECONTRACT ADMINISTRATION5/3/1999100000.00
5141902MDA97299M0013DODIAMONDBACTECH INTEGRATION CENTER/TECH DEV CENTER9/8/199950000.00
5151903MDA9769630014P00007SDLINCSOLAR BLIND DETECTORS7/9/19990.00
\n", "

495 rows × 7 columns

\n", "
" ], "text/plain": [ " Number CONTRACT_NUMBER CONTRACT_MOD PERFORMER \\\n", "2 1420 MDA97292J 1029 GR20 CNRI \n", "3 1421 MDA97292J1 029 GR22 CNRI \n", "4 1422 MDA97292J1 029 GR22 CNRI \n", "5 1423 MDA97292J1 029 P00025 CNRI \n", "6 1424 MDA972931 0030 P00009 GEORGIATEC \n", "7 1425 MDA9729320014 P00017 USDISPLAYC \n", "8 1426 MDA97293C0016 P00043 SYSPLANCOR \n", "9 1427 MDA97294C0003 A00003 BELLATLANT \n", "10 1428 MDA97294C0003 P00026 BELLATLANT \n", "11 1429 MDA97294C0003 P00027 BELLATLANT \n", "12 1430 MDA97294C0003 P00028 BELLATLANT \n", "13 1431 MDA97294C0003 P00029 BELLATLANT \n", "14 1432 MDA97294C0003 P00030 BELLATLANT \n", "15 1433 MDA97294C0003 P00031 BELLATLANT \n", "16 1434 MDA97294C0003 P00032 BELLATLANT \n", "17 1435 MDA97294C0016 P00026 BDMFEDERAL \n", "18 1436 MDA97294C0016 P00027 BDMFEDERAL \n", "19 1437 MDA97294C0016 P00028 BDMFEDERAL \n", "20 1438 MDA97294C0016 P00029 BDMFEDERAL \n", "21 1439 MDA97294C0016 P00030 BDMFEDERAL \n", "22 1440 MDA97294D0001 D003/P16 VRT \n", "23 1441 MDA97294D0001 0032/3 VRT \n", "24 1442 MDA97294D0001 003202 VALLEYELEC \n", "25 1443 MDA972951 0016 GR03 ARIZONASTA \n", "26 1444 MDA9729530027 P00014 BELLCORE \n", "27 1445 MDA9729530029 A00009 PLANARAMER \n", "28 1446 MDA9729530029 GR0008 PLANARAMER \n", "29 1447 MDA9729530036 GR06 ITNENERGYS \n", "30 1448 MDA9729530042 GR011 CRAYRESEAR \n", "31 1449 MDA97295C0004 P00008 UMASS \n", ".. ... ... ... ... \n", "484 1875 MDA97299F0024 BASIC GRCI \n", "485 1876 MDA97299F0024 BASIC GRCI \n", "486 1877 MDA97299F0025 BASIC SYSPLANCOR \n", "487 1878 MDA97299F0027 DO ORIONSCSYS \n", "488 1879 M DA97299F0028 DO DIGITSYSIN \n", "489 1880 MDA97299F0028 D001 DIGITSYSIN \n", "490 1881 MDA97299F0029 DO DTAI \n", "491 1882 MDA97299F0030 BASIC BOOZALLEN \n", "492 1883 MDA97299F0031 BASIC SCHAFER \n", "493 1884 MDA97299F0032 DO BRADSONCOR \n", "494 1885 MDA97299F0033 DO SYSPLANCOR \n", "495 1886 MDA97299F0033 P00001 SYSPLANCOR \n", "496 1887 MDA97299F0034 BASIC DIGITSYSIN \n", "497 1888 MDA97299M0002 DO INFOSYSLAB \n", "498 1889 MDA97299M0003 DO SRC \n", "501 2 None None None \n", "502 1890 MDA97299M0004 DO ARDAK \n", "503 1891 MDA97299M0004 P00001 ARDAK \n", "504 1892 MDA97299M0004 P00002 ARDAK \n", "505 1893 MDA97299M0005 DO SHA \n", "506 1894 MDA97299M0005 P00001 SHA \n", "507 1895 MDA97299M0006 DO VISTARESEA \n", "508 1896 MDA97299M0007 DO VISUALEYES \n", "509 1897 MDA97299M0008 BASIC BLUE RIDGE \n", "510 1898 MDA97299M0009 DO QRI \n", "511 1899 MDA97299M001 0 DO PRAJAINC \n", "512 1900 MDA97299M0011 BASIC lVI \n", "513 1901 MDA97299M0012 BASIC JERRYCOOKE \n", "514 1902 MDA97299M0013 DO DIAMONDBAC \n", "515 1903 MDA9769630014 P00007 SDLINC \n", "\n", " PROGRAM_TITLE AWARD_DATE AMOUNT \n", "2 INFORMATION MANAGEMENT 12/10/1998 687000.00 \n", "3 COMMUNICATOR 4/22/1999 400000.00 \n", "4 WEBINABOX 4122/1999 360000.00 \n", "5 WEBINABOX 8/24/1999 0.00 \n", "6 HIGH DEFINITION SYSTEMS (HDS) 1/29/1999 1210694.00 \n", "7 FLAT PANEL DISPLAYS 8116/1999 5794000.00 \n", "8 CHPS: Combat Hybrid Power Systems 1nt1999 79441.00 \n", "9 NEXT GENERATION INTERNET 8/28/1998 0.00 \n", "10 NEXT GENERATION INTERNET 1/2011999 332197.00 \n", "11 NEXT GENERATION INTERNET 2/4/1999 94750.00 \n", "12 NEXT GENERATION INTERNET 2/22/1999 450000.00 \n", "13 NEXT GENERATION INTERNET 3/1/1999 254750.00 \n", "14 NEXT GENERATION INTERNET 4/1 2/1999 0.00 \n", "15 NEXT GENERATION INTERNET 4/1 3/1999 254750.00 \n", "16 NEXT GENERATION INTERNET 9/8/1999 254750.00 \n", "17 STOWACTD 2/1 2/1999 117000.00 \n", "18 STOWACTD 3/1/1999 273000.00 \n", "19 IMAGE UNDERSTANDING 3/2911999 150166.00 \n", "20 STOWACTD 5/27/1999 40000.00 \n", "21 STOWACTD 911 /1999 55930.00 \n", "22 BADD 12/9/1998 73374.00 \n", "23 AGILE INFO CONTROL ENVIRONMENT 2/12/1999 100095.00 \n", "24 AGILE INFO CONTROL ENVIRONMENT 12/22/1998 100095.00 \n", "25 VLSI PHOTONICS 3/1 5/1999 149984.00 \n", "26 BROADBAND INFORMATION TECHNOLOGY 1/4/1999 4547200.00 \n", "27 HIGH DEFINITION SYSTEMS (HDS) 5/4/1999 0.00 \n", "28 HIGH DEFINITION SYSTEMS (HDS) 11/10/1998 7570137.00 \n", "29 PHOTOVOLTAICS (VP) 11/1 8/1998 558900.00 \n", "30 SHOCC 6nt1999 1289562.00 \n", "31 LARGE MILLIMETER TELESCOPE 8/30/1999 1151500.00 \n", ".. ... ... ... \n", "484 OFFICE/PROGRAM SUPPORT (RELATED TO VSEE8) 3/1711999 40000.00 \n", "485 WARFIGHTERS INTERNET 3/17/1999 117000.00 \n", "486 COUNTER UNDERGROUND FACILITIES 6/25/1999 251924.00 \n", "487 COUNTER MEASURES 6/11/1999 199991.00 \n", "488 CONTRACT ADMINISTRATION 7/14/1999 90000.00 \n", "489 CONTRACTS MANAGEMENT 6/30/1999 4422.00 \n", "490 TECH INTEGRATION CENTER/TECH DEV CENTER 8/4/1999 100000.00 \n", "491 POLYMER MATERIALS (CONG ADD) 5/15/1999 423916.45 \n", "492 CEROS (FENCED) 8/2/1999 59972.00 \n", "493 ADVANCED SHIP/SENSOR SYSTEMS MRN-02 8/9/1999 43425.18 \n", "494 CONTRACTS MANAGEMENT 8/30/1999 37075.00 \n", "495 CONTRACTS MANAGEMENT 9/13/1999 0.00 \n", "496 CONTRACTS MANAGEMENT 8/31/1999 64755.00 \n", "497 ADVANCED GROUND SURVELLIANCE 3/1211999 99729.00 \n", "498 ADVANCED MICROELECTRONICS 4/14/1999 10000.00 \n", "501 None None NaN \n", "502 BW MEDICAL DIAGNOSTICS 3/30/1999 99970.00 \n", "503 BW MEDICAL DIAGNOSTICS 5/26/1999 0.00 \n", "504 BW MEDICAL DIAGNOSTICS 8/4/1999 0.00 \n", "505 SENSOR EMULATION 5/4/1999 100000.00 \n", "506 SENSOR EMULATION 5/12/1999 0.00 \n", "507 UNDERSEA LITTORAL WARFARE 4/12/1999 74827.00 \n", "508 COMBAT CASUALTY DIAGNOSTICS:ULTRASOUND 5/3/1999 59500.00 \n", "509 OFFICE/PROGRAM SUPPORT (related to VTAX4) 5/11/1999 48566.00 \n", "510 ADVANCED SIMULATION TECH 6/29/1999 99494.00 \n", "511 COUNTER MEASURES 6/14/1999 80460.00 \n", "512 COUNTER MEASURES 7/16/1999 90000.00 \n", "513 CONTRACT ADMINISTRATION 5/3/1999 100000.00 \n", "514 TECH INTEGRATION CENTER/TECH DEV CENTER 9/8/1999 50000.00 \n", "515 SOLAR BLIND DETECTORS 7/9/1999 0.00 \n", "\n", "[495 rows x 7 columns]" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AMOUNT
PROGRAM_TITLE
3-D MICRO ELECTRONICSNaN
6/1711999NaN
AA V: Advanced Air Vehicle-500000.00
AAV: Advanced Air Vehicle10225000.00
ACMPNIP302233.00
ACTIVE NETWORKS50000.00
ACTIVE TEMPLATES329987.00
ADAPTIVE COMPUTING SYSTEMS850000.00
ADMINISTRATIVE SUPPORT290000.00
ADVANCED FLEXIBLE MANUFACTURING3400000.00
ADVANCED GROUND SURVELLIANCE124593.00
ADVANCED LITHOGRAPHY3030000.00
ADVANCED LOGISTICS TECHNOLOGY16518949.00
ADVANCED MICROELECTRONICS10000.00
ADVANCED NETWORKING TECHNOLOGY49959.00
ADVANCED SHIP/SENSOR SYSTEMS MRN-02328425.18
ADVANCED SIMULATION TECH2947837.00
AG ILE INFO CONTROL ENVIRONMENT199899.00
AGENT ADMIN SUPPORT769226.00
AGILE INFO CONTROL ENVIRONMENT3872700.00
AIM : Advanced ISR Man'!9ement600000.00
AIRBORNE COMMS NODE16800000.00
AIRBORNE VIDEO SURVEILLANCE468486.00
AM3: Affordable Multi-Missile Manufacturing18795907.00
AMOUNTNaN
ANTS SEEDLINGS50000.00
APLA: SELF HEALING/TAGS/MGM20526.00
ARRMD: Affordable Rapid Response Missile Demonstrator500000.00
ART: Advanced Rotorcraft Technology1500000.00
BADD642161.00
......
SECURITY SUPPORT1321000.00
SECURITY SUPPORT .423212.00
SENSOR EMULATION332215.00
SHOCC1289562.00
SKYLINK100000.00
SLID: Small Low-Cost Interceptor Device1859000.00
SMART MATERIALS/ACTUATORS1633760.00
SMART MATERIALS/DEMOS5280951.00
SOLAR BLIND DETECTORS250000.00
STARLIGHT SUPPORT COSTS-ITO210500.00
STOWACTD831302.00
SUB STUDY (SUBMARINE PAYLOADS AND SENSORS6180000.00
SUO: SITUATION AWARENESS SYS (SAS)19138500.00
SURVIVABLTY LARGE SCALE INFO SYS119732.00
Seedlings SGT-02242880.00
Seedlings TT-06900000.00
Seedlings TT-07200000.00
TACTICAL SENSORS290580.00
TECH INTEGRATION CENTER/TECH DEV CENTER150000.00
TMR: URBAN ROBOTICS332000.00
TRVS500000.00
UCAV: Unmanned Combat Air Vehicle10017493.00
UNDERSEA LITTORAL WARFARE1324827.00
VIRTUAL ELECTROMAGNETIC TEST RANGE498383.00
VLSI PHOTONICS149984.00
WARFIGHTERS INTERNET327000.00
WATER HAMMER2887441.00
WEBINABOX360000.00
lA INTEGRATED TESTBED (INFORMATION ASSURANCE)909090.00
lA INTEGRATED TESTBEDJINFORMATION ASSURANCE)79167.00
\n", "

164 rows × 1 columns

\n", "
" ], "text/plain": [ " AMOUNT\n", "PROGRAM_TITLE \n", "3-D MICRO ELECTRONICS NaN\n", "6/1711999 NaN\n", "AA V: Advanced Air Vehicle -500000.00\n", "AAV: Advanced Air Vehicle 10225000.00\n", "ACMPNIP 302233.00\n", "ACTIVE NETWORKS 50000.00\n", "ACTIVE TEMPLATES 329987.00\n", "ADAPTIVE COMPUTING SYSTEMS 850000.00\n", "ADMINISTRATIVE SUPPORT 290000.00\n", "ADVANCED FLEXIBLE MANUFACTURING 3400000.00\n", "ADVANCED GROUND SURVELLIANCE 124593.00\n", "ADVANCED LITHOGRAPHY 3030000.00\n", "ADVANCED LOGISTICS TECHNOLOGY 16518949.00\n", "ADVANCED MICROELECTRONICS 10000.00\n", "ADVANCED NETWORKING TECHNOLOGY 49959.00\n", "ADVANCED SHIP/SENSOR SYSTEMS MRN-02 328425.18\n", "ADVANCED SIMULATION TECH 2947837.00\n", "AG ILE INFO CONTROL ENVIRONMENT 199899.00\n", "AGENT ADMIN SUPPORT 769226.00\n", "AGILE INFO CONTROL ENVIRONMENT 3872700.00\n", "AIM : Advanced ISR Man'!9ement 600000.00\n", "AIRBORNE COMMS NODE 16800000.00\n", "AIRBORNE VIDEO SURVEILLANCE 468486.00\n", "AM3: Affordable Multi-Missile Manufacturing 18795907.00\n", "AMOUNT NaN\n", "ANTS SEEDLINGS 50000.00\n", "APLA: SELF HEALING/TAGS/MGM 20526.00\n", "ARRMD: Affordable Rapid Response Missile Demons... 500000.00\n", "ART: Advanced Rotorcraft Technology 1500000.00\n", "BADD 642161.00\n", "... ...\n", "SECURITY SUPPORT 1321000.00\n", "SECURITY SUPPORT . 423212.00\n", "SENSOR EMULATION 332215.00\n", "SHOCC 1289562.00\n", "SKYLINK 100000.00\n", "SLID: Small Low-Cost Interceptor Device 1859000.00\n", "SMART MATERIALS/ACTUATORS 1633760.00\n", "SMART MATERIALS/DEMOS 5280951.00\n", "SOLAR BLIND DETECTORS 250000.00\n", "STARLIGHT SUPPORT COSTS-ITO 210500.00\n", "STOWACTD 831302.00\n", "SUB STUDY (SUBMARINE PAYLOADS AND SENSORS 6180000.00\n", "SUO: SITUATION AWARENESS SYS (SAS) 19138500.00\n", "SURVIVABLTY LARGE SCALE INFO SYS 119732.00\n", "Seedlings SGT-02 242880.00\n", "Seedlings TT-06 900000.00\n", "Seedlings TT-07 200000.00\n", "TACTICAL SENSORS 290580.00\n", "TECH INTEGRATION CENTER/TECH DEV CENTER 150000.00\n", "TMR: URBAN ROBOTICS 332000.00\n", "TRVS 500000.00\n", "UCAV: Unmanned Combat Air Vehicle 10017493.00\n", "UNDERSEA LITTORAL WARFARE 1324827.00\n", "VIRTUAL ELECTROMAGNETIC TEST RANGE 498383.00\n", "VLSI PHOTONICS 149984.00\n", "WARFIGHTERS INTERNET 327000.00\n", "WATER HAMMER 2887441.00\n", "WEBINABOX 360000.00\n", "lA INTEGRATED TESTBED (INFORMATION ASSURANCE) 909090.00\n", "lA INTEGRATED TESTBEDJINFORMATION ASSURANCE) 79167.00\n", "\n", "[164 rows x 1 columns]" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#finally can do some operations\n", "darpa1999.groupby(by=\"PROGRAM_TITLE\").sum()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
AMOUNT
PERFORMER
ALPHATECH885851.00
ALPINECONS16110000.00
APT I2809441.00
APTI124999.00
ARDAK99970.00
ARIZONASTA345851.00
ART I2290000.00
ART!210500.00
ARTI4009102.00
ATTTECH200000.00
AUBURNU395320.00
AWARD DATENaN
BBN144000.00
BDMFEDERAL751046.00
BELLATLANT1641197.00
BELLCORE4547200.00
BLUE RIDGE48566.00
BOEING8082268.00
BOEINGDEFS4238010.00
BOEINGDESP5883520.00
BOEINGNAIN76655.00
BOOZALLEN3301594.45
BRADSONCOR885747.74
CALTECH256638.00
CENTRA1810672.00
CERIDIAN3797712.00
CFDRESCORP699919.00
CNRI5936135.00
COLOSTU100000.00
CRAYRESEAR1289562.00
......
TRACORAERO1447900.00
TRITECHINC222649.00
TRW5600000.00
UALABAMA900000.00
UARIZONA2790100.00
UCBERKELEY3260000.00
UCIRVINE50000.00
UCLA350000.00
UCLON156000.00
UCSANTABAR100000.00
UFLA321768.00
UILLURBCHA100000.00
UMASS1151500.00
UMINN1915045.00
UNIVNEWORL3474136.00
USCISI299997.00
USDISPLAYC5794000.00
UTAHSTU327618.00
UTEXAS350000.00
UVA365396.00
UWISCONSIN43650.00
VALLEYELEC100095.00
VANDERBILT204435.00
VEDAINC666262.00
VISTARESEA74827.00
VISUALEYES59500.00
VRT173469.00
WALCOFF20526.00
XEROXPARC1642515.00
lVI90000.00
\n", "

152 rows × 1 columns

\n", "
" ], "text/plain": [ " AMOUNT\n", "PERFORMER \n", "ALPHATECH 885851.00\n", "ALPINECONS 16110000.00\n", "APT I 2809441.00\n", "APTI 124999.00\n", "ARDAK 99970.00\n", "ARIZONASTA 345851.00\n", "ART I 2290000.00\n", "ART! 210500.00\n", "ARTI 4009102.00\n", "ATTTECH 200000.00\n", "AUBURNU 395320.00\n", "AWARD DATE NaN\n", "BBN 144000.00\n", "BDMFEDERAL 751046.00\n", "BELLATLANT 1641197.00\n", "BELLCORE 4547200.00\n", "BLUE RIDGE 48566.00\n", "BOEING 8082268.00\n", "BOEINGDEFS 4238010.00\n", "BOEINGDESP 5883520.00\n", "BOEINGNAIN 76655.00\n", "BOOZALLEN 3301594.45\n", "BRADSONCOR 885747.74\n", "CALTECH 256638.00\n", "CENTRA 1810672.00\n", "CERIDIAN 3797712.00\n", "CFDRESCORP 699919.00\n", "CNRI 5936135.00\n", "COLOSTU 100000.00\n", "CRAYRESEAR 1289562.00\n", "... ...\n", "TRACORAERO 1447900.00\n", "TRITECHINC 222649.00\n", "TRW 5600000.00\n", "UALABAMA 900000.00\n", "UARIZONA 2790100.00\n", "UCBERKELEY 3260000.00\n", "UCIRVINE 50000.00\n", "UCLA 350000.00\n", "UCLON 156000.00\n", "UCSANTABAR 100000.00\n", "UFLA 321768.00\n", "UILLURBCHA 100000.00\n", "UMASS 1151500.00\n", "UMINN 1915045.00\n", "UNIVNEWORL 3474136.00\n", "USCISI 299997.00\n", "USDISPLAYC 5794000.00\n", "UTAHSTU 327618.00\n", "UTEXAS 350000.00\n", "UVA 365396.00\n", "UWISCONSIN 43650.00\n", "VALLEYELEC 100095.00\n", "VANDERBILT 204435.00\n", "VEDAINC 666262.00\n", "VISTARESEA 74827.00\n", "VISUALEYES 59500.00\n", "VRT 173469.00\n", "WALCOFF 20526.00\n", "XEROXPARC 1642515.00\n", "lVI 90000.00\n", "\n", "[152 rows x 1 columns]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "darpa1999.groupby(by=\"PERFORMER\").sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##note that we've not done any work on the dates, which are filled with badly OCR'd data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## back to pdftotext\n", "\n", "`pdftotext` not use wildcards.\n", "\n", "To run on all files in a directory within the unix bash shell (Mac OS X, most linux):\n", "\n", "`for file in *.pdf; do pdftotext \"$file\" \"$file.txt\"; done`\n", "\n", "RUN in shell not in python\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#the greater evil\n", "## image text needing to be OCR'd--optical character recognition\n", "\n", "Here proprietary solutions rule the day. :(" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Easiest, if you trust people not to be evil, and severly limited\n", "\n", "Google drive for file < 2m or 10 pages.\n", "\n", "Google probably has the best ocr out there but no way to access at scale.\n", "\n", "if you've found a pdf online, can always consult Google's OCR of it via Google cache:\n", "\n", "take yer url and prefix it with:\n", "\n", "`https://webcache.googleusercontent.com/search?q=cache:{{your URL}}`\n", "\n", "\n", "Doesn't always work and result is challenging html that reproduces the *position* of text\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Key commerical products\n", "- Adobe Acrobat Pro\n", " - slow\n", " - not great on bulk operations, but does the job ok. \n", " - embeds the ocr'd text within a new pdf\n", " - extract using pdftotext or from a menu item. pdftotext better bet\n", "\n", "- Abbyy FineReader\n", " - can do multiple languages, tables\n", " - enterprise grade stuff\n", " - not horrendously expensive\n", " - fewered featured version *not* available in US. Not sure why.\n", " \n", "Locally, can be used on library machines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Open source alternatives\n", "\n", "Old form of Google tech: `tesseract`\n", "\n", "Futzy: requires pdfs to be divided into individual pages, then rendered as tiff.\n", "\n", "Very linux-y world of multiple dependencies, weird incompatibilites\n", "\n", "See https://apple.stackexchange.com/questions/128384/ocr-on-pdfs-in-os-x-with-free-open-source-tools\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Another potential evil: encryption\n", "\n", "*All* major utilities honor the pdf encryption schemes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For ebooks you \"own\" (i.e. have a license), such as Kindle books, use the Calibre application and the de-DRM add ons to extract your licensed text as a more open format." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }