{ "metadata": { "language": "Julia", "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Let's take libexpat.jl for a whirl\n", "\n", "We'll start with a very simple chunk of XML, and then move to a more realistic example." ] }, { "cell_type": "code", "collapsed": false, "input": [ "using LibExpat" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 160 }, { "cell_type": "code", "collapsed": false, "input": [ "names(LibExpat)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 161, "text": [ "14-element Array{Symbol,1}:\n", " :LibExpat \n", " :XPStreamHandler \n", " :free \n", " :xpath \n", " :pause \n", " :ETree \n", " symbol(\"@xpath_str\")\n", " :ParsedData \n", " :stop \n", " :resume \n", " :parse \n", " :XPCallbacks \n", " :parsefile \n", " :xp_parse " ] } ], "prompt_number": 161 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use xp_parse(string) to load a chunk of XML into an Etree" ] }, { "cell_type": "code", "collapsed": false, "input": [ "sm = \"\"\"hi\n", " hey\n", " yo\n", " \"\"\" " ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 162, "text": [ "\"hi\\n hey\\n yo\\n\"" ] } ], "prompt_number": 162 }, { "cell_type": "code", "collapsed": false, "input": [ " et=xp_parse(s);" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 162 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use LibExpat.find(et, element_path) to return an array of ETree objects matching an element path string \n", "\n", "The LibExpat.jl README describes the format of element_path.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the structure of a simple ETree\n", "\n", " * name = tag name of top element\n", " * attr = Dict of top level attributes\n", " * elements = array of top level payload/content, including junk whitespace.\n", " * parent = parent ETree. (the root node is self-referential, causing it to be displayed multiple times)\n", " " ] }, { "cell_type": "code", "collapsed": false, "input": [ "esm = xp_parse(sm)\n", "dump(esm)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "ETree" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " \n", " name: ASCIIString \"blah\"\n", " attr: Dict{String,String} len 2\n", " class: ASCIIString \"top\"\n", " id: ASCIIString \"42\"\n", " elements: Array(Union(String,ETree),(8,)) [\"hi\",\"\\n\",\" \",hey,\"\\n\",\" \",yo,\"\\n\"]\n", " parent: ETree \n", " name: ASCIIString \"\"\n", " attr: Dict{String,String} len 0\n", " elements: Array(Union(String,ETree),(1,)) [hi\n", " hey\n", " yo\n", "]\n", " parent: ETree \n", " name: ASCIIString \"\"\n", " attr: Dict{String,String} len 0\n", " elements: Array(Union(String,ETree),(1,)) [hi\n", " hey\n", " yo\n", "]\n", " parent: ETree \n", " name: ASCIIString \"\"\n", " attr: Dict{String,String} len 0\n", " elements: Array(Union(String,ETree),(1,)) [hi\n", " hey\n", " yo\n", "]\n", " parent: ETree \n", " name: ASCIIString \"\"\n", " attr: Dict{String,String} len 0\n", " elements: Array(Union(String,ETree),(1,)) [hi\n", " hey\n", " yo\n", "]\n", " parent: ETree \n" ] } ], "prompt_number": 163 }, { "cell_type": "code", "collapsed": false, "input": [ "esm.name, esm.attr" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 164, "text": [ "(\"blah\",[\"class\"=>\"top\",\"id\"=>\"42\"])" ] } ], "prompt_number": 164 }, { "cell_type": "code", "collapsed": false, "input": [ "esm.elements" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 165, "text": [ "8-element Array{Union(String,ETree),1}:\n", " \"hi\" \n", " \"\\n\" \n", " \" \" \n", " hey\n", " \"\\n\" \n", " \" \" \n", " yo \n", " \"\\n\" " ] } ], "prompt_number": 165 }, { "cell_type": "code", "collapsed": false, "input": [ "typeof(esm.elements[1]) <: String" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 166, "text": [ "true" ] } ], "prompt_number": 166 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract payload/contents from an element, ignoring whitespace and sub-elements" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for e in esm.elements\n", " stre = strip(string(e))\n", " if length(stre)>0\n", " println(stre, \" \", typeof(e))\n", " if typeof(e) <: String\n", " println(\"Payload: \",stre)\n", " end\n", " end\n", "end\n", " " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "hi" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " ASCIIString\n", "Payload: hi\n", "hey ETree\n", "yo ETree\n" ] } ], "prompt_number": 167 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A more realistic example\n", "\n", "Here we are scraping data from a chunk of fairly clean HTML." ] }, { "cell_type": "code", "collapsed": false, "input": [ "s=\"\"\"
\n", "\t\n", "\t\t\t
\n", "\t\t\t\n", "\n", "\n", "\t\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Flight Info - NXXXXXX(Rogers Bleeblah #)
DateOriginDestDepartArriveHobbsFlight TimeGround TimeFlight DistanceTaxi DistanceFuelFuel/hrFuel/nmAltitudeGnd Speed
Mon, May xx, 2010KMYFXXXX10:4412:431.92 hrs1.8 hrs (1:48)0.12 hrs (0:07)177.27 nm1.32 nm16.69 gal8.68 gal/hr0.09 gal/nm9511 msl95.21 kts
\n", "\n", "
\n", "
\n", "\"\"\";" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 167 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The // in \"/div/table//table//td\" allows expat to skip layers of elements, reaching anywhere under /div/table" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tds = LibExpat.find(et, \"/div/table//table//td\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 168, "text": [ "31-element Array{ETree,1}:\n", " Flight Info - NXXXXXX(Rogers Bleeblah #) \n", " Date \n", " Origin \n", " Dest \n", " Depart \n", " Arrive \n", " Hobbs \n", " Flight Time \n", " Ground Time \n", " Flight Distance \n", " Taxi Distance \n", " Fuel \n", " Fuel/hr \n", " \u22ee \n", " 10:44 \n", " 12:43 \n", " 1.92 hrs \n", " 1.8 hrs (1:48) \n", " 0.12 hrs (0:07) \n", " 177.27 nm \n", " 1.32 nm \n", " 16.69 gal \n", " 8.68 gal/hr \n", " 0.09 gal/nm \n", " 9511 msl \n", " 95.21 kts " ] } ], "prompt_number": 168 }, { "cell_type": "code", "collapsed": false, "input": [ "el = tds[1]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 169, "text": [ "Flight Info - NXXXXXX(Rogers Bleeblah #) " ] } ], "prompt_number": 169 }, { "cell_type": "code", "collapsed": false, "input": [ "typeof(el)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 170, "text": [ "ETree (constructor with 2 methods)" ] } ], "prompt_number": 170 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just get the text of the element:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "string(el)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 171, "text": [ "\"Flight Info - NXXXXXX(Rogers Bleeblah #) \"" ] } ], "prompt_number": 171 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check the attribute Dict to identifier elements by class" ] }, { "cell_type": "code", "collapsed": false, "input": [ "el.attr[\"class\"]" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 172, "text": [ "\"table_header\"" ] } ], "prompt_number": 172 }, { "cell_type": "code", "collapsed": false, "input": [ "get(el.attr, \"class\",\"\")" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 173, "text": [ "\"table_header\"" ] } ], "prompt_number": 173 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Build a dictionary of labels and values by parsing element payloads\n", "\n", "To extract from dirty html, it might make sense to match on class=\"table_td\" or class=\"table_row_header\" and then use expat to extract payloads." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Get the flight acid" ] }, { "cell_type": "code", "collapsed": false, "input": [ "function parse_header( hdr )\n", " #hdr = strip(td.elements[1])\n", " hdr = strip( split(hdr,'-')[2] )\n", " (acid, actype) = [strip(s) for s in split(hdr,'(')]\n", " actype = strip(replace(actype, \"#)\",\"\"))\n", " return (acid, actype)\n", "end " ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 174, "text": [ "parse_header (generic function with 1 method)" ] } ], "prompt_number": 174 }, { "cell_type": "code", "collapsed": false, "input": [ "parse_header( \"Flight Info - NXXXXXX (Rogers Bleeblah #) \" )" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 175, "text": [ "(\"NXXXXXX\",\"Rogers Bleeblah\")" ] } ], "prompt_number": 175 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Extract element payloads" ] }, { "cell_type": "code", "collapsed": false, "input": [ "labels = ASCIIString[]\n", "values = ASCIIString[]\n", "hdr = \"\"\n", "for td in tds\n", " if get(td.attr,\"class\",\"\")==\"table_header\" \n", " hdr = strip(td.elements[1])\n", " (acid, actype) = parse_header(hdr)\n", " end\n", " if get(td.attr,\"class\",\"\")==\"table_td\" \n", " push!(values, strip(td.elements[1]) )\n", " end\n", " if get(td.attr,\"class\",\"\")==\"table_row_header\" \n", " push!(labels, strip(td.elements[1]) )\n", " end\n", "end " ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 176 }, { "cell_type": "code", "collapsed": false, "input": [ "acid, actype" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 177, "text": [ "(\"NXXXXXX\",\"Rogers Bleeblah\")" ] } ], "prompt_number": 177 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load to Dict() " ] }, { "cell_type": "code", "collapsed": false, "input": [ "dmap = Dict()\n", "for (i,el) in enumerate(labels)\n", " v = values[i]\n", " if '0'<=v[end]<='9'\n", " dmap[el] = v\n", " else\n", " dmap[el] = split(v,' ')[1]\n", " end\n", "end\n", "dump(dmap)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Dict" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "{Any,Any} len 15\n", " Flight Time: ASCIIString \"1.8\"\n", " Fuel/hr: ASCIIString \"8.68\"\n", " Gnd Speed: ASCIIString \"95.21\"\n", " Fuel: ASCIIString \"16.69\"\n", " Fuel/nm: ASCIIString \"0.09\"\n", " Hobbs: ASCIIString \"1.92\"\n", " Flight Distance: ASCIIString \"177.27\"\n", " Date: ASCIIString \"Mon, May xx, 2010\"\n", " Ground Time: ASCIIString \"0.12\"\n", " Taxi Distance: ASCIIString \"1.32\"\n", " Dest: ASCIIString \"XXXX\"\n", " ...\n" ] } ], "prompt_number": 178 }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 159 } ], "metadata": {} } ] }