{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "jupyter nbconvert PyParsing.ipynb --to slides --post serve" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Introduction to Pyparsing\n", "\n", "## Brian A. Fannin" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Installation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "http://pyparsing.wikispaces.com/\n", "http://infohost.nmt.edu/tcc/help/pubs/pyparsing/web/index.html\n", "\n", "Support for Python < 2.6 requires installing a specific version. For anything else, >= 2.6 and 3.x, you're good to go. \n", "\n", "```\n", "pip install pyparsing\n", "```" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "from pyparsing import *" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "(['555', '-', '55', '-', '5555'], {})" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ssn = (\n", " Word(nums, exact=3) \n", " + Literal(\"-\") \n", " + Word(nums, exact=2) \n", " + Literal('-') \n", " + Word(nums, exact=4))\n", "\n", "ssn.parseString('555-55-5555')" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['555', '-', '55', '-', '5555']\n" ] } ], "source": [ "print(ssn.parseString('555-55-5555'))" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['555-55-5555']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ssn = Combine(\n", " Word(nums, exact=3) \n", " + Literal(\"-\") \n", " + Word(nums, exact=2) \n", " + Literal('-') \n", " + Word(nums, exact=4))\n", "\n", "list(ssn.parseString('555-55-5555'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The value returned from a call to parseString is an object of class `pyparsing.ParseResults`." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "pyparsing.ParseResults" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mojo = ssn.parseString('555-55-5555')\n", "type(mojo)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(['555-55-5555'], {})" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mojo" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['555-55-5555']\n" ] } ], "source": [ "print(mojo)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[['888-88-8888'], ['333-33-3333']]\n" ] } ], "source": [ "some_text = \"\"\"\n", " Jane Doe's social security number is 888-88-8888 and \n", " Bob's is 333-33-3333. I'm not sure what Steve's number is.\n", " I think it starts with 123-45.\n", "\"\"\"\n", "\n", "print(ssn.searchString(some_text))" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['555555555']\n" ] } ], "source": [ "ssn.setParseAction(lambda toks: toks[0].replace('-', ''))\n", "print(ssn.parseString('555-55-5555'))" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "([555555555], {})" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "strip_dash = lambda toks: toks[0].replace(\"-\", \"\")\n", "convert_int = lambda toks: int(toks[0])\n", "ssn.setParseAction(strip_dash, convert_int)\n", "mojo = ssn.parseString('555-55-5555')\n", "mojo" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "555555555" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ssn = ssn.setResultsName(\"ssn\")\n", "mojo = ssn.parseString('555-55-5555')\n", "mojo.ssn" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The value for mojo is 555555555 and its type is .\n" ] } ], "source": [ "mojo = ssn.parseString('555-55-5555').ssn\n", "print('The value for mojo is {} and its type is {}.'.format(mojo, type(mojo)))" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[888888888, 333333333]" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mojo = ssn.searchString(some_text)\n", "ssns = [soc.ssn for soc in mojo]\n", "ssns" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[32, 'ft.']\n" ] } ], "source": [ "strip_comma = lambda toks: toks[0].replace(\",\", \"\")\n", "\n", "decimal_number = (\n", " Word(nums, nums + \",\") \n", " + Optional(Literal(\".\") + Word(nums)))\n", " \n", "units = Word(alphas, alphas + \".\")\n", "\n", "observation = (\n", " decimal_number.setResultsName('measure').setParseAction(strip_comma, convert_int)\n", " + units.setResultsName('units')\n", ")\n", "\n", "test_strs = [\n", " '32 ft.',\n", " '48 feet',\n", " '14 meters',\n", " '1,000 yards',\n", "]\n", "\n", "print(observation.parseString(test_strs[0]))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "32" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "observation.parseString(test_strs[0]).measure" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[32, 48, 14, 1000]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "obs = [observation.parseString(ob).measure for ob in test_strs]\n", "obs" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['ft.', 'feet', 'meters', 'yards']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "units = [observation.parseString(ob).units for ob in test_strs]\n", "units" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data in the wild\n", "\n", "Let's use pyparsing to churn through some random HTML. I love Michael Caine, but I hate trolling the IMDB site. We'll build a parser that will strip interesting things from IMDB. (Yes, we could use Beautiful Soup for this.)\n", "\n", "[Michael Caine on IMDB](http://www.imdb.com/name/nm0000323/)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import requests\n", "\n", "response = requests.get('http://www.imdb.com/name/nm0000323/')\n", "response.status_code" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "261837" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(response.text)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "'\\n\\n\\n\\n\\n\\n\\n \\n \\n \\n\\n \\n \\n