{ "metadata": { "name": "", "signature": "sha256:013d27f914b48119267bd7bc6674346e01673f59bbf4884e6ee057f2d75ed8b3" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Yes, it's really named after Monty Python" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from IPython import display" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#load lessons learned from a life wasted" ] }, { "cell_type": "code", "collapsed": false, "input": [ "display.YouTubeVideo('csyL9EC0S0c?t=24m47s')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "%load python_tour.md" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Turtle Graphics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I think substack's turtle graphics was Javascript.\n", "\n", "Mine was Python, something about the elegance appeals to the mathematician inside me. It's much less about engineering\n", "\n", "Like Marijuana laws the dutch are leading the way in readable programming.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Philosophy" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import this" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So you might want to use python as if you're trying to do something that is just about munging text or most things that aren't about making event driven websites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#So let's talk about whitespace.\n", "The __off side rule__.\n", "\n", "Who can explain what that is in football?\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def is_even(a):\n", " if a % 2 == 0:\n", " print('Even!')\n", " return True\n", " print('Odd!')\n", " return False" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#what python is good at?\n", "\n", "+ multiple assignment\n", "+ list comprehension\n", "+ iterating over the dictionary\n", "+ dictionary comprehension\n", "\n", "\n", "##one way to do it!\n", "\n", "\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#multiple assignment\n", "\n", "a, b, c = 'spam', 'eggs', 'parrot'\n", "print(a, b, c)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Iteration Idiom" ] }, { "cell_type": "code", "collapsed": false, "input": [ "for (i=0; i < mylist_length; i++) {\n", " do_something(mylist[i]);\n", "}\n", "\n", "#The direct equivalent in Python would be this:\n", "\n", "i = 0\n", "while i < mylist_length:\n", " do_something(mylist[i])\n", " i += 1\n", "\n", "#That, however, while it works, is not considered Pythonic. It's not an idiom the Python language encourages. We could improve it. A typical idiom in Python to generate all numbers in a list would be to use something like the built-in range() function:\n", "\n", "for i in range(mylist_length):\n", " do_something(mylist[i])\n", "\n", "#This is however not Pythonic either. Here is the Pythonic way, encouraged by the language itself:\n", "\n", "for element in mylist:\n", " do_something(element)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "numbers = [1,2,3,4]\n", "for number in numbers:\n", " print(number)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#Hacker School" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rice_crispies.items()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "rice_crispies = {3:'Crackle', 5:'Pop'}\n", "for i in range(101):\n", " print(i)\n", " for flake in rice_crispies.keys():\n", " if i % flake == 0:\n", " print(rice_crispies[flake])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "ahundred = range(101)\n", "new_list = [i*2 for i in ahundred if i % 2 == 0]\n", "print(new_list)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#slicing" ] }, { "cell_type": "code", "collapsed": false, "input": [ "new_list[0:10]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ " There's an old programming proverb which goes something like this:\n", "\n", " Show me you algorithm,\n", " and I will remain puzzled,\n", " but show me your data structure,\n", " and I will be enlightened. \n", "\n", "This is a statement about software and coding, but first and foremost it is about human cognition. The way my brain works is to first visualize the data and then imagine what the algorithm does to it. \n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "tel = {'jack': 4098, 'sape': 4139}\n", "tel['guido']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "i think dictionaries are handled particularly nicely in \n", "https://docs.python.org/3/tutorial/datastructures.html#dictionaries" ] }, { "cell_type": "code", "collapsed": false, "input": [ "\n", "words = ['spam', 'spam', 'eggs', 'spam']\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "from collections import defaultdict\n", "word_counter = defaultdict(int)\n", "\n", "words = ['spam', 'spam', 'eggs', 'spam', 'parrot']\n", "\n", "for word in words:\n", " word_counter[word] += 1\n", "\n", "print(word_counter)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now I'll show you why to write your alrogithms in python, or at least why Norvig of google does\n", "#Spell checker" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!wget http://norvig.com/spell-correct.html" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "#90% of the google spelling corrector in 21 lines of Python" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "!head -n20 big.txt" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import re, collections\n", "\n", "def words(text): \n", " return re.findall('[a-z]+', text.lower()) \n", "\n", "def train(features):\n", " model = collections.defaultdict(int)\n", " for f in features:\n", " model[f] += 1\n", " return model\n", "\n", "NWORDS = train(words(file('big.txt').read()))\n", "\n", "alphabet = 'abcdefghijklmnopqrstuvwxyz'\n", "\n", "def edits(word):\n", " splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n", " deletes = [a + b[1:] for a, b in splits if b]\n", " transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]\n", " replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]\n", " inserts = [a + c + b for a, b in splits for c in alphabet]\n", " return set(deletes + transposes + replaces + inserts)\n", "\n", "def known_edits(word):\n", " edits_of_edits = set()\n", " for e1 in edits(word):\n", " for e2 in edits(e1):\n", " if e2 in NWORDS:\n", " edits_of_edits.add(e2)\n", " return edits_of_edits\n", " \n", " #norvigs way\n", " #return set(e2 for e1 in edits(word) for e2 in edits(e1) if e2 in NWORDS)\n", "\n", "def known(words): \n", " return set(w for w in words if w in NWORDS)\n", "\n", "def correct(word):\n", " candidates = known([word]) or known(edits(word)) or known_edits(word) or [word]\n", " return max(candidates, key=NWORDS.get)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "correct('spam')" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ "'spasm'" ] } ], "prompt_number": 12 }, { "cell_type": "code", "collapsed": false, "input": [ "word = 'spam'\n", "splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]\n", "deletes = [a + b[1:] for a, b in splits if b]\n", "transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]\n", "replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]\n", "inserts = [a + c + b for a, b in splits for c in alphabet]\n", "print(splits)\n", "print(deletes)\n", "print(transposes)\n", "print(replaces)\n", "print(inserts)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Homework\n", "\n", "+ Download [this book](https://www.gutenberg.org/ebooks/468.txt.utf-8) file and run the spell checker the `correct` function on every word on that book, and save it back.\n", "\n", "follow the following steps" ] }, { "cell_type": "code", "collapsed": false, "input": [ "!wget https://www.gutenberg.org/cache/epub/468/pg468.txt manon.txt" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "--2015-01-14 16:29:58-- https://www.gutenberg.org/cache/epub/468/pg468.txt\r\n", "Resolving www.gutenberg.org (www.gutenberg.org)... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "152.19.134.47\r\n", "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "connected.\r\n" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "HTTP request sent, awaiting response... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "200 OK\r\n", "Length: 370164 (361K) [text/plain]\r\n", "Saving to: 'pg468.txt\u2019\r\n", "\r\n", "\r", " 0% [ ] 0 --.-K/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 2% [ ] 8,192 32.9KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", " 6% [=> ] 24,576 52.7KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "11% [===> ] 40,960 57.9KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "19% [======> ] 73,728 80.0KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "28% [==========> ] 106,496 60.0KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "42% [===============> ] 155,648 77.0KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "50% [==================> ] 188,416 83.2KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "59% [======================> ] 221,184 88.1KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "64% [========================> ] 237,568 86.4KB/s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "73% [===========================> ] 270,336 87.8KB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "81% [==============================> ] 303,104 88.4KB/s eta 1s " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\r", "90% [==================================> ] 335,872 89.9KB/s eta 1s \r", "100%[======================================>] 370,164 98.6KB/s in 3.7s \r\n", "\r\n", "2015-01-14 16:30:03 (98.6 KB/s) - 'pg468.txt\u2019 saved [370164/370164]\r\n", "\r\n", "--2015-01-14 16:30:03-- http://manon.txt/\r\n", "Resolving manon.txt (manon.txt)... " ] }, { "output_type": "stream", "stream": "stdout", "text": [ "failed: Name or service not known.\r\n", "wget: unable to resolve host address 'manon.txt\u2019\r\n", "FINISHED --2015-01-14 16:30:03--\r\n", "Total wall clock time: 5.4s\r\n", "Downloaded: 1 files, 361K in 3.7s (98.6 KB/s)\r\n" ] } ], "prompt_number": 1 }, { "cell_type": "code", "collapsed": false, "input": [ "manon_string = open('pg468.txt','r').read()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "manon_words = manon_string.split()[1000:2000]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 20 }, { "cell_type": "code", "collapsed": false, "input": [ "transform_words #do list comprehension here to to put `correct` on all the words in the list manon_words" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 24 }, { "cell_type": "code", "collapsed": false, "input": [ "outfile = open('transformed_manon.txt','w')\n", "transform_string = ' '.join(transform_words)\n", "outfile.write(transform_string)\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 25 } ], "metadata": {} } ] }