{ "metadata": { "name": "", "signature": "sha256:5a829b8e30601945973e869a8c19a2d1d98695a6d27fdbc2b578ced0b5ed3cd1" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "#Data and Databases: Homework Assignment #2\n", "\n", "Please refer to [Homework #1](Homework_1.ipynb) for instructions on how to complete and turn in this homework assignment.\n", "\n", "##Problem set 1: String types\n", "\n", "In the cell below, I've defined a string in a variable called `start`. Your task is to write an expression (or series of statements) that creates a new string, which contains the character with the *next* ASCII value for each character in the string. For example, if you start with `\"abc\"`, you should end up with `\"bcd\"`: the ASCII value of `a` is 97, so the next ASCII character is `b` (98).\n", "\n", "> As a reminder: the `ord()` function returns the numerical ASCII value of a given character. (E.g., `ord('x')` evaluates to 120.) The `chr()` function does the converse: given an integer value, it returns the ASCII character corresponding to that value. (E.g., `chr(120)` evaluates to `'x'`)\n", "\n", "Hint: Try writing a list comprehension with `start` as its source list. (Or otherwise breaking `start` up into a list.)\n", "\n", "Expected output: `'beefs'`" ] }, { "cell_type": "code", "collapsed": false, "input": [ "start = \"adder\"\n", "# your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "This next problem is perhaps significantly less tricky! Add exactly one character to the code in the cell below so that the `len()` function evaluates to `5` (the expected value). (Hint: You need to somehow change the string's type.)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "my_favorite_finnish_name = \"V\u00e4in\u00f6\"\n", "len(my_favorite_finnish_name)" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "7" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the next cell, there is a variable `surprise` that contains a series of bytes. (Just trust me on this---don't worry if you don't quite understand what all of the `\\\\x...` stuff means.) These bytes contain a UTF-8 encoded string. Write an expression that converts the string of bytes into a Python Unicode string and `print` the result of this expression.\n", "\n", "Expected output: Adorable cat emoji!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "surprise = '\\xf0\\x9f\\x98\\xbb'\n", "print #your expression here!" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Problem set 2: Character encoding\n", "\n", "For this problem set, you'll be working with [this text file](http://static.decontextualize.com/spanish_data.txt). This is a very important text file that contains the names of various important Spanish people and the cities in which they live. (Or rather: let's imagine that this is the case. In reality, I generated the text file randomly from a list of Spanish names and places.) Run the following cell to retrieve the data in a string variable called `sp_data`:" ] }, { "cell_type": "code", "collapsed": true, "input": [ "import urllib\n", "url = \"http://static.decontextualize.com/spanish_data.txt\"\n", "sp_data = urllib.urlopen(url).read()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see when you `print` the `sp_data` variable, there's a problem: the text file is in a strange encoding!" ] }, { "cell_type": "code", "collapsed": false, "input": [ "print sp_data" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Montoro,Adela,Legan\ufffds\n", "Ramos,Faustino,Seville\n", "Moya,Artemio,San Crist\ufffdbal de La Laguna\n", "Rom\ufffdn,Olga,Burgos\n", "G\ufffdmez,Fernando,Vitoria-Gasteiz\n", "Vargas,Pedro,Fuenlabrada\n", "Ortiz,Columbano,Terrassa\n", "M\ufffdndez,Gisela,M\ufffdlaga\n", "P\ufffdrez,Abel,Zaragoza\n", "Garc\ufffda,Agust\ufffdn,Logro\ufffdo\n", "Dur\ufffdn,Adalberto,Albacete\n", "Soler,Marcial,Santa Cruz de Tenerife\n", "Velasco,Natalia,Alicante\n", "Gallardo,Leandro,Getafe\n", "Dom\ufffdnguez,Faustino,M\ufffdlaga\n", "Vargas,Conrado,A Coru\ufffda\n", "Lozano,Mar,Albacete\n", "Dur\ufffdn,Octavio,Alicante\n", "Campos,Antonia,Getafe\n", "Campos,Baldomero,Granada\n", "Prieto,Faustino,Badajoz\n", "Pascual,Amadeo,L'Hospitalet de Llobregat\n", "Guerrero,Rita,Santander\n", "V\ufffdzquez,Acacio,Jerez de la Frontera\n", "Jim\ufffdnez,Remedios,Castell\ufffdn de la Plana\n", "\n" ] } ], "prompt_number": 5 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following cell, write an expression that evaluates to a Unicode string containing the data in `sp_data`. Assign this expression to a variable called `sp_data_unicode` so that the `print` statement produces the expected result.\n", "\n", "Hint: In order to do this, you'll need to *determine the character encoding of the file*. You can do this in several ways: try different [codec names](https://docs.python.org/2/library/codecs.html#standard-encodings), or try looking at the file in a web browser and adjusting the web browser's encoding settings. (In Chrome, you can do this with `View > Encoding...`. Make sure to set it back to \"Auto-Detect\" when you're done.)\n", "\n", "Expected output: The same data as above, but with the Unicode question marks replaced by actual data. You can spot check this by looking for the line `Garc\u00eda,Agust\u00edn,Logro\u00f1o`." ] }, { "cell_type": "code", "collapsed": false, "input": [ "sp_data_unicode = sp_data.decode('macroman')\n", "sp_data_unicode = \"\" # replace \"\" here with your expression" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, you're doing great! For good measure, let's do a bit of data work. In the cell below, I've written a statement that assigns a list of lines from the `sp_data_unicode` variable to a list, so that `lines` is a list of strings, each with one line of data. Write a `for` loop that `print`s the name of every city name in the data that has *more than one* word. (E.g., `L'Hospitalet de Llobregat` should be included in the list, but `Albacete` should not.)\n", "\n", "Expected output:\n", "\n", "```\n", "San Crist\u00f3bal de La Laguna\n", "Santa Cruz de Tenerife\n", "A Coru\u00f1a\n", "L'Hospitalet de Llobregat\n", "Jerez de la Frontera\n", "Castell\u00f3n de la Plana\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "lines = [x for x in sp_data_unicode.split(\"\\n\") if len(x) > 0]\n", "for item in lines:\n", " pass # replace \"pass\" with one or more of your own statements" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "markdown", "metadata": {}, "source": [ "##Problem set 3: Regular Expressions\n", "\n", "In the following section, we're going to do a bit of digital humanities. (I guess this could also be journalism if you were... writing an investigative piece about... early 20th century American poetry?) We'll be working with the following text, Robert Frost's *The Road Not Taken*:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "poem_lines = ['Two roads diverged in a yellow wood,',\n", " 'And sorry I could not travel both',\n", " 'And be one traveler, long I stood',\n", " 'And looked down one as far as I could',\n", " 'To where it bent in the undergrowth;',\n", " '',\n", " 'Then took the other, as just as fair,',\n", " 'And having perhaps the better claim,',\n", " 'Because it was grassy and wanted wear;',\n", " 'Though as for that the passing there',\n", " 'Had worn them really about the same,',\n", " '',\n", " 'And both that morning equally lay',\n", " 'In leaves no step had trodden black.',\n", " 'Oh, I kept the first for another day!',\n", " 'Yet knowing how way leads on to way,',\n", " 'I doubted if I should ever come back.',\n", " '',\n", " 'I shall be telling this with a sigh',\n", " 'Somewhere ages and ages hence:',\n", " 'Two roads diverged in a wood, and I---',\n", " 'I took the one less travelled by,',\n", " 'And that has made all the difference.']" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 8 }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the cell above, I defined a variable `poem_lines` which has a list of lines in the poem.\n", "\n", "In the cell below, write a list comprehension (using `re.search()`) that evaluates to a list of lines containing at least one word that has exactly eight characters. (Hint: use the `\\b` anchor.)\n", "\n", "Expected result:\n", "\n", "```\n", "['Two roads diverged in a yellow wood,',\n", " 'And be one traveler, long I stood',\n", " 'Two roads diverged in a wood, and I---']\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import re\n", "# your code here" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Good! Now, in the following cell, write a list comprehension that evaluates to a list of lines in the poem that end with a four-letter word, regardless of whether or not there is punctuation following the word at the end of the line. (Hint: Try using the `?` quantifier. Is there an existing character class, or a way to *write* a character class, that matches non-alphanumeric characters?)\n", "\n", "Expected result:\n", "\n", "```\n", "['Two roads diverged in a yellow wood,',\n", " 'And sorry I could not travel both',\n", " 'Then took the other, as just as fair,',\n", " 'Because it was grassy and wanted wear;',\n", " 'Had worn them really about the same,',\n", " 'I doubted if I should ever come back.',\n", " 'I shall be telling this with a sigh']\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Okay, now a slightly trickier one. In the cell below, I've created a string `all_lines` which evaluates to the entire text of the poem in one string. Write an expression that evaluates to all of the words in the poem that are followed by a comma. (The strings in the resulting list should *not* include the comma.) Hint: Use grouping!\n", "\n", "Expected result:\n", "\n", "```\n", "['wood',\n", " 'traveler',\n", " 'other',\n", " 'fair',\n", " 'claim',\n", " 'same',\n", " 'Oh',\n", " 'way',\n", " 'wood',\n", " 'by']\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [ "all_lines = \" \".join(poem_lines)\n", "# your code here\n" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 10 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, something super tricky. Here's a list of strings that contains a restaurant menu. Your job is to wrangle this plain text, slightly-structured data into a list of dictionaries." ] }, { "cell_type": "code", "collapsed": false, "input": [ "entrees = [\n", " \"1. Yam, Rosemary and Chicken Bowl with Hot Sauce - $10.95\",\n", " \"2. Lavender and Pepperoni Sandwich - $8.49\",\n", " \"3. Water Chestnuts and Peas Power Lunch (with mayonnaise) - $12.95\",\n", " \"4v. Artichoke, Mustard Green and Arugula with Sesame Oil over noodles - $9.95\",\n", " \"5. Flank Steak with Lentils And Tabasco Pepper With Sweet Chilli Sauce - $19.95\",\n", " \"6v. Rutabaga And Cucumber Wrap - $8.49\"\n", "]" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll need to pull out the name of the dish and the price of the dish. The `v` after the number indicates that the dish is vegetarian---you'll need to include that information in your dictionary as well.\n", "\n", "Expected output:\n", "\n", "```\n", "[{'name': 'Yam, Rosemary and Chicken Bowl with Hot Sauce',\n", " 'price': '$10.95',\n", " 'vegetarian': False},\n", " {'name': 'Lavender and Pepperoni Sandwich',\n", " 'price': '$8.49',\n", " 'vegetarian': False},\n", " {'name': 'Water Chestnuts and Peas Power Lunch (with mayonnaise)',\n", " 'price': '$12.95',\n", " 'vegetarian': False},\n", " {'name': 'Artichoke, Mustard Green and Arugula with Sesame Oil over noodles',\n", " 'price': '$9.95',\n", " 'vegetarian': True},\n", " {'name': 'Flank Steak with Lentils And Tabasco Pepper With Sweet Chilli Sauce',\n", " 'price': '$19.95',\n", " 'vegetarian': False},\n", " {'name': 'Rutabaga And Cucumber Wrap', 'price': '$8.49', 'vegetarian': True}]\n", "```" ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "markdown", "metadata": {}, "source": [ "You're done! Great work.\n", "\n", "Extra credit: Write an expression below to get the name of the most expensive item on the menu." ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 } ], "metadata": {} } ] }