{
 "metadata": {
  "name": "",
  "signature": "sha256:f2ce8a9d9da6ebb4e9bf70b88b1a4a9e7e864beae9eede4df792a0f3ac6fdcf6"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "# Data Wrangle the OpenStreetMaps Data Set\n",
      "----------"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>As pointed out by [this](http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0) article from New York Times, the  data wrangle, or \u201cdata janitor work\u201d, requires from 50 percent to 80 percent of the time expended in a data science project. After downloading a sample of the S\u00e3o Paulo (Brazil) area, where I live, and start to clean it, I believe that I can understand what they mean by that.</b>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Importing needed libraries"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import os\n",
      "import data_wrangling as reshape\n",
      "from data_wrangling import *\n",
      "reload(reshape);"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Problems encountered in my map"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p> The first thing that I had to do was adapting the code provided by the classes to my data. For instance, the street type (Street, Avenue and so on) is located at the beginning of the street names in Brazil, not at the end, as in the sample that we used at the course. Also, as in the sample provided, the street type is usually shortened.</p>\n",
      "\n",
      "<p>As personal names are usually used to name streets in Brazil, another common feature related to street names is the appearance of a \"Title Name\" after the street type, as descibed in the documentation of the S\u00e3o Paulo street names data set provided by the <a href=\"http://www.fflch.usp.br/centrodametropole/en/716\">Center for metropolitan Studies</a>* of S\u00e3o Paulo (CMS). For example, in <em>Captain so-and-so street</em>, \"Captain\" would be the title name. As street types, title names usually are shortened as well.</p>\n",
      "\n",
      "<p><em><small>*a brazilian think tank that performs researches about the role of public policy in reducing poverty and inequality</small></em></p>\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Let's start exploring the data set. First , I checked the size of the file that I had to deal with:</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "fr_dbf=\"/Users/ucaiado/Documents/temp/LOG2014_CEM_RMSP.DBF\"\n",
      "filename='/Users/ucaiado/Dropbox/NEUTRINO/ALGO/DATA/sao-paulo_brazil.osm'"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"{} MB\".format(os.stat(filename).st_size/1024**2)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "315 MB\n"
       ]
      }
     ],
     "prompt_number": 396
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Then, I checked the tags found at the XML file, some possible problems and the number of unique users that contributed to the data set to feel what I would have to do.</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time reshape.initial_tests(filename)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "# of tags:\n",
        "{'bounds': 1,\n",
        " 'member': 38702,\n",
        " 'nd': 1826215,\n",
        " 'node': 1475441,\n",
        " 'osm': 1,\n",
        " 'relation': 4516,\n",
        " 'tag': 584178,\n",
        " 'way': 198120}\n",
        "\n",
        "\n",
        "# of potential problems on the data:"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "{'lower': 554488, 'lower_colon': 28594, 'other': 1091, 'problemchars': 5}\n",
        "\n",
        "\n",
        "# of unique users that contributed to this map:\n",
        " 1189"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "\n",
        "\n",
        "\n",
        "CPU times: user 48 s, sys: 464 ms, total: 48.4 s\n",
        "Wall time: 48.5 s\n"
       ]
      }
     ],
     "prompt_number": 387
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>To correct the street names, I had to create a dictionary of shortened words related to street types and another related to title names. The following street names are some examples of what was corrected in the OSM file:</p>\n",
      "<p>\n",
      "<pre>\n",
      "Av Jac\u00fa Pessego / Nova Trabalhadores => Avenida Jac\u00fa Pessego\n",
      "AV PEDROSO DE MORAIS => Avenida Pedroso De Morais\n",
      "Avenida Pres. Arthur Bernardes => Avenida Presidente Arthur Bernardes\n",
      "Al. Santos => Alameda Santos\n",
      "Av. Prof. L\u00facio Martins Rodrigues, travessas 4 e 5 => Avenida Professor L\u00facio Martins Rodrigues\n",
      "</pre>\n",
      "</p>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Another problem was the misspelled CITY names. There are 15 names that was wrote wrongly in the data set. In order to correct those, I adopted a similar approach to I used to correct street and title names: I iterated through the data set over and over until come up with a dictionary of wrong words (in this case, wrong names). It can be checked in the <em>data_wrangling.py</em> file in my <a href=\"https://github.com/ucaiado/OpenStreetMaps\">github</a> repository, as well as the dictionaries used to correct the street names. Here are some of the city names that were corrected:</p>\n",
      "<p>\n",
      "<pre>\n",
      "s\u00e3o paulo - sp => S\u00e3o Paulo\n",
      "s\u00e3o paulo/sp => S\u00e3o Paulo\n",
      "guaruja => Guaruj\u00e1\n",
      "</pre>\n",
      "</p>\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Finally, I reshaped all the OSM file with the aim of insert it into MongoDB. Here are some statistics about what was done on the process:<p> "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time reshape.process_map(filename, pretty = False)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "# of documents created: 1673561\n",
        "# of different street names: 1533\n",
        "# of docs where the street name was fixed: 6310\n",
        "# of docs where the city name was fixed: 5048\n",
        "\n",
        "CPU times: user 1min 27s, sys: 6.39 s, total: 1min 34s\n",
        "Wall time: 1min 34s\n"
       ]
      }
     ],
     "prompt_number": 356
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>I used the following code to insert the data into mongoDB. This method was quite fast...</p>\n",
      "\n",
      "<pre>\n",
      "\n",
      "mongoimport -db <DATABASE NAME> -c <COLLECTOIN NAME> --file /path/to/the/file.json\n",
      "</pre>\n"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>After putting the data in Mongo, I checked if it was done correctly. First, I connected to my local database:</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from pymongo import MongoClient\n",
      "client=MongoClient('localhost:27017')\n",
      "db=client.udacity"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Then, I checked if the city names are correct. It is expected 40 different city names:</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$group\": {\"_id\": None, \"cities\":{\"$addToSet\":\"$address.city\"}}},\n",
      "            {\"$unwind\": \"$cities\"},\n",
      "            {\"$group\": { \"_id\": \"Total City Names in the data\", \"count\": { \"$sum\": 1 } } }\n",
      "          ]\n",
      "aggregate(db.osm, pipeline)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 375,
       "text": [
        "[{u'_id': u'Total City Names in the data', u'count': 40}]"
       ]
      }
     ],
     "prompt_number": 375
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Afterward, I computed the number of different street names in the database (1533 expected):</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$group\": {\"_id\": None, \"street\":{\"$addToSet\":\"$address.street\"}}},\n",
      "            {\"$unwind\": \"$street\"},\n",
      "            {\"$group\": { \"_id\": \"Total number of different street names in the data\", \"count\": { \"$sum\": 1 } } }\n",
      "          ]\n",
      "pprint.pprint(aggregate(db.osm, pipeline)[0])\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{u'_id': u'Total number of different street names in the data', u'count': 1533}\n"
       ]
      }
     ],
     "prompt_number": 22
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Lastly, just to make sure that the OSM file was correctly process by my codes, I counted the number of unique users shown in database (1189 expected):</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$match\": {\"type\": \"node\"}},\n",
      "            {\"$group\": {\"_id\": None, \"user\":{\"$addToSet\":\"$created.uid\"}}},\n",
      "            {\"$unwind\": \"$user\"},\n",
      "            {\"$group\": { \"_id\": \"Number of unique user in the data\", \"count\": { \"$sum\": 1 } } }\n",
      "          ]\n",
      "pprint.pprint(aggregate(db.osm, pipeline)[0])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "{u'_id': u'Number of unique user in the data', u'count': 1189}\n"
       ]
      }
     ],
     "prompt_number": 24
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Overview of the data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>As I showed earlier, there are 40 different cities in the database. The way that the unique street names was counted before is not correct for this data set , as far as the same street name in different cities is computed as the same street, but indeed they are different streets. In order to correct that, I had to group the street names by city and then counting the street names.</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$group\": {\"_id\": \"$address.city\", \"street\":{\"$addToSet\":\"$address.street\"}}},\n",
      "            {\"$unwind\": \"$street\"},\n",
      "            {\"$group\": { \"_id\": \"Number of unique street names in the data, taking into account the city name\",\n",
      "                        \"count\": { \"$sum\": 1 } } }\n",
      "          ]\n",
      "pprint.pprint(aggregate(db.osm, pipeline))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[{u'_id': u'Number of unique street names in the data, taking into account the city name',\n",
        "  u'count': 1734}]\n"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>The number of streets, 1734, already seems small for 40 cities, given that one of them, S\u00e3o Paulo, is one of the biggest cities on the planet. Following the analysis, I computed the top 5 amenities on each city, querying it directly from MongoDB. It was tricky, as I did not find how I could \"pivot table\" the data directly from MongoDb using a single path. I solved it in the following way. First, I had to aggregate the data in order to create the needed data set structure for this analysis:</p> "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$match\":{\"amenity\":{\"$exists\":1}}},\n",
      "            {\"$match\":{\"address.city\":{\"$exists\":1}}},\n",
      "            {\"$group\": {\"_id\":{\"city\":\"$address.city\", \"amenity\":\"$amenity\"}, \"count\": {\"$sum\":1}}},\n",
      "            {\"$sort\" : {\"count\" : -1}},\n",
      "            {\"$project\": {\"city\": '$_id.city', 'amenity': '$_id.amenity', 'count':'$count'}},\n",
      "            {\"$group\":{'_id':\"$city\", 'result':{'$push':{\"amenity\":\"$amenity\",\"count\":\"$count\"}}}},\n",
      "            {\"$out\": \"osm_aux\"}\n",
      "          ]\n",
      "\n",
      "x=aggregate(db.osm, pipeline)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>I used two Group stages on the above pipeline *. First, I counted the data using the city name and the amenity as keys. Second, I reshaped the structure of this output in a new one, which I reshaped again using just the city as key while keeping the amenity and the count in a result list.</p>\n",
      "\n",
      "<p>Then, I inserted this output in a new collection called \"osm_aux\". Below I updated it to limit the size of each result list to 5.</p>\n",
      "\n",
      "<p><small><em>* I took this approach from [here](http://stackoverflow.com/questions/25711657/mongodb-limit-array-within-aggregate-query)</em></small></p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_collection=db.osm_aux\n",
      "my_collection.update({},{\"$push\" : {\"result\": {\"$each\": [], \"$slice\":5}}},multi=True)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 14,
       "text": [
        "{u'n': 29, u'nModified': 7, u'ok': 1, 'updatedExisting': True}"
       ]
      }
     ],
     "prompt_number": 14
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Finally, here is the top 5 amenities by city:</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for x in my_collection.find():\n",
      "    print \"city: \" + (x[\"_id\"])\n",
      "    pprint.pprint(x[\"result\"])\n",
      "    print \"\""
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "city: Aruj\u00e1\n",
        "[{u'amenity': u'Restaurant', u'count': 1}]\n",
        "\n",
        "city: Jacare\u00ed\n",
        "[{u'amenity': u'Place_Of_Worship', u'count': 1}]\n",
        "\n",
        "city: Barueri\n",
        "[{u'amenity': u'Bank', u'count': 2}]\n",
        "\n",
        "city: Santos\n",
        "[{u'amenity': u'Pharmacy', u'count': 3},\n",
        " {u'amenity': u'Restaurant', u'count': 2},\n",
        " {u'amenity': u'Bank', u'count': 2},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 1},\n",
        " {u'amenity': u'Fast_Food', u'count': 1}]\n",
        "\n",
        "city: Mau\u00e1\n",
        "[{u'amenity': u'School', u'count': 1},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 1}]\n",
        "\n",
        "city: Itaquaquecetuba\n",
        "[{u'amenity': u'Fast_Food', u'count': 1}]\n",
        "\n",
        "city: Ferraz De Vasconcelos\n",
        "[{u'amenity': u'Place_Of_Worship', u'count': 1},\n",
        " {u'amenity': u'School', u'count': 1}]\n",
        "\n",
        "city: S\u00e3o Vicente\n",
        "[{u'amenity': u'Hospital', u'count': 1}]\n",
        "\n",
        "city: It\u00fa\n",
        "[{u'amenity': u'Place_Of_Worship', u'count': 1}]\n",
        "\n",
        "city: Cubat\u00e3o\n",
        "[{u'amenity': u'Place_Of_Worship', u'count': 1}]\n",
        "\n",
        "city: Jundia\u00ed\n",
        "[{u'amenity': u'Grave_Yard', u'count': 1},\n",
        " {u'amenity': u'Fuel', u'count': 1},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 1},\n",
        " {u'amenity': u'Clinic', u'count': 1},\n",
        " {u'amenity': u'Townhall', u'count': 1}]\n",
        "\n",
        "city: Ibi\u00fana\n",
        "[{u'amenity': u'Public_Building', u'count': 1}]\n",
        "\n",
        "city: Carapicu\u00edba\n",
        "[{u'amenity': u'College', u'count': 1}]\n",
        "\n",
        "city: Osasco\n",
        "[{u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'School', u'count': 1}]\n",
        "\n",
        "city: Francisco Morato\n",
        "[{u'amenity': u'School', u'count': 1}]\n",
        "\n",
        "city: Mogi Das Cruzes\n",
        "[{u'amenity': u'Hospital', u'count': 1}, {u'amenity': u'Cafe', u'count': 1}]\n",
        "\n",
        "city: S\u00e3o Caetano Do Sul\n",
        "[{u'amenity': u'College', u'count': 2},\n",
        " {u'amenity': u'Restaurant', u'count': 1},\n",
        " {u'amenity': u'Pharmacy', u'count': 1},\n",
        " {u'amenity': u'Hospital', u'count': 1},\n",
        " {u'amenity': u'School', u'count': 1}]\n",
        "\n",
        "city: S\u00e3o Bernardo Do Campo\n",
        "[{u'amenity': u'Parking', u'count': 94},\n",
        " {u'amenity': u'School', u'count': 92},\n",
        " {u'amenity': u'Fuel', u'count': 61},\n",
        " {u'amenity': u'Restaurant', u'count': 61},\n",
        " {u'amenity': u'Bank', u'count': 55}]\n",
        "\n",
        "city: Embu Das Artes\n",
        "[{u'amenity': u'Bank', u'count': 4},\n",
        " {u'amenity': u'Parking', u'count': 4},\n",
        " {u'amenity': u'Arts_Centre', u'count': 1},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 1}]\n",
        "\n",
        "city: Guarulhos\n",
        "[{u'amenity': u'Fuel', u'count': 3},\n",
        " {u'amenity': u'School', u'count': 2},\n",
        " {u'amenity': u'University', u'count': 2},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 2},\n",
        " {u'amenity': u'Fire_Station', u'count': 1}]\n",
        "\n",
        "city: Mairipor\u00e3\n",
        "[{u'amenity': u'Parking', u'count': 1}]\n",
        "\n",
        "city: S\u00e3o Jos\u00e9 Dos Campos\n",
        "[{u'amenity': u'Community_Centre', u'count': 2},\n",
        " {u'amenity': u'Hospital', u'count': 1},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 1},\n",
        " {u'amenity': u'Cafe', u'count': 1},\n",
        " {u'amenity': u'Dentist', u'count': 1}]\n",
        "\n",
        "city: Cotia\n",
        "[{u'amenity': u'Place_Of_Worship', u'count': 2},\n",
        " {u'amenity': u'Pharmacy', u'count': 2},\n",
        " {u'amenity': u'Bank', u'count': 1},\n",
        " {u'amenity': u'Fuel', u'count': 1}]\n",
        "\n",
        "city: Suzano\n",
        "[{u'amenity': u'Fuel', u'count': 1}]\n",
        "\n",
        "city: Itanha\u00e9m\n",
        "[{u'amenity': u'Townhall', u'count': 1}]\n",
        "\n",
        "city: Diadema\n",
        "[{u'amenity': u'Fuel', u'count': 16},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 13},\n",
        " {u'amenity': u'Bank', u'count': 12},\n",
        " {u'amenity': u'School', u'count': 8},\n",
        " {u'amenity': u'Restaurant', u'count': 7}]\n",
        "\n",
        "city: Santo Andr\u00e9\n",
        "[{u'amenity': u'Bank', u'count': 28},\n",
        " {u'amenity': u'School', u'count': 9},\n",
        " {u'amenity': u'Post_Office', u'count': 8},\n",
        " {u'amenity': u'Restaurant', u'count': 5},\n",
        " {u'amenity': u'Place_Of_Worship', u'count': 5}]\n",
        "\n",
        "city: S\u00e3o Paulo\n",
        "[{u'amenity': u'Restaurant', u'count': 97},\n",
        " {u'amenity': u'School', u'count': 70},\n",
        " {u'amenity': u'Fuel', u'count': 57},\n",
        " {u'amenity': u'Bank', u'count': 50},\n",
        " {u'amenity': u'Parking', u'count': 49}]\n",
        "\n",
        "city: Tabo\u00e3o Da Serra\n",
        "[{u'amenity': u'Fuel', u'count': 1}, {u'amenity': u'School', u'count': 1}]\n",
        "\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>Summing up just the top 5 amenities by city, something curious came to the fore:</p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pipeline=[\n",
      "            {\"$unwind\": \"$result\"},\n",
      "            {\"$group\": {\"_id\":\"$_id\", \"sum\": {\"$sum\":\"$result.count\"}}},\n",
      "            {\"$sort\" : {\"sum\" : -1}},\n",
      "            {\"$limit\": 10}\n",
      "          ]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 16
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "x=aggregate(db.osm_aux, pipeline)\n",
      "pprint.pprint(x)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "[{u'_id': u'S\\xe3o Bernardo Do Campo', u'sum': 363},\n",
        " {u'_id': u'S\\xe3o Paulo', u'sum': 323},\n",
        " {u'_id': u'Diadema', u'sum': 56},\n",
        " {u'_id': u'Santo Andr\\xe9', u'sum': 55},\n",
        " {u'_id': u'Guarulhos', u'sum': 10},\n",
        " {u'_id': u'Embu Das Artes', u'sum': 10},\n",
        " {u'_id': u'Santos', u'sum': 9},\n",
        " {u'_id': u'S\\xe3o Caetano Do Sul', u'sum': 6},\n",
        " {u'_id': u'Cotia', u'sum': 6},\n",
        " {u'_id': u'S\\xe3o Jos\\xe9 Dos Campos', u'sum': 6}]\n"
       ]
      }
     ],
     "prompt_number": 17
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>The top 5 most frequently amenities are not in S\u00e3o Paulo, but in S\u00e3o Bernardo do Campo, what is near S\u00e3o Paulo. S\u00e3o Bernardo is not a small city, but for sure that it is much smaller then S\u00e3o Paulo. To draw valid conclusion from this data set I believe that I would need a data set more complete and proportional across the cities. Just to point out, the most frequent amenity in S\u00e3o Bernardo was parking.</p>\n",
      "\n",
      "<p>Latelly, as the new collection already fullfuield its mission, I can just discard it.<p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "my_collection.drop()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 18
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Other ideas about the data sets"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>In order to support what I have claimed about the size of the database,remembering that the total number of unique street names in the 40 cities in mongo was 1734, I used the streets data set provided by the CMS to count different street names just in S\u00e3o Paulo and S\u00e3o Bernardo: </p>"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%time set_aux=count_streets_in_dbf(fr_dbf)\n",
      "i_saopaulo=0\n",
      "i_saobernardo=0\n",
      "for x in set_aux:\n",
      "    if x.split(\"|\")[1].strip()=='sao paulo': i_saopaulo+=1\n",
      "    if x.split(\"|\")[1].strip()=='sao bernardo do campo': i_saobernardo+=1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "CPU times: user 30.1 s, sys: 844 ms, total: 31 s\n",
        "Wall time: 31 s\n"
       ]
      }
     ],
     "prompt_number": 58
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print \"Unique Street Names in S\u00e3o Paulo: {}\\nUnique Street Names in S\u00e3o Bernardo: {}\".format(i_saopaulo,i_saobernardo)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Unique Street Names in S\u00e3o Paulo: 43103\n",
        "Unique Street Names in S\u00e3o Bernardo: 3289\n"
       ]
      }
     ],
     "prompt_number": 57
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>As I was expecting, the OSM file from S\u00e3o Paulo area has too few records. Just S\u00e3o Paulo city has 20 times more data points than what was computed using the OSM data set of 40 cities . </p>\n",
      "\n",
      "<p>I believe that this incompleteness in addresses can be a hurdle to new contributions, given that besides contributing with locations on the map, like coffees and shops, the user also would need to contribute with the addresses, something that people often have no idea. This problem can end up preventing apps use this database to provide their services, apps that could encourage contributions with new data.</p>\n",
      "\n",
      "<p>A way to overcome this issue would be cross the database with other sources to fill in the missing addresses, like the CMS streets data set used above. However, matching different data sets can be really challenging because open databases usually present a lot of misspelled street names, given that the portuguese language has a lot of grammar rules that people often bypass. If I had to do that, I probably would replace the special characters by their \"not-special\"* counterparts (<em>\u00e7</em> by <em>c</em>, <em>\u00e1</em> by <em>a</em>) and then I would create a dictionary of \"wrong words => right words\" before matching the existing data points.</p>\n",
      "\n",
      "\n",
      "\n",
      "\n",
      "<p><small><em>* An <a href=\"https://pypi.python.org/pypi/Unidecode/\">useful package</a>  to achieve that.</em></small></p>"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "###Conclusion"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<p>\n",
      "I believe that the Open Street Map is satisfatory for the purpose of the data wrangle exercise but too incomplete to drawing statistical conclusions. I suggested that other data sets could be used to input more addresses into the OpenStreetMap.org and that it could stimulate others to contribute with more locations.\n",
      "</p>\n",
      "\n",
      "                      "
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "<em>CSS Style</em>\n",
      "\n",
      "\n",
      "<style type=\"text/css\">\n",
      "    \n",
      "    .rendered_html ol {list-style:decimal; margin: 1em 2em;}\n",
      "    \n",
      "    div.input{\n",
      "    width: 115ex;\n",
      "    }\n",
      "    \n",
      "    div.text_cell{\n",
      "    width: 120ex;\n",
      "    }\n",
      "    \n",
      "    div.text_cell_render{\n",
      "    font-family: \"Charis SIL\",serif;\n",
      "    line-height: 145%;\n",
      "    width: 120ex;\n",
      "\n",
      "    }\n",
      "\n",
      "    div.text_cell_render h1 {\n",
      "    font-size: 18pt;\n",
      "    }\n",
      "\n",
      "    div.text_cell_render h2 {\n",
      "    font-size: 14pt;\n",
      "    }\n",
      "\n",
      "    .CodeMirror {\n",
      "    font-family: Consolas, monospace;\n",
      "    }\n",
      "\n",
      "</style>\n",
      "\n",
      "\n",
      "<style type=\"text/css\">\n",
      "    table {\n",
      "        overflow:hidden;\n",
      "        font-family: \"Lucida Sans Unicode\", \"Lucida Grande\", Sans-Serif;\n",
      "        font-size: 12px;\n",
      "        margin: 45px;\n",
      "        width: 480px;\n",
      "        text-align: left;\n",
      "        border-collapse: collapse;\n",
      "        border: 1px solid #d3d3d3;\n",
      "        -moz-border-radius:5px; FF1+;\n",
      "        -webkit-border-radius:5px; Saf3-4;\n",
      "        border-radius:5px;\n",
      "        -moz-box-shadow: 0 0 4px rgba(0, 0, 0, 0.01);    \n",
      "    }\n",
      "    th\n",
      "    {\n",
      "        padding: 12px 17px 12px 17px;\n",
      "        font-weight: normal;\n",
      "        font-size: 14px;\n",
      "        border-bottom: 1px dashed #69c;\n",
      "    }\n",
      "\n",
      "    td\n",
      "    {\n",
      "        padding: 7px 17px 7px 17px;\n",
      "\n",
      "    }\n",
      "\n",
      "    tbody tr:hover th\n",
      "    {\n",
      "\n",
      "        background:  #E9E9E9;\n",
      "    }\n",
      "\n",
      "    tbody tr:hover td\n",
      "    {\n",
      "\n",
      "        background:  #E9E9E9;\n",
      "    }\n",
      "</style>"
     ]
    }
   ],
   "metadata": {}
  }
 ]
}