{
 "metadata": {
  "name": "",
  "signature": "sha256:cc5bf5bf21f1f18570c530e4dc9438ba67c0111307a2ee0e3082bf17f0217120"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk\n",
      "import numpy as np\n",
      "import pandas as pd\n",
      "\n",
      "from csv import QUOTE_ALL\n",
      "from datetime import datetime\n",
      "from io import BytesIO\n",
      "from random import choice, random, randrange, sample, randint\n",
      "from urllib2 import urlopen\n",
      "from zipfile import ZipFile"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "/home/abarto/.virtualenvs/pandas/local/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.\n",
        "  .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "Generating Fake Blog Comments"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Introduction"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The purpose of this notebook is to show you how to generate a table with data that looks like comments on a blogpost (or posts to a forum if you're old like me). Although we could have generated purely random strings, we wanted the data to look as real as possible, so we make use of the data published by the [United States Census Bureau](http://www.census.gov/) to simulated the entries based on their real probability of occurrence."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The generated data will the form of a Pandas DataFrame with the following columns:"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "| Name       | Description                                           |\n",
      "|------------|-------------------------------------------------------|\n",
      "| id         | An autogenerated sequence number                      |\n",
      "| timestamp  | Timestamp of the date and time when the post was made |\n",
      "| email      | The e-mail of the user                                |\n",
      "| first_name | First name of the user                                |\n",
      "| last_name  | Last name of the user                                 |\n",
      "| place      | Place of residence of the user                        |\n",
      "| text       | The actual text of the post                           |"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to generate meaningful text, we'll make use of the NLTK library by choosing random sentences from any of the available corpora."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Requirements"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The following modules are required to run this notebook:\n",
      "    \n",
      "* pandas (0.14.0 or greater)\n",
      "* nltk (2.0.4 or greater)"
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Census Data"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "As we mentioned in the introduction, we'll use data from the US Census Bureau. For the place of residence of the user we'll take random entries from the [2013 Gazeteer files](http://www.census.gov/geo/maps-data/data/gazetteer2013.html). Although this is not statistically appropriate as the population density of Lost Springs, Wyoming is quite different from New York City, we wanted to keep the post as simple as possible. On the next section we'll use a proper method for generating names that can be adapted for places (you need to get the population estimates data sets for that to work)."
     ]
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "Places"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with ZipFile(BytesIO(urlopen('http://www2.census.gov/geo/gazetteer/2013_Gazetteer/2013_Gaz_place_national.zip').read())) as zip_file:\n",
      "    gaz_place_national_2013_df = pd.read_csv(zip_file.open('2013_Gaz_place_national.txt'), sep='\\t')\n",
      "gaz_place_national_2013_df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>USPS</th>\n",
        "      <th>GEOID</th>\n",
        "      <th>ANSICODE</th>\n",
        "      <th>NAME</th>\n",
        "      <th>LSAD</th>\n",
        "      <th>FUNCSTAT</th>\n",
        "      <th>ALAND</th>\n",
        "      <th>AWATER</th>\n",
        "      <th>ALAND_SQMI</th>\n",
        "      <th>AWATER_SQMI</th>\n",
        "      <th>INTPTLAT</th>\n",
        "      <th>INTPTLONG                                                                                                 </th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> AL</td>\n",
        "      <td> 100100</td>\n",
        "      <td> 2582661</td>\n",
        "      <td>      Abanda CDP</td>\n",
        "      <td> 57</td>\n",
        "      <td> S</td>\n",
        "      <td>  7764034</td>\n",
        "      <td>  34284</td>\n",
        "      <td>  2.998</td>\n",
        "      <td> 0.013</td>\n",
        "      <td> 33.091627</td>\n",
        "      <td>-85.527029</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> AL</td>\n",
        "      <td> 100124</td>\n",
        "      <td> 2403054</td>\n",
        "      <td>  Abbeville city</td>\n",
        "      <td> 25</td>\n",
        "      <td> A</td>\n",
        "      <td> 40255362</td>\n",
        "      <td> 107642</td>\n",
        "      <td> 15.543</td>\n",
        "      <td> 0.042</td>\n",
        "      <td> 31.564689</td>\n",
        "      <td>-85.259124</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> AL</td>\n",
        "      <td> 100460</td>\n",
        "      <td> 2403063</td>\n",
        "      <td> Adamsville city</td>\n",
        "      <td> 25</td>\n",
        "      <td> A</td>\n",
        "      <td> 65064187</td>\n",
        "      <td>  29719</td>\n",
        "      <td> 25.121</td>\n",
        "      <td> 0.011</td>\n",
        "      <td> 33.605750</td>\n",
        "      <td>-86.974650</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> AL</td>\n",
        "      <td> 100484</td>\n",
        "      <td> 2405123</td>\n",
        "      <td>    Addison town</td>\n",
        "      <td> 43</td>\n",
        "      <td> A</td>\n",
        "      <td>  9753292</td>\n",
        "      <td>  83417</td>\n",
        "      <td>  3.766</td>\n",
        "      <td> 0.032</td>\n",
        "      <td> 34.202681</td>\n",
        "      <td>-87.178004</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> AL</td>\n",
        "      <td> 100676</td>\n",
        "      <td> 2405125</td>\n",
        "      <td>      Akron town</td>\n",
        "      <td> 43</td>\n",
        "      <td> A</td>\n",
        "      <td>  1776164</td>\n",
        "      <td>  13849</td>\n",
        "      <td>  0.686</td>\n",
        "      <td> 0.005</td>\n",
        "      <td> 32.879495</td>\n",
        "      <td>-87.741679</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "  USPS   GEOID  ANSICODE             NAME LSAD FUNCSTAT     ALAND  AWATER  \\\n",
        "0   AL  100100   2582661       Abanda CDP   57        S   7764034   34284   \n",
        "1   AL  100124   2403054   Abbeville city   25        A  40255362  107642   \n",
        "2   AL  100460   2403063  Adamsville city   25        A  65064187   29719   \n",
        "3   AL  100484   2405123     Addison town   43        A   9753292   83417   \n",
        "4   AL  100676   2405125       Akron town   43        A   1776164   13849   \n",
        "\n",
        "   ALAND_SQMI  AWATER_SQMI   INTPTLAT  \\\n",
        "0       2.998        0.013  33.091627   \n",
        "1      15.543        0.042  31.564689   \n",
        "2      25.121        0.011  33.605750   \n",
        "3       3.766        0.032  34.202681   \n",
        "4       0.686        0.005  32.879495   \n",
        "\n",
        "   INTPTLONG                                                                                                   \n",
        "0                                         -85.527029                                                           \n",
        "1                                         -85.259124                                                           \n",
        "2                                         -86.974650                                                           \n",
        "3                                         -87.178004                                                           \n",
        "4                                         -87.741679                                                           "
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Notice that the state of each place appears abbreviated, if we wanted the full name, we can make use of the [ANSI State Codes](http://www.census.gov/geo/reference/ansi_statetables.html) provided by the US Census Bureau."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "state_df = pd.read_csv(urlopen('http://www.census.gov/geo/reference/docs/state.txt'), sep='|', dtype={'STATE': 'str'})\n",
      "state_df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>STATE</th>\n",
        "      <th>STUSAB</th>\n",
        "      <th>STATE_NAME</th>\n",
        "      <th>STATENS</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> 01</td>\n",
        "      <td> AL</td>\n",
        "      <td>    Alabama</td>\n",
        "      <td> 1779775</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> 02</td>\n",
        "      <td> AK</td>\n",
        "      <td>     Alaska</td>\n",
        "      <td> 1785533</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> 04</td>\n",
        "      <td> AZ</td>\n",
        "      <td>    Arizona</td>\n",
        "      <td> 1779777</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> 05</td>\n",
        "      <td> AR</td>\n",
        "      <td>   Arkansas</td>\n",
        "      <td>   68085</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> 06</td>\n",
        "      <td> CA</td>\n",
        "      <td> California</td>\n",
        "      <td> 1779778</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 3,
       "text": [
        "  STATE STUSAB  STATE_NAME  STATENS\n",
        "0    01     AL     Alabama  1779775\n",
        "1    02     AK      Alaska  1785533\n",
        "2    04     AZ     Arizona  1779777\n",
        "3    05     AR    Arkansas    68085\n",
        "4    06     CA  California  1779778"
       ]
      }
     ],
     "prompt_number": 3
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "places_df = pd.merge(gaz_place_national_2013_df, state_df[['STATE_NAME', 'STUSAB']], left_on='USPS', right_on='STUSAB')[['USPS', 'NAME', 'STATE_NAME']]\n",
      "places_df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>USPS</th>\n",
        "      <th>NAME</th>\n",
        "      <th>STATE_NAME</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> AL</td>\n",
        "      <td>      Abanda CDP</td>\n",
        "      <td> Alabama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> AL</td>\n",
        "      <td>  Abbeville city</td>\n",
        "      <td> Alabama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> AL</td>\n",
        "      <td> Adamsville city</td>\n",
        "      <td> Alabama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> AL</td>\n",
        "      <td>    Addison town</td>\n",
        "      <td> Alabama</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> AL</td>\n",
        "      <td>      Akron town</td>\n",
        "      <td> Alabama</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 4,
       "text": [
        "  USPS             NAME STATE_NAME\n",
        "0   AL       Abanda CDP    Alabama\n",
        "1   AL   Abbeville city    Alabama\n",
        "2   AL  Adamsville city    Alabama\n",
        "3   AL     Addison town    Alabama\n",
        "4   AL       Akron town    Alabama"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "heading",
     "level": 3,
     "metadata": {},
     "source": [
      "People"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "For the names of the people we'll use the [1990](http://www.census.gov/genealogy/www/data/1990surnames/names_files.html) (for given names) and [2000](https://www.census.gov/genealogy/www/data/2000surnames/index.html) (for last names) census data. This time we won't be choosing entries willy-nilly. We want the names and last names frequency to mimic what happens in real life. In order to do that, we'll build a [Cumulative Frequency Distribution](http://www.statistics.com/glossary&term_id=222) for each data set. In order to save some memory, we'll only take the top 50 names and last names."
     ]
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Last names"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "with ZipFile(BytesIO(urlopen('https://www.census.gov/genealogy/www/data/2000surnames/names.zip').read())) as zip_file:\n",
      "    app_c_df = pd.read_csv(zip_file.open('app_c.csv'))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "app_c_df_50 = app_c_df[:50][['name', 'count']]\n",
      "app_c_df_50['prop'] = app_c_df_50['count'].apply(lambda x : x.astype(float) / app_c_df_50['count'].sum())\n",
      "app_c_df_50['cfd'] = app_c_df_50['prop'].cumsum()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 6
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Female names"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dist_female_first_df = pd.read_fwf(\n",
      "    urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.female.first'),\n",
      "    col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,\n",
      "    names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')\n",
      ")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 7
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dist_female_first_df_50 = dist_female_first_df[:50][['name', 'freq_in_percent']]\n",
      "dist_female_first_df_50['prop'] = dist_female_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_female_first_df_50['freq_in_percent'].sum())\n",
      "dist_female_first_df_50['cfd'] = dist_female_first_df_50['prop'].cumsum()\n",
      "dist_female_first_df_50.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>name</th>\n",
        "      <th>freq_in_percent</th>\n",
        "      <th>prop</th>\n",
        "      <th>cfd</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>      MARY</td>\n",
        "      <td> 2.629</td>\n",
        "      <td> 0.088032</td>\n",
        "      <td> 0.088032</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>  PATRICIA</td>\n",
        "      <td> 1.073</td>\n",
        "      <td> 0.035930</td>\n",
        "      <td> 0.123962</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>     LINDA</td>\n",
        "      <td> 1.035</td>\n",
        "      <td> 0.034657</td>\n",
        "      <td> 0.158619</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>   BARBARA</td>\n",
        "      <td> 0.980</td>\n",
        "      <td> 0.032815</td>\n",
        "      <td> 0.191435</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> ELIZABETH</td>\n",
        "      <td> 0.937</td>\n",
        "      <td> 0.031376</td>\n",
        "      <td> 0.222810</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 8,
       "text": [
        "        name  freq_in_percent      prop       cfd\n",
        "0       MARY            2.629  0.088032  0.088032\n",
        "1   PATRICIA            1.073  0.035930  0.123962\n",
        "2      LINDA            1.035  0.034657  0.158619\n",
        "3    BARBARA            0.980  0.032815  0.191435\n",
        "4  ELIZABETH            0.937  0.031376  0.222810"
       ]
      }
     ],
     "prompt_number": 8
    },
    {
     "cell_type": "heading",
     "level": 4,
     "metadata": {},
     "source": [
      "Male names"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dist_male_first_df = pd.read_fwf(\n",
      "    urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'),\n",
      "    col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,\n",
      "    names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')\n",
      ")"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 9
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "dist_male_first_df_50 = dist_male_first_df[:50][['name', 'freq_in_percent']]\n",
      "dist_male_first_df_50['prop'] = dist_male_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_male_first_df_50['freq_in_percent'].sum())\n",
      "dist_male_first_df_50['cfd'] = dist_male_first_df_50['prop'].cumsum()\n",
      "dist_male_first_df_50.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>name</th>\n",
        "      <th>freq_in_percent</th>\n",
        "      <th>prop</th>\n",
        "      <th>cfd</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>   JAMES</td>\n",
        "      <td> 3.318</td>\n",
        "      <td> 0.070376</td>\n",
        "      <td> 0.070376</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>    JOHN</td>\n",
        "      <td> 3.271</td>\n",
        "      <td> 0.069379</td>\n",
        "      <td> 0.139754</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>  ROBERT</td>\n",
        "      <td> 3.143</td>\n",
        "      <td> 0.066664</td>\n",
        "      <td> 0.206418</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> MICHAEL</td>\n",
        "      <td> 2.629</td>\n",
        "      <td> 0.055762</td>\n",
        "      <td> 0.262180</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> WILLIAM</td>\n",
        "      <td> 2.451</td>\n",
        "      <td> 0.051986</td>\n",
        "      <td> 0.314166</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 10,
       "text": [
        "      name  freq_in_percent      prop       cfd\n",
        "0    JAMES            3.318  0.070376  0.070376\n",
        "1     JOHN            3.271  0.069379  0.139754\n",
        "2   ROBERT            3.143  0.066664  0.206418\n",
        "3  MICHAEL            2.629  0.055762  0.262180\n",
        "4  WILLIAM            2.451  0.051986  0.314166"
       ]
      }
     ],
     "prompt_number": 10
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Generating the User Table"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to mimic what a real blogs or sites usually have, we'll generate a user table and then we'll choose one of them as the author of the simulated post. First we'll generate a list of first names (with a equal distribution of males and females) and last names (yes, I know that there's a correlation of first and last names if we take into account the ethnic origin, but we'll ignore that fact)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "users = []\n",
      "emails = set()\n",
      "email_domains = ('@gmail.com', '@yahoo.com', '@hotmail.com', '@outlook.com', '@mail.com', '@inbox.com', '@yandex.com')\n",
      "for i in range(500):\n",
      "    user = dict()\n",
      "    \n",
      "    # Name\n",
      "\n",
      "    random_lastname = random()\n",
      "    \n",
      "    user['last_name'] = app_c_df_50[random_lastname < app_c_df_50.cfd].iloc[0]['name'].capitalize()\n",
      "\n",
      "    random_gender = random()\n",
      "    random_name = random()\n",
      "    \n",
      "    if random_gender < 0.5:\n",
      "        user['first_name'] = dist_female_first_df_50[random_name < dist_female_first_df_50.cfd].iloc[0]['name'].capitalize()\n",
      "    else:\n",
      "        user['first_name'] = dist_male_first_df_50[random_name < dist_male_first_df_50.cfd].iloc[0]['name'].capitalize()\n",
      "    \n",
      "    # E-mail\n",
      "    \n",
      "    email_domain = choice(email_domains)\n",
      "    email = '{0}.{1}{2}'.format(user['first_name'].lower(), user['last_name'].lower(), email_domain)\n",
      "    \n",
      "    if not email in emails:\n",
      "        user['email'] = email\n",
      "    else:\n",
      "        user['email'] = '{0}.{1}_{2:4x}{3}'.format(\n",
      "            user['first_name'].lower(), user['last_name'].lower(), randrange(16**4), email_domain\n",
      "        )\n",
      "    \n",
      "    emails.add(user['email'])\n",
      "    \n",
      "    # Place\n",
      "    \n",
      "    place = places_df.ix[np.random.choice(places_df.index.values)]\n",
      "    \n",
      "    user['place'] = '{0}, {1}'.format(place['NAME'], place['STATE_NAME']).title()\n",
      "        \n",
      "    users.append(user)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 11
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "users_df = pd.DataFrame(users)\n",
      "users_df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>email</th>\n",
        "      <th>first_name</th>\n",
        "      <th>last_name</th>\n",
        "      <th>place</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td>    anna.davis@yahoo.com</td>\n",
        "      <td>    Anna</td>\n",
        "      <td>   Davis</td>\n",
        "      <td> Climax Springs Village, Missouri</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> richard.davis@inbox.com</td>\n",
        "      <td> Richard</td>\n",
        "      <td>   Davis</td>\n",
        "      <td>         Moapa Valley Cdp, Nevada</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>  amanda.nelson@mail.com</td>\n",
        "      <td>  Amanda</td>\n",
        "      <td>  Nelson</td>\n",
        "      <td>      County Center Cdp, Virginia</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> david.jackson@gmail.com</td>\n",
        "      <td>   David</td>\n",
        "      <td> Jackson</td>\n",
        "      <td>              Decatur City, Texas</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td> carol.miller@yandex.com</td>\n",
        "      <td>   Carol</td>\n",
        "      <td>  Miller</td>\n",
        "      <td>          Marshall City, Michigan</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "                     email first_name last_name  \\\n",
        "0     anna.davis@yahoo.com       Anna     Davis   \n",
        "1  richard.davis@inbox.com    Richard     Davis   \n",
        "2   amanda.nelson@mail.com     Amanda    Nelson   \n",
        "3  david.jackson@gmail.com      David   Jackson   \n",
        "4  carol.miller@yandex.com      Carol    Miller   \n",
        "\n",
        "                              place  \n",
        "0  Climax Springs Village, Missouri  \n",
        "1          Moapa Valley Cdp, Nevada  \n",
        "2       County Center Cdp, Virginia  \n",
        "3               Decatur City, Texas  \n",
        "4           Marshall City, Michigan  "
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Generating comments"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "In order to generate the text of the comments, we'll take random sentences from [Mary Shelley's Frankenstein; Or, The Modern Prometheu](http://www.gutenberg.org/ebooks/84). In order NLTK's [sent_tokenize](http://www.nltk.org/api/nltk.tokenize.html) function. For the timestamps of the messages, we'll just pick a random point in time between September 4th 1994 and January 4th 1995 (right around the time [http://www.imdb.com/title/tt0109836/](that other Frankenstein was released)."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "frankenstein_sentences = nltk.sent_tokenize(urlopen('http://www.gutenberg.org/ebooks/84.txt.utf-8').read().replace('\\r\\n', ' '))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 13
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "start_datetime = datetime(year=1994,month=9,day=4).toordinal()\n",
      "end_datetime = datetime(year=1995,month=1,day=4).toordinal()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 14
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "comments = []\n",
      "for i in range(1000):\n",
      "    comment = dict()\n",
      "    \n",
      "    comment['timestamp'] = randrange(start_datetime, end_datetime)\n",
      "    comment['text'] = ' '.join(sample(frankenstein_sentences, randint(1, 5)))\n",
      "    \n",
      "    user = users_df.ix[np.random.choice(users_df.index.values)]\n",
      "    \n",
      "    comment['email'] = user['email']\n",
      "    comment['first_name'] = user['first_name']\n",
      "    comment['last_name'] = user['last_name']\n",
      "    comment['place'] = user['place']\n",
      "    \n",
      "    comments.append(comment)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "comments_df = pd.DataFrame(sorted(comments, key=lambda p: p['timestamp']))\n",
      "comments_df.index.name = 'id'\n",
      "comments_df.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>email</th>\n",
        "      <th>first_name</th>\n",
        "      <th>last_name</th>\n",
        "      <th>place</th>\n",
        "      <th>text</th>\n",
        "      <th>timestamp</th>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>id</th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "      <th></th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> sharon.anderson@yandex.com</td>\n",
        "      <td>  Sharon</td>\n",
        "      <td> Anderson</td>\n",
        "      <td>    Springville Village, New York</td>\n",
        "      <td> \"Felix had accidentally been present at the tr...</td>\n",
        "      <td> 728175</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td>   barbara.nelson@inbox.com</td>\n",
        "      <td> Barbara</td>\n",
        "      <td>   Nelson</td>\n",
        "      <td>        Prince Cdp, West Virginia</td>\n",
        "      <td> She welcomed me with the greatest affection. T...</td>\n",
        "      <td> 728175</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td>   timothy.wright@yahoo.com</td>\n",
        "      <td> Timothy</td>\n",
        "      <td>   Wright</td>\n",
        "      <td>          East Burke Cdp, Vermont</td>\n",
        "      <td> When it became noon, and the sun rose higher, ...</td>\n",
        "      <td> 728175</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td>     carol.jackson@mail.com</td>\n",
        "      <td>   Carol</td>\n",
        "      <td>  Jackson</td>\n",
        "      <td>      Montrose City, South Dakota</td>\n",
        "      <td> Nay, these are virtuous and immaculate beings!...</td>\n",
        "      <td> 728175</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>  laura.jackson@hotmail.com</td>\n",
        "      <td>   Laura</td>\n",
        "      <td>  Jackson</td>\n",
        "      <td> Smithfield Borough, Pennsylvania</td>\n",
        "      <td> \"But my toils now drew near a close, and in tw...</td>\n",
        "      <td> 728175</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 16,
       "text": [
        "                         email first_name last_name  \\\n",
        "id                                                    \n",
        "0   sharon.anderson@yandex.com     Sharon  Anderson   \n",
        "1     barbara.nelson@inbox.com    Barbara    Nelson   \n",
        "2     timothy.wright@yahoo.com    Timothy    Wright   \n",
        "3       carol.jackson@mail.com      Carol   Jackson   \n",
        "4    laura.jackson@hotmail.com      Laura   Jackson   \n",
        "\n",
        "                               place  \\\n",
        "id                                     \n",
        "0      Springville Village, New York   \n",
        "1          Prince Cdp, West Virginia   \n",
        "2            East Burke Cdp, Vermont   \n",
        "3        Montrose City, South Dakota   \n",
        "4   Smithfield Borough, Pennsylvania   \n",
        "\n",
        "                                                 text  timestamp  \n",
        "id                                                                \n",
        "0   \"Felix had accidentally been present at the tr...     728175  \n",
        "1   She welcomed me with the greatest affection. T...     728175  \n",
        "2   When it became noon, and the sun rose higher, ...     728175  \n",
        "3   Nay, these are virtuous and immaculate beings!...     728175  \n",
        "4   \"But my toils now drew near a close, and in tw...     728175  "
       ]
      }
     ],
     "prompt_number": 16
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Conclusions"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We've generated a Pandas DataFrame with data that looks like comments to a blogpost. There are tons of ways to improve the quality of the data. For instance, we could have used bigger name and last name tables, generated the text using Markov chains (ideally trained from real comments), or distribute the posts unevenly across users. The last thing we need to do is save our work in CSV format:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "comments_df.to_csv('comments_df.csv', quoting=QUOTE_ALL)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}