{ "metadata": { "name": "", "signature": "sha256:cc5bf5bf21f1f18570c530e4dc9438ba67c0111307a2ee0e3082bf17f0217120" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "import nltk\n", "import numpy as np\n", "import pandas as pd\n", "\n", "from csv import QUOTE_ALL\n", "from datetime import datetime\n", "from io import BytesIO\n", "from random import choice, random, randrange, sample, randint\n", "from urllib2 import urlopen\n", "from zipfile import ZipFile" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "/home/abarto/.virtualenvs/pandas/local/lib/python2.7/site-packages/pandas/io/excel.py:626: UserWarning: Installed openpyxl is not supported at this time. Use >=1.6.1 and <2.0.0.\n", " .format(openpyxl_compat.start_ver, openpyxl_compat.stop_ver))\n" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Generating Fake Blog Comments" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The purpose of this notebook is to show you how to generate a table with data that looks like comments on a blogpost (or posts to a forum if you're old like me). Although we could have generated purely random strings, we wanted the data to look as real as possible, so we make use of the data published by the [United States Census Bureau](http://www.census.gov/) to simulated the entries based on their real probability of occurrence." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The generated data will the form of a Pandas DataFrame with the following columns:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "| Name | Description |\n", "|------------|-------------------------------------------------------|\n", "| id | An autogenerated sequence number |\n", "| timestamp | Timestamp of the date and time when the post was made |\n", "| email | The e-mail of the user |\n", "| first_name | First name of the user |\n", "| last_name | Last name of the user |\n", "| place | Place of residence of the user |\n", "| text | The actual text of the post |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to generate meaningful text, we'll make use of the NLTK library by choosing random sentences from any of the available corpora." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Requirements" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following modules are required to run this notebook:\n", " \n", "* pandas (0.14.0 or greater)\n", "* nltk (2.0.4 or greater)" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Census Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we mentioned in the introduction, we'll use data from the US Census Bureau. For the place of residence of the user we'll take random entries from the [2013 Gazeteer files](http://www.census.gov/geo/maps-data/data/gazetteer2013.html). Although this is not statistically appropriate as the population density of Lost Springs, Wyoming is quite different from New York City, we wanted to keep the post as simple as possible. On the next section we'll use a proper method for generating names that can be adapted for places (you need to get the population estimates data sets for that to work)." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Places" ] }, { "cell_type": "code", "collapsed": false, "input": [ "with ZipFile(BytesIO(urlopen('http://www2.census.gov/geo/gazetteer/2013_Gazetteer/2013_Gaz_place_national.zip').read())) as zip_file:\n", " gaz_place_national_2013_df = pd.read_csv(zip_file.open('2013_Gaz_place_national.txt'), sep='\\t')\n", "gaz_place_national_2013_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
USPSGEOIDANSICODENAMELSADFUNCSTATALANDAWATERALAND_SQMIAWATER_SQMIINTPTLATINTPTLONG
0 AL 100100 2582661 Abanda CDP 57 S 7764034 34284 2.998 0.013 33.091627-85.527029
1 AL 100124 2403054 Abbeville city 25 A 40255362 107642 15.543 0.042 31.564689-85.259124
2 AL 100460 2403063 Adamsville city 25 A 65064187 29719 25.121 0.011 33.605750-86.974650
3 AL 100484 2405123 Addison town 43 A 9753292 83417 3.766 0.032 34.202681-87.178004
4 AL 100676 2405125 Akron town 43 A 1776164 13849 0.686 0.005 32.879495-87.741679
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ " USPS GEOID ANSICODE NAME LSAD FUNCSTAT ALAND AWATER \\\n", "0 AL 100100 2582661 Abanda CDP 57 S 7764034 34284 \n", "1 AL 100124 2403054 Abbeville city 25 A 40255362 107642 \n", "2 AL 100460 2403063 Adamsville city 25 A 65064187 29719 \n", "3 AL 100484 2405123 Addison town 43 A 9753292 83417 \n", "4 AL 100676 2405125 Akron town 43 A 1776164 13849 \n", "\n", " ALAND_SQMI AWATER_SQMI INTPTLAT \\\n", "0 2.998 0.013 33.091627 \n", "1 15.543 0.042 31.564689 \n", "2 25.121 0.011 33.605750 \n", "3 3.766 0.032 34.202681 \n", "4 0.686 0.005 32.879495 \n", "\n", " INTPTLONG \n", "0 -85.527029 \n", "1 -85.259124 \n", "2 -86.974650 \n", "3 -87.178004 \n", "4 -87.741679 " ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that the state of each place appears abbreviated, if we wanted the full name, we can make use of the [ANSI State Codes](http://www.census.gov/geo/reference/ansi_statetables.html) provided by the US Census Bureau." ] }, { "cell_type": "code", "collapsed": false, "input": [ "state_df = pd.read_csv(urlopen('http://www.census.gov/geo/reference/docs/state.txt'), sep='|', dtype={'STATE': 'str'})\n", "state_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
STATESTUSABSTATE_NAMESTATENS
0 01 AL Alabama 1779775
1 02 AK Alaska 1785533
2 04 AZ Arizona 1779777
3 05 AR Arkansas 68085
4 06 CA California 1779778
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 3, "text": [ " STATE STUSAB STATE_NAME STATENS\n", "0 01 AL Alabama 1779775\n", "1 02 AK Alaska 1785533\n", "2 04 AZ Arizona 1779777\n", "3 05 AR Arkansas 68085\n", "4 06 CA California 1779778" ] } ], "prompt_number": 3 }, { "cell_type": "code", "collapsed": false, "input": [ "places_df = pd.merge(gaz_place_national_2013_df, state_df[['STATE_NAME', 'STUSAB']], left_on='USPS', right_on='STUSAB')[['USPS', 'NAME', 'STATE_NAME']]\n", "places_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
USPSNAMESTATE_NAME
0 AL Abanda CDP Alabama
1 AL Abbeville city Alabama
2 AL Adamsville city Alabama
3 AL Addison town Alabama
4 AL Akron town Alabama
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 4, "text": [ " USPS NAME STATE_NAME\n", "0 AL Abanda CDP Alabama\n", "1 AL Abbeville city Alabama\n", "2 AL Adamsville city Alabama\n", "3 AL Addison town Alabama\n", "4 AL Akron town Alabama" ] } ], "prompt_number": 4 }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "People" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the names of the people we'll use the [1990](http://www.census.gov/genealogy/www/data/1990surnames/names_files.html) (for given names) and [2000](https://www.census.gov/genealogy/www/data/2000surnames/index.html) (for last names) census data. This time we won't be choosing entries willy-nilly. We want the names and last names frequency to mimic what happens in real life. In order to do that, we'll build a [Cumulative Frequency Distribution](http://www.statistics.com/glossary&term_id=222) for each data set. In order to save some memory, we'll only take the top 50 names and last names." ] }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Last names" ] }, { "cell_type": "code", "collapsed": false, "input": [ "with ZipFile(BytesIO(urlopen('https://www.census.gov/genealogy/www/data/2000surnames/names.zip').read())) as zip_file:\n", " app_c_df = pd.read_csv(zip_file.open('app_c.csv'))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 5 }, { "cell_type": "code", "collapsed": false, "input": [ "app_c_df_50 = app_c_df[:50][['name', 'count']]\n", "app_c_df_50['prop'] = app_c_df_50['count'].apply(lambda x : x.astype(float) / app_c_df_50['count'].sum())\n", "app_c_df_50['cfd'] = app_c_df_50['prop'].cumsum()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 6 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Female names" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dist_female_first_df = pd.read_fwf(\n", " urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.female.first'),\n", " col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,\n", " names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')\n", ")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 7 }, { "cell_type": "code", "collapsed": false, "input": [ "dist_female_first_df_50 = dist_female_first_df[:50][['name', 'freq_in_percent']]\n", "dist_female_first_df_50['prop'] = dist_female_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_female_first_df_50['freq_in_percent'].sum())\n", "dist_female_first_df_50['cfd'] = dist_female_first_df_50['prop'].cumsum()\n", "dist_female_first_df_50.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefreq_in_percentpropcfd
0 MARY 2.629 0.088032 0.088032
1 PATRICIA 1.073 0.035930 0.123962
2 LINDA 1.035 0.034657 0.158619
3 BARBARA 0.980 0.032815 0.191435
4 ELIZABETH 0.937 0.031376 0.222810
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 8, "text": [ " name freq_in_percent prop cfd\n", "0 MARY 2.629 0.088032 0.088032\n", "1 PATRICIA 1.073 0.035930 0.123962\n", "2 LINDA 1.035 0.034657 0.158619\n", "3 BARBARA 0.980 0.032815 0.191435\n", "4 ELIZABETH 0.937 0.031376 0.222810" ] } ], "prompt_number": 8 }, { "cell_type": "heading", "level": 4, "metadata": {}, "source": [ "Male names" ] }, { "cell_type": "code", "collapsed": false, "input": [ "dist_male_first_df = pd.read_fwf(\n", " urlopen('http://www.census.gov/genealogy/www/data/1990surnames/dist.male.first'),\n", " col_specs=((0, 15),(15, 20),(21, 27),(28, 35)), header=None,\n", " names=('name', 'freq_in_percent', 'cumulative_freq_in_percent', 'rank')\n", ")" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 9 }, { "cell_type": "code", "collapsed": false, "input": [ "dist_male_first_df_50 = dist_male_first_df[:50][['name', 'freq_in_percent']]\n", "dist_male_first_df_50['prop'] = dist_male_first_df_50['freq_in_percent'].apply(lambda x : x.astype(float) / dist_male_first_df_50['freq_in_percent'].sum())\n", "dist_male_first_df_50['cfd'] = dist_male_first_df_50['prop'].cumsum()\n", "dist_male_first_df_50.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namefreq_in_percentpropcfd
0 JAMES 3.318 0.070376 0.070376
1 JOHN 3.271 0.069379 0.139754
2 ROBERT 3.143 0.066664 0.206418
3 MICHAEL 2.629 0.055762 0.262180
4 WILLIAM 2.451 0.051986 0.314166
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 10, "text": [ " name freq_in_percent prop cfd\n", "0 JAMES 3.318 0.070376 0.070376\n", "1 JOHN 3.271 0.069379 0.139754\n", "2 ROBERT 3.143 0.066664 0.206418\n", "3 MICHAEL 2.629 0.055762 0.262180\n", "4 WILLIAM 2.451 0.051986 0.314166" ] } ], "prompt_number": 10 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Generating the User Table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to mimic what a real blogs or sites usually have, we'll generate a user table and then we'll choose one of them as the author of the simulated post. First we'll generate a list of first names (with a equal distribution of males and females) and last names (yes, I know that there's a correlation of first and last names if we take into account the ethnic origin, but we'll ignore that fact)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "users = []\n", "emails = set()\n", "email_domains = ('@gmail.com', '@yahoo.com', '@hotmail.com', '@outlook.com', '@mail.com', '@inbox.com', '@yandex.com')\n", "for i in range(500):\n", " user = dict()\n", " \n", " # Name\n", "\n", " random_lastname = random()\n", " \n", " user['last_name'] = app_c_df_50[random_lastname < app_c_df_50.cfd].iloc[0]['name'].capitalize()\n", "\n", " random_gender = random()\n", " random_name = random()\n", " \n", " if random_gender < 0.5:\n", " user['first_name'] = dist_female_first_df_50[random_name < dist_female_first_df_50.cfd].iloc[0]['name'].capitalize()\n", " else:\n", " user['first_name'] = dist_male_first_df_50[random_name < dist_male_first_df_50.cfd].iloc[0]['name'].capitalize()\n", " \n", " # E-mail\n", " \n", " email_domain = choice(email_domains)\n", " email = '{0}.{1}{2}'.format(user['first_name'].lower(), user['last_name'].lower(), email_domain)\n", " \n", " if not email in emails:\n", " user['email'] = email\n", " else:\n", " user['email'] = '{0}.{1}_{2:4x}{3}'.format(\n", " user['first_name'].lower(), user['last_name'].lower(), randrange(16**4), email_domain\n", " )\n", " \n", " emails.add(user['email'])\n", " \n", " # Place\n", " \n", " place = places_df.ix[np.random.choice(places_df.index.values)]\n", " \n", " user['place'] = '{0}, {1}'.format(place['NAME'], place['STATE_NAME']).title()\n", " \n", " users.append(user)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 11 }, { "cell_type": "code", "collapsed": false, "input": [ "users_df = pd.DataFrame(users)\n", "users_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailfirst_namelast_nameplace
0 anna.davis@yahoo.com Anna Davis Climax Springs Village, Missouri
1 richard.davis@inbox.com Richard Davis Moapa Valley Cdp, Nevada
2 amanda.nelson@mail.com Amanda Nelson County Center Cdp, Virginia
3 david.jackson@gmail.com David Jackson Decatur City, Texas
4 carol.miller@yandex.com Carol Miller Marshall City, Michigan
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 12, "text": [ " email first_name last_name \\\n", "0 anna.davis@yahoo.com Anna Davis \n", "1 richard.davis@inbox.com Richard Davis \n", "2 amanda.nelson@mail.com Amanda Nelson \n", "3 david.jackson@gmail.com David Jackson \n", "4 carol.miller@yandex.com Carol Miller \n", "\n", " place \n", "0 Climax Springs Village, Missouri \n", "1 Moapa Valley Cdp, Nevada \n", "2 County Center Cdp, Virginia \n", "3 Decatur City, Texas \n", "4 Marshall City, Michigan " ] } ], "prompt_number": 12 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Generating comments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to generate the text of the comments, we'll take random sentences from [Mary Shelley's Frankenstein; Or, The Modern Prometheu](http://www.gutenberg.org/ebooks/84). In order NLTK's [sent_tokenize](http://www.nltk.org/api/nltk.tokenize.html) function. For the timestamps of the messages, we'll just pick a random point in time between September 4th 1994 and January 4th 1995 (right around the time [http://www.imdb.com/title/tt0109836/](that other Frankenstein was released)." ] }, { "cell_type": "code", "collapsed": false, "input": [ "frankenstein_sentences = nltk.sent_tokenize(urlopen('http://www.gutenberg.org/ebooks/84.txt.utf-8').read().replace('\\r\\n', ' '))" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 13 }, { "cell_type": "code", "collapsed": false, "input": [ "start_datetime = datetime(year=1994,month=9,day=4).toordinal()\n", "end_datetime = datetime(year=1995,month=1,day=4).toordinal()" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 14 }, { "cell_type": "code", "collapsed": false, "input": [ "comments = []\n", "for i in range(1000):\n", " comment = dict()\n", " \n", " comment['timestamp'] = randrange(start_datetime, end_datetime)\n", " comment['text'] = ' '.join(sample(frankenstein_sentences, randint(1, 5)))\n", " \n", " user = users_df.ix[np.random.choice(users_df.index.values)]\n", " \n", " comment['email'] = user['email']\n", " comment['first_name'] = user['first_name']\n", " comment['last_name'] = user['last_name']\n", " comment['place'] = user['place']\n", " \n", " comments.append(comment)" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 15 }, { "cell_type": "code", "collapsed": false, "input": [ "comments_df = pd.DataFrame(sorted(comments, key=lambda p: p['timestamp']))\n", "comments_df.index.name = 'id'\n", "comments_df.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
emailfirst_namelast_nameplacetexttimestamp
id
0 sharon.anderson@yandex.com Sharon Anderson Springville Village, New York \"Felix had accidentally been present at the tr... 728175
1 barbara.nelson@inbox.com Barbara Nelson Prince Cdp, West Virginia She welcomed me with the greatest affection. T... 728175
2 timothy.wright@yahoo.com Timothy Wright East Burke Cdp, Vermont When it became noon, and the sun rose higher, ... 728175
3 carol.jackson@mail.com Carol Jackson Montrose City, South Dakota Nay, these are virtuous and immaculate beings!... 728175
4 laura.jackson@hotmail.com Laura Jackson Smithfield Borough, Pennsylvania \"But my toils now drew near a close, and in tw... 728175
\n", "
" ], "metadata": {}, "output_type": "pyout", "prompt_number": 16, "text": [ " email first_name last_name \\\n", "id \n", "0 sharon.anderson@yandex.com Sharon Anderson \n", "1 barbara.nelson@inbox.com Barbara Nelson \n", "2 timothy.wright@yahoo.com Timothy Wright \n", "3 carol.jackson@mail.com Carol Jackson \n", "4 laura.jackson@hotmail.com Laura Jackson \n", "\n", " place \\\n", "id \n", "0 Springville Village, New York \n", "1 Prince Cdp, West Virginia \n", "2 East Burke Cdp, Vermont \n", "3 Montrose City, South Dakota \n", "4 Smithfield Borough, Pennsylvania \n", "\n", " text timestamp \n", "id \n", "0 \"Felix had accidentally been present at the tr... 728175 \n", "1 She welcomed me with the greatest affection. T... 728175 \n", "2 When it became noon, and the sun rose higher, ... 728175 \n", "3 Nay, these are virtuous and immaculate beings!... 728175 \n", "4 \"But my toils now drew near a close, and in tw... 728175 " ] } ], "prompt_number": 16 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Conclusions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've generated a Pandas DataFrame with data that looks like comments to a blogpost. There are tons of ways to improve the quality of the data. For instance, we could have used bigger name and last name tables, generated the text using Markov chains (ideally trained from real comments), or distribute the posts unevenly across users. The last thing we need to do is save our work in CSV format:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "comments_df.to_csv('comments_df.csv', quoting=QUOTE_ALL)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }