{ "metadata": { "name": "", "signature": "sha256:fcc4271853bd8d34b562760802935fec3ce16846ad3fe2db20a2a050637693f7" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Relating Age to Name\n", "\n", "This notebook is inspired by [Nate Silver's](https://twitter.com/FiveThirtyEight) recent article on [How to Tell Someone's Age When All you Know is Her Name](http://fivethirtyeight.com/features/how-to-tell-someones-age-when-all-you-know-is-her-name/). It allows one to (almost) replicate the analysis done in the article, and provides more extensive features. I have done similar work using R, and you can find it [here](https://github.com/ramnathv/agebyname)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "We will uses four primary datasets.\n", "\n", "1. [Babynames](http://www.ssa.gov/oact/babynames/names.zip)\n", "2. [Babynames by State](http://www.ssa.gov/oact/babynames/state/namesbystate.zip)\n", "3. [Cohort Life Tables](http://www.ssa.gov/oact/NOTES/as120/LifeTables_Tbl_7.html)\n", "4. [Census Live Births Data](http://www.census.gov/statab/hist/02HS0013.xls)\n", "\n", "We will download each of these datasets and process them to get them analysis ready." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline\n", "import re\n", "import urllib\n", "from zipfile import ZipFile\n", "from path import path\n", "import numpy as np\n", "import pandas as pd\n", "from scipy.interpolate import interp1d\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "from ggplot import *" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "/Users/ramnathv/anaconda/lib/python2.7/site-packages/pytz/__init__.py:29: UserWarning: Module argparse was already imported from /Users/ramnathv/anaconda/lib/python2.7/argparse.pyc, but /Users/ramnathv/anaconda/lib/python2.7/site-packages is being added to sys.path\n", " from pkg_resources import resource_stream\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Baby Names\n", "\n", "The first dataset we will be downloading is the [bnames](http://www.ssa.gov/oact/babynames/names.zip) dataset provided by SSA" ] }, { "cell_type": "code", "collapsed": false, "input": [ "urllib.urlretrieve(\"http://www.ssa.gov/oact/babynames/names.zip\", \"names.zip\")\n", "zf = ZipFile(\"names.zip\")\n", "def read_names(f):\n", " data = pd.read_csv(zf.open(f), header = None, names = ['name', 'sex', 'n'])\n", " data['year'] = int(re.findall(r'\\d+', f)[0])\n", " return data\n", " \n", "bnames = pd.concat([read_names(f) for f in zf.namelist() if f.endswith('.txt')])\n", "bnames.head()" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | name | \n", "sex | \n", "n | \n", "year | \n", "
---|---|---|---|---|
0 | \n", "Mary | \n", "F | \n", "9217 | \n", "1884 | \n", "
1 | \n", "Anna | \n", "F | \n", "3860 | \n", "1884 | \n", "
2 | \n", "Emma | \n", "F | \n", "2587 | \n", "1884 | \n", "
3 | \n", "Elizabeth | \n", "F | \n", "2549 | \n", "1884 | \n", "
4 | \n", "Minnie | \n", "F | \n", "2243 | \n", "1884 | \n", "
5 rows \u00d7 4 columns
\n", "\n", " | state | \n", "sex | \n", "year | \n", "name | \n", "n | \n", "
---|---|---|---|---|---|
0 | \n", "MA | \n", "F | \n", "1910 | \n", "Mary | \n", "988 | \n", "
1 | \n", "MA | \n", "F | \n", "1910 | \n", "Helen | \n", "473 | \n", "
2 | \n", "MA | \n", "F | \n", "1910 | \n", "Margaret | \n", "374 | \n", "
3 | \n", "MA | \n", "F | \n", "1910 | \n", "Dorothy | \n", "331 | \n", "
4 | \n", "MA | \n", "F | \n", "1910 | \n", "Alice | \n", "313 | \n", "
5 rows \u00d7 5 columns
\n", "\n", " | x | \n", "qx | \n", "lx | \n", "dx | \n", "Lx | \n", "Tx | \n", "ex | \n", "sex | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "0.14596 | \n", "100000 | \n", "14596 | \n", "90026 | \n", "5151511 | \n", "51.52 | \n", "M | \n", "1900 | \n", "
1 | \n", "1 | \n", "0.03282 | \n", "85404 | \n", "2803 | \n", "84003 | \n", "5061484 | \n", "59.26 | \n", "M | \n", "1900 | \n", "
2 | \n", "2 | \n", "0.01634 | \n", "82601 | \n", "1350 | \n", "81926 | \n", "4977482 | \n", "60.26 | \n", "M | \n", "1900 | \n", "
3 | \n", "3 | \n", "0.01052 | \n", "81251 | \n", "855 | \n", "80824 | \n", "4895556 | \n", "60.25 | \n", "M | \n", "1900 | \n", "
4 | \n", "4 | \n", "0.00875 | \n", "80397 | \n", "703 | \n", "80045 | \n", "4814732 | \n", "59.89 | \n", "M | \n", "1900 | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | x | \n", "qx | \n", "lx | \n", "dx | \n", "Lx | \n", "Tx | \n", "ex | \n", "sex | \n", "year | \n", "
---|---|---|---|---|---|---|---|---|---|
114 | \n", "114 | \n", "0.76365 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.79 | \n", "M | \n", "1900 | \n", "
234 | \n", "114 | \n", "0.75783 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0.80 | \n", "F | \n", "1900 | \n", "
344 | \n", "104 | \n", "0.46882 | \n", "25 | \n", "12 | \n", "19 | \n", "38 | \n", "1.54 | \n", "M | \n", "1910 | \n", "
464 | \n", "104 | \n", "0.42317 | \n", "191 | \n", "81 | \n", "150 | \n", "328 | \n", "1.72 | \n", "F | \n", "1910 | \n", "
574 | \n", "94 | \n", "0.26858 | \n", "3122 | \n", "838 | \n", "2703 | \n", "8656 | \n", "2.77 | \n", "M | \n", "1920 | \n", "
5 rows \u00d7 9 columns
\n", "\n", " | lx | \n", "sex | \n", "year | \n", "
---|---|---|---|
0 | \n", "0.0 | \n", "F | \n", "1900 | \n", "
1 | \n", "19.1 | \n", "F | \n", "1901 | \n", "
2 | \n", "38.2 | \n", "F | \n", "1902 | \n", "
3 | \n", "57.3 | \n", "F | \n", "1903 | \n", "
4 | \n", "76.4 | \n", "F | \n", "1904 | \n", "
5 rows \u00d7 3 columns
\n", "\n", " | year | \n", "births | \n", "
---|---|---|
0 | \n", "1909 | \n", "2718 | \n", "
1 | \n", "1910 | \n", "2777 | \n", "
2 | \n", "1911 | \n", "2809 | \n", "
3 | \n", "1912 | \n", "2840 | \n", "
4 | \n", "1913 | \n", "2869 | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | year | \n", "cor | \n", "
---|---|---|
0 | \n", "1909 | \n", "5.316662 | \n", "
1 | \n", "1910 | \n", "4.701138 | \n", "
2 | \n", "1911 | \n", "4.360075 | \n", "
3 | \n", "1912 | \n", "2.874415 | \n", "
4 | \n", "1913 | \n", "2.523176 | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | year | \n", "cor | \n", "
---|---|---|
0 | \n", "1909 | \n", "5.316662 | \n", "
1 | \n", "1910 | \n", "4.701138 | \n", "
2 | \n", "1911 | \n", "4.360075 | \n", "
3 | \n", "1912 | \n", "2.874415 | \n", "
4 | \n", "1913 | \n", "2.523176 | \n", "
5 rows \u00d7 2 columns
\n", "\n", " | name | \n", "sex | \n", "n | \n", "year | \n", "cor | \n", "lx | \n", "n_cor | \n", "n_alive | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "Violet | \n", "F | \n", "945 | \n", "1909 | \n", "5.316662 | \n", "171.9 | \n", "5024.245779 | \n", "8.636678 | \n", "
1 | \n", "Violet | \n", "F | \n", "1037 | \n", "1910 | \n", "4.701138 | \n", "191.0 | \n", "4875.080412 | \n", "9.311404 | \n", "
2 | \n", "Violet | \n", "F | \n", "1183 | \n", "1911 | \n", "4.360075 | \n", "1084.2 | \n", "5157.968506 | \n", "55.922695 | \n", "
3 | \n", "Violet | \n", "F | \n", "1535 | \n", "1912 | \n", "2.874415 | \n", "1977.4 | \n", "4412.227601 | \n", "87.247389 | \n", "
4 | \n", "Violet | \n", "F | \n", "1751 | \n", "1913 | \n", "2.523176 | \n", "2870.6 | \n", "4418.081208 | \n", "126.825439 | \n", "
5 rows \u00d7 8 columns
\n", "\n", " | index | \n", "name | \n", "p_alive | \n", "q0 | \n", "q100 | \n", "q25 | \n", "q50 | \n", "q75 | \n", "sex | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "24 | \n", "Emily | \n", "90.62 | \n", "4 | \n", "105 | \n", "12 | \n", "19 | \n", "28 | \n", "F | \n", "
1 | \n", "18 | \n", "Ashley | \n", "98.5 | \n", "4 | \n", "97 | \n", "18 | \n", "24 | \n", "28 | \n", "F | \n", "
2 | \n", "10 | \n", "Jessica | \n", "97.9 | \n", "4 | \n", "105 | \n", "22 | \n", "27 | \n", "32 | \n", "F | \n", "
3 | \n", "9 | \n", "Sarah | \n", "83.63 | \n", "4 | \n", "105 | \n", "19 | \n", "28 | \n", "37 | \n", "F | \n", "
4 | \n", "17 | \n", "Anna | \n", "51.81 | \n", "4 | \n", "105 | \n", "17 | \n", "34 | \n", "65 | \n", "F | \n", "
5 | \n", "3 | \n", "Jennifer | \n", "96.63 | \n", "4 | \n", "98 | \n", "29 | \n", "36 | \n", "42 | \n", "F | \n", "
6 | \n", "1 | \n", "Elizabeth | \n", "70.05 | \n", "4 | \n", "105 | \n", "23 | \n", "39 | \n", "58 | \n", "F | \n", "
7 | \n", "23 | \n", "Michelle | \n", "95.92 | \n", "4 | \n", "99 | \n", "29 | \n", "40 | \n", "47 | \n", "F | \n", "
8 | \n", "20 | \n", "Kimberly | \n", "95.87 | \n", "4 | \n", "81 | \n", "30 | \n", "41 | \n", "48 | \n", "F | \n", "
9 | \n", "15 | \n", "Lisa | \n", "94.48 | \n", "4 | \n", "104 | \n", "40 | \n", "47 | \n", "51 | \n", "F | \n", "
10 | \n", "14 | \n", "Karen | \n", "89.12 | \n", "4 | \n", "105 | \n", "48 | \n", "55 | \n", "62 | \n", "F | \n", "
11 | \n", "7 | \n", "Susan | \n", "86.52 | \n", "4 | \n", "105 | \n", "50 | \n", "57 | \n", "63 | \n", "F | \n", "
12 | \n", "19 | \n", "Donna | \n", "81.85 | \n", "4 | \n", "105 | \n", "52 | \n", "58 | \n", "66 | \n", "F | \n", "
13 | \n", "16 | \n", "Sandra | \n", "85.46 | \n", "4 | \n", "105 | \n", "50 | \n", "59 | \n", "67 | \n", "F | \n", "
14 | \n", "12 | \n", "Nancy | \n", "76.02 | \n", "4 | \n", "105 | \n", "53 | \n", "62 | \n", "70 | \n", "F | \n", "
15 | \n", "2 | \n", "Patricia | \n", "77.8 | \n", "4 | \n", "105 | \n", "53 | \n", "62 | \n", "70 | \n", "F | \n", "
16 | \n", "4 | \n", "Linda | \n", "84.87 | \n", "4 | \n", "105 | \n", "55 | \n", "62 | \n", "67 | \n", "F | \n", "
17 | \n", "22 | \n", "Carol | \n", "77.49 | \n", "4 | \n", "105 | \n", "56 | \n", "64 | \n", "71 | \n", "F | \n", "
18 | \n", "0 | \n", "Mary | \n", "51.35 | \n", "4 | \n", "105 | \n", "53 | \n", "64 | \n", "75 | \n", "F | \n", "
19 | \n", "5 | \n", "Barbara | \n", "71.56 | \n", "4 | \n", "105 | \n", "57 | \n", "65 | \n", "73 | \n", "F | \n", "
20 | \n", "6 | \n", "Margaret | \n", "45.63 | \n", "4 | \n", "105 | \n", "51 | \n", "65 | \n", "76 | \n", "F | \n", "
21 | \n", "21 | \n", "Ruth | \n", "34.23 | \n", "4 | \n", "105 | \n", "57 | \n", "70 | \n", "83 | \n", "F | \n", "
22 | \n", "13 | \n", "Betty | \n", "50.7 | \n", "4 | \n", "105 | \n", "65 | \n", "74 | \n", "82 | \n", "F | \n", "
23 | \n", "11 | \n", "Helen | \n", "29.38 | \n", "4 | \n", "105 | \n", "61 | \n", "74 | \n", "86 | \n", "F | \n", "
24 | \n", "8 | \n", "Dorothy | \n", "34.24 | \n", "4 | \n", "105 | \n", "64 | \n", "76 | \n", "85 | \n", "F | \n", "
25 rows \u00d7 9 columns
\n", "