{ "metadata": { "name": "", "signature": "sha256:07610a9c8acfbe28487e2a38842bfed459512c32d5752a9994690f4417c3477e" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Goals" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The focus of this notebook is on baby names that have been given to both male and female. " ] }, { "cell_type": "code", "collapsed": false, "input": [ "%matplotlib inline" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "\n", "from pylab import figure, show\n", "\n", "from pandas import DataFrame, Series\n", "import pandas as pd" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "try:\n", " import mpld3\n", " from mpld3 import enable_notebook\n", " from mpld3 import plugins\n", " enable_notebook()\n", "except Exception as e:\n", " print \"Attempt to import and enable mpld3 failed\", e" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# what would seaborn do?\n", "try:\n", " import seaborn as sns\n", "except Exception as e:\n", " print \"Attempt to import and enable seaborn failed\", e" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Preliminaries: Assumed location of pydata-book files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it more practical for me to look at your homework, I'm again going to assume a relative placement of files. I placed the files from \n", "\n", "https://github.com/pydata/pydata-book\n", "\n", "in a local directory, which in my case is \"/Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/\" \n", "\n", "and then symbolically linked (`ln -s`) to the the pydata-book from the root directory of the working-open-data folder. i.e., on OS X\n", "\n", " cd /Users/raymondyee/D/Document/Working_with_Open_Data/working-open-data\n", " ln -s /Users/raymondyee/D/Document/Working_with_Open_Data/pydata-book/ pydata-book\n", "\n", "That way the files from the pydata-book repository look like they sit in the working-open-data directory -- without having to actually copy the files.\n", "\n", "With this arrangment, I should then be able to drop your notebook into my own notebooks directory and run them without having to mess around with paths." ] }, { "cell_type": "code", "collapsed": false, "input": [ "import os\n", "\n", "NAMES_DIR = os.path.join(os.pardir, \"pydata-book\", \"ch02\", \"names\")\n", "\n", "assert os.path.exists(NAMES_DIR)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Please make sure the above assertion works.**" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Baby names dataset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "discussed in p. 35 of `PfDA` book" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To download all the data, including that for 2011 and 2012: [Popular Baby Names](http://www.ssa.gov/OACT/babynames/limits.html) --> includes state by state data." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Loading all data into Pandas" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# show the first five files in the NAMES_DIR\n", "\n", "import glob\n", "glob.glob(NAMES_DIR + \"/*\")[:5]" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# 2010 is the last available year in the pydata-book repo\n", "import os\n", "\n", "years = range(1880, 2011)\n", "\n", "pieces = []\n", "columns = ['name', 'sex', 'births']\n", "\n", "for year in years:\n", " path = os.path.join(NAMES_DIR, 'yob%d.txt' % year)\n", " frame = pd.read_csv(path, names=columns)\n", "\n", " frame['year'] = year\n", " pieces.append(frame)\n", "\n", "# Concatenate everything into a single DataFrame\n", "names = pd.concat(pieces, ignore_index=True)\n", "\n", "# why floats? I'm not sure.\n", "names.describe()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# how many people, names, males and females represented in names?\n", "\n", "names.births.sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# F vs M\n", "\n", "names.groupby('sex')['births'].sum()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# total number of names\n", "\n", "len(names.groupby('name'))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# use pivot_table to collect records by year (rows) and sex (columns)\n", "\n", "total_births = names.pivot_table('births', rows='year', cols='sex', aggfunc=sum)\n", "total_births.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# You can use groupy to get equivalent pivot_table calculation\n", "\n", "names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births']" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# how to calculate the total births / year\n", "\n", "names.groupby('year').sum().plot(title=\"total births by year\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "names.groupby('year').apply(lambda s: s.groupby('sex').agg('sum')).unstack()['births'].plot(title=\"births (M/F) by year\")" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# from book: add prop to names\n", "\n", "def add_prop(group):\n", " # Integer division floors\n", " births = group.births.astype(float)\n", "\n", " group['prop'] = births / births.sum()\n", " return group\n", "\n", "names = names.groupby(['year', 'sex']).apply(add_prop)\n" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# verify prop --> all adds up to 1\n", "\n", "np.allclose(names.groupby(['year', 'sex']).prop.sum(), 1)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# number of records in full names dataframe\n", "\n", "len(names)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "How to do top1000 calculation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section on the top1000 calculation is kept in here to provide some inspiration on how to work with baby names" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# from book: useful to work with top 1000 for each year/sex combo\n", "# can use groupby/apply\n", "\n", "names.groupby(['year', 'sex']).apply(lambda g: g.sort_index(by='births', ascending=False)[:1000])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_top1000(group):\n", " return group.sort_index(by='births', ascending=False)[:1000]\n", "\n", "grouped = names.groupby(['year', 'sex'])\n", "top1000 = grouped.apply(get_top1000)\n", "top1000.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# Do pivot table: row: year and cols= names for top 1000\n", "\n", "top_births = top1000.pivot_table('births', rows='year', cols='name', aggfunc=np.sum)\n", "top_births.tail()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# is your name in the top_births list?\n", "\n", "top_births['Raymond'].plot(title='plot for Raymond')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# for Aaden, which shows up at the end\n", "\n", "top_births.Aaden.plot(xlim=[1880,2010])" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# number of names represented in top_births\n", "\n", "len(top_births.columns)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# how to get the most popular name of all time in top_births?\n", "\n", "most_common_names = top_births.sum()\n", "most_common_names.sort(ascending=False)\n", "\n", "most_common_names.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# as of mpl v 0.1 (2014.03.04), the name labeling doesn't work -- so disble mpld3 for this figure\n", "\n", "mpld3.disable_notebook()\n", "plt.figure()\n", "most_common_names[:50][::-1].plot(kind='barh', figsize=(10,10))" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# turn mpld3 back on\n", "\n", "mpld3.enable_notebook()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "all_births pivot table" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# instead of top_birth -- get all_births\n", "\n", "all_births = names.pivot_table('births', rows='year', cols='name', aggfunc=sum)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_births = all_births.fillna(0)\n", "all_births.tail()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# set up to do start/end calculation\n", "\n", "all_births_cumsum = all_births.apply(lambda s: s.cumsum(), axis=0)" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "all_births_cumsum.tail()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Names that are both M and F" ] }, { "cell_type": "code", "collapsed": false, "input": [ "# remind ourselves of what's in names\n", "\n", "names.head()" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "code", "collapsed": false, "input": [ "# columns in names\n", "\n", "names.columns" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Approach to exploring ambigendered names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some things to think about:\n", "\n", "* calculate a set of ambi_names -- names that are both M and F in the database: `names_ambi`\n", "* calculate a pivot table `ambi_names_pt` that use a hierarchical index name/sex vs years\n", "* for a specific name, make a plot of male vs female population to validate your approach\n", "* think of using cumulative vs year-by-year instantaneous populations\n", "* think about metrics for measuring the sex shift of names\n", "* think about how to calculate how ambigendered a name is" ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Exercise" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Submit a notebook that describes what you've learned about the nature of ambigendered names in the baby names database. (Due date: Monday, March 10 at 11:5pm --> bCourses assignment to come.) I'm interested in seeing what you do with the data set in this regard. At the minimum, show that you are able to run Day_13_C_Baby_Names_MF_Completed. Be creative and have fun. " ] }, { "cell_type": "code", "collapsed": false, "input": [], "language": "python", "metadata": {}, "outputs": [] } ], "metadata": {} } ] }