{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Zipf's Law and US Metro Population Growth\n", "*Jeremy A. Seibert*\n", "\n", "
\n", "\n", "George Zipf (Pictured) was a lingustic from Harvard in the early 20th century who postulated and found that within languages certian words are used in a higher frequency, while the rest are hardly ever used. Though initially intended only for use in analyzing word frequencies, the generalized form known as the Zipf-Madelbrot Law and its associated have been found throughout many unrelated diciplines. \n", "\n", "As it turns out Zipf's Law explains and interesting question in Urban Economics, City growth. In this notebook, we will be showing the (Rank-Size) distributions of the United States Metropolitan Area's Population, and how Zipf's law explains the population distribution of Metro Areas.\n", "\n", "If Zipf's law holds then it would state that the correlation coeffecient of between Metro Rank and the Metro Population would be approiximately -1.0. \n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Gather the tools\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn as sns\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Metro Rank-Size Distribution\n", "\n", "Using Population Data collected from the US Census Bureau we can begin to construct the Rank-Size Distribution. In this notebook we are using the 2017 Population estimates as our base year. The methodology included in the repo explains how the Census Bureau derives their estimates for the populations in the metro areas. \n", "\n", "As a quick overview, they use the most recent census year (2000, 2010, 2020, etc.) as their base year and in conjuction with other population-based information, and then derive the estimate.\n", "\n", "Population Base + Births - Deaths + Migration = Population Estimate\n", "\n", "I would be remiss if I did not point out that there is obviously room for error in this calculation. However for our use case in this notebook the error is really a non-issue. Also within this notebook, where ever there is a metion of \"city\" this can be thought of synonomously as an agglomeration entity charaterized by the Census Metropolitan Statistical Areas (Metros)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | POPESTIMATE2017 | \n", "Rank | \n", "LogRank | \n", "LogPop | \n", "
---|---|---|---|---|
NAME | \n", "\n", " | \n", " | \n", " | \n", " |
New York-Newark-Jersey City, NY-NJ-PA | \n", "20320876 | \n", "1 | \n", "1.000000 | \n", "16.827159 | \n", "
Los Angeles-Long Beach-Anaheim, CA | \n", "13353907 | \n", "2 | \n", "1.693147 | \n", "16.407320 | \n", "
Chicago-Naperville-Elgin, IL-IN-WI | \n", "9533040 | \n", "3 | \n", "2.098612 | \n", "16.070274 | \n", "
Dallas-Fort Worth-Arlington, TX | \n", "7399662 | \n", "4 | \n", "2.386294 | \n", "15.816945 | \n", "
Houston-The Woodlands-Sugar Land, TX | \n", "6892427 | \n", "5 | \n", "2.609438 | \n", "15.745934 | \n", "
Washington-Arlington-Alexandria, DC-VA-MD-WV | \n", "6216589 | \n", "6 | \n", "2.791759 | \n", "15.642732 | \n", "
Miami-Fort Lauderdale-West Palm Beach, FL | \n", "6158824 | \n", "7 | \n", "2.945910 | \n", "15.633396 | \n", "
Philadelphia-Camden-Wilmington, PA-NJ-DE-MD | \n", "6096120 | \n", "8 | \n", "3.079442 | \n", "15.623163 | \n", "
Atlanta-Sandy Springs-Roswell, GA | \n", "5884736 | \n", "9 | \n", "3.197225 | \n", "15.587872 | \n", "
Boston-Cambridge-Newton, MA-NH | \n", "4836531 | \n", "10 | \n", "3.302585 | \n", "15.391708 | \n", "