{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Zipf's Law and US Metro Population Growth\n", "*Jeremy A. Seibert*\n", "\n", "

\n", "\n", "George Zipf (Pictured) was a lingustic from Harvard in the early 20th century who postulated and found that within languages certian words are used in a higher frequency, while the rest are hardly ever used. Though initially intended only for use in analyzing word frequencies, the generalized form known as the Zipf-Madelbrot Law and its associated have been found throughout many unrelated diciplines. \n", "\n", "As it turns out Zipf's Law explains and interesting question in Urban Economics, City growth. In this notebook, we will be showing the (Rank-Size) distributions of the United States Metropolitan Area's Population, and how Zipf's law explains the population distribution of Metro Areas.\n", "\n", "If Zipf's law holds then it would state that the correlation coeffecient of between Metro Rank and the Metro Population would be approiximately -1.0. \n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#Gather the tools\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import seaborn as sns\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Metro Rank-Size Distribution\n", "\n", "Using Population Data collected from the US Census Bureau we can begin to construct the Rank-Size Distribution. In this notebook we are using the 2017 Population estimates as our base year. The methodology included in the repo explains how the Census Bureau derives their estimates for the populations in the metro areas. \n", "\n", "As a quick overview, they use the most recent census year (2000, 2010, 2020, etc.) as their base year and in conjuction with other population-based information, and then derive the estimate.\n", "\n", "Population Base + Births - Deaths + Migration = Population Estimate\n", "\n", "I would be remiss if I did not point out that there is obviously room for error in this calculation. However for our use case in this notebook the error is really a non-issue. Also within this notebook, where ever there is a metion of \"city\" this can be thought of synonomously as an agglomeration entity charaterized by the Census Metropolitan Statistical Areas (Metros)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
POPESTIMATE2017RankLogRankLogPop
NAME
New York-Newark-Jersey City, NY-NJ-PA2032087611.00000016.827159
Los Angeles-Long Beach-Anaheim, CA1335390721.69314716.407320
Chicago-Naperville-Elgin, IL-IN-WI953304032.09861216.070274
Dallas-Fort Worth-Arlington, TX739966242.38629415.816945
Houston-The Woodlands-Sugar Land, TX689242752.60943815.745934
Washington-Arlington-Alexandria, DC-VA-MD-WV621658962.79175915.642732
Miami-Fort Lauderdale-West Palm Beach, FL615882472.94591015.633396
Philadelphia-Camden-Wilmington, PA-NJ-DE-MD609612083.07944215.623163
Atlanta-Sandy Springs-Roswell, GA588473693.19722515.587872
Boston-Cambridge-Newton, MA-NH4836531103.30258515.391708
\n", "
" ], "text/plain": [ " POPESTIMATE2017 Rank LogRank \\\n", "NAME \n", "New York-Newark-Jersey City, NY-NJ-PA 20320876 1 1.000000 \n", "Los Angeles-Long Beach-Anaheim, CA 13353907 2 1.693147 \n", "Chicago-Naperville-Elgin, IL-IN-WI 9533040 3 2.098612 \n", "Dallas-Fort Worth-Arlington, TX 7399662 4 2.386294 \n", "Houston-The Woodlands-Sugar Land, TX 6892427 5 2.609438 \n", "Washington-Arlington-Alexandria, DC-VA-MD-WV 6216589 6 2.791759 \n", "Miami-Fort Lauderdale-West Palm Beach, FL 6158824 7 2.945910 \n", "Philadelphia-Camden-Wilmington, PA-NJ-DE-MD 6096120 8 3.079442 \n", "Atlanta-Sandy Springs-Roswell, GA 5884736 9 3.197225 \n", "Boston-Cambridge-Newton, MA-NH 4836531 10 3.302585 \n", "\n", " LogPop \n", "NAME \n", "New York-Newark-Jersey City, NY-NJ-PA 16.827159 \n", "Los Angeles-Long Beach-Anaheim, CA 16.407320 \n", "Chicago-Naperville-Elgin, IL-IN-WI 16.070274 \n", "Dallas-Fort Worth-Arlington, TX 15.816945 \n", "Houston-The Woodlands-Sugar Land, TX 15.745934 \n", "Washington-Arlington-Alexandria, DC-VA-MD-WV 15.642732 \n", "Miami-Fort Lauderdale-West Palm Beach, FL 15.633396 \n", "Philadelphia-Camden-Wilmington, PA-NJ-DE-MD 15.623163 \n", "Atlanta-Sandy Springs-Roswell, GA 15.587872 \n", "Boston-Cambridge-Newton, MA-NH 15.391708 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Read in the Population Estimates\n", "file = './Dataset/US_Metro_Pop_Est.csv'\n", "pop = pd.read_csv(file,usecols=list(range(3,15)),encoding='latin-1')\n", "\n", "#Filter Metro-Only\n", "pop = pop.loc[pop['LSAD'] == 'Metropolitan Statistical Area'].set_index('NAME')\n", "\n", "#Filter for 2017\n", "pop_17 = pop[['POPESTIMATE2017']].sort_values(by='POPESTIMATE2017', ascending=False)\n", "\n", "#Rank the Metros\n", "pop_17['Rank'] = range(1,len(pop_17.index)+1)\n", "\n", "# Define Functions to Convert Values\n", "log_pop = lambda x: np.log(x)\n", "log_rank = lambda x: np.log(x) + 1\n", "\n", "# Convert to Log values\n", "pop_17['LogRank'] = pop_17['Rank'].apply(log_rank)\n", "pop_17['LogPop'] = pop_17['POPESTIMATE2017'].apply(log_pop)\n", "\n", "#Show the DataFrame\n", "pop_17.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Find the Regression Coeffecient\n", "\n", "In works done by Paul Krugman in 1996 and Gabaix in 1999 they found that regressing the size of the metro's against the rank of metro's yeilded a coeffecient very close to -1.0. Let's see in the next steps if we can do the same. \n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "R^2: 0.9741162743633364\n", "Coeffecient: [[-0.88246969]]\n" ] } ], "source": [ "# Import LinearRegression\n", "from sklearn.linear_model import LinearRegression\n", "\n", "# Create the regressor: reg\n", "reg = LinearRegression()\n", "\n", "# Create arrays for features and target variable\n", "y = pop_17['LogRank']\n", "X = pop_17['LogPop']\n", "\n", "# Reshape X and y\n", "y = y.reshape(-1,1)\n", "X_LogPop = X.reshape(-1,1)\n", "\n", "# Create the prediction space\n", "prediction_space = np.linspace(min(X_LogPop), max(X_LogPop)).reshape(-1,1)\n", "\n", "# Fit the model to the data\n", "reg.fit(X_LogPop, y)\n", "\n", "# Compute predictions over the prediction space: y_pred\n", "y_pred = reg.predict(prediction_space)\n", "\n", "#Visualize the Distribution\n", "sns.lmplot(x='LogPop',y='LogRank', data=pop_17)\n", "plt.show()\n", "\n", "# Print R^2 \n", "print('R^2: {}'.format(reg.score(X_LogPop, y)))\n", "print('Coeffecient: {}'.format(reg.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Not Quite -1.0 , What is different?\n", "\n", "In the regression above we obtained a coeffecient value close to -0.9, which is not quite the -1.0 that was reported by Krugman and Gabaix. We see that within the middle section of the distribution the fit is near perfect, this section includes metro's such as Ashville, NC and Lexington, KY, however towards the tails we find that the fit is not so nice. Gabaix in his 1999 paper provides an explaiantion for the smaller metros, in that they are more succeptible to industry-level shocks due to a lack of indusrty diversity (Gabaix, 1999).\n", "\n", "Back to the regression, What did we miss? The key here is that both Krugman and Gabaix both ran their regressions on the top-135 largest metros, rather than all 382 metros (Gabiax,1999) (Krugman,1996). \n", "\n", "Let's re-run our regression and check out the result." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "R^2: 0.9762999583583303\n", "Coeffecient: [[-1.07343662]]\n" ] } ], "source": [ "# Import LinearRegression\n", "from sklearn.linear_model import LinearRegression\n", "\n", "# Create the regressor: reg\n", "reg = LinearRegression()\n", "\n", "# Create arrays for features and target variable\n", "y = pop_17['LogRank'][:134]\n", "X = pop_17['LogPop'][:134]\n", "\n", "# Reshape X and y\n", "y = y.reshape(-1,1)\n", "X_LogPop = X.reshape(-1,1)\n", "\n", "# Create the prediction space\n", "prediction_space = np.linspace(min(X_LogPop), max(X_LogPop)).reshape(-1,1)\n", "\n", "# Fit the model to the data\n", "reg.fit(X_LogPop, y)\n", "\n", "# Compute predictions over the prediction space: y_pred\n", "y_pred = reg.predict(prediction_space)\n", "\n", "#Visualize the Distribution\n", "sns.lmplot(x='LogPop',y='LogRank', data=pop_17[:134])\n", "plt.show()\n", "\n", "# Print R^2 \n", "print('R^2: {}'.format(reg.score(X_LogPop, y)))\n", "print('Coeffecient: {}'.format(reg.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. One more try!\n", "\n", "We didn't exactly get the -1.0 coeffecient with the Top 135 either, but it we are pretty close. In this run we got a correlation coeffcient of approximately 1.1%. This means that we can confirm that somewhere between using all of the metro areas at -0.9 and the top-135 at -1.1, there exists a line of best fit that closely appoximates -1.0. \n", "\n", "If we look at the Top-203 largest metros we find the arbitrialy best fit to rank-size regression coeffecient of -1.0 for the data from 2017. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "R^2: 0.977705640850799\n", "Coeffecient: [[-1.00146052]]\n" ] } ], "source": [ "# Import LinearRegression\n", "from sklearn.linear_model import LinearRegression\n", "\n", "# Create the regressor: reg\n", "reg = LinearRegression()\n", "\n", "# Create arrays for features and target variable\n", "y = pop_17['LogRank'][:202]\n", "X = pop_17['LogPop'][:202]\n", "\n", "# Reshape X and y\n", "y = y.reshape(-1,1)\n", "X_LogPop = X.reshape(-1,1)\n", "\n", "# Create the prediction space\n", "prediction_space = np.linspace(min(X_LogPop), max(X_LogPop)).reshape(-1,1)\n", "\n", "# Fit the model to the data\n", "reg.fit(X_LogPop, y)\n", "\n", "# Compute predictions over the prediction space: y_pred\n", "y_pred = reg.predict(prediction_space)\n", "\n", "#Visualize the Distribution\n", "sns.lmplot(x='LogPop',y='LogRank', data=pop_17[:202])\n", "plt.show()\n", "\n", "# Print R^2 \n", "print('R^2: {}'.format(reg.score(X_LogPop, y)))\n", "print('Coeffecient: {}'.format(reg.coef_))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Eureka! \n", "\n", "There is our -1.0 regression coeffecient! \n", "\n", "Through all of this fun, and several beautiful R^2 and coeffecient values, we found that US Metro Areas do closely follow the distribution that George Zipfs found almost 90 years ago.\n", "\n", "This notebook could be re-run using (2016,2015,2014,etc.) which are contained in the csv file, just change the pandas population column and run!\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Citations\n", "\n", "* Gabiax, X., ZIPF’S LAW FOR CITIES: AN EXPLANATION (Cambridge, MA, 1999). [Actual Paper](http://www.casa.ucl.ac.uk/mike-michigan-april1/mike%27s%20stuff/attach/Gabaix.pdf)\n", "\n", "* Krugman, P., The Self-Organizing Economy (Cambridge, MA: Blackwell, 1996a)., ‘‘Confronting the Urban Mystery,’’ \n", "\n", "* [Census Population Estimation Methodology](./Metro_Est_Methodology.pdf)\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }