{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Processing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook reads in and processes the following U.S. Census Bureau county-level datasets on population estimates by age, sex and race/ethnicity:\n", "\n", "**For the 2010-18 period (all counties downloaded)**\n", "- [**Vintage 2018 county population estimates by demographic characteristics**](https://www.census.gov/data/tables/time-series/demo/popest/2010s-counties-detail.html): estimates released in 2011 for the 2000-2010 period, with no knowledge of the Census count at the end of the decade\n", "\n", "**For the 2000-10 period (only Maryland counties downloaded)**\n", "- [**County Intercensal Datasets: 2000-2010**](https://www.census.gov/data/datasets/time-series/demo/popest/intercensal-2000-2010-counties.html): revised estimates released after the 2010 Census count. The preferred estimates for the period. [Per the Bureau](https://www.census.gov/programs-surveys/popest/guidance.html): \"They differ from the postcensal estimates that are released annually because they rely on a formula that redistributes the difference between the April 1 postcensal estimate and April 1 census count for the end of the decade across the estimates for that decade.\"\n", "\n", "**For the 1990-99 period (all counties downloaded)**\n", "- [**State and County Intercensal Datasets: 1990-2000**](https://www.census.gov/data/datasets/time-series/demo/popest/intercensal-1990-2000-state-and-county-characteristics.html): revised estimates released after the 2000 Census count. The preferred estimates for the period. [Per the Bureau](https://www.census.gov/programs-surveys/popest/guidance.html): \"They differ from the postcensal estimates that are released annually because they rely on a formula that redistributes the difference between the April 1 postcensal estimate and April 1 census count for the end of the decade across the estimates for that decade.\"\n", "\n", "Notes:\n", "\n", "- The race/ethnicity categories for the 1990-2000 period are not directly comparable to data from post-2000 period in that the individuals were not presented with the option to self-identify with more than one race until the 2000 Census. Also during the 1990-99 period, \"Asian or Pacific Islander\" was a race category, while 2000 onwards breaks out \"Asian\" and \"Native Hawaiian or Other Pacific Islander\" separately.\n", "\n", "- In the 1990-90 and 2000-09 periods, the youngest age group is <1 years and the second-youngest age group is 1-4 years. In the 2010-18 period, the youngest age group is 0-4 years. This notebook standardizes the age groups such that the youngest category is 0-4 years.\n", "\n", "The raw datasets are saved in the `input/` folder. The processed datasets are saved in the `output/` folder." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "suppressMessages(library('tidyverse'))\n", "suppressMessages(library('reshape2'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions to recode year, age and race variables" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "recodeyear.18 <- function (df) {\n", " df <- df %>% mutate(year_name = case_when(YEAR == 3 ~ 2010, \n", " YEAR == 4 ~ 2011, \n", " YEAR == 5 ~ 2012,\n", " YEAR == 6 ~ 2013,\n", " YEAR == 7 ~ 2014,\n", " YEAR == 8 ~ 2015,\n", " YEAR == 9 ~ 2016, \n", " YEAR == 10 ~ 2017,\n", " YEAR == 11 ~ 2018))\n", " return(df)\n", "}\n", "\n", "recodeyear.00 <- function (df) {\n", " df <- df %>% mutate(year_name = case_when(YEAR == 2 ~ 2000,\n", " YEAR == 3 ~ 2001, \n", " YEAR == 4 ~ 2002, \n", " YEAR == 5 ~ 2003,\n", " YEAR == 6 ~ 2004,\n", " YEAR == 7 ~ 2005,\n", " YEAR == 8 ~ 2006,\n", " YEAR == 9 ~ 2007, \n", " YEAR == 10 ~ 2008,\n", " YEAR == 11 ~ 2009))\n", " return(df)\n", "}\n", "\n", "\n", "recodeage.18 <- function (df) {\n", " df <- df %>% mutate(age_name = case_when(AGEGRP == 1 ~ 'age_00-04',\n", " AGEGRP == 2 ~ 'age_05-09',\n", " AGEGRP == 3 ~ 'age_10-14',\n", " AGEGRP == 4 ~ 'age_15-19',\n", " AGEGRP == 5 ~ 'age_20-24',\n", " AGEGRP == 6 ~ 'age_25-29',\n", " AGEGRP == 7 ~ 'age_30-34',\n", " AGEGRP == 8 ~ 'age_35-39',\n", " AGEGRP == 9 ~ 'age_40-44',\n", " AGEGRP == 10 ~ 'age_45-49',\n", " AGEGRP == 11 ~ 'age_50-54',\n", " AGEGRP == 12 ~ 'age_55-59',\n", " AGEGRP == 13 ~ 'age_60-64',\n", " AGEGRP == 14 ~ 'age_65-69',\n", " AGEGRP == 15 ~ 'age_70-74',\n", " AGEGRP == 16 ~ 'age_75-79',\n", " AGEGRP == 17 ~ 'age_80-84',\n", " AGEGRP == 18 ~ 'age_85+',\n", " AGEGRP == 0 ~ 'total'))\n", " return(df)\n", "}\n", "\n", "recodeage.90.00 <- function (df) {\n", " df <- df %>% mutate(age_name = case_when(AGEGRP == 0 ~ 'age_00-04',\n", " AGEGRP == 1 ~ 'age_00-04',\n", " AGEGRP == 2 ~ 'age_05-09',\n", " AGEGRP == 3 ~ 'age_10-14',\n", " AGEGRP == 4 ~ 'age_15-19',\n", " AGEGRP == 5 ~ 'age_20-24',\n", " AGEGRP == 6 ~ 'age_25-29',\n", " AGEGRP == 7 ~ 'age_30-34',\n", " AGEGRP == 8 ~ 'age_35-39',\n", " AGEGRP == 9 ~ 'age_40-44',\n", " AGEGRP == 10 ~ 'age_45-49',\n", " AGEGRP == 11 ~ 'age_50-54',\n", " AGEGRP == 12 ~ 'age_55-59',\n", " AGEGRP == 13 ~ 'age_60-64',\n", " AGEGRP == 14 ~ 'age_65-69',\n", " AGEGRP == 15 ~ 'age_70-74',\n", " AGEGRP == 16 ~ 'age_75-79',\n", " AGEGRP == 17 ~ 'age_80-84',\n", " AGEGRP == 18 ~ 'age_85+',\n", " AGEGRP == 99 ~ 'total'))\n", " return(df)\n", "}\n", "\n", "recoderace <- function (df) {\n", " df <- df %>% mutate(race_name = case_when(race == 'H' ~ 'Hispanic', \n", " race == 'NHAA' ~ 'Asian', \n", " race == 'NHBA' ~ 'Black',\n", " race == 'NHIA' ~ 'American Indian and Alaska Native',\n", " race == 'NHNA' ~ 'Native Hawaiian and Other Pacific Islander',\n", " race == 'NHTOM' ~ 'Multiple',\n", " race == 'NHWA' ~ 'White', \n", " race == 'TOT' ~ 'Total'))\n", " return(df)\n", "}\n", "\n", "recoderace.90 <- function (df) {\n", " df <- df %>% mutate(race_sex = case_when(age_sex == 1 ~ 'White_male',\n", " age_sex == 2 ~ 'White_female',\n", " age_sex == 3 ~ 'Black_male',\n", " age_sex == 4 ~ 'Black_female',\n", " age_sex == 5 ~ 'AmericanIndianAlaskanNative_male',\n", " age_sex == 6 ~ 'AmericanIndianAlaskanNative_female',\n", " age_sex == 7 ~ 'AsianPac_male',\n", " age_sex == 8 ~ 'AsianPac_female'),\n", " hisp = case_when(ethnic == 1 ~ 'not_Hispanic',\n", " ethnic == 2 ~ 'Hispanic'))\n", " return(df)\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*2010-18 period*" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "data.18 <- suppressMessages(read_csv('input/CC-EST2018-ALLDATA.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a subset of the data and include just the July population estimates (YEAR > 2, per the file layout [documentation](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2018/cc-est2018-alldata.pdf)). Race/ethnicity categories selected such that the sum of the categories will be the total population." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "data.18.subset <- data.18 %>% select(STATE, \n", " STNAME, \n", " COUNTY, \n", " CTYNAME,\n", " AGEGRP,\n", " YEAR,\n", " TOT_POP,\n", " TOT_MALE,\n", " TOT_FEMALE,\n", " NHWA_MALE,\n", " NHWA_FEMALE,\n", " NHBA_MALE,\n", " NHBA_FEMALE,\n", " NHIA_MALE,\n", " NHIA_FEMALE,\n", " NHAA_MALE,\n", " NHAA_FEMALE,\n", " NHNA_MALE,\n", " NHNA_FEMALE,\n", " NHTOM_MALE,\n", " NHTOM_FEMALE,\n", " H_MALE,\n", " H_FEMALE) %>% filter(YEAR > 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recode year and age" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "data.18.subset <- recodeyear.18(data.18.subset)\n", "data.18.subset <- recodeage(data.18.subset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reshape data to long and create a separate column for sex" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "data.18.subset.long <- melt(data.18.subset, id.vars = c('STATE', \n", " 'STNAME', \n", " 'COUNTY', \n", " 'CTYNAME',\n", " 'age_name',\n", " 'AGEGRP',\n", " 'YEAR',\n", " 'year_name'))\n", "\n", "data.18.subset.long <- data.18.subset.long %>% separate(variable, into = c('race',\n", " 'sex')) %>% \n", "mutate(sex = ifelse(sex == 'POP', 'both', sex),\n", " race = ifelse(race == 'TOT', 'all', race))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "Recode race" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [], "source": [ "data.18.subset.long <- recoderace(data.18.subset.long)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save to CSV" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "write_csv(data.18.subset.long, \n", " 'output/data_2018.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*2000-09 period*" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [], "source": [ "md.data.00 <- suppressMessages(read_csv('co-est00int-alldata-24.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Select a subset of the data and include just the July population estimates over the period (YEAR > 1 and YEAR < 12, per the file layout [documentation](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2000-2010/intercensal/county/co-est00int-alldata.pdf)). Race/ethnicity categories selected such that the sum of the categories will be the total population." ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "md.data.00.subset <- md.data.00 %>% select(COUNTY, \n", " CTYNAME,\n", " AGEGRP,\n", " YEAR,\n", " TOT_POP,\n", " TOT_MALE,\n", " TOT_FEMALE,\n", " NHWA_MALE,\n", " NHWA_FEMALE,\n", " NHBA_MALE,\n", " NHBA_FEMALE,\n", " NHIA_MALE,\n", " NHIA_FEMALE,\n", " NHAA_MALE,\n", " NHAA_FEMALE,\n", " NHNA_MALE,\n", " NHNA_FEMALE,\n", " NHTOM_MALE,\n", " NHTOM_FEMALE,\n", " H_MALE,\n", " H_FEMALE) %>% \n", " filter(YEAR > 1 & YEAR < 12)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recode year and age" ] }, { "cell_type": "code", "execution_count": 174, "metadata": {}, "outputs": [], "source": [ "md.data.00.subset <- recodeyear.00(md.data.00.subset)\n", "md.data.00.subset <- recodeage.90.00(md.data.00.subset)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Reshape data to long and create a separate column for sex" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [], "source": [ "md.data.00.subset.long <- melt(md.data.00.subset, id.vars = c('COUNTY',\n", " 'CTYNAME',\n", " 'age_name',\n", " 'AGEGRP',\n", " 'YEAR',\n", " 'year_name'))\n", "\n", "md.data.00.subset.long <- md.data.00.subset.long %>% separate(variable, into = c('race',\n", " 'sex')) %>% \n", " mutate(sex = ifelse(sex == 'POP', 'both', sex),\n", " race = ifelse(race == 'TOT', 'all', race))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Group by age categories and sum to get the totals for the 0-4 age category" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [], "source": [ "md.data.00.subset.long.grouped <- md.data.00.subset.long %>% \n", "group_by(year_name, \n", " COUNTY,\n", " CTYNAME, \n", " age_name, \n", " race, \n", " sex,\n", " YEAR) %>%\n", " summarise(value = sum(value))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recode race" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [], "source": [ "md.data.00.subset.long.grouped <- recoderace(md.data.00.subset.long.grouped)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Save to CSV" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "write_csv(md.data.00.subset.long.grouped, \n", " 'output/data_md_2000.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*1990-99 period*" ] }, { "cell_type": "code", "execution_count": 186, "metadata": {}, "outputs": [], "source": [ "data.90 <- suppressMessages(read_csv('input/stch-icen1990.csv'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recode race and age" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "data.90 <- recodeage.90.00(data.90)\n", "data.90 <- recoderace.90(data.90)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a separate column for sex" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [], "source": [ "data.90 <- data.90 %>% separate(race_sex, into = c('race', 'sex'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Group by age categories and sum to get the totals for the 0-4 age category" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [], "source": [ "data.1990.grouped <- data.90 %>% \n", " group_by(year, fips, age_name, race, sex, hisp) %>%\n", " summarise(pop = sum(pop)) %>% ungroup() %>% \n", " mutate(year = paste0('19',year))" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [], "source": [ "write_csv(data.1990.grouped, 'output/data_1990.csv')" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
yearfipsage_nameracesexhisppop
1990 1001 age_00-04 AmericanIndianAlaskanNativefemale Hispanic 0
1990 1001 age_00-04 AmericanIndianAlaskanNativefemale not_Hispanic 4
1990 1001 age_00-04 AmericanIndianAlaskanNativemale Hispanic 0
1990 1001 age_00-04 AmericanIndianAlaskanNativemale not_Hispanic 0
1990 1001 age_00-04 AsianPac female Hispanic 0
1990 1001 age_00-04 AsianPac female not_Hispanic 4
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllll}\n", " year & fips & age\\_name & race & sex & hisp & pop\\\\\n", "\\hline\n", "\t 1990 & 1001 & age\\_00-04 & AmericanIndianAlaskanNative & female & Hispanic & 0 \\\\\n", "\t 1990 & 1001 & age\\_00-04 & AmericanIndianAlaskanNative & female & not\\_Hispanic & 4 \\\\\n", "\t 1990 & 1001 & age\\_00-04 & AmericanIndianAlaskanNative & male & Hispanic & 0 \\\\\n", "\t 1990 & 1001 & age\\_00-04 & AmericanIndianAlaskanNative & male & not\\_Hispanic & 0 \\\\\n", "\t 1990 & 1001 & age\\_00-04 & AsianPac & female & Hispanic & 0 \\\\\n", "\t 1990 & 1001 & age\\_00-04 & AsianPac & female & not\\_Hispanic & 4 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| year | fips | age_name | race | sex | hisp | pop |\n", "|---|---|---|---|---|---|---|\n", "| 1990 | 1001 | age_00-04 | AmericanIndianAlaskanNative | female | Hispanic | 0 |\n", "| 1990 | 1001 | age_00-04 | AmericanIndianAlaskanNative | female | not_Hispanic | 4 |\n", "| 1990 | 1001 | age_00-04 | AmericanIndianAlaskanNative | male | Hispanic | 0 |\n", "| 1990 | 1001 | age_00-04 | AmericanIndianAlaskanNative | male | not_Hispanic | 0 |\n", "| 1990 | 1001 | age_00-04 | AsianPac | female | Hispanic | 0 |\n", "| 1990 | 1001 | age_00-04 | AsianPac | female | not_Hispanic | 4 |\n", "\n" ], "text/plain": [ " year fips age_name race sex hisp pop\n", "1 1990 1001 age_00-04 AmericanIndianAlaskanNative female Hispanic 0 \n", "2 1990 1001 age_00-04 AmericanIndianAlaskanNative female not_Hispanic 4 \n", "3 1990 1001 age_00-04 AmericanIndianAlaskanNative male Hispanic 0 \n", "4 1990 1001 age_00-04 AmericanIndianAlaskanNative male not_Hispanic 0 \n", "5 1990 1001 age_00-04 AsianPac female Hispanic 0 \n", "6 1990 1001 age_00-04 AsianPac female not_Hispanic 4 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(data.1990.grouped)" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.5.3" } }, "nbformat": 4, "nbformat_minor": 2 }