{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# More with R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**By [Ryan Menezes](https://twitter.com/ryanvmenezes) (Los Angeles Times) & [Christine Zhang](https://twitter.com/christinezhang) (Knight-Mozilla / Los Angeles Times)**\n", "\n", "*IRE conference -- New Orleans, LA* \n", " \n", "June 18, 2016 \n", "\n", "This workshop is a continuation of our previous session, [Getting Started with R](Getting%20started%20with%20R.ipynb)\n", "\n", "To recap, in Getting Started with R, we cleaned and merged two census datasets with some demographic information on Louisiana for the years 2000 and 2010, with a view toward understanding population changes pre- and post-Hurricane Katrina, which took place in August 2005.\n", "\n", "As in our last session, we will use [The Times-Picayune](http://www.nola.com/politics/index.ssf/2011/02/new_orleans_officials_2010_pop.html) article as the inspiration for our analysis.\n", "\n", "** In this session, we will: **\n", "\n", "* Use scatterplots and histograms to better see trends in our data\n", "* Query our data for insights we could write about\n", "* Group our data to perform aggregate calculations\n", "* Use R's built-in regression tools to visualize trendlines\n", "\n", "** Here are some questions we will set out to answer: **\n", "\n", "* How many census tracts in New Orleans had fewer people in 2010 than in 2000?\n", "* Which parishes (Lousiana lingo for counties) saw the most dramatic population changes in that time period?\n", "* How did the occupancy of Louisiana homes change in that time period?\n", "* Which parishes saw the most dramatic occupancy changes?\n", "* Is there a relationship between population and occupancy rate?\n", "\n", "The following code and annotations were written in a Jupyter notebook. The code is best run in RStudio version 0.99.902 using R version 3.3.0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's read in the census data file we created in our last workshop and visually inspect the first six rows using `head`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
fips.codetractparishstatepopulation.00total.housing.units.00occupied.housing.units.00vacant.housing.units.00population.10total.housing.units.10occupied.housing.units.10vacant.housing.units.10
122001960100Census Tract 9601Acadia ParishLouisiana618824102236174621325742345229
222001960200Census Tract 9602Acadia ParishLouisiana505619091764145598823622144218
322001960300Census Tract 9603Acadia ParishLouisiana314912461145101358214271286141
422001960400Census Tract 9604Acadia ParishLouisiana561721761991185658426042362242
522001960500Census Tract 9605Acadia ParishLouisiana492717961692104609323492178171
622001960600Census Tract 9606Acadia ParishLouisiana565422922073219597225042306198
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllll}\n", " & fips.code & tract & parish & state & population.00 & total.housing.units.00 & occupied.housing.units.00 & vacant.housing.units.00 & population.10 & total.housing.units.10 & occupied.housing.units.10 & vacant.housing.units.10\\\\\n", "\\hline\n", "\t1 & 22001960100 & Census Tract 9601 & Acadia Parish & Louisiana & 6188 & 2410 & 2236 & 174 & 6213 & 2574 & 2345 & 229\\\\\n", "\t2 & 22001960200 & Census Tract 9602 & Acadia Parish & Louisiana & 5056 & 1909 & 1764 & 145 & 5988 & 2362 & 2144 & 218\\\\\n", "\t3 & 22001960300 & Census Tract 9603 & Acadia Parish & Louisiana & 3149 & 1246 & 1145 & 101 & 3582 & 1427 & 1286 & 141\\\\\n", "\t4 & 22001960400 & Census Tract 9604 & Acadia Parish & Louisiana & 5617 & 2176 & 1991 & 185 & 6584 & 2604 & 2362 & 242\\\\\n", "\t5 & 22001960500 & Census Tract 9605 & Acadia Parish & Louisiana & 4927 & 1796 & 1692 & 104 & 6093 & 2349 & 2178 & 171\\\\\n", "\t6 & 22001960600 & Census Tract 9606 & Acadia Parish & Louisiana & 5654 & 2292 & 2073 & 219 & 5972 & 2504 & 2306 & 198\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " fips.code tract parish state population.00\n", "1 22001960100 Census Tract 9601 Acadia Parish Louisiana 6188\n", "2 22001960200 Census Tract 9602 Acadia Parish Louisiana 5056\n", "3 22001960300 Census Tract 9603 Acadia Parish Louisiana 3149\n", "4 22001960400 Census Tract 9604 Acadia Parish Louisiana 5617\n", "5 22001960500 Census Tract 9605 Acadia Parish Louisiana 4927\n", "6 22001960600 Census Tract 9606 Acadia Parish Louisiana 5654\n", " total.housing.units.00 occupied.housing.units.00 vacant.housing.units.00\n", "1 2410 2236 174\n", "2 1909 1764 145\n", "3 1246 1145 101\n", "4 2176 1991 185\n", "5 1796 1692 104\n", "6 2292 2073 219\n", " population.10 total.housing.units.10 occupied.housing.units.10\n", "1 6213 2574 2345\n", "2 5988 2362 2144\n", "3 3582 1427 1286\n", "4 6584 2604 2362\n", "5 6093 2349 2178\n", "6 5972 2504 2306\n", " vacant.housing.units.10\n", "1 229\n", "2 218\n", "3 141\n", "4 242\n", "5 171\n", "6 198" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census <- read.csv('census_comparison.csv')\n", "head(census)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Grouping and summing populations by parish\n", "\n", "Recall that our information is organized by census tract. We'll use dplyr's `group_by` to group the data by parish. We'll use dplyr's `summarise_each` to add up the columns. \n", "\n", "Then we'll calculate the percent change between the two years:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning message:\n", ": package ‘dplyr’ was built under R version 3.2.4\n", "Attaching package: ‘dplyr’\n", "\n", "The following objects are masked from ‘package:stats’:\n", "\n", " filter, lag\n", "\n", "The following objects are masked from ‘package:base’:\n", "\n", " intersect, setdiff, setequal, union\n", "\n" ] } ], "source": [ "## if dplyr was not installed we would have to run this\n", "# install.packages('dplyr')\n", "\n", "## to import the package and all of its functions \n", "library('dplyr')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [], "source": [ "parishes <- census %>% # notice the use of \"%>%\"\n", "group_by(parish) %>% # this tells R to group our data by parishes\n", "summarise_each( \n", " # sum all the columns \n", " funs(sum(., na.rm = TRUE)),\n", " # except the non-numerical ones\n", " -fips.code, -tract, -state\n", ") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You'll notice that we used a strange symbol, `%>%`, to accomplish this. This is what R calls the \"pipe operator.\" It has gained popularity in recent years as an alternative to nesting functions. \n", "\n", "If we were to accomplish the same task without piping, we would need to write the following unwieldy line of code:\n", "\n", "```\n", "summarise_each(group_by(census, parish), funs(sum(., na.rm = TRUE)), -fips.code, -tract, -state)\n", "```" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
parishpopulation.00total.housing.units.00occupied.housing.units.00vacant.housing.units.00population.10total.housing.units.10occupied.housing.units.10vacant.housing.units.10perc.pop.diff
1Acadia Parish588612320921142206761773253872284125464.947249
2Allen Parish25440915781021055257649733851612171.273585
3Ascension Parish76627291722669124811072154078437790299439.91804
4Assumption Parish233889635823913962342110351873616150.141098
5Avoyelles Parish414811657614736184042073180421543226101.427159
6Beauregard Parish329861450112104239735654150401315918818.08828
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllll}\n", " & parish & population.00 & total.housing.units.00 & occupied.housing.units.00 & vacant.housing.units.00 & population.10 & total.housing.units.10 & occupied.housing.units.10 & vacant.housing.units.10 & perc.pop.diff\\\\\n", "\\hline\n", "\t1 & Acadia Parish & 58861 & 23209 & 21142 & 2067 & 61773 & 25387 & 22841 & 2546 & 4.947249\\\\\n", "\t2 & Allen Parish & 25440 & 9157 & 8102 & 1055 & 25764 & 9733 & 8516 & 1217 & 1.273585\\\\\n", "\t3 & Ascension Parish & 76627 & 29172 & 26691 & 2481 & 107215 & 40784 & 37790 & 2994 & 39.91804\\\\\n", "\t4 & Assumption Parish & 23388 & 9635 & 8239 & 1396 & 23421 & 10351 & 8736 & 1615 & 0.141098\\\\\n", "\t5 & Avoyelles Parish & 41481 & 16576 & 14736 & 1840 & 42073 & 18042 & 15432 & 2610 & 1.427159\\\\\n", "\t6 & Beauregard Parish & 32986 & 14501 & 12104 & 2397 & 35654 & 15040 & 13159 & 1881 & 8.08828\\\\\n", "\\end{tabular}\n" ], "text/plain": [ "Source: local data frame [6 x 10]\n", "\n", " parish population.00 total.housing.units.00\n", " (fctr) (int) (int)\n", "1 Acadia Parish 58861 23209\n", "2 Allen Parish 25440 9157\n", "3 Ascension Parish 76627 29172\n", "4 Assumption Parish 23388 9635\n", "5 Avoyelles Parish 41481 16576\n", "6 Beauregard Parish 32986 14501\n", "Variables not shown: occupied.housing.units.00 (int), vacant.housing.units.00\n", " (int), population.10 (int), total.housing.units.10 (int),\n", " occupied.housing.units.10 (int), vacant.housing.units.10 (int), perc.pop.diff\n", " (dbl)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parishes$perc.pop.diff <- (parishes$population.10 - parishes$population.00) / parishes$population.00 * 100\n", "\n", "head(parishes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can answer the question: **Which parishes (Lousiana lingo for counties) saw the most dramatic population changes?**\n", "\n", "We'll do this by arranging the columns by the percent change in population, using dplyr's `arrange`. To make the output easier to see, we'll only select a few of the columns." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
parishpopulation.00population.10perc.pop.diff
1St. Bernard Parish6722935897-46.60489
2Cameron Parish99916839-31.54839
3Orleans Parish484674343829-29.05974
4Tensas Parish66185252-20.64068
5East Carroll Parish94217759-17.64144
6Plaquemines Parish2675723042-13.88422
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & parish & population.00 & population.10 & perc.pop.diff\\\\\n", "\\hline\n", "\t1 & St. Bernard Parish & 67229 & 35897 & -46.60489\\\\\n", "\t2 & Cameron Parish & 9991 & 6839 & -31.54839\\\\\n", "\t3 & Orleans Parish & 484674 & 343829 & -29.05974\\\\\n", "\t4 & Tensas Parish & 6618 & 5252 & -20.64068\\\\\n", "\t5 & East Carroll Parish & 9421 & 7759 & -17.64144\\\\\n", "\t6 & Plaquemines Parish & 26757 & 23042 & -13.88422\\\\\n", "\\end{tabular}\n" ], "text/plain": [ "Source: local data frame [6 x 4]\n", "\n", " parish population.00 population.10 perc.pop.diff\n", " (fctr) (int) (int) (dbl)\n", "1 St. Bernard Parish 67229 35897 -46.60489\n", "2 Cameron Parish 9991 6839 -31.54839\n", "3 Orleans Parish 484674 343829 -29.05974\n", "4 Tensas Parish 6618 5252 -20.64068\n", "5 East Carroll Parish 9421 7759 -17.64144\n", "6 Plaquemines Parish 26757 23042 -13.88422" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "parishes %>% \n", "select(parish, population.00, population.10, perc.pop.diff) %>% # select the columns of interest\n", "arrange(perc.pop.diff) %>%\n", "head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of a query like this one could turn into a paragraph in a story. The Times-Picayune did just this with the following sentence:\n", "\n", "> Hard-hit St. Bernard saw the most dramatic population decline, losing 47 percent of its population compared with 2000. Plaquemines Parish's population also fell, though only by 14 percent.\n", "\n", "\n", "For now, we'll focus on just New Orleans, or the Orleans Parish. To help us filter the data set to only include the census tracts in this parish, we are going to import a package called dplyr.\n", "\n", "We'll filter the data and run the `str` command, which gives you the structure of the variable as defined by R:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "'data.frame':\t211 obs. of 12 variables:\n", " $ fips.code : num 2.21e+10 2.21e+10 2.21e+10 2.21e+10 2.21e+10 ...\n", " $ tract : Factor w/ 793 levels \"Census Tract 1\",..: 1 159 368 453 596 597 598 599 600 601 ...\n", " $ parish : Factor w/ 64 levels \"Acadia Parish\",..: 36 36 36 36 36 36 36 36 36 36 ...\n", " $ state : Factor w/ 1 level \"Louisiana\": 1 1 1 1 1 1 1 1 1 1 ...\n", " $ population.00 : int 2381 1347 1468 2564 2034 2957 2342 5131 2902 4400 ...\n", " $ total.housing.units.00 : int 1408 691 719 1034 704 1106 978 2100 992 1641 ...\n", " $ occupied.housing.units.00: int 1145 496 559 873 506 1011 671 1886 893 1593 ...\n", " $ vacant.housing.units.00 : int 263 195 160 161 198 95 307 214 99 48 ...\n", " $ population.10 : int 2455 1197 1231 2328 849 2534 1605 3925 2205 4346 ...\n", " $ total.housing.units.10 : int 1513 738 641 1137 328 1108 922 1795 994 1644 ...\n", " $ occupied.housing.units.10: int 1229 496 467 911 269 923 498 1456 804 1544 ...\n", " $ vacant.housing.units.10 : int 284 242 174 226 59 185 424 339 190 100 ...\n" ] } ], "source": [ "orleans <- census %>% filter(parish == 'Orleans Parish')\n", "orleans %>% str() # this is equivalent to str(orleans)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Scatterplots\n", "\n", "Plotting our data can reveal some interesting trends about the New Orleans population in 2000 and its population in 2010. R has a `plot` function where we can specify the x and y variables we want to plot:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n" ], "text/plain": [ "plot without title" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "plot(orleans$population.00, orleans$population.10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This doesn't tell us very much, other than the (intuitive) fact that census tracts with higher populations in 2000 also tended to have higher populations in 2010. \n", "\n", "We can better examine the relationship if we draw a 45-degree line with `abline(0, 1)`. If a census tract's population in 2010 was exactly the same as its population in 2000, it would fall on this line. If its 2010 population was lower than its 2000 population, it would fall below this line.\n", "\n", "This gives us a quick way to visually inspect the changes in population.\n", "\n", "While we're at it, we'll add some x and y limits, labels and a title." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot with title “Census Tracts in Orleans Parish”" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "plot(\n", " orleans$population.00,\n", " orleans$population.10, \n", " xlim = c(0, 8000), \n", " ylim = c(0, 8000),\n", " xlab = '2000 population',\n", " ylab = '2010 population',\n", " main = 'Census Tracts in Orleans Parish'\n", ")\n", "\n", "abline(0, 1) # Draw a line with y-intercept of 0 and slope of 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most of the points fall below the line, indicating most census tracts saw population drops. This makes sense with our previous finding that New Orleans overall saw a 29% drop in population between 2000 and 2010.\n", "\n", "We can further quantify this by answering the question: **How many census tracts in New Orleans had fewer people in 2010 than in 2000?**" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "orleans$pop.diff <- orleans$population.10 - orleans$population.00" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After calculating the difference, we now need to run a query that asks a programming equivalent version of the question above: \"How many numbers in the pop.diff column are less than 0?\"" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "pop.drops.orleans <- sum(orleans$pop.diff < 0, na.rm = TRUE)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] \"In New Orleans, 135 tracts had fewer people in 2010 than in 2000.\"\n" ] } ], "source": [ "print(paste('In New Orleans,', pop.drops.orleans, 'tracts had fewer people in 2010 than in 2000.'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Changes in Louisiana housing occupancy\n", "\n", "We can calculate the occupancy rate (or the percent of housing units that are occupied) for both years and the percentage point change, and write all of these variables to new columns in our data frame:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "parishes$perc.occupied.00 <- parishes$occupied.housing.units.00 / parishes$total.housing.units.00 * 100\n", "parishes$perc.occupied.10 <- parishes$occupied.housing.units.10 / parishes$total.housing.units.10 * 100\n", "parishes$perc.occupied.diff <- parishes$perc.occupied.10 - parishes$perc.occupied.00" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now answer the question, **Which parishes saw the most dramatic occupancy changes?** \n", "\n", "All we have to do is arrange the data by `perc.occupied.diff`, our new variable for the percentage point change in occupancy rates.\n", "\n", "Note that this time, we'll arrange the data using `arrange(desc(perc.occupied.diff))` to first see the most dramatic increases in occupancy rates - parishes where occupancy rates went up the most. `arrange`'s default is to arrange in ascending order." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "parish.occupancy.rates <- parishes %>% \n", "select(parish, perc.occupied.00, perc.occupied.10, perc.occupied.diff) %>% \n", "arrange(desc(perc.occupied.diff)) # arrange(desc()) arranges in descending order" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `head` to show the first few rows of the data. We specify n = 3 to show the first 3 rows." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
parishperc.occupied.00perc.occupied.10perc.occupied.diff
1St. Helena Parish76.9368384.135927.199093
2Cameron Parish67.3163471.667134.350789
3Beauregard Parish83.4701187.493354.023246
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & parish & perc.occupied.00 & perc.occupied.10 & perc.occupied.diff\\\\\n", "\\hline\n", "\t1 & St. Helena Parish & 76.93683 & 84.13592 & 7.199093\\\\\n", "\t2 & Cameron Parish & 67.31634 & 71.66713 & 4.350789\\\\\n", "\t3 & Beauregard Parish & 83.47011 & 87.49335 & 4.023246\\\\\n", "\\end{tabular}\n" ], "text/plain": [ "Source: local data frame [3 x 4]\n", "\n", " parish perc.occupied.00 perc.occupied.10 perc.occupied.diff\n", " (fctr) (dbl) (dbl) (dbl)\n", "1 St. Helena Parish 76.93683 84.13592 7.199093\n", "2 Cameron Parish 67.31634 71.66713 4.350789\n", "3 Beauregard Parish 83.47011 87.49335 4.023246" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "head(parish.occupancy.rates, n = 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It might be more interesting for us to see parishes where the occupancy rates went *down* the most. \n", "\n", "We can see this using the `tail` command, which shows us the last few rows of the data. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
parishperc.occupied.00perc.occupied.10perc.occupied.diff
1Tensas Parish71.9261764.70063-7.225543
2Orleans Parish87.5215674.86098-12.66058
3St. Bernard Parish93.7775378.72454-15.05298
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " & parish & perc.occupied.00 & perc.occupied.10 & perc.occupied.diff\\\\\n", "\\hline\n", "\t1 & Tensas Parish & 71.92617 & 64.70063 & -7.225543\\\\\n", "\t2 & Orleans Parish & 87.52156 & 74.86098 & -12.66058\\\\\n", "\t3 & St. Bernard Parish & 93.77753 & 78.72454 & -15.05298\\\\\n", "\\end{tabular}\n" ], "text/plain": [ "Source: local data frame [3 x 4]\n", "\n", " parish perc.occupied.00 perc.occupied.10 perc.occupied.diff\n", " (fctr) (dbl) (dbl) (dbl)\n", "1 Tensas Parish 71.92617 64.70063 -7.225543\n", "2 Orleans Parish 87.52156 74.86098 -12.660584\n", "3 St. Bernard Parish 93.77753 78.72454 -15.052984" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tail(parish.occupancy.rates, n = 3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Histograms\n", "\n", "Maybe we want to know more about the overall trends in parishes' occupancy rates. For example, did most Louisiana parishes see decreases or increases in occupancy? \n", "\n", "We can explore this question using a histogram of perc.occupied.diff. A histogram is a good plot for understanding the distribution of a variable. On the x-axis, it plots the range of values of perc.occupied.diff (the change in occupancy over the two years) and on the y-axis, it plots how frequently these values occurred in our data." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot with title “Histogram of parish.occupancy.rates$perc.occupied.diff”" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "hist(parish.occupancy.rates$perc.occupied.diff)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we did above, let's improve on this default plot that R spit out.\n", "\n", "The histogram grouped `perc.occupied.diff` by bins, which are equal-sized intervals of the variable's values. Here we have six bins, but we would understand the distribution in greater detail if we had more bins. To do this, we need to add an argument called `breaks`. \n", "\n", "Let's also draw a vertical line at 0 to quickly see how many parishes saw drops in occupancy rates. We'll also add some labels." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot with title “Difference in occupancy rates, by parish”" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "hist(\n", " parish.occupancy.rates$perc.occupied.diff, \n", " breaks = 20, \n", " main = 'Difference in occupancy rates, by parish',\n", " xlab = 'Percent point change in occupancy between 2000 and 2010', \n", " ylab = 'Number of parishes'\n", ")\n", "abline(v = 0, lwd = 5) # draw a vertical line at 0 with width of 5 pixels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can see that most of the parishes' occupancy rates went down.\n", "\n", "### Relationships and trend lines\n", "\n", "Let's examine the relationship between two of the variables in our data frame. We can run the `names` command to find the names of our columns. " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 'parish'
  2. \n", "\t
  3. 'population.00'
  4. \n", "\t
  5. 'total.housing.units.00'
  6. \n", "\t
  7. 'occupied.housing.units.00'
  8. \n", "\t
  9. 'vacant.housing.units.00'
  10. \n", "\t
  11. 'population.10'
  12. \n", "\t
  13. 'total.housing.units.10'
  14. \n", "\t
  15. 'occupied.housing.units.10'
  16. \n", "\t
  17. 'vacant.housing.units.10'
  18. \n", "\t
  19. 'perc.pop.diff'
  20. \n", "\t
  21. 'perc.occupied.00'
  22. \n", "\t
  23. 'perc.occupied.10'
  24. \n", "\t
  25. 'perc.occupied.diff'
  26. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 'parish'\n", "\\item 'population.00'\n", "\\item 'total.housing.units.00'\n", "\\item 'occupied.housing.units.00'\n", "\\item 'vacant.housing.units.00'\n", "\\item 'population.10'\n", "\\item 'total.housing.units.10'\n", "\\item 'occupied.housing.units.10'\n", "\\item 'vacant.housing.units.10'\n", "\\item 'perc.pop.diff'\n", "\\item 'perc.occupied.00'\n", "\\item 'perc.occupied.10'\n", "\\item 'perc.occupied.diff'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 'parish'\n", "2. 'population.00'\n", "3. 'total.housing.units.00'\n", "4. 'occupied.housing.units.00'\n", "5. 'vacant.housing.units.00'\n", "6. 'population.10'\n", "7. 'total.housing.units.10'\n", "8. 'occupied.housing.units.10'\n", "9. 'vacant.housing.units.10'\n", "10. 'perc.pop.diff'\n", "11. 'perc.occupied.00'\n", "12. 'perc.occupied.10'\n", "13. 'perc.occupied.diff'\n", "\n", "\n" ], "text/plain": [ " [1] \"parish\" \"population.00\" \n", " [3] \"total.housing.units.00\" \"occupied.housing.units.00\"\n", " [5] \"vacant.housing.units.00\" \"population.10\" \n", " [7] \"total.housing.units.10\" \"occupied.housing.units.10\"\n", " [9] \"vacant.housing.units.10\" \"perc.pop.diff\" \n", "[11] \"perc.occupied.00\" \"perc.occupied.10\" \n", "[13] \"perc.occupied.diff\" " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names(parishes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll use population.10 (the 2010 population) and perc.occupied.10 (the 2010 occupancy rate). \n", "\n", "We can draw the straight line that best fits the relationship between these two variables. In order to do this, we can use `abline`, but with an additional argument, `lm`, which stands for \"linear model.\" We have to tell R that `parishes$perc.occupied.10` is our y-variable and `parishes$population.10` is our x variable, this time in a slightly different format than the plot. " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "plot without title" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "plot(parishes$population.10, parishes$perc.occupied.10)\n", "abline(lm(parishes$perc.occupied.10 ~ parishes$population.10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see from the plot that a straight line does not fit the data well. For one, the straight line would cross 100 percent, which is impossible. A better fit for this relationship would be a curved line.\n", "\n", "To fit a curve, we have to do a logarithmic transformation of the 2010 population, then make that the x-variable in our model. A logarithmic transformation is a statistical concept that you can read more about [here](https://infoactive.co/data-design/ch11.html)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "curved.model <- lm(parishes$perc.occupied.10 ~ log(parishes$population.10))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the curve is slightly more complicated. For each x-value, we want to plot the predicted y-value. In R, we have to first sort the data points of the curve. The `lines` function will then connect all of the points." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "plot without title" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "predicted.values <- predict(curved.model)\n", "\n", "plot(parishes$population.10, parishes$perc.occupied.10)\n", "lines(sort(parishes$population.10), sort(predicted.values))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The curve appears to fit the data better than the line. For example, the linear trend was bad at estimating the relationship between population and occupancy rate for parishes with very low populations.\n", "\n", "What's most interesting about this plot is the presence of a large outlier. Let's finalize a better version of this plot by adding text labels for each parish so we can identify which parish has a high population but a very low occupancy rate." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "\n" ], "text/plain": [ "Plot with title “Louisiana parishes in 2010”" ] }, "metadata": { "image/svg+xml": { "isolated": true } }, "output_type": "display_data" } ], "source": [ "options(scipen=5) # to prevent axes from appearing in scientific notation\n", "plot(\n", " parishes$population.10, parishes$perc.occupied.10, \n", " type = \"n\", # type = \"n\" tells R not to plot the points\n", " xlab = \"Population\",\n", " ylab = \"Occupancy rate (%)\",\n", " main = \"Louisiana parishes in 2010\"\n", ")\n", "text(\n", " parishes$population.10, \n", " parishes$perc.occupied.10, \n", " labels = parishes$parish, cex = 0.6\n", ") # adding text where the points should be\n", "lines(sort(parishes$population.10), sort(predicted.values), col = \"red\", lwd = 2)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "This concludes our workshop, More with R. We hope you found it useful!\n", "\n", "\n", "Any questions?\n", "\n", "* christine.zhang@latimes.com or [@christinezhang](https://twitter.com/christinezhang) on Twitter\n", "* ryan.menezes@latimes.com or [@ryanvmenezes](https://twitter.com/ryanvmenezes) on Twitter" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.2" } }, "nbformat": 4, "nbformat_minor": 0 }