{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# ENGLISH R1A: Chinatown and the Culture of Exclusion\n",
    "**Instructor: Amy Lee**\n",
    "\n",
    "**Developers: Michaela Palmer, Maya Shen, Cynthia Leu, Chris Cheung**\n",
    "\n",
    "**FPF 2017**\n",
    "\n",
    "Welcome to lab! Please read this lab in its entirety, as the analysis will make a lot more sense with the background context provided.\n",
    "This lab is intended to be a hands-on introduction to data science as it can be applied to Chinatown demographics and analyzing primary texts.\n",
    "\n",
    "We will be reading and analyzing representations of Chinatown in the form of data and maps. In addition, we will learn how data tools can be used to read and analyze large volumes of text.\n",
    "\n",
    "## What this lab will cover\n",
    "* Running Jupyter Notebooks\n",
    "* Data Analysis of Chinatowns' demographics\n",
    "* Visualization & Interpretation\n",
    "* Using Data Tools to Analyze Primary Texts\n",
    "\n",
    "## What you need to do\n",
    "* Read the content, complete the questions\n",
    "* Analyze the data\n",
    "* Submit the assignment\n",
    "\n",
    "\n",
    "# 1. Running Jupyter Notebooks\n",
    "\n",
    "You are currently working in a Jupyter Notebook. A Notebook allows text and code to be combined into one document. Each rectangular section of a notebook is called a \"cell.\" There are two types of cells in this notebook: text cells and code cells. \n",
    "\n",
    "Jupyter allows you to run simulations and regressions in real time. To do this, select a code cell, and click the \"run cell\" button at the top that looks like ▶| to confirm any changes. Alternatively, you can hold down the `shift` key and then press `return` or `enter`.\n",
    "\n",
    "In the following simulations, anytime you see `In [ ]` you should click the \"run cell\" button to see output. **If you get an error message after running a cell, go back to the beginning of the lab and make sure that every previous code cell has been run.**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Part 0: Introduction to Python and Jupyter Notebooks: <a id='jupyter'></a>\n",
    "\n",
    "## 1. Cells, Arithmetic, and Code\n",
    "In a notebook, each rectangle containing text or code is called a *cell*.\n",
    "\n",
    "Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings.  You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.\n",
    "\n",
    "After you edit a cell, click the \"run cell\" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) You can also press `SHIFT-ENTER` to run any cell or progress from one cell to the next.\n",
    "\n",
    "Other cells contain code in the Python programming language.  Running a code cell will execute all of the code it contains.\n",
    "\n",
    "Try running this cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Hello, World!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will now quickly go through some very basic functionality of Python, which we'll be using throughout the rest of this notebook.\n",
    "\n",
    "### 1.1 Arithmetic\n",
    "Quantitative information arises everywhere in data science. In addition to representing commands to `print` out lines, expressions can represent numbers and methods of combining numbers. \n",
    "\n",
    "The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "3.2500"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We don't necessarily always need to say \"`print`\", because Jupyter always prints the last line in a code cell. If you want to print more than one line, though, do specify \"`print`\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(3)\n",
    "4\n",
    "5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). Use parentheses to specify the order of operations, which act according to PEMDAS, just as you may have learned in school. Use parentheses for a happy new year!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "2 + (6 * 5 - (6 * 3)) ** 2 * (( 2 ** 3 ) / 4 * 7)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 Variables\n",
    "\n",
    "We sometimes want to work with the result of some computation more than once. To be able to do that without repeating code everywhere we want to use it, we can store it in a variable with *assignment statements*, which have the variable name on the left, an equals sign, and the expression to be evaluated and stored on the right. In the cell below, `(3 * 11 + 5) / 2 - 9` evaluates to 10, and gets stored in the variable `result`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = (3 * 11 + 5) / 2 - 9"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Functions\n",
    "\n",
    "One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are:\n",
    "\n",
    "| Function | Description                                                   |\n",
    "|----------|---------------------------------------------------------------|\n",
    "| `abs`      | Returns the absolute value of its argument                    |\n",
    "| `max`      | Returns the maximum of all its arguments                      |\n",
    "| `min`      | Returns the minimum of all its arguments                      |\n",
    "| `round`    | Round its argument to the nearest integer                     |\n",
    "\n",
    "Here are two call expressions that both evaluate to 3\n",
    "\n",
    "```python\n",
    "abs(2 - 5)\n",
    "max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))\n",
    "```\n",
    "\n",
    "These function calls first evaluate the expressions in the arguments (inside the parentheses), then evaluate the function on the results. `abs(2-5)` evaluates first to `abs(3)`, then returns `3`.\n",
    "\n",
    "A **statement** is a whole line of code.  Some statements are just expressions, like the examples above, that can be broken down into its subexpressions which get evaluated individually before evaluating the statement as a whole.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 Calling functions\n",
    "\n",
    "The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.\n",
    "\n",
    "For example, the `abs` function takes a single number as its argument and returns the absolute value of that number.  The absolute value of a number is its distance from 0 on the number line, so `abs(5)` is 5 and `abs(-5)` is also 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs(-5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Functions can be called as above, putting the argument in parentheses at the end, or by using \"dot notation\", and calling the function after finding the arguments, as in the cell immediately below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datascience import make_array\n",
    "nums = make_array(1, 2, 3)  # makes a list of items, in this case, numbers"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nums.mean()  # finds the average of the array"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1: Exploring Demographic Data: <a id='jupyter'></a>\n",
    "\n",
    "## 1.1 Importing Modules\n",
    "\n",
    "First, we need to import libraries so that we are able to call the functions from within. We are going to use these functions to manipulate data tables and conduct a statistical analysis. Run the code cell below to import these modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%capture\n",
    "!python -m spacy download en\n",
    "!pip install --no-cache-dir wordcloud\n",
    "!pip3 install --no-cache-dir -U folium\n",
    "!pip install --no-cache-dir textblob\n",
    "from datascience import *\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from ipywidgets import *\n",
    "%matplotlib inline\n",
    "import folium\n",
    "import pandas as pd\n",
    "from IPython.display import HTML, display, IFrame\n",
    "import folium\n",
    "import spacy\n",
    "from wordcloud import WordCloud\n",
    "from textblob import TextBlob\n",
    "import geojson"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Official map of Chinatown in San Francisco - 1855\n",
    "#### Prepared under the supervision of the special committee of the Board of Supervisors. July 1885.\n",
    "\n",
    "![image](data/PJM_1093_01.jpg)\n",
    "\n",
    "\n",
    "This map reflects the pervasive bias against the Chinese in California and in turn further fostered the hysteria. It was published as part of an official report of a Special Committee established by the San Francisco Board of Supervisors \"on the Condition of the Chinese Quarter.\" The Report resulted from a dramatic increase in hostility to the Chinese, particularly because many Chinese laborers had been driven out of other Western states by vigilantes and sought safety in San Francisco (Shah 2001, 37).\n",
    "<p>The substance and tone of the Report is best illustrated by a few excerpts: \"The general aspect of the streets and habitations was filthy in the extreme, . . . a slumbering pest, likely at any time to generate and spread disease, . . . a constant source of danger . . . , the filthiest spot inhabited by men, women and children on the American continent.\" (Report 4-5). \"The Chinese brought here with them and have successfully maintained and perpetuated the grossest habits of bestiality practiced by the human race.\" (Ibid. 38).\n",
    "<p>The map highlights the Committee's points, particularly the pervasiveness of gambling, prostitution and opium use. It shows the occupancy of the street floor of every building in Chinatown, color coded to show: General Chinese Occupancy|Chinese Gambling Houses|Chinese Prostitution|Chinese Opium Resorts|Chinese Joss Houses|and White Prostitution.\n",
    "The Report concludes with a recommendation that the Chinese be driven out of the City by stern enforcement of the law: \"compulsory obedience to our laws [is] necessarily obnoxious and revolting to the Chinese|and the more rigidly this enforcement is insisted upon and carried out the less endurable will existence be to them here, the less attractive will life be to them in California. Fewer will come and fewer will remain. . . . Scatter them by such a policy as this to other States . . . .\" (Ibid. 67-68)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Analyzing Demographics\n",
    "In this section, we will examine some of the factors that influence population growth and how they are changing the landscape of Chinatowns across the U.S."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 Reading Data, 2010-2015\n",
    "\n",
    "Now it's time to work with tables and explore some real data. A `Table` is just like how we made a list above with `make_array`, but for all the rows in a table.\n",
    "\n",
    "We're going to first look at the most recent demographic data from 2010-2015:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data = Table.read_table('data/2010-2015.csv')  # read in data from file\n",
    "historical_data['FIPS'] = ['0' + str(x) for x in historical_data['FIPS']]  # fix FIPS columns\n",
    "historical_data.show(10)  # show first ten rows"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can get some quick summary statistics by calling the `.stats()` function on our `Table` variable:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.stats()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So which census tract has the highest Asian population?\n",
    "\n",
    "First we can find the highest population by using the `max` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max(historical_data['Asian'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plug that into a table that uses the `where` and `are.equal_to` functions:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.where('Asian', are.equal_to(max(historical_data['Asian'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This FIPS code 06075035300 is tract [353](https://censusreporter.org/profiles/14000US06075035300-census-tract-353-san-francisco-ca/). Does this make sense to you?\n",
    "\n",
    "---\n",
    "\n",
    "It might be better to look at which census tracts has Asian as the highest proportion of the population:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data['Asian_percentage'] = historical_data['Asian'] / historical_data['Population']\n",
    "historical_data.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can use the same method to get the `max` and subset our table:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "max(historical_data['Asian_percentage'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.where('Asian_percentage', are.equal_to(max(historical_data['Asian_percentage'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "FIPS code 06075011800 is census tract [118](https://censusreporter.org/profiles/14000US06075011800-census-tract-118-san-francisco-ca/). Does this make sense?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: Write one sentence describing the Asian population in Chinatown.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "### Tables Essentials!\n",
    "\n",
    "For your reference, here's a table of useful `Table` functions:\n",
    "\n",
    "|Name|Example|Purpose|\n",
    "|-|-|-|\n",
    "|`Table`|`Table()`|Create an empty table, usually to extend with data|\n",
    "|`Table.read_table`|`Table.read_table(\"my_data.csv\")`|Create a table from a data file|\n",
    "|`with_columns`|`tbl = Table().with_columns(\"N\", np.arange(5), \"2*N\", np.arange(0, 10, 2))`|Create a copy of a table with more columns|\n",
    "|`column`|`tbl.column(\"N\")`|Create an array containing the elements of a column|\n",
    "|`sort`|`tbl.sort(\"N\")`|Create a copy of a table sorted by the values in a column|\n",
    "|`where`|`tbl.where(\"N\", are.above(2))`|Create a copy of a table with only the rows that match some *predicate*|\n",
    "|`num_rows`|`tbl.num_rows`|Compute the number of rows in a table|\n",
    "|`num_columns`|`tbl.num_columns`|Compute the number of columns in a table|\n",
    "|`select`|`tbl.select(\"N\")`|Create a copy of a table with only some of the columns|\n",
    "|`drop`|`tbl.drop(\"2*N\")`|Create a copy of a table without some of the columns|\n",
    "|`take`|`tbl.take(np.arange(0, 6, 2))`|Create a copy of the table with only the rows whose indices are in the given array|\n",
    "|`join`|`tbl1.join(\"shared_column_name\", tbl2)`|Join together two tables with a common column name\n",
    "|`are.equal_to()`|`tbl.where(\"SEX\", are.equal_to(0))`|find values equal to that indicated|\n",
    "|`are.not_equal_to()`|`tbl.where(\"SEX\", are.not_equal_to(0))` | find values not including the one indicated|\n",
    "|`are.above()`| `tbl.where(\"AGE\", are.above(30))` | find values greater to that indicated|\n",
    "|`are.below()`| `tbl.where(\"AGE\", are.below(40))` | find values less than that indicated |\n",
    "|`are.between()`| `tbl.where(\"SEX\", are.between(18, 60))` | find values between the two indicated |\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2.2 The correlation coefficient - *r*\n",
    "\n",
    "If we were interested in the relationship between two variables in our dataset, we'd want to look at correlation.\n",
    "\n",
    "\n",
    "> The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1 implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables. ~Wikipedia\n",
    "\n",
    "*r* = 1: the scatter diagram is a perfect straight line sloping upwards\n",
    "\n",
    "*r* = -1: the scatter diagram is a perfect straight line sloping downwards.\n",
    "\n",
    "Let's calculate the correlation coefficient between each of the continuous variables in our dataset.. We can use the `.to_df().corr()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.to_df().corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We often visualize correlations with a `scatter` plot:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.scatter('Population', 'Asian')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.scatter('One_race', 'Asian')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.scatter('Two_or_more_races', 'Asian')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To look at a 1-1 relationship over time we might prefer a simple line graph. We can first group the data by `Year`, then take the `mean` for the `Population`, and `plot` that against `Year`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.to_df().groupby('Year')['Population'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.to_df().groupby('Year')['Population'].mean().plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_data.to_df().groupby('Year')['Asian_percentage'].mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3 2015\n",
    "\n",
    "Let's look at only the year 2015:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical_2015 = historical_data.where('Year', are.equal_to(2015))\n",
    "historical_2015.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can make a [choropleth map](https://en.wikipedia.org/wiki/Choropleth_map) with a little function, don't worry about the code below!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def choro_column(tab, column):\n",
    "    sf_2010 = geojson.load(open(\"data/2010-sf.geojson\"))\n",
    "    threshold_scale = np.linspace(min(tab[column]), max(tab[column]), 6, dtype=float).tolist()\n",
    "\n",
    "    mapa = folium.Map(location=(37.7793784, -122.4063879), zoom_start=11)\n",
    "    mapa.choropleth(geo_data=sf_2010,\n",
    "                    data=tab.to_df(),\n",
    "                    columns=['FIPS', column],\n",
    "                    fill_color='YlOrRd',\n",
    "                    key_on='feature.properties.GEOID10',\n",
    "                    threshold_scale=threshold_scale)\n",
    "    \n",
    "    mapa.save(\"map.html\")\n",
    "    return mapa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here's a choropleth of all the population:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "choro_column(historical_2015, 'Population')\n",
    "IFrame('map.html', width=700, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at only Asian:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "choro_column(historical_2015, 'Asian')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try making one more choropleth below with only `Asian_percentage`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: Where is the largest concentration of Asian residents?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Challenge\n",
    "<div class=\"alert alert-info\">\n",
    "Create a choropleth for 2010 with the same `Asian_percentage` column. Do you see any differences from 2010 to 2015?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "## 2.4 1940-2010\n",
    "\n",
    "Now let's take a look at the historical data showing how the Asian population has changed over time, as compared to the black population.\n",
    "\n",
    "First, let's load in all our of decennial San Francisco Chinatown census data acquired from an online domain called Social Explorer. Let's first examine this dataset to get a sense of what's in it.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: Can you explain how you would derive the Asian population from the given census data?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical = Table.read_table('data/process.csv')\n",
    "historical.show(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical['Other'] = historical['Total Population'] - historical['White'] - historical['Black']\n",
    "historical.show(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use the mean function to find the average total population in Chinatown.  Do you notice any significant changes between 1940 and 2010?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Total Population'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot the results on a graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Total Population'].mean().plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['White'].mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can plot the average population of different racial groups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['White'].mean().plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Black'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Black'].mean().plot()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Other'].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "historical.to_df().groupby('Year')['Other'].mean().plot()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: Describe the population trends you observed from the above graphs. How would you compare the changes in Asian vs Black vs White populations?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.5 Manhattan\n",
    "One of the goals of this module is to compare different Chinatowns from across the US. We will now compare the SF Chinatown data to the census data from Manhattan's Chinatown.  Let's load the Manhattan data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "manhattan = Table.read_table('data/manhattan_cleaned.csv')\n",
    "manhattan.show(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "manhattan.to_df().corr()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "manhattan.scatter('Chinese Population', 'White Population', color=['b','r'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "manhattan_2010 = manhattan.where('Year', are.equal_to(2010))\n",
    "manhattan_2010.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def choro_column(tab, column):\n",
    "    tab = tab.to_df()\n",
    "    tab['Census Tract'] = tab['Census Tract'].astype(str).str.strip('0').str.strip('.')\n",
    "    nyc_2010 = geojson.load(open(\"data/nyc-census-2010.geojson\"))\n",
    "    tracts = folium.features.GeoJson(nyc_2010)\n",
    "    threshold_scale = np.linspace(min(tab[column]), max(tab[column]), 6, dtype=float).tolist()\n",
    "\n",
    "    mapa = folium.Map(location=(40.7128, -74.00609), zoom_start=11)\n",
    "    mapa.choropleth(geo_data=nyc_2010,\n",
    "                    data=tab,\n",
    "                    columns=['Census Tract', column],\n",
    "                    fill_color='YlOrRd',\n",
    "                    key_on='feature.properties.CTLabel',\n",
    "                    threshold_scale=threshold_scale)\n",
    "    mapa.save(\"map.html\")\n",
    "    return mapa"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "choro_column(manhattan_2010, 'Chinese Population')\n",
    "IFrame('map.html', width=700, height=400)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "manhattan_2010['Asian_percentage'] = manhattan_2010['Asian/Other Population'] / manhattan_2010['Total Population']\n",
    "manhattan_2010.show(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "choro_column(manhattan_2010, 'Asian_percentage')\n",
    "IFrame('map.html', width=700, height=400)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Reading Primary Texts\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this class, we have been learning how to 'close-read' primary texts. Close-reading generally involves picking select passages and reading for the latent meanings embedded in word choice, syntax, the use of metaphors and symbols, etc. Here, we are introducing another way of analyzing primary texts using computational methods. Computational text analysis generally involves 'counting' words. Let's see how this works by analyzing some of the poems written by Chinese immigrants on Angel Island.\n",
    "\n",
    "<p>Run the following cell to import the poems from a .txt file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('data/islandpoetry1_22.txt', \"r\") as f:\n",
    "    raw = f.read()\n",
    "print(raw)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're interested in which words appear the most often in our set of poems. It's pretty hard to read or see much in this form. We'll coming back to the topic of what words are the most common with actual numbers a bit later but for now, run the following cell to generate two interesting visualizations of the most common words (minus those such as \"the\", \"a\", etc.). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wordcloud = WordCloud().generate(raw)\n",
    "\n",
    "plt.imshow(wordcloud, interpolation='bilinear')\n",
    "plt.axis(\"off\")\n",
    "\n",
    "# lower max_font_size\n",
    "wordcloud = WordCloud(max_font_size=40).generate(raw)\n",
    "plt.figure()\n",
    "plt.imshow(wordcloud, interpolation=\"bilinear\")\n",
    "plt.axis(\"off\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: What are the most common words you notice? Judging from these words, what do you think these poems are about?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Oops, it seems we've forgotten just how many poems we have in our set. Luckily we have a quick way of finding out! Each \"\\n\" in the display of the poem text indicates a line break. It turns out that each poem is separated by an empty line, a.k.a. two line breaks or \"\\n\"'s."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_poems = len(raw.split(\"\\n\\n\"))\n",
    "num_poems"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "We can also use this idea to calculate the number of characters in each poem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "num_char_per_poem = [len(p) for p in raw.split(\"\\n\\n\")]\n",
    "print(num_char_per_poem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is interesting but seems like just a long list of numbers. What about the average number of characters per poem?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.mean(num_char_per_poem)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at it in histogram form to get a better idea of our data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Table().with_columns(\"Character Count\", np.asarray(num_char_per_poem)).hist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each bar of this histogram tells us what proportion of the poems (the height of the bar) have that many characters (the position of the bar on the x-axis).\n",
    "\n",
    "We can also use \"\\n\" to look at [enjambment](https://en.wikipedia.org/wiki/Enjambment) too. Let's calculate the proportion of lines that are enjambed out of the total number of lines per poem. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from string import punctuation\n",
    "\n",
    "poems = raw.split(\"\\n\\n\")\n",
    "\n",
    "all_poems_enjambment = []\n",
    "for p in poems:\n",
    "    lines = p.split(\"\\n\")\n",
    "    poems = raw.split(\"\\n\\n\")\n",
    "    enjambment = 0\n",
    "    for l in lines:\n",
    "        try:\n",
    "            if l[-1] in punctuation:\n",
    "                pass\n",
    "            else:\n",
    "                enjambment += 1\n",
    "        except:\n",
    "            pass\n",
    "    enj = enjambment/len(lines)\n",
    "    all_poems_enjambment.append(enj)\n",
    "\n",
    "print(all_poems_enjambment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Once again, what about the average?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.mean(all_poems_enjambment)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now return to the question of the words that appear the most frequently in these 49 poems. First we have to use spaCy, an open-source software library for Natural Language Processing (NLP), to parse through the text and replace all the \"\\n\"'s with spaces."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "nlp = spacy.load('en', parser=False)\n",
    "parsed_text = nlp(raw.replace(\"\\n\", \" \"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can separate all the words/symbols and put them in a table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "toks_tab = Table()\n",
    "toks_tab.append_column(label=\"Word\", values=[word.text for word in parsed_text])\n",
    "toks_tab.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "toks_tab.append_column(label=\"POS\", values=[word.pos_ for word in parsed_text])\n",
    "toks_tab.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's create a new table with even more columns using the \"tablefy\" function below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def tablefy(parsed_text):\n",
    "    toks_tab = Table()\n",
    "    toks_tab.append_column(label=\"Word\", values=[word.text for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"POS\", values=[word.pos_ for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Lemma\", values=[word.lemma_ for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Stop Word\", values=[word.is_stop for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Punctuation\", values=[word.is_punct for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Space\", values=[word.is_space for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Number\", values=[word.like_num for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"OOV\", values=[word.is_oov for word in parsed_text])\n",
    "    toks_tab.append_column(label=\"Dependency\", values=[word.dep_ for word in parsed_text])\n",
    "    return toks_tab"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tablefy(parsed_text).show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, let's look at the frequency of words. However, we want to get rid of words such as \"the\" and \"and\" (stop words), punctuation, and spaces. We can do this by selecting rows that are not stop words, punctuation, or spaces and then sorting by word!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "word_counts = tablefy(parsed_text).where(\"Stop Word\", are.equal_to(False)).where(\n",
    "    \"Punctuation\", are.equal_to(False)).where(\n",
    "    \"Space\", are.equal_to(False)).group(\"Word\").sort(\"count\",descending=True)\n",
    "word_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this table, we have both the words \"sad\" and \"sadness\" - it seems strange to separate them. It turns out that these words are part of the same \"lexeme\", or a unit of meaning. For example, \"run\", \"runs\", \"ran\", and \"running\" are all part of the same lexeme with the lemma 'run'. Lemmas are another column in our table from above! Nice!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemma_counts = tablefy(parsed_text).where(\"Stop Word\", are.equal_to(False)).where(\n",
    "    \"Punctuation\", are.equal_to(False)).where(\n",
    "    \"Space\", are.equal_to(False)).group(\"Lemma\").sort(\"count\",descending=True)\n",
    "lemma_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at how many words there are of each part of speech."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pos_counts = tablefy(parsed_text).where(\"Stop Word\", are.equal_to(False)).where(\n",
    "    \"Punctuation\", are.equal_to(False)).where(\n",
    "    \"Space\", are.equal_to(False)).group(\"POS\").sort(\"count\",descending=True)\n",
    "pos_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also look at the proportions of each POS out of all the words!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for i in np.arange(pos_counts.num_rows):\n",
    "    pos = pos_counts.column(\"POS\").item(i)\n",
    "    count = pos_counts.column(\"count\").item(i)\n",
    "    total = np.sum(pos_counts.column(\"count\"))\n",
    "    proportion = str(count / total)\n",
    "    print(pos + \" proportion: \" + proportion)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If we're interested in words' relations with each other, we can look at words that are next to each other. The function below returns the word following the first instance of the word you search for in the specified source."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def nextword(word, source):\n",
    "    for i, w in enumerate(source):\n",
    "        if w == word:\n",
    "            return source[i+1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Mess around a bit with this function! Change the \"word\" argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "split_txt = raw.split()\n",
    "\n",
    "# Change the target or \"home\" to other words!\n",
    "nextword(\"home\", split_txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are specifically interested in the word \"I\" and the words that poets use in succession. Let's make an array of all the words that come after it in these poems. For easier viewing, the phrases have been printed out. What do you notice?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "one_after_i = make_array()\n",
    "for i, w in enumerate(split_txt):\n",
    "    if w == \"I\":\n",
    "        one_after_i = np.append(one_after_i, split_txt[i+1])\n",
    "for i in one_after_i:\n",
    "    print(\"I \" + i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Above we have only shown the next word, what about the next two words? Does this give you any new insight?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "two_after_i = make_array()\n",
    "for i, w in enumerate(split_txt):\n",
    "    if w == \"I\":\n",
    "        two_after_i = np.append(two_after_i, split_txt[i+1] + \" \" + split_txt[i+2])\n",
    "for i in two_after_i:\n",
    "    print(\"I \" + i)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Try doing some exploring of your own! If you're feeling stuck, feel free to copy and edit code from above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Write your own code here!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Sentiment Analysis\n",
    "\n",
    "We can do some analysis of the overall sentiments, or emotions conveyed, in each of the poems using the code below. Here, we analyze the overall sentiment of each poem individually. Once you run the next cell, you'll see the sentiment values for each poem. A value below 0 denotes a negative sentiment, and a value above 0 is positive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sentiments = make_array()\n",
    "for p in poems:\n",
    "    poem = TextBlob(p)\n",
    "    sentiments = np.append(sentiments, poem.sentiment.polarity)\n",
    "sentiments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, what does this mean? It appears that the number of poems with negative sentiment is about the same as the number of poems with positive or neutral (0) sentiment. We can look at the proportion of negative poems in the next cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "neg_proportion = np.count_nonzero(sentiments < 0)/len(sentiments)\n",
    "neg_proportion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, so just under half of the poems have negative sentiment. So, on average the poems have slightly positive sentiment, right?\n",
    "\n",
    "We can also perform sentiment analysis across the text of all of the poems at once and see what happens:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poems_all = TextBlob(raw.replace('\\n', ' '))\n",
    "poems_all.sentiment.polarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This way of analyzing the text tells us that the language in all of the poems has slightly negative sentiment.\n",
    "\n",
    "One more analysis we can perform is computing the average sentiment of the poems, given the list of each individual poem's sentiments that we computed earlier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.mean(sentiments)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This method also tells us that our poems have slightly negative sentiment, on average.\n",
    "\n",
    "Here, let's look at one of the poems with it's sentiment value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poem_3 = poems[3].replace('\\n', ' ')\n",
    "print(poem_3)\n",
    "print(TextBlob(poem_3).sentiment.polarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's look at one more poem:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "poem_47 = poems[47].replace('\\n', ' ')\n",
    "print(poem_47)\n",
    "print(TextBlob(poem_47).sentiment.polarity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "Question: Do you think the sentiment analyzer did a good job assigning the sentiment to these poems? What might that mean for the trends we see in our average sentiment across the poems?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    "# Replace this text with your response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "***Please fill out our [feedback form](https://goo.gl/forms/Ir0Ulg5WDQogHhK72)!***"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}