{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# ENGLISH R1A: Chinatown and the Culture of Exclusion\n", "**Instructor: Amy Lee**\n", "\n", "**Developers: Michaela Palmer, Maya Shen, Cynthia Leu, Chris Cheung**\n", "\n", "**FPF 2017**\n", "\n", "Welcome to lab! Please read this lab in its entirety, as the analysis will make a lot more sense with the background context provided.\n", "This lab is intended to be a hands-on introduction to data science as it can be applied to Chinatown demographics and analyzing primary texts.\n", "\n", "We will be reading and analyzing representations of Chinatown in the form of data and maps. In addition, we will learn how data tools can be used to read and analyze large volumes of text.\n", "\n", "## What this lab will cover\n", "* Running Jupyter Notebooks\n", "* Data Analysis of Chinatowns' demographics\n", "* Visualization & Interpretation\n", "* Using Data Tools to Analyze Primary Texts\n", "\n", "## What you need to do\n", "* Read the content, complete the questions\n", "* Analyze the data\n", "* Submit the assignment\n", "\n", "\n", "# 1. Running Jupyter Notebooks\n", "\n", "You are currently working in a Jupyter Notebook. A Notebook allows text and code to be combined into one document. Each rectangular section of a notebook is called a \"cell.\" There are two types of cells in this notebook: text cells and code cells. \n", "\n", "Jupyter allows you to run simulations and regressions in real time. To do this, select a code cell, and click the \"run cell\" button at the top that looks like ▶| to confirm any changes. Alternatively, you can hold down the `shift` key and then press `return` or `enter`.\n", "\n", "In the following simulations, anytime you see `In [ ]` you should click the \"run cell\" button to see output. **If you get an error message after running a cell, go back to the beginning of the lab and make sure that every previous code cell has been run.**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Part 0: Introduction to Python and Jupyter Notebooks: \n", "\n", "## 1. Cells, Arithmetic, and Code\n", "In a notebook, each rectangle containing text or code is called a *cell*.\n", "\n", "Cells (like this one) can be edited by double-clicking on them. This cell is a text cell, written in a simple format called [Markdown](http://daringfireball.net/projects/markdown/syntax) to add formatting and section headings. You don't need to worry about Markdown today, but it's a pretty fun+easy tool to learn.\n", "\n", "After you edit a cell, click the \"run cell\" button at the top that looks like ▶| to confirm any changes. (Try not to delete the instructions.) You can also press `SHIFT-ENTER` to run any cell or progress from one cell to the next.\n", "\n", "Other cells contain code in the Python programming language. Running a code cell will execute all of the code it contains.\n", "\n", "Try running this cell:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Hello, World!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now quickly go through some very basic functionality of Python, which we'll be using throughout the rest of this notebook.\n", "\n", "### 1.1 Arithmetic\n", "Quantitative information arises everywhere in data science. In addition to representing commands to `print` out lines, expressions can represent numbers and methods of combining numbers. \n", "\n", "The expression `3.2500` evaluates to the number 3.25. (Run the cell and see.)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "3.2500" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We don't necessarily always need to say \"`print`\", because Jupyter always prints the last line in a code cell. If you want to print more than one line, though, do specify \"`print`\"." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(3)\n", "4\n", "5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Many basic arithmetic operations are built in to Python, like `*` (multiplication), `+` (addition), `-` (subtraction), and `/` (division). There are many others, which you can find information about [here](http://www.inferentialthinking.com/chapters/03/1/expressions.html). Use parentheses to specify the order of operations, which act according to PEMDAS, just as you may have learned in school. Use parentheses for a happy new year!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "2 + (6 * 5 - (6 * 3)) ** 2 * (( 2 ** 3 ) / 4 * 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Variables\n", "\n", "We sometimes want to work with the result of some computation more than once. To be able to do that without repeating code everywhere we want to use it, we can store it in a variable with *assignment statements*, which have the variable name on the left, an equals sign, and the expression to be evaluated and stored on the right. In the cell below, `(3 * 11 + 5) / 2 - 9` evaluates to 10, and gets stored in the variable `result`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result = (3 * 11 + 5) / 2 - 9" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Functions\n", "\n", "One important form of an expression is the call expression, which first names a function and then describes its arguments. The function returns some value, based on its arguments. Some important mathematical functions are:\n", "\n", "| Function | Description |\n", "|----------|---------------------------------------------------------------|\n", "| `abs` | Returns the absolute value of its argument |\n", "| `max` | Returns the maximum of all its arguments |\n", "| `min` | Returns the minimum of all its arguments |\n", "| `round` | Round its argument to the nearest integer |\n", "\n", "Here are two call expressions that both evaluate to 3\n", "\n", "```python\n", "abs(2 - 5)\n", "max(round(2.8), min(pow(2, 10), -1 * pow(2, 10)))\n", "```\n", "\n", "These function calls first evaluate the expressions in the arguments (inside the parentheses), then evaluate the function on the results. `abs(2-5)` evaluates first to `abs(3)`, then returns `3`.\n", "\n", "A **statement** is a whole line of code. Some statements are just expressions, like the examples above, that can be broken down into its subexpressions which get evaluated individually before evaluating the statement as a whole.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Calling functions\n", "\n", "The most common way to combine or manipulate values in Python is by calling functions. Python comes with many built-in functions that perform common operations.\n", "\n", "For example, the `abs` function takes a single number as its argument and returns the absolute value of that number. The absolute value of a number is its distance from 0 on the number line, so `abs(5)` is 5 and `abs(-5)` is also 5." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abs(5)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "abs(-5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Functions can be called as above, putting the argument in parentheses at the end, or by using \"dot notation\", and calling the function after finding the arguments, as in the cell immediately below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datascience import make_array\n", "nums = make_array(1, 2, 3) # makes a list of items, in this case, numbers" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nums.mean() # finds the average of the array" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1: Exploring Demographic Data: \n", "\n", "## 1.1 Importing Modules\n", "\n", "First, we need to import libraries so that we are able to call the functions from within. We are going to use these functions to manipulate data tables and conduct a statistical analysis. Run the code cell below to import these modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!python -m spacy download en\n", "!pip install --no-cache-dir wordcloud\n", "!pip3 install --no-cache-dir -U folium\n", "!pip install --no-cache-dir textblob\n", "from datascience import *\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from ipywidgets import *\n", "%matplotlib inline\n", "import folium\n", "import pandas as pd\n", "from IPython.display import HTML, display, IFrame\n", "import folium\n", "import spacy\n", "from wordcloud import WordCloud\n", "from textblob import TextBlob\n", "import geojson" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Official map of Chinatown in San Francisco - 1855\n", "#### Prepared under the supervision of the special committee of the Board of Supervisors. July 1885.\n", "\n", "\n", "\n", "\n", "This map reflects the pervasive bias against the Chinese in California and in turn further fostered the hysteria. It was published as part of an official report of a Special Committee established by the San Francisco Board of Supervisors \"on the Condition of the Chinese Quarter.\" The Report resulted from a dramatic increase in hostility to the Chinese, particularly because many Chinese laborers had been driven out of other Western states by vigilantes and sought safety in San Francisco (Shah 2001, 37).\n", "
The substance and tone of the Report is best illustrated by a few excerpts: \"The general aspect of the streets and habitations was filthy in the extreme, . . . a slumbering pest, likely at any time to generate and spread disease, . . . a constant source of danger . . . , the filthiest spot inhabited by men, women and children on the American continent.\" (Report 4-5). \"The Chinese brought here with them and have successfully maintained and perpetuated the grossest habits of bestiality practiced by the human race.\" (Ibid. 38).\n", "
The map highlights the Committee's points, particularly the pervasiveness of gambling, prostitution and opium use. It shows the occupancy of the street floor of every building in Chinatown, color coded to show: General Chinese Occupancy|Chinese Gambling Houses|Chinese Prostitution|Chinese Opium Resorts|Chinese Joss Houses|and White Prostitution.\n", "The Report concludes with a recommendation that the Chinese be driven out of the City by stern enforcement of the law: \"compulsory obedience to our laws [is] necessarily obnoxious and revolting to the Chinese|and the more rigidly this enforcement is insisted upon and carried out the less endurable will existence be to them here, the less attractive will life be to them in California. Fewer will come and fewer will remain. . . . Scatter them by such a policy as this to other States . . . .\" (Ibid. 67-68)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Analyzing Demographics\n", "In this section, we will examine some of the factors that influence population growth and how they are changing the landscape of Chinatowns across the U.S." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2.1 Reading Data, 2010-2015\n", "\n", "Now it's time to work with tables and explore some real data. A `Table` is just like how we made a list above with `make_array`, but for all the rows in a table.\n", "\n", "We're going to first look at the most recent demographic data from 2010-2015:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "historical_data = Table.read_table('data/2010-2015.csv') # read in data from file\n", "historical_data['FIPS'] = ['0' + str(x) for x in historical_data['FIPS']] # fix FIPS columns\n", "historical_data.show(10) # show first ten rows" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can get some quick summary statistics by calling the `.stats()` function on our `Table` variable:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "historical_data.stats()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So which census tract has the highest Asian population?\n", "\n", "First we can find the highest population by using the `max` function:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max(historical_data['Asian'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's plug that into a table that uses the `where` and `are.equal_to` functions:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "historical_data.where('Asian', are.equal_to(max(historical_data['Asian'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This FIPS code 06075035300 is tract [353](https://censusreporter.org/profiles/14000US06075035300-census-tract-353-san-francisco-ca/). Does this make sense to you?\n", "\n", "---\n", "\n", "It might be better to look at which census tracts has Asian as the highest proportion of the population:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "historical_data['Asian_percentage'] = historical_data['Asian'] / historical_data['Population']\n", "historical_data.show(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can use the same method to get the `max` and subset our table:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "max(historical_data['Asian_percentage'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "historical_data.where('Asian_percentage', are.equal_to(max(historical_data['Asian_percentage'])))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "FIPS code 06075011800 is census tract [118](https://censusreporter.org/profiles/14000US06075011800-census-tract-118-san-francisco-ca/). Does this make sense?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
Run the following cell to import the poems from a .txt file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "with open('data/islandpoetry1_22.txt', \"r\") as f:\n", " raw = f.read()\n", "print(raw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're interested in which words appear the most often in our set of poems. It's pretty hard to read or see much in this form. We'll coming back to the topic of what words are the most common with actual numbers a bit later but for now, run the following cell to generate two interesting visualizations of the most common words (minus those such as \"the\", \"a\", etc.). " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "wordcloud = WordCloud().generate(raw)\n", "\n", "plt.imshow(wordcloud, interpolation='bilinear')\n", "plt.axis(\"off\")\n", "\n", "# lower max_font_size\n", "wordcloud = WordCloud(max_font_size=40).generate(raw)\n", "plt.figure()\n", "plt.imshow(wordcloud, interpolation=\"bilinear\")\n", "plt.axis(\"off\")\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "