--- title: "Harry Potter and regular expressions" author: 'STOR 390' output: html_document --- Assignment 2 is due 2/28/17. You can find the raw .Rmd file at: [https://raw.githubusercontent.com/idc9/stor390/master/assignments/harry_potter/harry_potter.Rmd](https://raw.githubusercontent.com/idc9/stor390/master/assignments/harry_potter/harry_potter.Rmd). The text of all 7 Harry Potter books is available online: [http://www.readfreeonline.net/Author/J._K._Rowling/Index.html](http://www.readfreeonline.net/Author/J._K._Rowling/Index.html) (very possible the website is now defunct). In this assignment you will use dplyr, ggplot and regular expressions to do an exploratory analysis of Harry Potter and the Philosopher's Stone. Here are a couple examples of similar text analysis projects (that you will be able to do in a couple weeks!) - [The Life-Changing Magic of Tidying Text](http://juliasilge.com/blog/Life-Changing-Magic/) by Julia Silge (yes [janeaustenr](https://github.com/juliasilge/janeaustenr) is an entire R package devoted to Jane Austen) - [Harry Potter agression](https://github.com/andrewheiss/Harry-Potter-aggression) by Andrew Heiss # Question 0 Set `eval=FALSE` for the chunk above and `eval=TRUE`for the chunk below and all test chunks. The text file comes with the Sakai announcement. ```{r, eval=T, message=F, warning=F} # set up library(tidyverse) library(stringr) text <- read_file('philosophers_stone.txt') ``` # Question 1 How many words are in the book? ```{r} # ``` # Question 2 How many times are each of the following characters mentioned? Display the answer using an appropriate visualization. - Harry, Hermione, Ron, Neville, Dumbledore, Draco, Snape, Hagrid, McGonagall ```{r} people <- c('Harry', 'Hermione', 'Ron', 'Neville', 'Dumbledore', 'Draco', 'Snape', 'Hagrid', 'McGonagall') ``` # Question 3 Break the text into paragraphs; create a vector called `paragraphs` where each entry is a paragraph in the book. ```{r} # assume paragraphs end with \\\r\\\n ``` # Question 4 Write a function that can break the text up into paragraphs, sentences, or words. This is a preview of [what you'll be doing](http://tidytextmining.com/tidytext.html#the-unnest_tokens-function) in a couple weeks. This function does not need to be perfect. For sentences, give one example where the function you wrote fails. *Hint*: the function should probably have a if statement ```{r} # Assume a sentence ends with one of: .!? # there are multiple valid definitions of word, just do something reasonable unnest_tokens <- function(text, token='words'){ # splits a string into tokens # input # text is a string # token can be one of: words, paragraphs, sentences # output: a character vector } ``` ```{r test4, eval=F} # TODO add more # Test code for the grader -- you don't have to modify these sum(paragraphs == unnest_tokens(text, 'paragraphs')) ``` # Question 5 Put the data into tidy format with one row per paragraph. - create a tibble called `paragraph_df` with one column `text` with the text of each paragraph (*hint*: you might need to use `as.character(paragraphs)`) - add a new column `index` that gives the index of each paragraph - **without** using dplyr add a column called `Harry` that counts the number of times Harry is referenced in each paragraph ```{r} # ``` *Hint*: you can use question 2 to check your answer # Question 6 Write a function called `reference_counter` that generalizes question 5 for any tidy text data frame and any list of words. *Hint*: do this **without** dplyr, use base R subsetting (i.e.`[]`). If `df` is a data frame and `vec` is a vector then `df['blah'] <- vec` will create a new column for `df` call `blah`. ```{r} reference_counter <- function(text_df, word_list){ # inputs # text_df is a tibble with a column called text # word_list is a vector of strings # for each word in word_list add a column to text_df counting # the number of times that word appears in each row of text df # does not modify the original text_df # do this WITHOUT using dplyr } ``` ```{r test6, eval=F} # test code for grader test_words <- c('Harry', 'Hagrid', 'wand') test_df <- reference_counter(paragraph_df, test_words) test_df %>% select(Harry, Hagrid, wand) %>% summarise_all(sum) ``` # Question 7 Using the `reference_counter` function update `paragraph_df` to include columns counting the number of references to each characters from Q2 in each paragraph ```{r} # ``` ```{r test7, eval=F} # test code for grader paragraph_df[,people] %>% summarise_all(sum) ``` # Question 8 Make a new data frame called `person_refs` with three columns: person, num_refs, index. num_refs is the number of references each person gets in paragraph and index is the index of the paragraph. Limit this data frame to the following 5 characters: Harry, Hermione, Ron, Draco, Neville. *Hint*: use `gather`. ```{r} # ``` Make a bar plot showing the number of paragraphs that references each of the 5 characters ```{r} # ``` Now we want to examine how characters evolve over "time." Plot the number of references vs. the paragraph index. ```{r} # ``` In this question we are using paragraphs for "time windows." What are other "time windows" we could have used? What are some trade offs for these different choices. # Question 9 How often are Harry and Herminone referenced together? Plot the number of references per paragraph for Harry vs. Herminone. - one plot using `geom_point` - one plot using `geom_jitter` (use the width/height arguments of jitter to make the jitter plot look better) ```{r} # ``` ```{r} # ``` Why is the jitter plot better than a simple point plot? # Question 10 Do Harry and Hermione tend to co-occur? Fit a linear regression of Harry vs. Hermione references per paragraph. Use the `lm()` function and print out the `summary` of the model. ```{r} # ``` Now use `geom_smooth` to plot the linear regression line on top of the jitter plot. ```{r} # ``` # Question 11 Is there are relationship between the length of the paragraph and the number of times Harry is mentioned? Add a column called `num_words` to `paragraph_df` counting the number of words in each paragraph. Then use a linear regression to answer for the question. Provide both a statistical summary and a visualization. ```{r} # ``` # Question 12 Create an indicator variable `harry_mentioned` that indicates whether or not Harry is mentioned in each paragraph. This indicator variable should be a factor. ```{r} # ``` Now do a linear regression with `harry_mentioned` as the x variable instead of the number of times he is mentioned. ```{r} # ``` # Free response Ask and answer a question with this data set. You should make at least 2 figures (e.g. plot, printout of a regression, etc). Provide a written explanation of the question and the evidence for your answer.