Overview

Due before class Monday November 27th.

Fork the hw09 repository

Go here to fork the repo for homework 09.

Your mission

Perform text analysis.

Okay, I need more information

Perform sentiment analysis, classification, or topic modeling using text analysis methods as demonstrated in class and in the readings.

Okay, I need even more information.

Do the above. Can’t think of a data source?

  • gutenbergr
  • AssociatedPress from the topicmodels package
  • NYTimes or USCongress from the RTextTools package
  • Reuters-21578 - a standard set of text documents (articles published by Reuters in 1987). To access the document-term matrix for this data set, run the following code:

    install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
    library(tm.corpus.Reuters21578)
    data("Reuters21578_DTM")
  • State of the Union speeches
  • Scrape tweets using twitteR (you know how to use the API now, right?)

Analyze the text for sentiment OR topic. Or build a statistical learning model using text features to predict some outcome of interest. You don’t have to do all these things, just pick one. The lecture notes and Tidy Text Mining with R are good starting points for templates to perform this type of analysis, but feel free to expand beyond these examples.

Submit the assignment

Your assignment should be submitted as a set of R scripts, R Markdown documents, data files, etc. Whatever is necessary to show your code and present your results. Follow instructions on homework workflow. As part of the pull request, you’re encouraged to reflect on what was hard/easy, problems you solved, helpful tutorials you read, etc.

Rubric

Check minus: Cannot get code to run or is poorly documented. Severe misinterpretations of the results. No effort is made to pre-process the text for analysis.1

Check: Solid effort. Hits all the elements. No clear mistakes. Easy to follow (both the code and the output). Nothing spectacular, either bad or good.

Check plus: Interpretation is clear and in-depth. Accurately interprets the results, with appropriate caveats for what the technique can and cannot do. Code is reproducible (i.e. if analyzing tweets, you have stored a copy in a local file so I can exactly reproduce your results as well as run it on a new sample of tweets). Uses a sentiment analysis or topic model approach not directly covered in class.


  1. Or you provide no justification for keeping content such as numbers, stop words, etc.

This work is licensed under the CC BY-NC 4.0 Creative Commons License.