{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Preprocessing\n", "\n", "Like other data types, text data never comes clean. Moreover, most of our downstream methods only accept data structured in a particular way. Because of this, before we do any computational text analysis techniques, we will always need to perform some level of preprocessing. Text data has its own unique kind of preprocessing. In this notebook, we will cover the core preprocessing methods in preparation for our next two weeks:\n", "\n", "- Reading in files\n", "- Character encoding\n", "- Tokenization\n", "- Sentence segmentation\n", "- Removing punctuation\n", "- Stripping whitespace\n", "- Text normalization\n", "- Stop words\n", "- Stemming/Lemmatizing\n", "- POS tagging\n", "- DTM/TF-IDF\n", "\n", "### Time\n", "- Teaching: 50 minutes\n", "- Exercises: 60 minutes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading in files\n", "\n", "The first step is to read in the files containing the data. As we discussed last week, the most common file types for text data are: `.txt`, `.csv`, `.json`, `.html` and `.xml`.\n", "\n", "#### Reading in `.txt` files\n", "\n", "Python has built-in support for reading in `.txt` files.\n", "\n", "- What type of object is `raw`?\n", "- How many characters are in `raw`?\n", "- Get the first 1000 characters of `raw`?" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "DATA_DIR = 'data'\n", "fname = 'pride-and-prejudice.txt'\n", "fname = os.path.join(DATA_DIR, fname)\n", "with open(fname, encoding='utf-8') as f:\n", " raw = f.read()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reading in `.csv`\n", "\n", "Python has a built-in module called `csv` for reading in csv files.\n", "\n", "- What type is `tweets`?\n", "- How many entries are in `raw`?\n", "- Which entry is the header row?\n", "- How can we get the text of the first question?\n", "- How can we get a list of the texts of all questions?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import csv\n", "fname = 'trump-tweets.csv'\n", "fname = os.path.join(DATA_DIR, fname)\n", "tweets = []\n", "with open(fname) as f:\n", " reader = csv.reader(f)\n", " tweets = list(reader)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Reading in `.csv` with `pandas`\n", "\n", "`pandas` is a third-party library that makes working with tabular data much easier. This is the recommended way to read in a `.csv` file.\n", "\n", "- How many tweets are there?\n", "- What happened to the header row?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "fname = 'trump-tweets.csv'\n", "fname = os.path.join(DATA_DIR, fname)\n", "tweets = pd.read_csv(fname)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Date | \n", "Time | \n", "Tweet_Text | \n", "Type | \n", "Media_Type | \n", "Hashtags | \n", "Tweet_Id | \n", "Tweet_Url | \n", "twt_favourites_IS_THIS_LIKE_QUESTION_MARK | \n", "Retweets | \n", "Unnamed: 10 | \n", "Unnamed: 11 | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "16-11-11 | \n", "15:26:37 | \n", "Today we express our deepest gratitude to all ... | \n", "text | \n", "photo | \n", "ThankAVet | \n", "7.970000e+17 | \n", "https://twitter.com/realDonaldTrump/status/797... | \n", "127213 | \n", "41112 | \n", "NaN | \n", "NaN | \n", "
1 | \n", "16-11-11 | \n", "13:33:35 | \n", "Busy day planned in New York. Will soon be mak... | \n", "text | \n", "NaN | \n", "NaN | \n", "7.970000e+17 | \n", "https://twitter.com/realDonaldTrump/status/797... | \n", "141527 | \n", "28654 | \n", "NaN | \n", "NaN | \n", "
2 | \n", "16-11-11 | \n", "11:14:20 | \n", "Love the fact that the small groups of protest... | \n", "text | \n", "NaN | \n", "NaN | \n", "7.970000e+17 | \n", "https://twitter.com/realDonaldTrump/status/797... | \n", "183729 | \n", "50039 | \n", "NaN | \n", "NaN | \n", "