{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regular Expressions\n", "\n", "Regular expressions are powerful tools to extract *structured information* from *unstructured text.* For example, suppose that we are scraping Twitter data, and we'd like to extract a list of all the mentions and hashtags in a tweet. Our raw data might look something like this: \n", "\n", "
\n", "\n", "We'd like to extract the hashtags from this tweet. For example, we'd like to write a function `collect_hashtags()` with the following output: \n", "\n", "```python\n", "collect_hashtags(tw)\n", "['statstwitter', 'epitwitter', 'rstats', 'math', 'AI', 'DataScience', 'python', 'Science']\n", "```\n", "\n", "We could then use this function on many tweets in order to conduct an analysis of what people are talking about on Twitter. How can we recognize the hashtags? \n", "\n", "If you're familiar with Twitter, you know that a hashtag consists of the symbol \\#, followed by one or more letters, which may or may not be capitalized. A space `\" \"` terminates the hashtag. \n", "\n", "This is an informal description of a *pattern* -- a rule for detecting hashtags in text. In this case, the rule is: \n", "\n", "> Find a `#`. Then, continue through letters and numbers until a space `\" \"` is reached.\n", "\n", "Regular expressions allow us to formally construct and use patterns to obtain structured data like hashtags from unstructured text. They are an extremely powerful tool in any applications in which we need to work with text data. \n", "\n", "To work with regular expressions, we need a few functions from the `re` package. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a plaintext representation of our tweet. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw = \"Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science\" \n", "tw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we need to do is construct a *pattern* that matches the pieces of text that we want to find. Patterns are represented as *raw strings*, that is, they are preceded by `r` outside quotes. Raw strings don't process special characters. For example, the string `\"\\n\"` has just one character (the special newline character), but the string `r\"\\n\"` has two (`\"\\\"` and `\"n\"`). " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 2)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(\"\\n\"), len(r\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's start pattern matching. Our main tool is the function `re.search()`. This function finds the very first match of the specified pattern. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Our Great American Model was built on tough (very strong!!) parametric assumptions!
— Statistician Trump (@StatisticianTr2) July 11, 2020
But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!!#statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science