{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regular Expressions\n", "\n", "Regular expressions are powerful tools to extract *structured information* from *unstructured text.* For example, suppose that we are scraping Twitter data, and we'd like to extract a list of all the mentions and hashtags in a tweet. Our raw data might look something like this: \n", "\n", "

Our Great American Model was built on tough (very strong!!) parametric assumptions!

But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!!#statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science

— Statistician Trump (@StatisticianTr2) July 11, 2020
\n", "\n", "We'd like to extract the hashtags from this tweet. For example, we'd like to write a function `collect_hashtags()` with the following output: \n", "\n", "```python\n", "collect_hashtags(tw)\n", "['statstwitter', 'epitwitter', 'rstats', 'math', 'AI', 'DataScience', 'python', 'Science']\n", "```\n", "\n", "We could then use this function on many tweets in order to conduct an analysis of what people are talking about on Twitter. How can we recognize the hashtags? \n", "\n", "If you're familiar with Twitter, you know that a hashtag consists of the symbol \\#, followed by one or more letters, which may or may not be capitalized. A space `\" \"` terminates the hashtag. \n", "\n", "This is an informal description of a *pattern* -- a rule for detecting hashtags in text. In this case, the rule is: \n", "\n", "> Find a `#`. Then, continue through letters and numbers until a space `\" \"` is reached.\n", "\n", "Regular expressions allow us to formally construct and use patterns to obtain structured data like hashtags from unstructured text. They are an extremely powerful tool in any applications in which we need to work with text data. \n", "\n", "To work with regular expressions, we need a few functions from the `re` package. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's a plaintext representation of our tweet. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw = \"Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science\" \n", "tw" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first thing we need to do is construct a *pattern* that matches the pieces of text that we want to find. Patterns are represented as *raw strings*, that is, they are preceded by `r` outside quotes. Raw strings don't process special characters. For example, the string `\"\\n\"` has just one character (the special newline character), but the string `r\"\\n\"` has two (`\"\\\"` and `\"n\"`). " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 2)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(\"\\n\"), len(r\"\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's start pattern matching. Our main tool is the function `re.search()`. This function finds the very first match of the specified pattern. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"#\"\n", "result = re.search(pattern, tw)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This says that the first match of the pattern `#` occurred at index 198. We can extract either the location of the match or the substring that produced the match. For the latter, we use the `group()` method -- we'll explain this name in a future lecture. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(198, 199)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.span()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'#'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.group()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, the matching substring is just a single character. Let's make things a bit more interesting -- we'll look for the first hashtag that begins with `\"#epi\"`: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"#epi\"\n", "result = re.search(pattern, tw)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check that the span corresponds to the location in the original string: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'#epi'" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sp = result.span()\n", "tw[sp[0]:sp[1]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We would have gotten the same result by checking `result.group()`: " ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'#epi'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result.group()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Various Syntax \n", "\n", "The regular expressions engine has a lot of syntax options that can help you easily express very complicated patterns. Here are a few of the most important ones. " ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.search(r\"rk\", \"kirk\")" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# search always takes the FIRST match\n", "re.search(r\"rk\", \"kirk kirk\")" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Repeated characters\n", "# o* matches any segment of o's. May include no o's. \n", "# o+ matches any segment with at least one o\n", "\n", "s = \"Sisk Siskooooooo\"\n", "\n", "re.search(r\"ko+\", s)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Bracket expressions\n", "# [A-z] matches any letter, upper or lower case\n", "# try [A-Z] or [a-z]\n", "# Add + to match continuous strings of letters (i.e. words)\n", "\n", "s = \"Siskoooooo in DS9\"\n", "\n", "re.search(r\"[A-Z]+[0-9]\", s)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Back to Twitter\n", "\n", "Now we're ready to try finding hashtags. Each one is a `#` character followed by a string of letters and numbers, regardless of case, with no spaces. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"#[A-z0-9]+\"\n", "result = re.search(pattern, tw)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can alternatively retrieve all the matches, while throwing away the positional information, using `re.findall()`: " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['#statstwitter',\n", " '#epitwitter',\n", " '#rstats',\n", " '#math',\n", " '#AI',\n", " '#DataScience',\n", " '#python',\n", " '#Science']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "re.findall(pattern, tw)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We achieved our goal! In the next lecture, we'll look at how to extract even more complex expressions. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }