{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![](https://github.com/MartinSchweinberger/SLAT7829/blob/master/images/bannerSLAT7829.jpeg?raw=true)\n", "\n", "# Regular Expressions in R\n", "\n", "This tutorial introduces regular expressions and how they can be used when working with language data.\n", "\n", "## What are regular expressions?\n", "\n", "Regular expressions, also known as regex, are patterns used to match and manipulate text. They are a powerful tool used in computer programming, data processing, and text editing.\n", "\n", "A regular expression is a sequence of characters that defines a search pattern. They can be used to search for specific patterns in text, extract information from text, and replace text. For example, you could use a regular expression to search for all email addresses in a text file, extract all phone numbers from a web page, or replace all instances of a word in a document.\n", "\n", "Regular expressions are supported in many programming languages, text editors, and command-line tools, such as R, Python, or Perl. Although the may be minor differences, regular expressions are very consistent across programming languages.\n", "\n", "**Preparation and session set up**\n", "\n", "In this tutorial, we will explore regular expressions in R and to start, we need to load the necessary packages.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "library(dplyr)\n", "library(stringr)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once you have installed RStudio and have initiated the session by executing the code shown above, you are good to go.\n", "\n", "# Explaining the code\n", "\n", "This section uses a couple of commands. Below is list with explanations of what these commands mean and what they do to help you understand the code shown below.\n", "\n", "* `readLines()`: reads in a text line-by-line \n", "* `%>%`: called the *pipe* and can be read as *and then* \n", "* `unlist()`: flattens a list to a simple vector (from a hierarchical/nested to a flat data structure)\n", "* `str_detect()` and `str_detect_all()`: checks if a pattern occurs in a string (text element) \n", "* `str_remove()` and `str_remove_all()`: removes a pattern from a string (text element) \n", "* `str_replace()` and `str_replace_all()`: replaces a pattern in a string (text element) with another pattern \n", "* `str_view()` and `str_view_all()`: highlights a pattern in a string (text element)\n", "\n", "# Getting started with Regular Expressions\n", "\n", "To put regular expressions into practice, we need some text that we will perform out searches on. In this tutorial, we will use texts from wikipedia about grammar.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# read in first text\n", "text1 <- readLines(\"https://slcladal.github.io/data/testcorpus/linguistics02.txt\")\n", "et <- paste(text1, sep = \" \", collapse = \" \")\n", "# inspect example text\n", "et\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition, we will split the example text into words to have another resource we can use to understand regular expressions\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# split example text\n", "set <- str_split(et, \" \") %>%\n", " unlist()\n", "# inspect\n", "head(set)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we delve into using regular expressions, we will have a look at the regular expressions that can be used in R and also check what they stand for.\n", "\n", "There are three basic types of regular expressions:\n", "\n", "* regular expressions that stand for individual symbols and determine frequencies\n", "\n", "* regular expressions that stand for classes of symbols\n", "\n", "* regular expressions that stand for structural properties\n", "\n", "The regular expressions below show the first type of regular expressions, i.e. regular expressions that stand for individual symbols and determine frequencies.\n", "\n", "![](https://slcladal.github.io/images/regex01.JPG)\n", "\n", "\n", "\n", "The regular expressions below show the second type of regular expressions, i.e. regular expressions that stand for classes of symbols.\n", "\n", "![](https://slcladal.github.io/images/regex02.JPG)\n", "\n", "\n", "\n", "The regular expressions that denote classes of symbols are enclosed in `[]` and `:`. The last type of regular expressions, i.e. regular expressions that stand for structural properties are shown below.\n", "\n", "![](https://slcladal.github.io/images/regex03.JPG)\n", "\n", "\n", "\n", "\n", "# Demonstration\n", "\n", "In this section, we will explore how to use regular expressions. At the end, we will go through some exercises to help you understand how you can best utilize regular expressions.\n", "\n", "Show all words in the split example text that contain `a` or `n`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"[an]\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words in the split example text that begin with a lower case `a`.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"^a\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words in the split example text that end in a lower case `s`.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"s$\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words in the split example text in which there is an `e`, then any other character, and than another `n`.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"e.n\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words in the split example text in which there is an `e`, then two other characters, and than another `n`.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"e.{2,2}n\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words that consist of exactly three alphabetical characters in the split example text.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"^[:alpha:]{3,3}$\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Show all words that consist of six or more alphabetical characters in the split example text.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"^[:alpha:]{6,}$\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Replace all lower case `a`s with upper case `E`s in the example text.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_replace_all(et, \"a\", \"E\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove all non-alphabetical characters in the split example text.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_remove_all(set, \"\\\\W\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Remove all white spaces in the example text.\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_remove_all(et, \" \")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Escaping regular expressions**\n", "\n", "Remove . and ? and ,:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_replace_all(text1, \"\\\\.|\\\\?|,\", \" PUNCTUATION \")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Removing superfluous white spaces**\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "whitespaces <- \" This is a test sentence with superfluous white spaces \"\n", "# inspect\n", "whitespaces\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# remove superfluous white spaces\n", "perfectwhitespaces <- str_squish(whitespaces)\n", "# inspect\n", "perfectwhitespaces\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Replacing elements at the beginning and end of strings**\n", "\n", "Find strings that begin with a *w*\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_detect(set, \"^w\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"^w\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find all words that end with *s* or *es*:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "set[str_detect(set, \"e{0,1}s$\")]\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Limiting scope of regular expressions**\n", "\n", "We want to limit how far a regular expression extends:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "limiting <- \"This is an <\/of> example sentence. \"\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": "\n" }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_remove_all(limiting, \"<.*>\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What happens is that the scope of `.*` extends to the last instance of `>`. We can limit the scope by adding a `?`:\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_remove_all(limiting, \"<.*?>\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Highlighting patterns**\n", "\n", "We use the `str_view` and `str_view_all` functions to show the occurrences of regular expressions in the example text.\n", "\n", "To begin with, we match an exactly defined pattern (`ang`).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_view_all(et, \"ang\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we include . which stands for any symbol (except a new line symbol).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "str_view_all(et, \".n.\")\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Practice\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext <- c(\"HTML, which stands for HyperText Markup Language, is the standard language used to create web pages. It's comprised of different tags, such as , , ,

, , and , that are used to define the structure and content of a web page.\n", "\n", "Phone numbers can be written in various formats, but a common one in the US is (555) 555-5555. Another common format is 555-555-5555. Regular expressions can be used to match and extract phone numbers from text.\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 1\n", "\n", "Extract all phone numbers.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 2\n", "\n", "Split text into words.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 3\n", "\n", "Extract all words containing the sequence `ex`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 4\n", "\n", "Find all words that begin with an `n`\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 5\n", "\n", "Find all words that end with a `h`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 6\n", "\n", "Remove all html tags.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 7\n", "\n", "Find all words with 2 or 3 characters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 8\n", "\n", "Remove all words that contain 4 characters.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 9\n", "\n", "Replace all words with 3 characters with the sequence `LOL`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Task 10\n", "\n", "Extract all words that begin with a capital letter.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "practicetext\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sessionInfo()\n", "\n" ] } ], "metadata": { "anaconda-cloud": "", "kernelspec": { "display_name": "R", "langauge": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1" } }, "nbformat": 4, "nbformat_minor": 1 }