{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Lookarounds\n", "\n", "\"Lookarounds\" let you filter your matches based on what is \"around\" the match. For example, let's go back to our Statistician Trump tweet: \n", "\n", "

Our Great American Model was built on tough (very strong!!) parametric assumptions!

But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!!#statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science

— Statistician Trump (@StatisticianTr2) July 11, 2020
\n", "\n", "Suppose we'd like to extract all words followed by one or more exclamation points `!!`. One way to do this is using standard groups. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science'" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tw = \"Our Great American Model was built on tough (very strong!!) parametric assumptions! But FAR LEFT elitists living in coastal TANGENT SPACES (out of touch!) want to throw these out. Not on my watch!! #statstwitter #epitwitter #rstats #math #AI #DataScience #python #Science\" \n", "tw" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('strong', '!!'), ('assumptions', '!'), ('touch', '!'), ('watch', '!!')]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"([A-z]+)(!+)\"\n", "result = re.findall(pattern, tw)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, we don't necessarily want to extract the punctuation itself, just the immediately preceding word. To do this we can use a *lookahead*, the syntax for which is `?=`. Add a lookahead will cause us to return a match only if the next part of the string matches the lookahead expression. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['strong', 'assumptions', 'touch', 'watch']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"([A-z]+)(?=!+)\"\n", "result = re.findall(pattern, tw)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use lookbehinds, which behave very similarly, but they search immediately behind the candidate match. For example, suppose we'd like to extract everything in a string within parentheses. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Sometimes (but not always), the most important (or at least, most intriguing) information is between parentheses. '" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = \"Sometimes (but not always), the most important (or at least, most intriguing) information is between parentheses. \"\n", "s" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['but not always', 'or at least, most intriguing']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pattern = r\"(?<=[(])([A-z\\s,]+)(?=[)])\"\n", "result = re.findall(pattern, s)\n", "result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It's usually not necessary to use lookarounds (one can instead use groups and discard parts of the matches), but using them often makes your downstream code much tidier. " ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }