This notebook intends to summarize key usages about regular expressions in python.

References:
- [Regular Expressions: Regexes in Python (Part 1)](https://realpython.com/regex-python/)

In [1]:
import re

## re.search

`re.search(, )` scans `` looking for the first location where the pattern `` matches.

- If a match is found, then `re.search()` returns a **match object**. Otherwise, it returns `None`.
- A match object is truthy, so you can use it in a Boolean context like a conditional statement.
- In a regex, a set of characters specified in square brackets ([]) makes up a **character class**. This metacharacter sequence matches any single character that is in the class.

In [5]:
re.search



In [13]:
s = 'foo123bar'
print(re.search('123', s))
print(s[3:6])
print(re.search('[1-9][1-9][1-9]', s))


123



In [8]:
if re.search('123', s):
 print('Found a match.')
else:
 print('No match.')

Found a match.


### [ ]

- `[0-9a-fA-F]` matches any hexadecimal digit character.
- `[^0-9]` matches any character that isn’t a digit.
- to match a literal `^`, put it not in the first position.
- to match a literal hyphen `-`, put it in the first or last or use a backslash.
- to match a literal `]`, put it in the first or use a backslash.
- all other regex metacharacters lose their special meaning inside a character class.

In [14]:
print(re.search('[#:^]', 'foo^bar:baz#qux'))



In [15]:
print(re.search('[-abc]', '123-456'))
print(re.search('[abc-]', '123-456'))
print(re.search('[ab\-c]', '123-456'))






In [19]:
print(re.search('[]abc]', '12[3]456'))
print(re.search('[a\]bc]', '12[3]456'))





### \w \W
- `\w` matches any alphanumeric word character. Word characters are uppercase and lowercase letters, digits, and the underscore (_) character, so `\w` is essentially shorthand for [a-zA-Z0-9_].
- `\W` is the opposite. It matches any non-word character and is equivalent to [^a-zA-Z0-9_].

### \d \D
`\d` matches any decimal digit character. `\D` is the opposite. It matches any character that isn’t a decimal digit. `\d` is essentially equivalent to [0-9], and `\D` is equivalent to [^0-9].

### \s \S
\s matches any whitespace character, including a newline charactor `\n`. `\S` is the opposite of \s. It matches any character that isn’t whitespace.

`[\d\w\s]` matches any digit, word, or whitespace character.

In [22]:
print(re.search('\s', 'foo\nbar baz'))
print(re.search('\S', ' \n foo \n baz'))





## Escaping Metacharacters

### backslash `\`

- `\\` represents literal backslash.
- `r' '`: raw string, which suppress the interpreter's process of literal strings. Always use raw strings when dealing with backslash matches.

In [24]:
s = 'foo\bar'
print(s)
s = r'foo\bar'
print(s)

fooar
foo\bar


In [27]:
print(re.search('\\\\', s)) # deal with interpreter's process first, then pass to reg process
print(re.search(r'\\', s))





### Anchors


- `^`, `\A`: start of a string.
- `$`, `\Z`: end of a string.
- `\b`: boundary of a word. A word means `[\w]*`. Use raw string here.
- `\B`: not a boundary.

In [95]:
print(re.search('^foo', 'foobar'))
print(re.search('^foo', 'barfoo'))
print(re.search('foo$', 'barfoo'))

print(re.search(r'\bfoo\b', '#foo.bar')) # do remember to use raw string
print(re.search(r'foo\b', 'foo.bar'))


None





### Quantifiers
A quantifier metacharacter immediately follows a portion of a `` and indicates how many times that portion must occur for the match to succeed.

Greedy: produce the longest possible match.
- `*`: zero or more
- `+`: one or more
- `?`: zero or one

Non-greedy versions of the above respectively: the shortest possible match.
- `*?`
- `+?`
- `??`

range
Note that don't put a space inside the `{}`.
- `{m}`: exactly m
- `{m,n}`: m - n, greedy version.
- `{m,}`: m - inf
- `{,n}`: 0 - n
- `{,}`: 0 - inf
- `{}`: literal `{}`
- `{m,n}?`: non-greedy version.

In [39]:
print(re.search('<.*>', '% %'))
print(re.search('<.*?>', '% %'))
print(re.search('<[^>]*>', '% %'))

print(re.search('<.+>', '% %'))
print(re.search('<.+?>', '% %'))

print(re.search('ba?', 'baaaa'))
print(re.search('ba??', 'baaaa'))










In [55]:
print(re.search('b[ac]{2,7}', 'baacaaac'))
print(re.search('b[ac]{2,7}?', 'baacaaac'))





### Grouping Constructs and Backreferences

- `()`: defines a group
- capture groups
- backreferences `\`: treat the captured groups as variables and use them in the ``. **Use raw string.**
- named groups: `(?P)`. Refer to it using `(?P=name)`, extract it using `m.group('name')`.
- non-capturing group: `(?:)`. Used when we need the grouping feature, but don't need the retrieval information later.
- conditional match:
 - `(?()|)`: use numbered reference
 - `(?()|)`: use named reference

In [6]:
m = re.search('(foo(bar)?)+(\d\d\d)?', 'foofoobar')
print(m)
print(m.groups())
print(m.group(1))
print(m.group(2))
print(m.group(3))


('foobar', 'bar', None)
foobar
bar
None


In [59]:
m = re.search('(\w+),(\w+),(\w+)', 'foo,quux,baz')
print(m)
print(m.groups())
print(m.group(1))
print(m.group(2))
print(m.group(3))
print(m.group(0)) # the matched string


('foo', 'quux', 'baz')
foo
quux
baz
foo,quux,baz


In [8]:
regex = r'(\w+), \1'
m = re.search(regex, 'foo, foo')
print(m)
print(m.group(1))

m = re.search(regex, 'foo, bar')
print(m)


foo
None


In [63]:
m = re.search(r'(?P\w+), (?:\w+), (?P\w+), (?P=w1), (?P=w2)', 'foo, test, bar, foo, bar, remaining')
print(m)
print(m.groups())
print(m.group('w2'))


('foo', 'bar')
bar


In [66]:
regex = r'^(###)?foo(?(1)bar|baz)'
print(re.search(regex, '###foobar'))
print(re.search(regex, 'foobaz'))
print(re.search(regex, '#foobaz'))
print(re.search(regex, '#foobar'))



None
None


In [67]:
regex = r'^(?P\W+)?foo(?(ch)(?P=ch)|)$'
print(re.search(regex, '##foo##'))
print(re.search(regex, '#foo#'))
print(re.search(regex, 'foo'))
print(re.search(regex, 'foo#'))
print(re.search(regex, '##foo%'))




None
None


### Lookahead and lookbehind assertions

Similar to anchors, these assertions are of zero width.

- `(?=)`: assert positive the next regex parser position
- `(?!)`: assert positive the next regex parser position
- `(?<=)`: assert positive the previous regex parser position, must be of fixed length.
- `(?)`: assert positive the previous regex parser position

In [93]:
print(re.search('foo(?=\w)', 'foob1z'))
print(re.search('foo(?!\w)', 'foo@23'))
print(re.search('(?<=\W)foo', '#foob1z'))
print(re.search('(?||`: alternation







In [9]:
print(re.search('foo(?#this is a comment)bar', 'foobar123'))
print(re.search('[0-9]+|(foo|bar|baz)*', '9032'))
print(re.search('[0-9]+|(foo|bar|baz)*', 'foobarfoo'))






### Flags

- `re.I`: `re.IGNORECASE`, case-insensitive.
- `re.M`: `re.MULTILINE`, enable anchors to work with embedded newlines.
- `re.S`: `re.DOTALL`, enable `.` to match a newline.
- `re.X`: `re.VERBOSE`, ignore whitespace and comment, to make the regex more human-friendly. Use `r''' '''`.
- `re.DEBUG`: show the debug information.
- encoding specification
 - `re.A`: `re.ASCII`, ASCII encoding
 - `re.U`: `re.UNICODE`, UNICODE encoding
 - `re.L`: `re.LOCALE`, according to your current locale
- `|`: combine multiple flags.
- `(?)`, `imsxauL`: set flag for the whole regex, at the beginning
- `(?-:)`: set and remove flag for ``.

In [86]:
print(re.search('^foo', 'FoObar', re.I|re.DEBUG))
print(re.search(r'''
 ^ # start of the regex
 (\(\d{3}\))? # optional three-digit area code
 (\s)* # optional whitespace
 \d{3} # three-digit prefix
 [-.] # seperator
 \d{4} # four-digit line number
 $ # end of the regex
''', '(123) 234-3427', re.X))

AT AT_BEGINNING
LITERAL 102
LITERAL 111
LITERAL 111

 0. INFO 4 0b0 3 3 (to 5)
 5: AT BEGINNING
 7. LITERAL_UNI_IGNORE 0x66 ('f')
 9. LITERAL_UNI_IGNORE 0x6f ('o')
11. LITERAL_UNI_IGNORE 0x6f ('o')
13. SUCCESS




In [85]:
print(re.search('^bar.baz', 'FoO\nbAr\nbaZ', re.I|re.M|re.S))
print(re.search('(?ims)^bar.baz', 'FoO\nbAr\nbaZ'))
print(re.search('(?ims)^bar.(?-i:baz)', 'FoO\nbAr\nbaZ'))



None


### Summary of `?`

- outside `()`
 - following `*`, `+`, `?`, `{m,n}`: non-greedy version
 - following ``: zero or one repetition
- inside `()`: serves as a magic prefix
 - `(?P)`: named group, `(?P)` to create, `(?P=name)` to reference
 - `(?:)`: non-capturing group, `(?:)` to create a non-capturing group
 - `(?#)`: comment
 - `(?())`: conditional match, `?()|` for numbered groups, `(?()|)` for named groups
 - `(?=)`, `(?!)`, `(?<=)`, `(?)`: flag can be `imsxauL`, set flags for the entire regex
 - `(?-:)`: set and remove flag for the regex portion