{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', '..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Ethen 2017-10-31 17:04:05 \n", "\n", "CPython 3.5.2\n", "IPython 6.2.1\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# magic to print version\n", "%load_ext watermark\n", "%watermark -a 'Ethen' -d -t -v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Strings and Text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the materials are a condensed reimplementation from the resource: Python3 Cookbook Chapter 2. Strings and Text, which originally was freely available online." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Splitting Strings on Any of Multiple Delimiters Using re.split\n", "\n", "The separator is either a semicolon (;), a comma (,), a whitespace ( ) or multiple whitespace." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "line = 'asdf fjdk; afed, fjek,asdf, foo'\n", "\n", "re.split(r'[;,\\s]\\s*', line)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching Text at the Start or End of a String\n", "\n", "Use the `str.startswith()` or `str.endswith()`." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['foo.c', 'spam.c', 'spam.h']\n", "True\n" ] } ], "source": [ "filenames = ['Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h']\n", "\n", "# pass in a tuple for multiple match, must be tuple, list won't work\n", "print([name for name in filenames if name.endswith(('.c', '.h'))])\n", "\n", "print(any(name.endswith('.py') for name in filenames))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wildcard Patterns Way of Matching Strings Using fnmatchcase\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['5412 N CLARK ST', '1060 W ADDISON ST']" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from fnmatch import fnmatchcase\n", "\n", "addresses = [\n", " '5412 N CLARK ST',\n", " '1060 W ADDISON ST',\n", " '1039 W GRANVILLE AVE',\n", " '2122 N CLARK st',\n", " '4802 N BROADWAY']\n", "\n", "[addr for addr in addresses if fnmatchcase(addr, '* ST')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Matching and Searching for Text Patterns\n", "\n", "**Example1:** Finding the position of a simple first match using `str.find()`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'yeah, but no, but yeah, but no, but yeah'\n", "text.find('no')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example2:** Match a lot of the same complex pattern, it's better to precompile the regular expression pattern first using `re.compile()`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "yes\n", "no\n", "yes\n", "no\n" ] } ], "source": [ "import re\n", "\n", "text1 = '11/27/2012'\n", "text2 = 'Nov 27, 2012'\n", " \n", "# Simple matching: \\d+ means match one or more digits\n", "# the 'r' simply means raw strings, this leaves the backslash (\\)\n", "# uninterpretted, or else you'll have to use \\\\ to match special characters\n", "if re.match(r'\\d+/\\d+/\\d+', text1):\n", " print('yes')\n", "else:\n", " print('no')\n", "\n", "if re.match(r'\\d+/\\d+/\\d+', text2):\n", " print('yes')\n", "else:\n", " print('no')\n", "\n", "\n", "# the re.compile version\n", "datepat = re.compile(r'\\d+/\\d+/\\d+')\n", "if datepat.match(text1):\n", " print('yes')\n", "else:\n", " print('no')\n", "\n", "if datepat.match(text2):\n", " print('yes')\n", "else:\n", " print('no')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example3:** Find all occurences in the text instead of just the first one with `findall()`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['11/27/2012', '3/13/2013']" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'\n", "datepat.findall(text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example4:** Capture groups by enclosing the pattern in parathensis." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('11', '27', '2012')\n", "11\n" ] } ], "source": [ "# single match\n", "datepat = re.compile(r'(\\d+)/(\\d+)/(\\d+)')\n", "m = datepat.match('11/27/2012')\n", "print(m.groups())\n", "print(m.group(1))" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('11', '27', '2012'), ('3', '13', '2013')]\n", "[('11', '27', '2012'), ('3', '13', '2013')]\n", "2012-11-27\n", "2013-3-13\n" ] } ], "source": [ "# mutiple match\n", "text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'\n", "\n", "print(datepat.findall(text))\n", "print(re.findall(r'(\\d+)/(\\d+)/(\\d+)', text)) # for matching just once\n", "\n", "for month, day, year in datepat.findall(text):\n", " print('{}-{}-{}'.format(year, month, day))" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('11', '27', '2012')\n", "('3', '13', '2013')\n" ] } ], "source": [ "# return a iterator instead of a list\n", "for m in datepat.finditer(text):\n", " print(m.groups())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Searching and Replacing Text\n", "\n", "**Example1:** Finding the position of a simple first match using `str.replace()`." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'yep, but no, but yep, but no, but yep'" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'yeah, but no, but yeah, but no, but yeah'\n", "text.replace('yeah', 'yep')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example2:** More complex replace using `re.sub()`. The nackslashed digits refers to the matched group." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Today is 2012-11-27. PyCon starts 2013-3-13.'" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "# replace date from d/m/Y to Y-m-d\n", "text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'\n", "re.sub(r'(\\d+)/(\\d+)/(\\d+)', r'\\3-\\1-\\2', text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example3:** Define a function for the substitution." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "from calendar import month_abbr\n", "\n", "\n", "def change_date(m):\n", " # place in the matched pattern and return the replaced text\n", " mon_name = month_abbr[ int(m.group(1)) ]\n", " return '{} {} {}'.format(m.group(2), mon_name, m.group(3))\n", "\n", "\n", "datepat = re.compile(r'(\\d+)/(\\d+)/(\\d+)')\n", "datepat.sub(change_date, text)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example4:** Use `.subn()` to replace and return the number of substitution made." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Today is 2012-11-27. PyCon starts 2013-3-13.\n", "2\n" ] } ], "source": [ "newtext, n = datepat.subn(r'\\3-\\1-\\2', text)\n", "print(newtext)\n", "print(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example5:** supply the `re.IGNORECASE` flag if you want to ignore cases." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['PYTHON', 'python', 'Python']" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text = 'UPPER PYTHON, lower python, Mixed Python'\n", "re.findall('python', text, flags = re.IGNORECASE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stripping Unwanted Characters from Strings Using strip\n", "\n", "For unwanted characters in the beginning and end of the string, use `str.strip()`. And there's `str.lstrip()` and `str.rstrip()` for left and right stripping." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "hello world\n", "hello world\n" ] } ], "source": [ "# white space stripping\n", "s = ' hello world \\n'\n", "print(s.strip())\n", "\n", "# character stripping\n", "t = '-----hello world====='\n", "print(t.strip('-='))" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generator Expression can be useful when you want to perform other operations after stripping\n" ] } ], "source": [ "\"\"\"\n", "with open(filename) as f:\n", " lines = (line.strip() for line in f)\n", " for line in lines:\n", "\"\"\"\n", "print('Generator Expression can be useful when you want to perform other operations after stripping')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Character to Character Mapping Using translate.\n", "\n", "Boiler plate: The method `str.translate()` returns a copy of the string in which all characters have been translated using a preconstructed table using the `str.maketrans()` function." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "th3s 3s str3ng 2x1mpl2....w4w!!!\n" ] } ], "source": [ "intab = 'aeiou'\n", "outtab = '12345'\n", "\n", "# maps the character a > 1, e > 2\n", "trantab = str.maketrans(intab, outtab)\n", "\n", "str = 'this is string example....wow!!!'\n", "print(str.translate(trantab))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Combining and Concatenating Strings\n", "\n", "**Example1:** Use `.join()` when the strings you wish to combine are in a sequence." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Chicago Not Chicago?\n" ] } ], "source": [ "parts = ['Is', 'Chicago', 'Not', 'Chicago?']\n", "print(' '.join(parts))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Example2:** Don't use the `+` operator when unneccessary." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Is Chicago Not Chicago?\n", "Is Chicago Not Chicago?\n" ] } ], "source": [ "a = 'Is Chicago'\n", "b = 'Not Chicago?'\n", "print(a + ' ' + b)\n", "print(a, b, sep = ' ')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## String Formatting" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'Guido has 37 messages.'" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s = '{name} has {n} messages.'\n", "s.format(name = 'Guido', n = 37)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reformatting Text to a Fixed Number of Columns Using textwrap\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Look into my eyes, look into my eyes,\n", "the eyes, the eyes, the eyes, not around\n", "the eyes, don't look around the eyes,\n", "look into my eyes, you're under.\n", "90\n" ] } ], "source": [ "import os\n", "import textwrap\n", "\n", "s = \"Look into my eyes, look into my eyes, the eyes, the eyes, \\\n", "the eyes, not around the eyes, don't look around the eyes, \\\n", "look into my eyes, you're under.\"\n", "\n", "print(textwrap.fill(s, 40))\n", "\n", "# if you want to get the text to match the terminal size\n", "print(os.get_terminal_size().columns)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" }, "toc": { "nav_menu": { "height": "314px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }