{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# from IPython.display import Image\n", "from IPython.display import clear_output\n", "from IPython.display import FileLink, FileLinks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction to\n", "\n", "![title](img/python-logo-master-flat.png)\n", "\n", "### with Application to Bioinformatics\n", "\n", "#### - Day 5" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Review\n", "\n", "- Dictionaries\n", " - Create a dictionary containing the keys `a` and `b`. Both should have the value 1.\n", " - Change the value of `b` to 5.\n", "- Lists\n", " - Create a list containing the elements `'a'`, `'b'`, `'c'`.\n", " - Reverse it\n", "- Set the variable `title` to `\"A movie\"` and `rating` to `10`.\n", " - Use formatting to produce the following string:\n", " \n", " `\"The movie the movie got rating 10!\"`" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Create a dictionary containing the keys a and b. Both should have the value 1" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "# Change the value of b to 5" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Create a list containing the elements `'a'`, `'b'`, `'c'`" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Reverse it" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "# Set the variable `title` to `\"A movie\"` and `rating` to 10." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Use formatting to produce: \"The movie the movie got rating 10!\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### TODAY\n", "\n", "- review\n", "- regex\n", "- sumup" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Review Day 4\n", "\n", "- More control!\n", "\n", " - `None`\n", " - `break`, `continue`\n", " - keyword arguments\n", " - documentation, comments...\n", " \n", "- Pandas\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Control loops\n", "\n", "- `break` a loop => stop it\n", "\n", "
\n", "\"break\"\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Control loops\n", "\n", "- `continue` => go on to the next iteration\n", "\n", "
\n", "\"break\"\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Control statements\n", "\n", "- `pass` => do nothing\n", "\n", "```py\n", "for line in file:\n", " if len(line) > 40:\n", " # TODO find out what to do here\n", " pass\n", " do_something(line)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### None\n", "\n", "`None` means \"nothing\". Is neither true nor false.\n", "\n", "```py\n", "if value:\n", " print('value is not None')\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Keyword arguments\n", "\n", "```py\n", "open(filename, encoding=\"utf-8\")\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```py\n", "open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Keyword arguments\n", "\n", "- programmer: set default values\n", "- user: ignore parameters\n", "- better overview" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Using code\n", "\n", "- Python standard modules\n", "- Your colleague's code\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "- `import pandas`\n", "- `import matplotlib.pyplot as plt`\n", "- `from matplotlib.pyplot import show`\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Documentition and getting help\n", "\n", "- `help(sys)`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- write comments `# why do I do this?`\n", "- write documentation `\"\"\"what is this? how do you use it?\"\"\"`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Writing readable code" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```py\n", "def f(a, b):\n", " for c in open(a):\n", " if c.startswith(b):\n", " print(c)\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "==>" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "```py\n", "def print_lines(filename, start):\n", " \"\"\"Print all lines in the file that starts with the given string.\"\"\"\n", " for line in open(filename):\n", " if line.startswith(start):\n", " print(line)\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "**
Care about the names of your variables and functions
**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pandas\n", "\n", "- Read tables\n", "```py\n", "dataframe = pandas.read_table('mydata.txt', sep='|', index_col=0)\n", "dataframe = pandas.read_csv('mydata.csv')\n", "```\n", "\n", "- Select rows and colums\n", "```py\n", "dataframe.columname\n", "dataframe.loc[index]\n", "dataframe.loc[dataframe.age == 20 ]\n", "```\n", "\n", "- Plot it\n", "```py\n", "datafram.plot(kind='line', x='column1', y='column2')\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## TODAY\n", "\n", "- Regular expressions\n", "- Sum up of the course" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expressions\n", "\n", "


\n", "- **A smarter way of searching text**\n", "\n", "- **search&replace**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expressions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- A formal language for defining search patterns" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Let's you search not only for exact strings but controlled variations of that string." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Why?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Examples:\n", " - Find variations in a protein or DNA sequence\n", " - `\"MVR???A\"`\n", " - `\"ATG???TAG`\n", " - American/British spelling, endings and other variants:\n", " - salpeter, salpetre, saltpeter, nitre, niter or KNO3\n", " - hemaglobin, heamoglobin, hemaglobins, heamoglobin's\n", " - catalyze, catalyse, catalyzed...\n", " - A pattern in a vcf file\n", " - a digit appearing after a tab" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expressions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- When?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "- To find information\n", " - in your `vcf` or `fasta` files\n", " - in your code\n", " - in your next essay\n", " - in a database\n", " - online\n", " - in a bunch of articles\n", " - ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " - Search/replace\n", " - becuase → because\n", " - color → colour\n", " - `\\t` (tab) → `\" \"` (four spaces)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Supported by most programming languages, text editors, search engines..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Defining a search pattern" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\"regex\"\n", "\"regex\"\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Common operations\n", "- `.` matches any character (once)\n", "- `?` repeat previous pattern 0 or 1 times\n", "- `*` repeat previous pattern 0 or more times\n", "- `+` repeat previous pattern 1 or more times" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`colour.*`\n", "\n", "`salt?peter`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`.*` matches everything (including the empty string)!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "
\"salt?pet..\"
\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
saltpeter
\n", "\n", "\n", "
\"saltpet88\"
\n", "
\"salpetin\"
\n", "
\"saltpet \"
\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More common operations - classes of characters\n", "\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More common operations - classes of characters\n", "\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`\\w+`\n", "\n", "![result](img/regex_w.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More common operations - classes of characters\n", "\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`\\d+`\n", "\n", "![result](img/regex_d.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More common operations - classes of characters\n", "\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`\\s+`\n", "\n", "![result](img/regex_s.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### More common operations - classes of characters\n", "\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace\n", "- `[abc]` matches a single character defined in this set {a, b, c}\n", "- `[^abc]` matches a single character that is **not** a, b or c" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### `[a-z]` matches all letters between `a` and `z` (the english alphabet).\n", "\n", "#### `[a-z]+` matches any (lowercased) english word." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
salt?pet[er]+\n", "
\n", "\n", "
saltpeter
\n", "
salpetre
\n", "\n", "
\"saltpet88\"
\n", "
\"salpetin\"
\n", "
\"saltpet \"
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Example - finding patterns in vcf**\n", "\n", "\n", "1\t920760\trs80259304\tT\tC\t.\tPASS\tAA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131\tGT:DP:CB\t0/1:1:SM 0/0:4/SM...\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Find a sample:\n", "\n", "`0/0` `0/1` `1/1` ..." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`\"[01]/[01]\"` (or `\"\\d/\\d\")`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```\\s[01]/[01]:```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Example - finding patterns in vcf**\n", "\n", "\n", "1\t920760\trs80259304\tT\tC\t.\tPASS\tAA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131\tGT:DP:CB\t0/1:1:SM 0/0:4/SM...\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "- Find all lines containing more than one homozygous sample." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`... 1/1:... ... 1/1:... ...`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```.*1/1.*1/1.*```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```.*\\s1/1:.*\\s1/1:.*```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercise 1\n", "\n", "\n", "- `.` matches any character (once)\n", "- `?` repeat previous pattern 0 or 1 times\n", "- `*` repeat previous pattern 0 or more times\n", "- `+` repeat previous pattern 1 or more times\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace\n", "- `[abc]` matches a single character defined in this set {a, b, c}\n", "- `[^abc]` matches a single character that is **not** a, b or c\n", "- `[a-z]` matches any (lowercased) letter from the english alphabet\n", "- `.*` matches anything\n", "\n", "→ Notebook Day_5_Exercise_1 (~30 minutes) " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Regular expressions in Python" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import re" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('ab*')\n", "p" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Searching" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('ab*')\n", "\n", "p.search('abc')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "print(p.search('cb'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('HELLO')\n", "m = p.search('gsdfgsdfgs HELLO __!@£§≈[|ÅÄÖ‚…’fi]')\n", "\n", "print(m)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Case insensitiveness" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "p = re.compile('[a-z]+')\n", "result = p.search('ATGAAA')\n", "print(result)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('[a-z]+', re.IGNORECASE)\n", "\n", "result = p.search('ATGAAA')\n", "result" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The match object" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "result = p.search('123 ATGAAA 456')\n", "result" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`result.group()`: Return the string matched by the expression\n", "\n", "`result.start()`: Return the starting position of the match\n", "\n", "`result.end()`: Return the ending position of the match\n", "\n", "`result.span()`: Return both (start, end)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "result.group()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "result.start()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.end()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "result.span()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Zero or more...?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "p = re.compile('.*HELLO.*')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "m = p.search('lots of text HELLO more text and characters!!! ^^')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "m.group()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `*` is **greedy**." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Finding all the matching patterns" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "p = re.compile('HELLO')\n", "objects = p.finditer('lots of text HELLO more text HELLO ... and characters!!! ^^')\n", "print(objects)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "for m in objects:\n", " print(f'Found {m.group()} at position {m.start()}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "objects = p.finditer('lots of text HELLO more text HELLO ... and characters!!! ^^')\n", "for m in objects:\n", " print('Found {} at position {}'.format(m.group(), m.start()))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### How to find a full stop?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "txt = \"The first full stop is here: .\"\n", "p = re.compile('.')\n", "\n", "m = p.search(txt)\n", "print('\"{}\" at position {}'.format(m.group(), m.start()))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('\\.')\n", "\n", "m = p.search(txt)\n", "print('\"{}\" at position {}'.format(m.group(), m.start()))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### More operations\n", "- `\\` escaping a character\n", "- `^` beginning of the string\n", "- `$` end of string\n", "- `|` boolean `or`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "`^hello$`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "
\n", " salt?pet(er|re) | nit(er|re) | KNO3\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Substitution" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "#### Finally, we can fix our spelling mistakes!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "txt = \"Do it becuase I say so, not becuase you want!\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import re\n", "p = re.compile('becuase')\n", "txt = p.sub('because', txt)\n", "print(txt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "p = re.compile('\\s+')\n", "p.sub(' ', txt)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Overview" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ " - Construct regular expressions\n", " \n", " ```py\n", " p = re.compile()```\n", " \n", " - Searching\n", " \n", " ```py\n", " p.search(text)\n", " ```\n", " \n", " - Substitution\n", " \n", " ```py\n", " p.sub(replacement, text)\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Typical code structure:**\n", "\n", "```python\n", "p = re.compile( ... )\n", "m = p.search('string goes here')\n", "if m:\n", " print('Match found: ', m.group())\n", "else:\n", " print('No match')\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Regular expressions\n", "\n", "\n", "- A powerful tool to search and modify text\n", "\n", "- There is much more to read in the [docs](https://docs.python.org/3/library/re.html)\n", "\n", "- Note: regex comes in different flavours. If you use it outside Python, there might be small variations in the syntax." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercise 2\n", " \n", "- `.` matches any character (once)\n", "- `?` repeat previous pattern 0 or 1 times\n", "- `*` repeat previous pattern 0 or more times\n", "- `+` repeat previous pattern 1 or more times\n", "- `\\w` matches any letter or number, and the underscore\n", "- `\\d` matches any digit\n", "- `\\D` matches any non-digit\n", "- `\\s` matches any whitespace (spaces, tabs, ...)\n", "- `\\S` matches any non-whitespace\n", "- `[abc]` matches a single character defined in this set {a, b, c}\n", "- `[^abc]` matches a single character that is **not** a, b or c\n", "- `[a-z]` matches any (lowercased) letter from the english alphabet\n", "- `.*` matches anything\n", "- `\\` escaping a character\n", "- `^` beginning of the string\n", "- `$` end of string\n", "- `|` boolean `or`\n", " \n", "Read more: full documentation https://docs.python.org/3.6/library/re.html\n", "
\n", "→ Notebook Day_5_Exercise_2 (~30 minutes) \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Sum up!

" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Processing files - looping through the lines\n", "\n", "```py\n", "\n", "for line in open('myfile.txt', 'r'):\n", " do_stuff(line)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Store values\n", "\n", "```py\n", "iterations = 0\n", "information = []\n", "\n", "for line in open('myfile.txt', 'r'):\n", " iterations += 1\n", " information += do_stuff(line)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Values\n", "\n", "- Base types:\n", "\n", " - ```py\n", " str \"hello\"```\n", " - ```py\n", " int 5```\n", " - ```py\n", " float 5.2```\n", " - ```py\n", " bool True```\n", " \n", "- Collections:\n", "\n", " - ```py\n", " list [\"a\", \"b\", \"c\"]```\n", " - ```py\n", " dict {\"a\": \"alligator\", \"b\": \"bear\", \"c\": \"cat\"}```\n", " - ```py\n", " tuple (\"this\", \"that\")```\n", " - ```py\n", " set {\"drama\", \"sci-fi\"}```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Assign values**\n", "```py\n", "iterations = 0\n", "score = 5.2\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "#### Modify values and compare\n", "\n", "- ```py\n", "+, -, *,... # mathematical```\n", "- ```py\n", "and, or, not # logical ```\n", "- ```py\n", "==, != # comparisons```\n", "- ```py\n", "<, >, <=, >= # comparisons```\n", "- ```py\n", "in # membership\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "value = 4\n", "nextvalue = 1\n", "nextvalue += value\n", "print('nextvalue: ', nextvalue, 'value: ', value)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "x = 5\n", "y = 7\n", "z = 2\n", "x > 6 and y == 7 or z > 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "(x > 6 and y == 7) or z > 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Strings\n", "\n", "Raw text\n", "\n", "- Common manipulations:\n", " \n", " - ```py\n", " s.strip() # remove unwanted spacing```\n", " - ```py\n", " s.split() # split line into columns```\n", " - ```py\n", " s.upper(), s.lower() # change the case```\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\n", "- Regular expressions help you find and replace strings.\n", "\n", " - ```py\n", " p = re.compile('A.A.A')\n", " p.search(dnastring)\n", " ```\n", " - ```py\n", " p = re.compile('T')\n", " p.sub('U', dnastring)\n", " ```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "import re\n", "\n", "p = re.compile('p.*\\sp') # the greedy star!\n", "\n", "p.search('a python programmer writes python code').group()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Collections\n", "\n", "Can contain strings, integer, booleans...\n", "- **Mutable**: you can *add*, *remove*, *change* values\n", "\n", " - Lists:\n", " ```py\n", " mylist.append('value')\n", "```\n", "\n", " - Dicts:\n", " ```py\n", " mydict['key'] = 'value'```\n", "\n", " - Sets:\n", "```py\n", " myset.add('value')```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Collections\n", "\n", "- Test for membership:\n", "```py\n", "value in myobj\n", "```\n", "\n", "- Check size:\n", "```py\n", "len(myobj)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Lists\n", "\n", "- Ordered!\n", "\n", "```py\n", "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]\n", "\n", "todolist.sort()\n", "todolist.reverse()\n", "todolist[2]\n", "todolist[-1]\n", "todolist[2:6]\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todolist.sort()\n", "print(todolist)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todolist.reverse()\n", "print(todolist)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todolist[2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todolist[-1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "todolist[2:]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Dictionaries\n", "\n", "- Keys have values\n", "\n", "```py\n", "mydict = {\"a\": \"alligator\", \"b\": \"bear\", \"c\": \"cat\"}\n", "counter = {\"cats\": 55, \"dogs\": 8}\n", "\n", "mydict[\"a\"]\n", "mydict.keys()\n", "mydict.values()\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "counter = {'cats': 0, 'others': 0}\n", "\n", "for animal in ['zebra', 'cat', 'dog', 'cat']:\n", " if animal == 'cat':\n", " counter['cats'] += 1\n", " else:\n", " counter['others'] += 1\n", " \n", "counter" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Sets\n", "\n", "- Bag of values\n", " \n", " - No order\n", " \n", " - No duplicates \n", "\n", " - Fast membership checks\n", " \n", " - Logical set operations (union, difference, intersection...)\n", "\n", "\n", "```py\n", "myset = {\"drama\", \"sci-fi\"}\n", "|\n", "myset.add(\"comedy\")\n", "\n", "myset.remove(\"drama\")\n", "```\n" ] }, { "cell_type": "raw", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "for m in objects:\n", " print(f'Found {m.group()} at position {m.start()}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]\n", "\n", "todo_items = set(todolist)\n", "todo_items" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "todo_items.add(\"study\")\n", "todo_items" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "todo_items.add(\"eat\")\n", "todo_items" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Strings\n", "\n", "- Works like a list of characters\n", " - ```py\n", " s += \"more words\" # add content```\n", " - ```py\n", " s[4] # get character at index 4```\n", " - ```py\n", " 'e' in s # check for membership```\n", " \n", " - ```py\n", " len(s) # check size```\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- But are immutable\n", "\n", " - ```py\n", " > s[2] = 'i'```\n", " ---\n", " ```py\n", " Traceback (most recent call last):\n", " File \"\", line 1, in \n", " TypeError: 'str' object does not support item assignment\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Tuples\n", "\n", "- A group (usually two) of values that belong together\n", " \n", " - ```py\n", "tup = (max_lenght, sequence)\n", "```\n", " - An ordered sequence (like lists)\n", "\n", " - ```py\n", "length = tup[0] # get content at index 0\n", "```\n", " - Immutable" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "tup = (2, 'xy')\n", "tup[0]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tup[0] = 2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "```py\n", "def find_longest_seq(file):\n", " # some code here...\n", " return length, sequence\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```py\n", "answer = find_longest_seq(filepath)\n", "print('lenght', answer[0])\n", "print('sequence', answer[1])\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "```py\n", "answer = find_longest_seq(filepath)\n", "length, sequence = find_longest_seq(filepath)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Deciding what to do\n", "\n", "```py\n", "if count > 10:\n", " print('big')\n", "elif count > 5:\n", " print('medium')\n", "else:\n", " print('small')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "shopping_list = ['bread', 'egg', ' butter', 'milk']\n", "tired = True\n", "\n", "if len(shopping_list) > 4:\n", " print('Really need to go shopping!')\n", "elif not tired:\n", " print('Not tired? Then go shopping!')\n", "else:\n", " print('Better to stay at home') " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Deciding what to do - if statement\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Program flow - for loops\n", "\n", "```py\n", "information = []\n", "\n", "for line in open('myfile.txt', 'r'):\n", " if is_comment(line):\n", " use_comment(line)\n", " else:\n", " information = read_data(line)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Program flow - while loops\n", "\n", "```py\n", "keep_going = True\n", "information = []\n", "index = 0\n", "\n", "while keep_going:\n", " current_line = lines[index]\n", " information += read_line(current_line)\n", " index += 1\n", " if check_something(current_line):\n", " keep_going = False\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Different types of loops\n", "\n", "__`For` loop__\n", "\n", "is a control flow statement that performs operations over a known amount of steps.\n", "\n", "__`While` loop__\n", "\n", "is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition.\n", "\n", "

\n", "\n", "__Which one to use?__\n", "\n", "`For` loops - standard for iterations over lists and other iterable objects\n", "\n", "`While` loops - more flexible and can iterate an unspecified number of times\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "user_input = \"thank god it's friday\"\n", "for c in user_input:\n", " print(c.upper())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "i = 0\n", "while i < len(user_input):\n", " c = user_input[i]\n", " print(c.upper())\n", " i += 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "i = 0\n", "go_on = True\n", "while go_on:\n", " c = user_input[i]\n", " print(c.upper())\n", " i += 1\n", " if c == 'd':\n", " go_on = False" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Controlling loops\n", "\n", "- `break` - stop the loop\n", "- `continue` - go on to the next iteration" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [], "source": [ "user_input = \"thank god it's friday\"\n", "for c in user_input:\n", " print(c.upper())\n", " if c == 'd':\n", " break" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "i = 0\n", "while True: \n", " c = user_input[i]\n", " i += 1\n", " if c in 'aoueiy':\n", " continue\n", " print(c.upper())\n", " if c == 'd':\n", " break" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "**Watch out!**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "i = 0\n", "while i > 10: \n", " print(user_input[i])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "While loops may be infinite!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Input/Output\n", "\n", "- In:\n", "\n", " - Read files: `fh = open(filename, 'r')`\n", "\n", " - `for line in fh:`\n", " - `fh.read()`\n", " - `fh.readlines()`\n", " - Read information from command line: `sys.argv[1:]`\n", "\n", "- Out:\n", "\n", " - Write files: `fh = open(filename, 'w')`\n", " - `fh.write(text)`\n", " - Printing: `print('my_information')`\n", " \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Input/Output\n", "\n", "- Open files should be closed:\n", " - `fh.close()`\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Formatting" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "
\n",
    "{[field_name] [: format_spec]}\n",
    "
\n", "\n", "*Format_spec:*\n", "
\n",
    "filling alignment width precision type\n",
    "   -       >        10    .2       f\n",
    "
\n", "\n", "```py\n", " print('|{:30}|{:^10}|{:^10.2f}|'.format(movie, votes, total/votes))\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Code structure\n", "\n", "- Functions\n", "- Modules" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Functions\n", "\n", "- A named piece of code that performs a certain task.\n", "\n", "\"Drawing\" \n", "\n", "\n", "- Is given a number of input arguments\n", " - to be used (are in scope) within the function body\n", "- Returns a result (maybe `None`)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Functions - keyword arguments\n", "\n", "```py\n", "def prettyprinter(name, value, delim=\":\", end=None):\n", " out = \"The \" + name + \" is \" + delim + \" \" + value\n", " if end:\n", " out += end\n", " return out\n", "```\n", "\n", "\n", "- used to set default values (often `None`)\n", "- can be skipped in function calls\n", "- improve readability" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Files (modules)\n", "\n", "- A (larger) piece of code containing functions, classes...\n", "- Can be imported\n", "\n", "```py\n", "import mymodule\n", "```\n", "- Can be run as a script\n", "\n", "```> python3 mymodule.py```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Using your code\n", "\n", "Any longer pieces of code that have been used and will be re-used should be saved\n", "\n", "- Save it as a file ` .py`\n", "\n", "- To run it:\n", "`python3 mycode.py`\n", "\n", "- Import it:\n", "`import mycode`\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Documentation and comments\n", "\n", "- ```py\n", "\"\"\" This is a doc-string explaining what the purpose of this function/module is.\"\"\"```\n", "- ```py\n", "# This is a comment that helps understanding the code\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Comments *will* help you" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Undocumented code rarely gets used" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Try to keep your code readable: use informative variable and function names" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
\n", "\"Module\"/\n", "
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Why programming?\n", "\n", "Endless possibilities! \n", "- reverse complement DNA\n", "- custom filtering of VCF files\n", "- plotting of results\n", "- all excel stuff!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Why programming?\n", "\n", "- Computers are fast\n", "- Computers don't get bored\n", "- Computers don't get sloppy" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Create reproducable results\n", "- Extract large amount of information" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Final advice\n", "\n", "- Stop to think before you start coding\n", " - use pseudocode\n", " - use top-down programming\n", " - use paper and pen\n", " - take breaks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- You know the basics - don't be afraid to try\n", "- You will get faster" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### Final advice\n", "\n", "- Getting help\n", " - ask colleauges\n", " - talk about your problem (get a rubber duck)\n", " - search the web\n", " - take breaks!\n", " - NBIS drop-ins" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "
Now you know Python!
\n", "
\n", "
🎉

\n", "
Well done!
\n" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "rise": { "height": 900, "width": "90%" } }, "nbformat": 4, "nbformat_minor": 2 }