{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "# from IPython.display import Image\n",
    "from IPython.display import clear_output\n",
    "from IPython.display import FileLink, FileLinks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Introduction to\n",
    "\n",
    "![title](img/python-logo-master-flat.png)\n",
    "\n",
    "### with Application to Bioinformatics\n",
    "\n",
    "#### - Day 5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Review\n",
    "\n",
    "- Dictionaries\n",
    "  - Create a dictionary containing the keys `a` and `b`. Both should have the value 1.\n",
    "  - Change the value of `b` to 5.\n",
    "- Lists\n",
    "  - Create a list containing the elements `'a'`, `'b'`, `'c'`.\n",
    "  - Reverse it\n",
    "- Set the variable `title` to `\"A movie\"` and `rating` to `10`.\n",
    " - Use formatting to produce the following string:\n",
    " \n",
    "     `\"The movie the movie got rating 10!\"`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Create a dictionary containing the keys a and b. Both should have the value 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "# Change the value of b to 5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Create a list containing the elements `'a'`, `'b'`, `'c'`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reverse it"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "# Set the variable `title` to `\"A movie\"` and `rating` to 10."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use formatting to produce: \"The movie the movie got rating 10!\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### TODAY\n",
    "\n",
    "- review\n",
    "- regex\n",
    "- sumup"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "## Review Day 4\n",
    "\n",
    "- More control!\n",
    "\n",
    "  - `None`\n",
    "  - `break`, `continue`\n",
    "  - keyword arguments\n",
    "  - documentation, comments...\n",
    "  \n",
    "- Pandas\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Control loops\n",
    "\n",
    "- `break` a loop => stop it\n",
    "\n",
    "<center>\n",
    "<img src=\"img/break.png\" alt=\"break\" width=\"30%\"/>\n",
    "</center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Control loops\n",
    "\n",
    "- `continue` => go on to the next iteration\n",
    "\n",
    "<center>\n",
    "<img src=\"img/continue.png\" alt=\"break\" width=\"30%\"/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### Control statements\n",
    "\n",
    "- `pass`  => do nothing\n",
    "\n",
    "```py\n",
    "for line in file:\n",
    "    if len(line) > 40:\n",
    "        # TODO find out what to do here\n",
    "        pass\n",
    "    do_something(line)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### None\n",
    "\n",
    "`None` means \"nothing\". Is neither true nor false.\n",
    "\n",
    "```py\n",
    "if value:\n",
    "   print('value is not None')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Keyword arguments\n",
    "\n",
    "```py\n",
    "open(filename, encoding=\"utf-8\")\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```py\n",
    "open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### Keyword arguments\n",
    "\n",
    "- programmer: set default values\n",
    "- user: ignore parameters\n",
    "- better overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### Using code\n",
    "\n",
    "- Python standard modules\n",
    "- Your colleague's code\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "- `import pandas`\n",
    "- `import matplotlib.pyplot as plt`\n",
    "- `from matplotlib.pyplot import show`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Documentition and getting help\n",
    "\n",
    "- `help(sys)`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- write comments       `# why do I do this?`\n",
    "- write documentation  `\"\"\"what is this? how do you use it?\"\"\"`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Writing readable code"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```py\n",
    "def f(a, b):\n",
    "    for c in open(a):\n",
    "        if c.startswith(b):\n",
    "            print(c)\n",
    "            ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "==>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "```py\n",
    "def print_lines(filename, start):\n",
    "    \"\"\"Print all lines in the file that starts with the given string.\"\"\"\n",
    "    for line in open(filename):\n",
    "        if line.startswith(start):\n",
    "            print(line)\n",
    "            ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "**<center>Care about the names of your variables and functions</center>**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Pandas\n",
    "\n",
    "- Read tables\n",
    "```py\n",
    "dataframe = pandas.read_table('mydata.txt', sep='|', index_col=0)\n",
    "dataframe = pandas.read_csv('mydata.csv')\n",
    "```\n",
    "\n",
    "- Select rows and colums\n",
    "```py\n",
    "dataframe.columname\n",
    "dataframe.loc[index]\n",
    "dataframe.loc[dataframe.age == 20 ]\n",
    "```\n",
    "\n",
    "- Plot it\n",
    "```py\n",
    "datafram.plot(kind='line', x='column1', y='column2')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## TODAY\n",
    "\n",
    "- Regular expressions\n",
    "- Sum up of the course"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regular Expressions\n",
    "\n",
    "<br><br><br>\n",
    "- **A smarter way of searching text**\n",
    "\n",
    "- **search&replace**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- A formal language for defining search patterns"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Let's you search not only for exact strings but controlled variations of that string."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Why?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Examples:\n",
    "   - Find variations in a protein or DNA sequence\n",
    "     - `\"MVR???A\"`\n",
    "     - `\"ATG???TAG`\n",
    "   - American/British spelling, endings and other variants:\n",
    "     - salpeter, salpetre, saltpeter, nitre, niter or KNO3\n",
    "     - hemaglobin, heamoglobin, hemaglobins, heamoglobin's\n",
    "     - catalyze, catalyse, catalyzed...\n",
    "   - A pattern in a vcf file\n",
    "     - a digit appearing after a tab"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regular Expressions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- When?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "\n",
    "- To find information\n",
    "    - in your `vcf` or `fasta` files\n",
    "    - in your code\n",
    "    - in your next essay\n",
    "    - in a database\n",
    "    - online\n",
    "    - in a bunch of articles\n",
    "    - ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    " - Search/replace\n",
    "   - becuase &rarr; because\n",
    "   - color &rarr; colour\n",
    "   - `\\t` (tab) &rarr; `\"    \"` (four spaces)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Supported by most programming languages, text editors, search engines..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Defining a search pattern"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"img/color.png\" alt=\"regex\" width=\"50%\"/>\n",
    "<img src=\"img/salpeter.png\" alt=\"regex\" width=\"50%\"/>\n",
    "\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Common operations\n",
    "- `.` matches any character (once)\n",
    "- `?` repeat previous pattern 0 or 1 times\n",
    "- `*` repeat previous pattern 0 or more times\n",
    "- `+` repeat previous pattern 1 or more times"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`colour.*`\n",
    "\n",
    "`salt?peter`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`.*` matches everything (including the empty string)!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "\n",
    "<center><code>\"salt?pet..\"</code></center>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center><font color=\"green\">saltpeter</font></center>\n",
    "\n",
    "\n",
    "<center><font color=\"red\">\"saltpet88\"</font></center>\n",
    "<center><font color=\"red\">\"salpetin\"</font></center>\n",
    "<center><font color=\"red\">\"saltpet  \"</font></center>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### More common operations - classes of characters\n",
    "\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### More common operations - classes of characters\n",
    "\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`\\w+`\n",
    "\n",
    "![result](img/regex_w.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### More common operations - classes of characters\n",
    "\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`\\d+`\n",
    "\n",
    "![result](img/regex_d.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### More common operations - classes of characters\n",
    "\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`\\s+`\n",
    "\n",
    "![result](img/regex_s.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### More common operations - classes of characters\n",
    "\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace\n",
    "- `[abc]` matches a single character defined in this set {a, b, c}\n",
    "- `[^abc]` matches a single character that is **not** a, b or c"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "#### `[a-z]` matches all letters between `a` and `z` (the english alphabet).\n",
    "\n",
    "#### `[a-z]+` matches any (lowercased) english word."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center><code>salt<font color=\"red\">?</font>pet<font color=\"red\">[</font>er<font color=\"red\">]+</font>\n",
    "</code></center>\n",
    "\n",
    "   <font color=\"green\"><center>saltpeter</center>\n",
    "    <font color=\"green\"><center>salpetre</center>\n",
    "\n",
    "<center><strike><font color=\"red\">\"saltpet88\"</font></strike></center>\n",
    "<center><strike><font color=\"red\">\"salpetin\"</font></strike></center>\n",
    "<center><strike><font color=\"red\">\"saltpet  \"</font></strike></center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Example - finding patterns in vcf**\n",
    "\n",
    "<font size=\"5\"><code>\n",
    "1\t920760\trs80259304\tT\tC\t.\tPASS\tAA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131\tGT:DP:CB\t0/1:1:SM 0/0:4/SM...\n",
    "</code></font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Find a sample:\n",
    "\n",
    "`0/0`  `0/1`  `1/1`  ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`\"[01]/[01]\"`      (or   `\"\\d/\\d\")`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```\\s[01]/[01]:```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Example - finding patterns in vcf**\n",
    "\n",
    "<font size=\"5\"><code>\n",
    "1\t920760\trs80259304\tT\tC\t.\tPASS\tAA=T;AC=18;AN=120;DP=190;GP=1:930897;BN=131\tGT:DP:CB\t0/1:1:SM 0/0:4/SM...\n",
    "</code></font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "- Find all lines containing more than one homozygous sample."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`... 1/1:...  ... 1/1:...  ...`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```.*1/1.*1/1.*```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```.*\\s1/1:.*\\s1/1:.*```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Exercise 1\n",
    "\n",
    "\n",
    "- `.` matches any character (once)\n",
    "- `?` repeat previous pattern 0 or 1 times\n",
    "- `*` repeat previous pattern 0 or more times\n",
    "- `+` repeat previous pattern 1 or more times\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace\n",
    "- `[abc]` matches a single character defined in this set {a, b, c}\n",
    "- `[^abc]` matches a single character that is **not** a, b or c\n",
    "- `[a-z]` matches any (lowercased) letter from the english alphabet\n",
    "- `.*` matches anything\n",
    "\n",
    "<b>&rarr; Notebook Day_5_Exercise_1  (~30 minutes) </b>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Regular expressions in Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import re"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('ab*')\n",
    "p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Searching"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('ab*')\n",
    "\n",
    "p.search('abc')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "print(p.search('cb'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('HELLO')\n",
    "m = p.search('gsdfgsdfgs  HELLO  __!@£§≈[|ÅÄÖ‚…’ﬁ]')\n",
    "\n",
    "print(m)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Case insensitiveness"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('[a-z]+')\n",
    "result = p.search('ATGAAA')\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('[a-z]+', re.IGNORECASE)\n",
    "\n",
    "result = p.search('ATGAAA')\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### The match object"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "result = p.search('123 ATGAAA 456')\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`result.group()`: Return the string matched by the expression\n",
    "\n",
    "`result.start()`: Return the starting position of the match\n",
    "\n",
    "`result.end()`: Return the ending position of the match\n",
    "\n",
    "`result.span()`: Return both (start, end)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "result.group()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "result.start()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.end()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.span()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Zero or more...?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('.*HELLO.*')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "m = p.search('lots of text  HELLO  more text and characters!!! ^^')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "m.group()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The `*` is **greedy**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Finding all the matching patterns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('HELLO')\n",
    "objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')\n",
    "print(objects)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "for m in objects:\n",
    "    print(f'Found {m.group()} at position {m.start()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "objects = p.finditer('lots of text  HELLO  more text  HELLO ... and characters!!! ^^')\n",
    "for m in objects:\n",
    "    print('Found {} at position {}'.format(m.group(), m.start()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### How to find a full stop?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "txt = \"The first full stop is here: .\"\n",
    "p = re.compile('.')\n",
    "\n",
    "m = p.search(txt)\n",
    "print('\"{}\" at position {}'.format(m.group(), m.start()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('\\.')\n",
    "\n",
    "m = p.search(txt)\n",
    "print('\"{}\" at position {}'.format(m.group(), m.start()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### More operations\n",
    "- `\\` escaping a character\n",
    "- `^` beginning of the string\n",
    "- `$` end of string\n",
    "- `|` boolean `or`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "`^hello$`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "<center>\n",
    "    <code>salt<font color=\"red\">?</font>pet<font color=\"red\">(</font>er<font color=\"red\">|</font>re<font color=\"red\">)</font> <font color=\"red\">|</font> nit<font color=\"red\">(</font>er<font color=\"red\">|</font>re<font color=\"red\">)</font> <font color=\"red\">|</font> KNO3</code>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Substitution"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "#### Finally, we can fix our spelling mistakes!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "txt = \"Do it   becuase   I say so,     not becuase you want!\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "p = re.compile('becuase')\n",
    "txt = p.sub('because', txt)\n",
    "print(txt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "p = re.compile('\\s+')\n",
    "p.sub(' ', txt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Overview"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    " - Construct regular expressions\n",
    " \n",
    "     ```py\n",
    "     p = re.compile()```\n",
    "     \n",
    " - Searching\n",
    " \n",
    "     ```py\n",
    "     p.search(text)\n",
    "     ```\n",
    "     \n",
    " - Substitution\n",
    " \n",
    "     ```py\n",
    "     p.sub(replacement, text)\n",
    "     ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Typical code structure:**\n",
    "\n",
    "```python\n",
    "p = re.compile( ... )\n",
    "m = p.search('string goes here')\n",
    "if m:\n",
    "    print('Match found: ', m.group())\n",
    "else:\n",
    "    print('No match')\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Regular expressions\n",
    "\n",
    "\n",
    "- A powerful tool to search and modify text\n",
    "\n",
    "- There is much more to read in the [docs](https://docs.python.org/3/library/re.html)\n",
    "\n",
    "- Note: regex comes in different flavours. If you use it outside Python, there might be small variations in the syntax."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Exercise 2\n",
    "         \n",
    "- `.` matches any character (once)\n",
    "- `?` repeat previous pattern 0 or 1 times\n",
    "- `*` repeat previous pattern 0 or more times\n",
    "- `+` repeat previous pattern 1 or more times\n",
    "- `\\w` matches any letter or number, and the underscore\n",
    "- `\\d` matches any digit\n",
    "- `\\D` matches any non-digit\n",
    "- `\\s` matches any whitespace (spaces, tabs, ...)\n",
    "- `\\S` matches any non-whitespace\n",
    "- `[abc]` matches a single character defined in this set {a, b, c}\n",
    "- `[^abc]` matches a single character that is **not** a, b or c\n",
    "- `[a-z]` matches any (lowercased) letter from the english alphabet\n",
    "- `.*` matches anything\n",
    "- `\\` escaping a character\n",
    "- `^` beginning of the string\n",
    "- `$` end of string\n",
    "- `|` boolean `or`\n",
    "  \n",
    "Read more: full documentation https://docs.python.org/3.6/library/re.html\n",
    "<br/>\n",
    "<b>&rarr; Notebook Day_5_Exercise_2  (~30 minutes) </b>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<h3> <center>Sum up!</center></h3>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Processing files - looping through the lines\n",
    "\n",
    "```py\n",
    "\n",
    "for line in open('myfile.txt', 'r'):\n",
    "    do_stuff(line)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Store values\n",
    "\n",
    "```py\n",
    "iterations = 0\n",
    "information = []\n",
    "\n",
    "for line in open('myfile.txt', 'r'):\n",
    "    iterations += 1\n",
    "    information += do_stuff(line)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Values\n",
    "\n",
    "- Base types:\n",
    "\n",
    "    - ```py\n",
    "    str     \"hello\"```\n",
    "    - ```py\n",
    "    int     5```\n",
    "    - ```py\n",
    "    float   5.2```\n",
    "    - ```py\n",
    "    bool    True```\n",
    "    \n",
    "- Collections:\n",
    "\n",
    "    - ```py\n",
    "    list  [\"a\", \"b\", \"c\"]```\n",
    "    - ```py\n",
    "    dict  {\"a\": \"alligator\", \"b\": \"bear\", \"c\": \"cat\"}```\n",
    "    - ```py\n",
    "    tuple (\"this\", \"that\")```\n",
    "    - ```py\n",
    "    set   {\"drama\", \"sci-fi\"}```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Assign values**\n",
    "```py\n",
    "iterations = 0\n",
    "score = 5.2\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "#### Modify values and compare\n",
    "\n",
    "- ```py\n",
    "+, -, *,...   # mathematical```\n",
    "- ```py\n",
    "and, or, not  # logical ```\n",
    "- ```py\n",
    "==, !=        # comparisons```\n",
    "- ```py\n",
    "<, >, <=, >=  # comparisons```\n",
    "- ```py\n",
    "in            # membership\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "value = 4\n",
    "nextvalue = 1\n",
    "nextvalue += value\n",
    "print('nextvalue: ', nextvalue, 'value: ', value)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "x = 5\n",
    "y = 7\n",
    "z = 2\n",
    "x > 6 and y == 7 or z > 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "(x > 6 and y == 7) or z > 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Strings\n",
    "\n",
    "Raw text\n",
    "\n",
    "- Common manipulations:\n",
    " \n",
    "  - ```py\n",
    "  s.strip()  # remove unwanted spacing```\n",
    "  - ```py\n",
    "  s.split()  # split line into columns```\n",
    "  - ```py\n",
    "  s.upper(), s.lower()  # change the case```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "\n",
    "- Regular expressions help you find and replace strings.\n",
    "\n",
    "  - ```py\n",
    "  p = re.compile('A.A.A')\n",
    "  p.search(dnastring)\n",
    "  ```\n",
    "  - ```py\n",
    "  p = re.compile('T')\n",
    "  p.sub('U', dnastring)\n",
    "  ```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "p = re.compile('p.*\\sp')  # the greedy star!\n",
    "\n",
    "p.search('a python programmer writes python code').group()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Collections\n",
    "\n",
    "Can contain strings, integer, booleans...\n",
    "- **Mutable**: you can *add*, *remove*, *change* values\n",
    "\n",
    "  - Lists:\n",
    "  ```py\n",
    "    mylist.append('value')\n",
    "```\n",
    "\n",
    "  - Dicts:\n",
    "  ```py\n",
    "    mydict['key'] = 'value'```\n",
    "\n",
    "  - Sets:\n",
    "```py\n",
    "    myset.add('value')```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Collections\n",
    "\n",
    "- Test for membership:\n",
    "```py\n",
    "value in myobj\n",
    "```\n",
    "\n",
    "- Check size:\n",
    "```py\n",
    "len(myobj)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Lists\n",
    "\n",
    "- Ordered!\n",
    "\n",
    "```py\n",
    "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]\n",
    "\n",
    "todolist.sort()\n",
    "todolist.reverse()\n",
    "todolist[2]\n",
    "todolist[-1]\n",
    "todolist[2:6]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "todolist.sort()\n",
    "print(todolist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "todolist.reverse()\n",
    "print(todolist)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "todolist[2]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "todolist[-1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "todolist[2:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Dictionaries\n",
    "\n",
    "- Keys have values\n",
    "\n",
    "```py\n",
    "mydict = {\"a\": \"alligator\", \"b\": \"bear\", \"c\": \"cat\"}\n",
    "counter = {\"cats\": 55, \"dogs\": 8}\n",
    "\n",
    "mydict[\"a\"]\n",
    "mydict.keys()\n",
    "mydict.values()\n",
    "```\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "counter = {'cats': 0, 'others': 0}\n",
    "\n",
    "for animal in ['zebra', 'cat', 'dog', 'cat']:\n",
    "    if animal == 'cat':\n",
    "        counter['cats'] += 1\n",
    "    else:\n",
    "        counter['others'] += 1\n",
    "        \n",
    "counter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Sets\n",
    "\n",
    "- Bag of values\n",
    " \n",
    "  - No order\n",
    "  \n",
    "  - No duplicates \n",
    "\n",
    "  - Fast membership checks\n",
    "  \n",
    "  - Logical set operations (union, difference, intersection...)\n",
    "\n",
    "\n",
    "```py\n",
    "myset = {\"drama\", \"sci-fi\"}\n",
    "|\n",
    "myset.add(\"comedy\")\n",
    "\n",
    "myset.remove(\"drama\")\n",
    "```\n"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "for m in objects:\n",
    "    print(f'Found {m.group()} at position {m.start()}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "todolist = [\"work\", \"sleep\", \"eat\", \"work\"]\n",
    "\n",
    "todo_items = set(todolist)\n",
    "todo_items"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "todo_items.add(\"study\")\n",
    "todo_items"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "todo_items.add(\"eat\")\n",
    "todo_items"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Strings\n",
    "\n",
    "- Works like a list of characters\n",
    "  - ```py\n",
    "  s += \"more words\"  # add content```\n",
    "  - ```py\n",
    "  s[4]               # get character at index 4```\n",
    "  - ```py\n",
    "  'e' in s           # check for membership```\n",
    "  \n",
    "  - ```py\n",
    "  len(s)             # check size```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- But are immutable\n",
    "\n",
    "  - ```py\n",
    "  > s[2] = 'i'```\n",
    "  ---\n",
    "  ```py\n",
    "  Traceback (most recent call last):\n",
    "  File \"<stdin>\", line 1, in <module>\n",
    "  TypeError: 'str' object does not support item assignment\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Tuples\n",
    "\n",
    "- A group (usually two) of values that belong together\n",
    "  \n",
    "  - ```py\n",
    "tup = (max_lenght, sequence)\n",
    "```\n",
    "  - An ordered sequence (like lists)\n",
    "\n",
    "  - ```py\n",
    "length = tup[0]  # get content at index 0\n",
    "```\n",
    "  - Immutable"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "tup = (2, 'xy')\n",
    "tup[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tup[0] = 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "```py\n",
    "def find_longest_seq(file):\n",
    "    # some code here...\n",
    "    return length, sequence\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```py\n",
    "answer = find_longest_seq(filepath)\n",
    "print('lenght', answer[0])\n",
    "print('sequence', answer[1])\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "```py\n",
    "answer = find_longest_seq(filepath)\n",
    "length, sequence = find_longest_seq(filepath)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Deciding what to do\n",
    "\n",
    "```py\n",
    "if count > 10:\n",
    "   print('big')\n",
    "elif count > 5:\n",
    "   print('medium')\n",
    "else:\n",
    "   print('small')\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "shopping_list = ['bread', 'egg', ' butter', 'milk']\n",
    "tired         = True\n",
    "\n",
    "if len(shopping_list) > 4:\n",
    "    print('Really need to go shopping!')\n",
    "elif not tired:\n",
    "    print('Not tired? Then go shopping!')\n",
    "else:\n",
    "    print('Better to stay at home')   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Deciding what to do - if statement\n",
    "\n",
    "<img src=\"img/if_else_statement.png\" alt=\"Drawing\" style=\"width: 600px;\"/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Program flow - for loops\n",
    "\n",
    "```py\n",
    "information = []\n",
    "\n",
    "for line in open('myfile.txt', 'r'):\n",
    "    if is_comment(line):\n",
    "       use_comment(line)\n",
    "    else:\n",
    "       information = read_data(line)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"img/forloop.png\" alt=\"Drawing\" style=\"width: 600px;\"/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Program flow - while loops\n",
    "\n",
    "```py\n",
    "keep_going = True\n",
    "information = []\n",
    "index = 0\n",
    "\n",
    "while keep_going:\n",
    "    current_line = lines[index]\n",
    "    information += read_line(current_line)\n",
    "    index += 1\n",
    "    if check_something(current_line):\n",
    "        keep_going = False\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img src=\"img/whileloop.png\" alt=\"Drawing\" style=\"width: 600px;\"/> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Different types of loops\n",
    "\n",
    "__`For` loop__\n",
    "\n",
    "is a control flow statement that performs operations over a known amount of steps.\n",
    "\n",
    "__`While` loop__\n",
    "\n",
    "is a control flow statement that allows code to be executed repeatedly based on a given Boolean condition.\n",
    "\n",
    "<br></br>\n",
    "\n",
    "__Which one to use?__\n",
    "\n",
    "`For` loops - standard for iterations over lists and other iterable objects\n",
    "\n",
    "`While` loops - more flexible and can iterate an unspecified number of times\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "user_input = \"thank god it's friday\"\n",
    "for c in user_input:\n",
    "    print(c.upper())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "i = 0\n",
    "while i < len(user_input):\n",
    "    c = user_input[i]\n",
    "    print(c.upper())\n",
    "    i += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "i = 0\n",
    "go_on = True\n",
    "while go_on:\n",
    "    c = user_input[i]\n",
    "    print(c.upper())\n",
    "    i += 1\n",
    "    if c == 'd':\n",
    "        go_on = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Controlling loops\n",
    "\n",
    "- `break` - stop the loop\n",
    "- `continue` - go on to the next iteration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "user_input = \"thank god it's friday\"\n",
    "for c in user_input:\n",
    "    print(c.upper())\n",
    "    if c == 'd':\n",
    "        break"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [],
   "source": [
    "i = 0\n",
    "while True:    \n",
    "    c = user_input[i]\n",
    "    i += 1\n",
    "    if c in 'aoueiy':\n",
    "        continue\n",
    "    print(c.upper())\n",
    "    if c == 'd':\n",
    "        break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "**Watch out!**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "i = 0\n",
    "while i > 10:    \n",
    "    print(user_input[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "While loops may be infinite!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Input/Output\n",
    "\n",
    "- In:\n",
    "\n",
    "  - Read files: `fh = open(filename, 'r')`\n",
    "\n",
    "     - `for line in fh:`\n",
    "     - `fh.read()`\n",
    "     - `fh.readlines()`\n",
    "  - Read information from command line: `sys.argv[1:]`\n",
    "\n",
    "- Out:\n",
    "\n",
    "  - Write files: `fh = open(filename, 'w')`\n",
    "       - `fh.write(text)`\n",
    "  - Printing: `print('my_information')`\n",
    "  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Input/Output\n",
    "\n",
    "- Open files should be closed:\n",
    "    - `fh.close()`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### Formatting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "<pre>\n",
    "{[field_name] [: <font color=\"green\">format_spec</font>]}\n",
    "</pre>\n",
    "\n",
    "*Format_spec:*\n",
    "<pre>\n",
    "<font color=\"green\">filling alignment width precision type</font>\n",
    "   -       &gt;        10    .2       f\n",
    "</pre>\n",
    "\n",
    "```py\n",
    " print('|{:30}|{:^10}|{:^10.2f}|'.format(movie, votes, total/votes))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Code structure\n",
    "\n",
    "- Functions\n",
    "- Modules"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Functions\n",
    "\n",
    "- A named piece of code that performs a certain task.\n",
    "\n",
    "<img src=\"img/function_structure_explained.png\" alt=\"Drawing\" style=\"width: 600px;\"/>  \n",
    "\n",
    "\n",
    "- Is given a number of input arguments\n",
    "  - to be used (are in scope) within the function body\n",
    "- Returns a result (maybe `None`)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Functions - keyword arguments\n",
    "\n",
    "```py\n",
    "def prettyprinter(name, value, delim=\":\", end=None):\n",
    "    out = \"The \" + name + \" is \" + delim + \" \" + value\n",
    "    if end:\n",
    "        out += end\n",
    "    return out\n",
    "```\n",
    "\n",
    "\n",
    "- used to set default values (often `None`)\n",
    "- can be skipped in function calls\n",
    "- improve readability"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "source": [
    "#### Files (modules)\n",
    "\n",
    "- A (larger) piece of code containing functions, classes...\n",
    "- Can be imported\n",
    "\n",
    "```py\n",
    "import mymodule\n",
    "```\n",
    "- Can be run as a script\n",
    "\n",
    "```> python3 mymodule.py```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Using your code\n",
    "\n",
    "Any longer pieces of code that have been used and will be re-used should be saved\n",
    "\n",
    "- Save it as a file `    .py`\n",
    "\n",
    "- To run it:\n",
    "`python3 mycode.py`\n",
    "\n",
    "- Import it:\n",
    "`import mycode`\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Documentation and comments\n",
    "\n",
    "- ```py\n",
    "\"\"\" This is a doc-string explaining what the purpose of this function/module is.\"\"\"```\n",
    "- ```py\n",
    "# This is a comment that helps understanding the code\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Comments *will* help you"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Undocumented code rarely gets used"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Try to keep your code readable: use informative variable and function names"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<div style=\"overflow-y:scroll; max-height:900px\">\n",
    "<img src=\"img/example_code.png\" alt=\"Module\"/>\n",
    "    </div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Why programming?\n",
    "\n",
    "Endless possibilities!  \n",
    "- reverse complement DNA\n",
    "- custom filtering of VCF files\n",
    "- plotting of results\n",
    "- all excel stuff!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Why programming?\n",
    "\n",
    "- Computers are fast\n",
    "- Computers don't get bored\n",
    "- Computers don't get sloppy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- Create reproducable results\n",
    "- Extract large amount of information"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Final advice\n",
    "\n",
    "- Stop to think before you start coding\n",
    "    - use pseudocode\n",
    "    - use top-down programming\n",
    "    - use paper and pen\n",
    "    - take breaks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "- You know the basics - don't be afraid to try\n",
    "- You will get faster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "#### Final advice\n",
    "\n",
    "- Getting help\n",
    "  - ask colleauges\n",
    "  - talk about your problem (get a rubber duck)\n",
    "  - search the web\n",
    "  - take breaks!\n",
    "  - NBIS drop-ins"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<center>Now you know Python!</center>\n",
    "<br>\n",
    "<center><font size=20> 🎉 </font></center><br>\n",
    "<center><font size=20>    Well done!</font></center>\n"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  },
  "rise": {
   "height": 900,
   "width": "90%"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}