{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "# Table of Contents\n",
    " <p><div class=\"lev1 toc-item\"><a href=\"#PEB-Belgrade---Bash-workshop\" data-toc-modified-id=\"PEB-Belgrade---Bash-workshop-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>PEB Belgrade - Bash workshop</a></div><div class=\"lev1 toc-item\"><a href=\"#Definitions:-The-Unix-Philosophy\" data-toc-modified-id=\"Definitions:-The-Unix-Philosophy-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>Definitions: The Unix Philosophy</a></div><div class=\"lev2 toc-item\"><a href=\"#More-definitions\" data-toc-modified-id=\"More-definitions-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>More definitions</a></div><div class=\"lev1 toc-item\"><a href=\"#How-to-follow-this-workshop\" data-toc-modified-id=\"How-to-follow-this-workshop-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>How to follow this workshop</a></div><div class=\"lev2 toc-item\"><a href=\"#Getting-a-Terminal-application\" data-toc-modified-id=\"Getting-a-Terminal-application-3.1\"><span class=\"toc-item-num\">3.1&nbsp;&nbsp;</span>Getting a Terminal application</a></div><div class=\"lev3 toc-item\"><a href=\"#Windows\" data-toc-modified-id=\"Windows-3.1.1\"><span class=\"toc-item-num\">3.1.1&nbsp;&nbsp;</span>Windows</a></div><div class=\"lev3 toc-item\"><a href=\"#Mac\" data-toc-modified-id=\"Mac-3.1.2\"><span class=\"toc-item-num\">3.1.2&nbsp;&nbsp;</span>Mac</a></div><div class=\"lev3 toc-item\"><a href=\"#Linux\" data-toc-modified-id=\"Linux-3.1.3\"><span class=\"toc-item-num\">3.1.3&nbsp;&nbsp;</span>Linux</a></div><div class=\"lev2 toc-item\"><a href=\"#Getting-the-workshop-materials\" data-toc-modified-id=\"Getting-the-workshop-materials-3.2\"><span class=\"toc-item-num\">3.2&nbsp;&nbsp;</span>Getting the workshop materials</a></div><div class=\"lev2 toc-item\"><a href=\"#If-you-get-&quot;command-not-found&quot;\" data-toc-modified-id=\"If-you-get-&quot;command-not-found&quot;-3.3\"><span class=\"toc-item-num\">3.3&nbsp;&nbsp;</span>If you get \"command not found\"</a></div><div class=\"lev3 toc-item\"><a href=\"#Advanced:-using-git-to-get-the-materials\" data-toc-modified-id=\"Advanced:-using-git-to-get-the-materials-3.3.1\"><span class=\"toc-item-num\">3.3.1&nbsp;&nbsp;</span>Advanced: using git to get the materials</a></div><div class=\"lev1 toc-item\"><a href=\"#Basic-Unix-Commands:-ls,-cd\" data-toc-modified-id=\"Basic-Unix-Commands:-ls,-cd-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>Basic Unix Commands: ls, cd</a></div><div class=\"lev2 toc-item\"><a href=\"#ls--l\" data-toc-modified-id=\"ls--l-4.1\"><span class=\"toc-item-num\">4.1&nbsp;&nbsp;</span>ls -l</a></div><div class=\"lev1 toc-item\"><a href=\"#Accessing-the-contents-of-a-file:-head,-cat,-less\" data-toc-modified-id=\"Accessing-the-contents-of-a-file:-head,-cat,-less-5\"><span class=\"toc-item-num\">5&nbsp;&nbsp;</span>Accessing the contents of a file: head, cat, less</a></div><div class=\"lev1 toc-item\"><a href=\"#Searching-patterns-into-file:-grep\" data-toc-modified-id=\"Searching-patterns-into-file:-grep-6\"><span class=\"toc-item-num\">6&nbsp;&nbsp;</span>Searching patterns into file: grep</a></div><div class=\"lev2 toc-item\"><a href=\"#Accessing-grep-documentation\" data-toc-modified-id=\"Accessing-grep-documentation-6.1\"><span class=\"toc-item-num\">6.1&nbsp;&nbsp;</span>Accessing grep documentation</a></div><div class=\"lev2 toc-item\"><a href=\"#Searching-multiple-files\" data-toc-modified-id=\"Searching-multiple-files-6.2\"><span class=\"toc-item-num\">6.2&nbsp;&nbsp;</span>Searching multiple files</a></div><div class=\"lev2 toc-item\"><a href=\"#Searching-multiple-patterns-and-the-Unix-piping-system\" data-toc-modified-id=\"Searching-multiple-patterns-and-the-Unix-piping-system-6.3\"><span class=\"toc-item-num\">6.3&nbsp;&nbsp;</span>Searching multiple patterns and the Unix piping system</a></div><div class=\"lev2 toc-item\"><a href=\"#Regular-Expressions\" data-toc-modified-id=\"Regular-Expressions-6.4\"><span class=\"toc-item-num\">6.4&nbsp;&nbsp;</span>Regular Expressions</a></div><div class=\"lev2 toc-item\"><a href=\"#Regular-Expression-exercise\" data-toc-modified-id=\"Regular-Expression-exercise-6.5\"><span class=\"toc-item-num\">6.5&nbsp;&nbsp;</span>Regular Expression exercise</a></div><div class=\"lev1 toc-item\"><a href=\"#Working-with-tabular-files:-Awk\" data-toc-modified-id=\"Working-with-tabular-files:-Awk-7\"><span class=\"toc-item-num\">7&nbsp;&nbsp;</span>Working with tabular files: Awk</a></div><div class=\"lev2 toc-item\"><a href=\"#Example-of-tabular-file:-the-GFF3-format\" data-toc-modified-id=\"Example-of-tabular-file:-the-GFF3-format-7.1\"><span class=\"toc-item-num\">7.1&nbsp;&nbsp;</span>Example of tabular file: the GFF3 format</a></div><div class=\"lev2 toc-item\"><a href=\"#Basic-AWK-syntax:-filters\" data-toc-modified-id=\"Basic-AWK-syntax:-filters-7.2\"><span class=\"toc-item-num\">7.2&nbsp;&nbsp;</span>Basic AWK syntax: filters</a></div><div class=\"lev4 toc-item\"><a href=\"#Exercise\" data-toc-modified-id=\"Exercise-7.2.0.1\"><span class=\"toc-item-num\">7.2.0.1&nbsp;&nbsp;</span>Exercise</a></div><div class=\"lev2 toc-item\"><a href=\"#Awk:-printing-columns-and-doing-operations\" data-toc-modified-id=\"Awk:-printing-columns-and-doing-operations-7.3\"><span class=\"toc-item-num\">7.3&nbsp;&nbsp;</span>Awk: printing columns and doing operations</a></div><div class=\"lev3 toc-item\"><a href=\"#Exercise-(difficult)\" data-toc-modified-id=\"Exercise-(difficult)-7.3.1\"><span class=\"toc-item-num\">7.3.1&nbsp;&nbsp;</span>Exercise (difficult)</a></div><div class=\"lev2 toc-item\"><a href=\"#AWK:-searching-by-regular-expressions\" data-toc-modified-id=\"AWK:-searching-by-regular-expressions-7.4\"><span class=\"toc-item-num\">7.4&nbsp;&nbsp;</span>AWK: searching by regular expressions</a></div><div class=\"lev3 toc-item\"><a href=\"#Last-exercise!\" data-toc-modified-id=\"Last-exercise!-7.4.1\"><span class=\"toc-item-num\">7.4.1&nbsp;&nbsp;</span>Last exercise!</a></div><div class=\"lev1 toc-item\"><a href=\"#Bonus:-Makefiles\" data-toc-modified-id=\"Bonus:-Makefiles-8\"><span class=\"toc-item-num\">8&nbsp;&nbsp;</span>Bonus: Makefiles</a></div><div class=\"lev2 toc-item\"><a href=\"#Defining-pipelines-with-Makefiles\" data-toc-modified-id=\"Defining-pipelines-with-Makefiles-8.1\"><span class=\"toc-item-num\">8.1&nbsp;&nbsp;</span>Defining pipelines with Makefiles</a></div><div class=\"lev2 toc-item\"><a href=\"#How-to-run-Makefile-rules\" data-toc-modified-id=\"How-to-run-Makefile-rules-8.2\"><span class=\"toc-item-num\">8.2&nbsp;&nbsp;</span>How to run Makefile rules</a></div><div class=\"lev1 toc-item\"><a href=\"#The-last-slide\" data-toc-modified-id=\"The-last-slide-9\"><span class=\"toc-item-num\">9&nbsp;&nbsp;</span>The last slide</a></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# PEB Belgrade - Bash workshop\n",
    "\n",
    "Giovanni M. Dall'Olio, GlaxoSmithKline, 12/09/2016. All materials available here: https://dalloliogm.github.io/ \n",
    "\n",
    "```\n",
    " _______________\n",
    " / Welcome to    \\\n",
    " \\ PEB Belgrade! /\n",
    "  ---------------\n",
    "         \\   ^__^\n",
    "          \\  (oo)\\_______\n",
    "             (__)\\       )\\/\\\n",
    "                 ||----w |\n",
    "                 ||     ||\n",
    "```\n",
    "\n",
    "Welcome to Belgrade!\n",
    "\n",
    "In this workshop we will review some basic Unix command, as well as bash usage.\n",
    "\n",
    "If you attended the [Programming for Evolutionary Biology course in Leipzig](http://evop.bioinf.uni-leipzig.de/), this will be a refreshener. I've hidden some **secrets** in the exercises, so you will not get bored :-)\n",
    "\n",
    "If you are new to bash, this will be a short introduction. Press Space or Down do continue."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "skip"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": []
    }
   ],
   "source": [
    "# Configuration - this will not appear in the slideshow\n",
    "alias grep='grep --color'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Definitions: The Unix Philosophy\n",
    "\n",
    "**Unix** is the name of an operating system created in the '80s, which became popular for introducing a novel approach to computing."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "The Unix philosophy can be summarised as: \n",
    "\n",
    "- Make each program do one thing well.\n",
    "- Expect the output of every program to become the input to another, as yet unknown, program. \n",
    "\n",
    "Press Space to continue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "You will see how each Unix tool is specialized on a single task, and how the piping system allows to combine these tool together.\n",
    "\n",
    "These principles can be useful to any person wishing to learn programming. You may use the same approach when learning programming, starting writing small programs and functions, and combining them together in bigger pipelines. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## More definitions\n",
    "\n",
    "**Linux**: \n",
    "\n",
    "    A \"descendant\" of Unix, e.g. an operating system based on Unix that can run on modern computers\n",
    "\n",
    "**Terminal**:\n",
    "\n",
    "    A software that allows to input commands to the computer, by typing them rather than point-and-click\n",
    "        \n",
    "**Bash**: \n",
    "\n",
    "    A command-line interpreter, e.g. a software that interprets the commands given from the terminal, and execute them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# How to follow this workshop\n",
    "\n",
    "\n",
    "## Getting a Terminal application\n",
    "\n",
    "All the exercises will be done in a Terminal. \n",
    "\n",
    "During the conference we may have also time for a \"Linux Install Party\", to get Linux into some of your laptops. However there are ways to access a bash terminal without installing Linux first.\n",
    "\n",
    "Press space or the down key to see what to install or launch."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Windows\n",
    "\n",
    "For Windows users, we will use a terminal emulator software called MobaXTerm: http://mobaxterm.mobatek.net/\n",
    "\n",
    "The Home Edition is free and contains all the features we will need for the workshop:\n",
    "\n",
    "<a href=\"http://mobaxterm.mobatek.net/\" target=\"_blank\"><img src=\"http://mobaxterm.mobatek.net/img/moba/features/feature-terminal.png\" width=\"400\"></a>\n",
    "\n",
    "To install new software, use (e.g. make):\n",
    "\n",
    "```\n",
    "apt-cyg install make\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Mac\n",
    "\n",
    "You should be able to use the Console App in Mac.\n",
    "\n",
    "<img src=\"http://blog.teamtreehouse.com/wp-content/uploads/2012/09/Screen-Shot-2012-09-25-at-1.01.45-PM.png\" width=\"400\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Linux\n",
    "\n",
    "Congratulations on having Linux installed! You can use your favorite terminal app (e.g. gnome-terminal or konsole)\n",
    "\n",
    "<img src=\"https://camo.githubusercontent.com/358e5e2206148280d127128499ad92e8621e5517/68747470733a2f2f7261772e6769746875622e636f6d2f63656d6d616e6f75696c696469732f6d6f6e6f6b61692d676e6f6d652d7465726d696e616c2f6d61737465722f73637265656e73686f742d30312e706e67\" width=\"400\">\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Getting the workshop materials\n",
    "\n",
    "Now that we have a terminal application ready, let's download all the course materials.\n",
    "\n",
    "Open the terminal, and type the following commands (omitting the \"$:\"):\n",
    "\n",
    "```\n",
    "$: wget  https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n",
    "$: unzip master.zip\n",
    "\n",
    "```\n",
    "\n",
    "\n",
    "Explanation:\n",
    "\n",
    " - the **wget** command downloads a .zip file containing all the materials\n",
    " - the **unzip** command uncompresses the .zip file, creating a new folder in your home area."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## If you get \"command not found\"\n",
    "\n",
    "Download https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n",
    "\n",
    "Open Cygwin\n",
    "\n",
    "cd /cygdrive/c/Documents\\ and\\ Settings/ (your name)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "\\# Expected output\n",
    "\n",
    "```\n",
    "$: wget  https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n",
    "\n",
    "--2016-08-26 09:55:53--  https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n",
    "Resolving github.com... 192.30.253.113\n",
    "Connecting to github.com|192.30.253.113|:443... connected.\n",
    "HTTP request sent, awaiting response... 302 Found\n",
    "Location: https://codeload.github.com/dalloliogm/belgrade_unix_intro/zip/master [following]\n",
    "--2016-08-26 09:55:54--  https://codeload.github.com/dalloliogm/belgrade_unix_intro/zip/master\n",
    "Resolving codeload.github.com... 192.30.253.120\n",
    "Connecting to codeload.github.com|192.30.253.120|:443... connected.\n",
    "HTTP request sent, awaiting response... 200 OK\n",
    "Length: 129112 (126K) [application/zip]\n",
    "Saving to: `master.zip.1'\n",
    "\n",
    "100%[========================================================================================================================================================================>] 129,112      617K/s   in 0.2s\n",
    "\n",
    "2016-08-26 09:55:54 (617 KB/s) - `master.zip' saved [129112/129112]\n",
    "$: unzip master.zip\n",
    "\n",
    "Archive:  master.zip\n",
    "   creating: belgrade_unix_intro-master/\n",
    "  inflating: belgrade_unix_intro-master/PEB Bash Workshop.ipynb\n",
    "  inflating: belgrade_unix_intro-master/README.md\n",
    "   creating: belgrade_unix_intro-master/data/\n",
    "   creating: belgrade_unix_intro-master/data/part1_grep/\n",
    "  inflating: belgrade_unix_intro-master/data/part1_grep/file1.txt\n",
    "  inflating: belgrade_unix_intro-master/data/part1_grep/file2.txt\n",
    "   creating: belgrade_unix_intro-master/src/\n",
    "   creating: belgrade_unix_intro-master/src/data/\n",
    "  inflating: belgrade_unix_intro-master/src/data/README.rst\n",
    "  inflating: belgrade_unix_intro-master/src/generate_grep_exercise.py\n",
    "  ```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Advanced: using git to get the materials\n",
    "\n",
    "If the software git is installed, you can get the materials by the following:\n",
    "\n",
    "```\n",
    "git clone git@github.com:dalloliogm/belgrade_unix_intro.git\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Basic Unix Commands: ls, cd\n",
    "\n",
    "Let's have a look at the files we just downloaded.\n",
    "\n",
    "We will use two basic Unix commands:\n",
    "\n",
    " - **ls** list the number of files in the current directory\n",
    " - **cd** allows to move to a different directory.\n",
    " \n",
    " \n",
    "Typing **ls** will show all the files in the current directory. Among these you should see a folder called **belgrade_unix_intro-master**, created by the wget and unzip commands. \n",
    " \n",
    "Let's move to this new directory, and list the files in it:\n",
    " \n",
    "```\n",
    "$: cd belgrade_unix_intro-master/\n",
    "$: ls\n",
    "```\n",
    "\n",
    "This will show a list of files, including a file called **start_here.txt**, a README, a few folders (data/, src/), and some other files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[0m\u001b[01;34mdata\u001b[0m                     PEB Bash Workshop.slides.html    \u001b[01;34msrc\u001b[0m\r\n",
      "Makefile                 PEB Bioconductor workshop.ipynb  start_here.txt\r\n",
      "PEB Bash Workshop.ipynb  README.md\r\n"
     ]
    }
   ],
   "source": [
    "ls"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Press space or the down key to continue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## ls -l\n",
    "\n",
    "You can use the -l option of ls to visualize more details:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "total 160\r\n",
      "drwxrwxr-x 4 gioby gioby  4096 Sep  8 19:02 \u001b[0m\u001b[01;34mdata\u001b[0m\r\n",
      "-rw-rw-r-- 1 gioby gioby   929 Sep  8 21:57 Makefile\r\n",
      "-rw-rw-r-- 1 gioby gioby 83026 Sep  8 22:04 PEB Bash Workshop.ipynb\r\n",
      "-rw-rw-r-- 1 gioby gioby 56603 Sep  8 19:20 PEB Bioconductor workshop.ipynb\r\n",
      "-rw-rw-r-- 1 gioby gioby   260 Sep  5 18:23 README.md\r\n",
      "drwxrwxr-x 3 gioby gioby  4096 Sep  8 21:38 \u001b[01;34msrc\u001b[0m\r\n",
      "-rw-rw-r-- 1 gioby gioby  1877 Sep  5 18:23 start_here.txt\r\n"
     ]
    }
   ],
   "source": [
    "# Contents of the PEB workshop directory\n",
    "ls -l"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Accessing the contents of a file: head, cat, less\n",
    "\n",
    "The new folder contains a file named start_here.txt, containing the first instructions for the workshop.\n",
    "\n",
    "To access the contents of a file, we can use several Unix commands:\n",
    "\n",
    "| command       | description                                   | example             |\n",
    "| :------------:|:-------------------------------------         |:------------------- |\n",
    "| **head**      | print the first lines of the file             | head start_here.txt |\n",
    "| **tail**      | print the last lines of the file              | tail start_here.txt |\n",
    "| **cat**       | print the contents of the file to the screen  | cat start_here.txt  |\n",
    "| **less**      | allows to navigate contents of the file       | less start_here.txt |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the first exercise, type \"head start_here.txt\" and follow the instructions:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " _________________________________________________\r\n",
      "/ To navigate the contents of this file, type:     \\\r\n",
      "|                                                  |\r\n",
      "|        less start_here.txt                       |\r\n",
      "\\                                                  /\r\n",
      " --------------------------------------------------\r\n",
      "        \\   ^__^\r\n",
      "         \\  (oo)\\_______\r\n",
      "            (__)\\       )\\/\\\r\n",
      "                ||----w |\r\n"
     ]
    }
   ],
   "source": [
    "head start_here.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Searching patterns into file: grep\n",
    "\n",
    "The instructions for the next exercises are stored in the **data/exercise1_grep.txt** file. \n",
    "\n",
    "However, if you look at this file with head or less, you will see that its contents have no meaning!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "PWapg3ZDqzNWF6v VocznrWXXLTi gIY7Tj0bVx     pmslXrBubMQeoEJXrF0OfHpcpxwTlktHSCm\r\n",
      "spf5840ZpMkpg4tZvgd3z4dxLVLiXnfmtrNaGL9d    BV04Lu18iLMugTwPHRRkLCADC8PKO8jXutZ\r\n",
      "zBTK9i8ya oe4IoxbCZhST4XvDe mrccT7cwYGAD    1SmSareQB5q8wNsAvaA79aqXlIpmBZgmUVR\r\n",
      "4gr mwZcxIg6pQwgddsJa4giM7hzjp8lit49D7kH    upYIZQr8MbyEk4CX Y7k0uMmW9kk1fNJDea\r\n",
      "DMj0BJp wJ8BF3xyd61euAWb4IjOv6paBlKGse3a    buZnpSOJv9PWhQpQnuxmZosVdYFw6TZF3RG\r\n",
      "yxArpyFCKt5637qiASyfadyheMBAp4bccq5furIx    EOgGCEnWGuJwLSmvoehnXBdlbqDS5YN f7k\r\n",
      "T016mr v0mzedsHTFReC3ZjqVuYXpPuTulu8F0Z3    pmr9l96nOUEVXckfdiidZUP6UvFNh4Doaqz\r\n",
      "B0zFnnWEFttxrUjyuHgU9U09wEt7HfHBP1MAstQb    WgxYhtDn3swa5fsmYgtxQKjjbIZzuVszEdl\r\n",
      "qByK4hFg7JQowOAXW60EBXQYSDHFgUWHlJAGYnjO    CoB6YKtvZPaS8H8BRdsuBwdqU3KRz O9oXk\r\n",
      "3Ntf9b6jv7hZsjtfEcaIzMuakpsEjl6i7Mra4M3U    MgWDXcpafKACEA0rUAro9DjHo4VgbBJ6tdj\r\n"
     ]
    }
   ],
   "source": [
    "head data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "This is because I've hidden all the instructions in the file, just to make things more interesting!\n",
    "\n",
    "Press space or the down key to see how to continue."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "I've hidden the instructions for the first exercise in the lines containing the word \"start\" somewhere.\n",
    "\n",
    "To find them, type:\n",
    "\n",
    "```\n",
    "$: grep start data/exercise1_grep\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "NjIM \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KesMsuqZNWhowFxuSFX4IaLymKGYdef    \r\n",
      "McQYo6 umUY816rvtSGjAl \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K DdBaWfylxrZ     ______________________________\r\n",
      "Ewu1xLvv7OrXNWu4otWYoF gdV4U3i\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KdzYlJ    / Congrats!                    \\  \r\n",
      "kDDgWqtLBgY85 PQm8p1ajcAEzbQdb \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K rMv    |  You've used grep correctly, |\r\n",
      "kCVqk6sGesHvBp6 pNLzStgdhKu \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9YQQNI    \\  and found a cow.            /\r\n",
      "tLMfr \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KUY36ToJEfE4uqIAQ3JboyoBOFyL8s     ------------------------------\r\n",
      " bWKJdeuL \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KI4xvVOZxwyC7oMKHaoG5ePF4k            \\   ^__^\r\n",
      "fThKk5wk \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KAo6IzNddHcxuj93oFRam0mneoF             \\  (oo)\\_______\r\n",
      "aw\u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kyw5tH3FzetzVxhw8c VrV7Uyis 5q8Yvj                (__)\\       )\\/\\ \r\n",
      "QgF4gHcEbAz \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9ZbzG90fUafm64BIlTEIIr                    ||----w |\r\n",
      "XKWA \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K6DpuNhYTXTcmo0UtCGa4SUo4JvnwvD                    ||     ||\r\n",
      "ZA0BrMOyH99y7VY97lkomNXHUJUv8MWg \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K R    \r\n",
      "YgOF6ahX0hEhMf \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KTZd 1 wDtbgoa86I8Atk    \r\n",
      "vln0CxxjrcgeeQ5EtPdG0Spx7\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KIAT35hzj 6    \r\n",
      "xglkzByTDeiIKyoZbCQbO4br\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K rb39 ExT9F    The command grep allows to search\r\n",
      "5\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K7zPVWugW3vb9 mYBIxsuIVxhHUdIxiTgFZ    for a pattern in a text file.\r\n",
      "CLcSSkWNF0tHLOluZr43qptA \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KxojHnAwbmJ    \r\n",
      "Fp \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K ZwtvOeMtDld9oahg9rdmBvKtIjPqXFQ    It will print all the matching \r\n",
      "VOsyrwG4UOEsdfYLOfGFGKZWEvtJse \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9Ra    lines to the screen.\r\n",
      "Sv81TKcZ8Fx1lb7xPZVMxW4ODNoKg8p7IHZ\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K    \r\n",
      "GlbV\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KpQQ5eQDweIn0VAGC8bQLbQ0Dzw4Ggvt    =================\r\n",
      "kziTL5jTi\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K pijnXXmWRApPCC 19SUNHN8n7    Next Exercise\r\n",
      "v768DQ0dRCix6 \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kc0me0SF qIsYfeC704lam    =================\r\n",
      "vebdjvHTd\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K RxBxhJayFkmRXqyOvqg5khG4O    \r\n",
      "QorxdcpNP1utzB\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K6WpDOX4YzyIFpkZEalKW4    In the next exercise we will see \r\n",
      "GipBzz4Ul5sj3hVmVkQvPg \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kz9v6AF91EirG    how to access grep's documentation.\r\n",
      "CC09wNO65rwuCqUgi8Skg1NZ0SGR7WDUoVT\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K    \r\n",
      "fjT7Ag59 RuhusLFzU \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KGHFvKsYSp bnNsLG    Grep the following word to continue:\r\n",
      "Zx6RINR3hk \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K667gnhTiLYLiB30MxX7irwVP      _            _        \r\n",
      "T0aoAQpfbNkO8LkSzSLJkLVEaXNxzQ \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KVoL3     | |          | |       \r\n",
      "Nv0hZYvh0pHN0AlT BN\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K C8pMzkIs7usQUWd     | |__    ___ | | _ __  \r\n",
      "5  \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K  9Rq5tBOFDxiExQrRlPgCXoWt43a3US     | '_ \\  / _ \\| || '_ \\ \r\n",
      "46unsRj3c4ClXQvcoFPyE9cnRHDQOHFNNZ\u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kc     | | | ||  __/| || |_) |\r\n",
      "40H3 \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kj6glBCFXqOhMH3BEdgBsPPQuBbOt6D     |_| |_| \\___||_|| .__/ \r\n",
      "Qam1yoNK3BCpwSyhRX8Wb3rA1U\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K djDiAHuT                     | |    \r\n",
      "PFW \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KKxEzUChDZGSPQNj4gsTS5k1JMBvWuY                      |_|  \r\n",
      "1bS5w1uaq65\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KnVRWYkojLFMSkMjui8YYz1A5    \r\n",
      "g0 g8iyP \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KQqkz7F05ST C S73TpreeesnFm    \r\n"
     ]
    }
   ],
   "source": [
    "grep start data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Accessing grep documentation\n",
    "\n",
    "To access the documentation of a command, we can use the **man** command.\n",
    "\n",
    "Let's type the following:\n",
    "\n",
    "```\n",
    "$: man grep\n",
    "```\n",
    "\n",
    "This will allow to navigate the documentation for grep, in the same modality as with the **less** command. Use the arrows to scroll, and q to exit.\n",
    "\n",
    "For the next exercise, you will have to identify two options in the man page, and use them to do a case-insensitive search for \"ignorecase\", and count the number of lines."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "td4kwN6cV0kqU3qMkwYOHl9MqjTQ \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K MP6MrF    \r\n",
      "GY \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K yEvNL3RuVQYqiumisBftk8irLzXwt61y    The documentation for grep can \r\n",
      "MTEwA\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KhQapljO9yUtucAiNpvZrKdbwc3KUcsu    be accessed through man:\r\n",
      "NVD5n5HKKKz6GgDmyOGMlKSMTd \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K Rii BCjC    \r\n",
      "ku1GNL6IpSHBvcGroqpHgbMUNCg3Yz3l\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KnOBy        $: man grep\r\n",
      "XkHkwMI\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hidg7uJURR6loj5IAwv9oyeIUmqT    \r\n",
      "sGKar9AKY \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K VhNi3MlGzT3WjAQdpWbvtuWeb    Scroll down to see all the \r\n",
      "jHLbw4whFT1B\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KDfqZqhjXRYPjF0y7pkM8g3z3    parameters for grep and their description.\r\n",
      "hZ1OQgKcsgZo \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K m4s64C8nSR5zM gU4fYObu    \r\n",
      "9jlkynOW\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KLTaeswR5UnouUc3Ipsd4OjVI5PFO    Use / to search for text.\r\n",
      "k4pjhosSNRgJlr7kt\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KAvkWOHszFMoP yPEbgT    Press the q key to exit.\r\n",
      "DPdj3lg4P6 UtuibInP\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K ErdkiRKtYHTKDAJ     \r\n",
      "5Ru8b5\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K cCuwtAVpbxoqHtK70dT9vtw5NsZR8    \r\n",
      "O\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hL9xy8U77RDjkZsRX6WDZf8ywnBY83LiL5    \r\n",
      "8nhxNAlz3yHtfZFEBjwvnKPFB \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K qUV YkeQX    ==============\r\n",
      "FU4nt\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K RStwArEw6UGFM4O7kKlxItNqVfD8Jl    Next exercise\r\n",
      "IcOj \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K P56z3xD7QRkE1admG5sNTulg7B38om    ==============\r\n",
      "FSnpSHWkiELodvyTu Tx\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K uAWnw9UTW0ZPITU    \r\n",
      "iCu7cLxdU0vLMBo\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K46htatY8jYJC6XXVNDHTF    For the next exercise, you will need to open \r\n",
      "qLbjcY\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K cp8USrp6u51ainbnXsp DForAbOq3    grep's documentation and identify two options:\r\n",
      "LPTakIcUOWmROON8GPJ4szSpKqZn3c \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K k5jK    \r\n",
      "graSN0 cI4H6Zl \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hCxcK0ynPImVu0Mogdcw    - the option for case-insensitive searches\r\n",
      "MSewEHXyuatdRzy9GSokR DaLKp\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KfDDJd7u p    - the option for counting \r\n",
      "r7k6b1c9XDZcWnxH syn9peY uNq\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KjKyOyg0T      the number of matching lines, \r\n",
      "zi4Rycq58rmxjH zW1AhCWAO1s\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KSyViqAbyAC      instead of printing them to the screen.\r\n",
      "CNx6GsFSs\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K iRQE6pdA0jJiStNjknOaoQPSD     \r\n",
      "ial36NIIePB7P5 \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K tpJ6bnVvVv7gESXp1Apc    Once you have identified these options, \r\n",
      "A9HSI nKdCcuDp8WGEFkbWE8gJsUAZatatIO\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K    do a case-insensitive search on this file for the word \r\n",
      "erU3 7ppIkaPoqBFCFkFFYMo\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K HxVST S9fFj    \"ignorecase\", then count the number of lines.\r\n",
      "lwjWEVzMBJSZiRSXvJzQmePQPFKeL4OQdO\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K R    \r\n",
      "8P5kONdSaqg0tolHUGq8nN9brT7k \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K 6duGCw        \r\n"
     ]
    }
   ],
   "source": [
    "grep help data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p14PGGX\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KDoCCJ9sYiegaozfL6LXxDmf    \r\n",
      "o6m1cg7C\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KUJbpjD laYkpG6gdBHbJIM            \r\n",
      "aNqS0Tg4kVIeLlyDeYoBlalps0\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[Kw5dd    Remember that, to continue with the exercise, \r\n",
      "bRe7rR0sM8mcf8W1woMoReyj\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KLtPrHA    you need to do a case-insensitive search for the word\r\n",
      "erU3 7ppIkaPoqBFCFkFFYMohelp HxVST S9fFj    \"\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[K\", then count the number of lines.\r\n",
      "w\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KTt0lDGMb5KCFWEm4t8RmBNXtLvURX                    ||----w |\r\n"
     ]
    }
   ],
   "source": [
    "# If we do a search for \"ignorecase\" without any option, we only get some of the lines.\n",
    "# You can notice that the cow is not properly displayed :-)\n",
    "grep ignorecase data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p14PGGX\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KDoCCJ9sYiegaozfL6LXxDmf    \r\n",
      "o6m1cg7C\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KUJbpjD laYkpG6gdBHbJIM            \r\n",
      "aNqS0Tg4kVIeLlyDeYoBlalps0\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[Kw5dd    Remember that, to continue with the exercise, \r\n",
      "bRe7rR0sM8mcf8W1woMoReyj\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KLtPrHA    you need to do a case-insensitive search for the word\r\n",
      "erU3 7ppIkaPoqBFCFkFFYMohelp HxVST S9fFj    \"\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[K\", then count the number of lines.\r\n",
      "1ofqHyPgr74Vx 0vUkETWFA\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[Ku8SJQ5C    \r\n",
      "1 vfC7\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KMUtRWYq3KGKJpR8koi7FhtzX     _____________\r\n",
      "OTMODZfX1gD9l38Tu9PEQZrshVzL\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KbI    / Good Job!   \\ \r\n",
      "u7YtPPNnVLSzB8HCBvtOcIHey0X8Wt\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[K    | You did a   |\r\n",
      "QfX1XYVyUHpwU\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KpT fi6GkHvOkG LDb    | case-insens |\r\n",
      "Vw4ePnDoZ4KxNs58pWlGMoFVc\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KpQj 6    | itive       |\r\n",
      "fN4SOVBxl6\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KeJ5Ldyb0y4PLVSL1ZCv7    \\ search      /\r\n",
      "mmNqW04FRacds3eYb\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KRk5rFhFpKahDt     -------------\r\n",
      "ZgQZAYDnIE7Jk4PLhZ10gx\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KpxQqxB4t            \\   ^__^\r\n",
      "50FY1806\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[K6DzXRGwihWPeO3J gjHsDG             \\  (oo)\\_______\r\n",
      "QxAIpmflI jFcJ\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[KQM06LNCSX lftJUX                (__)\\       )\\/\\ \r\n",
      "w\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KTt0lDGMb5KCFWEm4t8RmBNXtLvURX                    ||----w |\r\n",
      "w1EeylvQJWMF\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KWavz4 ICR89dkvr6sf                    ||     ||\r\n",
      "wayAmo30uEjxkMyJvis\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KkwshDQX DGB    \r\n",
      "45s7W ggf\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[KUYiHjY0F6BWSqqDfZ6c F    \r\n",
      "zmqy\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[Kqo5w9DIs0DGFlDayGlVaheoIlO        \r\n"
     ]
    }
   ],
   "source": [
    "# The -i option allows to do a case-insensitive search.\n",
    "# As you can see, some lines contain upper case characters:\n",
    "grep -i ignorecase data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "21\r\n"
     ]
    }
   ],
   "source": [
    "# To solve the exercise, we also have to count the number of output lines.\n",
    "# This can be done with the \"-c\" option:\n",
    "grep -i -c ignorecase data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "tidjh\u001b[01;31m\u001b[K21\u001b[m\u001b[KyvuMNPDEma8t6PksdgTVkimf6F8LHegXf    \r\n",
      "OllivZL3QFq8OiobDOQjdrPT1KeqT\u001b[01;31m\u001b[K21\u001b[m\u001b[K bRG WMRc    \r\n",
      "eCmkBM\u001b[01;31m\u001b[K21\u001b[m\u001b[KOATsb57fD9ao6czsMB1f7gtWvJCFAW3z     ____________________\r\n",
      "YCOQlk1yUmr8EjN3NBxEB0SSToh\u001b[01;31m\u001b[K21\u001b[m\u001b[KXfpm BiVHS7    / Congrats! Yes      \\ \r\n",
      "JCsq1gs3drLCHAerYroSp331AJMHr\u001b[01;31m\u001b[K21\u001b[m\u001b[Km9Atm4UMR    | the answer to the  |\r\n",
      "z3nfFTpzSKGHfdDwtIadMjgiYx\u001b[01;31m\u001b[K21\u001b[m\u001b[Kiiat3S9VVT8R    | case-insensitive   |\r\n",
      "0qBEpfp1dcTibKVwObda341CTH9zoYJpBFe8\u001b[01;31m\u001b[K21\u001b[m\u001b[Kyy    | and count question |\r\n",
      "KJIsvaofywLv6uz1\u001b[01;31m\u001b[K21\u001b[m\u001b[K6aZlUBQ3XBJd1jVC5bdHAE    | is \u001b[01;31m\u001b[K21\u001b[m\u001b[K.             /\r\n",
      "jy0FgakHM4Tq7ncjhUN\u001b[01;31m\u001b[K21\u001b[m\u001b[KggkNyZhNhJC4eyz ESN     --------------------\r\n",
      "xwwOmWdp5pJ8IsvtNMx9EnWOnjmuUEdt4o8d\u001b[01;31m\u001b[K21\u001b[m\u001b[Kzc            \\   ^__^\r\n",
      "k azZdXgjRGFYTHuMIp0SFkwjp4vHRG1lnlmSj\u001b[01;31m\u001b[K21\u001b[m\u001b[K             \\  (oo)\\_______\r\n",
      "jYe19iH7NaYtPGDC7mXoy5G7\u001b[01;31m\u001b[K21\u001b[m\u001b[Ks8EGrD8wFCZSlJ                (__)\\       )\\/\\  \r\n",
      "CXUYNxwnP8jr3NR5T9SCl5TQAwJI5ZjNCm zw\u001b[01;31m\u001b[K21\u001b[m\u001b[KY                    ||----w |\r\n",
      "l\u001b[01;31m\u001b[K21\u001b[m\u001b[KFpLp HaLHc1MaoMXflHI4wr981PUNefC0cKDC                    ||     ||\r\n",
      "fC1BsEyvpDm cCnceoQCj3\u001b[01;31m\u001b[K21\u001b[m\u001b[Kv36bmPx5u9Ht6qxs    \r\n",
      "VAAh4PTzYzWSbMxmtDE8XtwYqSu8KFq50\u001b[01;31m\u001b[K21\u001b[m\u001b[KycKLY    \r\n",
      "WmHhzfH8XzJ4Dd3PvgMoIXAnoJJG3G9HlGUtD\u001b[01;31m\u001b[K21\u001b[m\u001b[Kd    =============\r\n",
      "yCrjC\u001b[01;31m\u001b[K21\u001b[m\u001b[KuBDHKBR1P0XVXQp9XE6T7Nqa6C p8ZQ4H    Next exercise\r\n",
      "zfa7If6rzhvuv O6HFHU\u001b[01;31m\u001b[K21\u001b[m\u001b[KcbLnpW0Yipf3xSKJSS    =============\r\n",
      "FPgwt6n3mfTJtartXVwrMAtmn3ISF\u001b[01;31m\u001b[K21\u001b[m\u001b[KyiK0U9NH4    \r\n",
      "PnV\u001b[01;31m\u001b[K21\u001b[m\u001b[KlkRoTqqVP 9Hs4v4RlJLFdOx6LkhICM WW1    Searching in multiple files\r\n",
      "uaRh\u001b[01;31m\u001b[K21\u001b[m\u001b[K9wTTl0wCVin63cfrywW06LwQOb vx1k5Uu    \r\n",
      "FTOHCTMDFlKj cNVu\u001b[01;31m\u001b[K21\u001b[m\u001b[KDgKqN1EZxhU1iPyGRrko1    Grep can search the same pattern\r\n",
      "EfzALglVAh8cPso5WmyYi8v1QG0c\u001b[01;31m\u001b[K21\u001b[m\u001b[KLUTKPqw66N    in more than one file at the same time.\r\n",
      "rGWMTbnXJnehtyAY3vxTJWUdaUXH MxFnyA\u001b[01;31m\u001b[K21\u001b[m\u001b[KfUN    \r\n",
      "l0UcWWd0LG0GeFwNKlGEyj07pbUOPTee1\u001b[01;31m\u001b[K21\u001b[m\u001b[Kt0MsN    The folder data/multiplefiles/ contains hundreds of different files.\r\n",
      "Ow7gE6ZNvIGLP775npX6j5menzWz4\u001b[01;31m\u001b[K21\u001b[m\u001b[KHg00qDP3w    \r\n",
      "uWhIZ4kk6cI7d9503RXAniriZjemCOZ\u001b[01;31m\u001b[K21\u001b[m\u001b[KJ7BTCBt    Can you identify the file containing the word \"regex\"?\r\n",
      "Wxd46\u001b[01;31m\u001b[K21\u001b[m\u001b[KC JxW68aYMWbeCMY0eVMtTqF8iAfhqazV    \r\n",
      "iekhxfE5LpZ\u001b[01;31m\u001b[K21\u001b[m\u001b[KqUxwIjXpYtMchz489rzXtZ0 VOU        \r\n"
     ]
    }
   ],
   "source": [
    "# solution: how to find the instructions for the next exercise\n",
    "grep 21 data/exercise1_grep.txt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Searching multiple files\n",
    "\n",
    "Grep is useful to search over multiple files in a single command.\n",
    "\n",
    "The folder data/multiplefiles/ contains 50 randomly generated files. You can see their contents with head data/multiplefiles/\\* or with less.\n",
    "\n",
    "One of these files contains the word \"regex\" in it. Are you able to find it?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K5gsumFTKbKEJv9dD8W94FhoEQU8qf8RMUc\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KR    \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KYgDiqkA C1o\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K9giqI66c3sOwfLirOsgPpSuq    \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KIsXSnp 8U8pKR0LsVuK\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KO5GFegOtV4GW4fNQ    Good! You've found the \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[Kl 4px8KhPRmfEJgi5uTuVO1XahG3H1sY\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K4wt    file containing the word \"\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K\"\r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[Kyz8P5 HC6N5D XRHPncZjTAeM\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KT9bQUoZdsh    \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KeWUd18s0MVx5YYrEK KCKeF5hvO\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KIiZbIGUX    To continue, \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KMLXiKZJ8KyHMou9lYsz4ZjFYJSfB 14t\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KtpJ    grep file32.txt data/exercise1_grep.txt\r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KveFQU\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KfnQxwQw6POJRNvvAeYwToX6ptvN39m    \r\n",
      "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KcHoNv\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KiGjHkmptPVTjOzvWVGbrGoHoywV4Vy    \r\n"
     ]
    }
   ],
   "source": [
    "# solution: you can use the \"*\" character to specify multiple files:\n",
    "grep 'regex' data/multiplefiles/*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Searching multiple patterns and the Unix piping system\n",
    "\n",
    "How can we search that contain two or more patterns?\n",
    "\n",
    "One solution is to use the Unix piping system, executing one grep command, and then another grep on the output.\n",
    "\n",
    "This can be done using the pipe \"|\" symbol, like the following:\n",
    "\n",
    "```\n",
    "$: grep (first pattern) myfile.txt | grep (second pattern)\n",
    "```\n",
    "\n",
    "Press space or the down key for some examples."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The file data/genes/mgat_genes.gb is a genbank file. Notice how this format is well suited for grep searches:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "LOCUS       HUMUDPCNA               4705 bp    DNA     linear   PRI 19-SEP-1995\r\n",
      "DEFINITION  Human alpha-1,3-mannosyl-glycoprotein beta-1,\r\n",
      "            2-N-acetylglucosaminyltransferase (MGAT) gene, complete cds.\r\n",
      "ACCESSION   M61829\r\n",
      "VERSION     M61829.1  GI:340075\r\n",
      "KEYWORDS    alpha-1,3-mannosyl-glycoprotein beta-1,2-N-acetylglucosaminyltrae.\r\n",
      "SOURCE      Homo sapiens (human)\r\n",
      "  ORGANISM  Homo sapiens\r\n",
      "            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\r\n",
      "            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\r\n"
     ]
    }
   ],
   "source": [
    "head data/genes/mgat_genes.gb"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Let's say we want to search all the lines where \"ORGANISM\" is \"Homo sapiens\". \n",
    "\n",
    "We can do it with two grep commands:\n",
    "\n",
    "```\n",
    "grep ORGANISM data/genes/mgat_genes.gb | grep 'Homo sapiens'\n",
    "```\n",
    "Notice that searching for \"Homo sapiens\" alone would not be enough, as there are other lines where the word \"Homo sapiens\" is present."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n",
      "  ORGANISM  \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n"
     ]
    }
   ],
   "source": [
    "grep ORGANISM data/genes/mgat_genes.gb | grep 'Homo sapiens'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "The file contains sequences from two other organisms apart from Homo sapiens. Can you guess which one to search for the next exercise?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t   _______________\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t          < Well guessed! >\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t   ---------------\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\\   ^__^\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t \\  (oo)\\_______\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t  \t\t    (__)\\       )\\/\\\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\t||----w |\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\t||     ||\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t===============\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tNext Exercise\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t===============\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tTo continue, grep\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\"regex\" in\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tdata/exercise1_grep.txt\r\n",
      "  ORGANISM  Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n"
     ]
    }
   ],
   "source": [
    "# Solution: grep for \"bos taurus\":\n",
    "grep ORGANISM data/genes/mgat_genes.gb | grep taurus"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Regular Expressions\n",
    "\n",
    "Regular expressions allow to search for more complex patterns. \n",
    "\n",
    "Here are some simple regular expression examples:\n",
    "\n",
    "| regex    | description |\n",
    "| -------- | ----------- |\n",
    "| .        | matches any character |\n",
    "| [A-Za-z] | matches any of the characters within parenthesis|\n",
    "| .\\*      | matches any character, any number of times| "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Regular Expression exercise\n",
    "\n",
    "Let's have a look at the file data/genes/sequences.fasta:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">seq000 sequence description\r\n",
      "NANTCNNNGNATNNCNTACNTGNTCGNCCG\r\n",
      ">seq001 sequence description\r\n",
      "TGTCAATTNTCNGCGTCNNACNNACTCGCN\r\n",
      ">seq002 sequence description\r\n",
      "TGGGCGNTCATGNANAATGTTACGCTCNGG\r\n",
      ">seq003 sequence description\r\n",
      "GCCTTTNGGNNCTCACTANGCANGTTTGAN\r\n",
      ">seq004 sequence description   \r\n",
      "CATNANNAAAccTTTAGGCACTCNACACNG\r\n"
     ]
    }
   ],
   "source": [
    "head data/genes/sequences.fasta "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Can you use grep to identify all the sequences containing three As, followed by any two characters, followed by three Ts?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CATNANN\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KAGGCACTCNACACNG\r\n",
      "AGGCCGCNGNGGTA\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KACNAAGAC\r\n",
      "GTGTNNGTCAAGCNCGNCGTTN\u001b[01;31m\u001b[KAAAGGTTT\u001b[m\u001b[K\r\n",
      "ATNCGNAGNNCANTNGAC\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KTTGT\r\n",
      "NTACNTA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KCCACTNTTTANTCAA\r\n",
      "CNNGAGCG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KNGCAAGTCTGNNCN\r\n",
      "CATGGGC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KATANATGTNAANCNT\r\n",
      "GGTGGGNNCCCAGNCGNC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KNNCT\r\n",
      "GA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KTCNAACTTTNNAATAANNCN\r\n",
      "GAACAAGCNGCCCTTGGCC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KGNC\r\n",
      "NNTCGTNGNNNA\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KTAAGAGCACC\r\n",
      "NNNGNGTNG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KTTGAACGANNAAT\r\n",
      "CC\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KNTAGNAGCCTGTAGAGCCGC\r\n"
     ]
    }
   ],
   "source": [
    "grep 'AAA..TTT' data/genes/sequences.fasta"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Bonus: if we use the -B 1 grep option, we can retrieve the names of these sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ">seq004 sequence description   \r\n",
      "CATNANN\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KAGGCACTCNACACNG\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq012 sequence description    ________________\r\n",
      "AGGCCGCNGNGGTA\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KACNAAGAC\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq015 sequence description\r\n",
      "GTGTNNGTCAAGCNCGNCGTTN\u001b[01;31m\u001b[KAAAGGTTT\u001b[m\u001b[K\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq022 sequence description   / Congrats! This \\ \r\n",
      "ATNCGNAGNNCANTNGAC\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KTTGT\r\n",
      ">seq023 sequence description   | was the last  |\r\n",
      "NTACNTA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KCCACTNTTTANTCAA\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq029 sequence description   \\ grep exercise / \r\n",
      "CNNGAGCG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KNGCAAGTCTGNNCN\r\n",
      ">seq030 sequence description    ----------------\r\n",
      "CATGGGC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KATANATGTNAANCNT\r\n",
      ">seq031 sequence description       \\  ^__^\r\n",
      "GGTGGGNNCCCAGNCGNC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KNNCT\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq033 sequence description        \\ (oo)\\_______\r\n",
      "GA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KTCNAACTTTNNAATAANNCN\r\n",
      ">seq034 sequence description          (__)\\    )\\/\\ \r\n",
      "GAACAAGCNGCCCTTGGCC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KGNC\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq038 sequence description           ||----w |\r\n",
      "NNTCGTNGNNNA\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KTAAGAGCACC\r\n",
      ">seq039 sequence description           ||    ||\r\n",
      "NNNGNGTNG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KTTGAACGANNAAT\r\n",
      "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n",
      ">seq041 sequence description   \r\n",
      "CC\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KNTAGNAGCCTGTAGAGCCGC\r\n"
     ]
    }
   ],
   "source": [
    "grep -B1 'AAA..TTT' data/genes/sequences.fasta "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "run_control": {
     "frozen": false,
     "read_only": false
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq004 sequence description   \r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq012 sequence description    ________________\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq015 sequence description\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq022 sequence description   / Congrats! This \\ \r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq023 sequence description   | was the last  |\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq029 sequence description   \\ grep exercise / \r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq030 sequence description    ----------------\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq031 sequence description       \\  ^__^\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq033 sequence description        \\ (oo)\\_______\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq034 sequence description          (__)\\    )\\/\\ \r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq038 sequence description           ||----w |\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq039 sequence description           ||    ||\r\n",
      "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq041 sequence description   \r\n"
     ]
    }
   ],
   "source": [
    "# Bonus: pipe an additional grep '>' to see a cow:\n",
    "grep -B1 'AAA..TTT' data/genes/sequences.fasta  | grep '>'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Working with tabular files: Awk\n",
    "\n",
    "\n",
    "The **awk** command allows to search and manipulate tabular files from the command line.\n",
    "\n",
    "Imagine it as the equivalent of Excel/Calc for the command line. It allows to do search on specific columns of a file, to do numerical operations, or to change the order of the columns.\n",
    "\n",
    "The advantage of a command-line tool over graphical software is that the memory footprint is much lower. So you can access and modify large files in a fraction of the time that it would take with Excel."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Example of tabular file: the GFF3 format\n",
    "\n",
    "The file data/genes/chr8.gff contains an example of file in the GFF3 format:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##gff-version 3\r\n",
      "##source-version refgene 1.28.10\r\n",
      "##date 2016-09-08\r\n",
      "##genome-build .\thg19\r\n",
      "chr8\trefgene\tgene\t18248755\t18258723\t.\t+\t.\tgene_id=10;symbol=NAT2;;ID=10\r\n",
      "chr8\trefgene\tgene\t100549014\t100549089\t.\t-\t.\tgene_id=100126309;symbol=MIR875;;ID=100126309    \r\n",
      "chr8\trefgene\tgene\t144895127\t144895212\t.\t-\t.\tgene_id=100126338;symbol=MIR937;;ID=100126338\r\n",
      "chr8\trefgene\tgene\t145619364\t145619445\t.\t-\t.\tgene_id=100126351;symbol=MIR939;;ID=100126351\r\n",
      "chr8\trefgene\tgene\t91970706\t91997485\t.\t-\t.\tgene_id=100127983;symbol=C8orf88;;ID=100127983\r\n",
      "chr8\trefgene\tgene\t74332309\t74353753\t.\t+\t.\tgene_id=100128126;symbol=STAU2-AS1;;ID=100128126\r\n"
     ]
    }
   ],
   "source": [
    "head data/genes/chr8.gff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see it is a tab-separated file, which we could easily read in Excel or Calc.\n",
    "\n",
    "The format specifications are defined [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format3), but in short:\n",
    "\n",
    "- the first, fourth and fifth columns contain the chromosome name and coordinates\n",
    "- the second column describes the tool or resource that generated the annotation\n",
    "- the third column describe the type of feature (e.g. gene, transcript, exon, TF binding site, Histone Acetylation mark, etc...\n",
    "- the ninth column contains several fields, separated by a semicolon\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Basic AWK syntax: filters\n",
    "\n",
    "The basic AWK syntax is the following:\n",
    "\n",
    "```\n",
    "awk 'filters {print statements}' filename\n",
    "```\n",
    "\n",
    "Awk is quite smart at recognizing the field separator, and by default assumes they are separated by tabs.\n",
    "\n",
    "Each column of the file can be referred to with the dollar sign followed by the number of column.\n",
    "\n",
    "For example $2 refers to the second column, and so on.\n",
    "\n",
    "The following code filters all the lines belonging to chromosome 8, between the coordinates 100000 and 200000:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8\trefgene\tgene\t182200\t197339\t.\t+\t.\tgene_id=169270;symbol=ZNF596;;ID=169270\r\n",
      "chr8\trefgene\tgene\t116086\t117024\t.\t-\t.\tgene_id=441308;symbol=OR4F21;;ID=441308\r\n",
      "chr8\trefgene\tgene\t158345\t182318\t.\t-\t.\tgene_id=644128;symbol=RPL23AP53;;ID=644128\r\n"
     ]
    }
   ],
   "source": [
    "awk '$1==\"chr8\" && $4>100000 && $5<200000 ' data/genes/chr8.gff"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "#### Exercise\n",
    "\n",
    "Can you print all the lines between 5000000 and 10000000 ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8\trefgene\tgene\t7143733\t7212876\t.\t-\t.\tgene_id=100128890;symbol=FAM66B;ID=100128890\r\n",
      "chr8\trefgene\tgene\t7215498\t7220490\t.\t-\t.\tgene_id=100131980;symbol=ZNF705G;ID=100131980\r\n",
      "chr8\trefgene\tgene\t7812535\t7866277\t.\t+\t.\tgene_id=100132103;symbol=FAM66E;ID=100132103\r\n",
      "chr8\trefgene\tgene\t7783859\t7809935\t.\t+\t.\t             _________\r\n",
      "chr8\trefgene\tgene\t6261077\t6264069\t.\t-\t.\t            / Cows in \\\r\n",
      "chr8\trefgene\tgene\t7272385\t7274354\t.\t-\t.\t            | the     |\r\n",
      "chr8\trefgene\tgene\t7946463\t7946611\t.\t-\t.\t            \\ Genome! /\r\n",
      "chr8\trefgene\tgene\t6602685\t6602765\t.\t+\t.\t             ---------\r\n",
      "chr8\trefgene\tgene\t8905955\t8906028\t.\t+\t.\t                      \\   ^__^\r\n",
      "chr8\trefgene\tgene\t6602689\t6602761\t.\t-\t.\t                       \\  (oo)\\_______\r\n",
      "chr8\trefgene\tgene\t6693076\t6699975\t.\t+\t.\t                          (__)\\       )\\/\\\r\n",
      "chr8\trefgene\tgene\t8559666\t8561617\t.\t+\t.\t                              ||----w |\r\n",
      "chr8\trefgene\tgene\t9182561\t9192590\t.\t+\t.\t                              ||      |\r\n",
      "chr8\trefgene\tgene\t8175258\t8239257\t.\t-\t.\tgene_id=157285;symbol=SGK223;ID=157285\r\n",
      "chr8\trefgene\tgene\t9757574\t9760839\t.\t-\t.\tgene_id=157627;symbol=LINC00599;ID=157627\r\n",
      "chr8\trefgene\tgene\t6835171\t6856724\t.\t-\t.\tgene_id=1667;symbol=DEFA1;ID=1667\r\n",
      "chr8\trefgene\tgene\t6793345\t6795786\t.\t-\t.\tgene_id=1669;symbol=DEFA4;ID=1669\r\n",
      "chr8\trefgene\tgene\t6912829\t6914259\t.\t-\t.\tgene_id=1670;symbol=DEFA5;ID=1670\r\n",
      "chr8\trefgene\tgene\t6782216\t6783598\t.\t-\t.\tgene_id=1671;symbol=DEFA6;ID=1671\r\n",
      "chr8\trefgene\tgene\t6728097\t6735529\t.\t-\t.\tgene_id=1672;symbol=DEFB1;ID=1672\r\n",
      "chr8\trefgene\tgene\t7752199\t7754237\t.\t+\t.\tgene_id=1673;symbol=DEFB4A;ID=1673\r\n",
      "chr8\trefgene\tgene\t6844700\t6866346\t.\t-\t.\tgene_id=170949;symbol=DEFT1P;ID=170949\r\n",
      "chr8\trefgene\tgene\t7353368\t7366833\t.\t+\t.\tgene_id=245910;symbol=DEFB107A;ID=245910\r\n",
      "chr8\trefgene\tgene\t6357175\t6420784\t.\t-\t.\tgene_id=285;symbol=ANGPT2;ID=285\r\n",
      "chr8\trefgene\tgene\t8086092\t8102387\t.\t+\t.\tgene_id=286042;symbol=FAM86B3P;ID=286042\r\n",
      "chr8\trefgene\tgene\t6666041\t6693166\t.\t-\t.\tgene_id=389610;symbol=XKR5;ID=389610\r\n",
      "chr8\trefgene\tgene\t7829183\t7830775\t.\t-\t.\tgene_id=392188;symbol=USP17L8;ID=392188\r\n",
      "chr8\trefgene\tgene\t7189909\t7191501\t.\t+\t.\tgene_id=401447;symbol=USP17L1;ID=401447\r\n",
      "chr8\trefgene\tgene\t9760898\t9760982\t.\t-\t.\tgene_id=406907;symbol=MIR124-1;ID=406907\r\n",
      "chr8\trefgene\tgene\t7413660\t7431920\t.\t-\t.\tgene_id=441317;symbol=FAM90A7P;ID=441317\r\n",
      "chr8\trefgene\tgene\t7627106\t7628835\t.\t+\t.\tgene_id=441328;symbol=FAM90A10P;ID=441328\r\n",
      "chr8\trefgene\tgene\t6808248\t6809121\t.\t-\t.\tgene_id=449491;symbol=DEFA8P;ID=449491\r\n",
      "chr8\trefgene\tgene\t6816811\t6817683\t.\t-\t.\tgene_id=449492;symbol=DEFA9P;ID=449492\r\n",
      "chr8\trefgene\tgene\t6825663\t6826635\t.\t-\t.\tgene_id=449493;symbol=DEFA10P;ID=449493\r\n",
      "chr8\trefgene\tgene\t7669242\t7673238\t.\t-\t.\tgene_id=503614;symbol=DEFB107B;ID=503614\r\n",
      "chr8\trefgene\tgene\t6565878\t6619021\t.\t+\t.\tgene_id=55326;symbol=AGPAT5;ID=55326\r\n",
      "chr8\trefgene\tgene\t7194637\t7196229\t.\t+\t.\tgene_id=645402;symbol=USP17L4;ID=645402\r\n",
      "chr8\trefgene\tgene\t7833915\t7835507\t.\t-\t.\tgene_id=645836;symbol=USP17L3;ID=645836\r\n",
      "chr8\trefgene\tgene\t7705402\t7721319\t.\t+\t.\tgene_id=653423;symbol=SPAG11A;ID=653423\r\n",
      "chr8\trefgene\tgene\t9599182\t9599278\t.\t+\t.\tgene_id=693182;symbol=MIR597;ID=693182\r\n",
      "chr8\trefgene\tgene\t6886123\t6887011\t.\t-\t.\tgene_id=724068;symbol=DEFA11P;ID=724068\r\n",
      "chr8\trefgene\tgene\t6873391\t6875823\t.\t-\t.\tgene_id=728358;symbol=DEFA1B;ID=728358\r\n",
      "chr8\trefgene\tgene\t6264113\t6501140\t.\t+\t.\tgene_id=79648;symbol=MCPH1;ID=79648\r\n",
      "chr8\trefgene\tgene\t8993764\t9009152\t.\t-\t.\tgene_id=79660;symbol=PPP1R3B;ID=79660\r\n",
      "chr8\trefgene\tgene\t9413445\t9639856\t.\t+\t.\tgene_id=8658;symbol=TNKS;ID=8658\r\n",
      "chr8\trefgene\tgene\t8860314\t8890849\t.\t+\t.\tgene_id=90459;symbol=ERI1;ID=90459\r\n",
      "chr8\trefgene\tgene\t8641999\t8751131\t.\t-\t.\tgene_id=9258;symbol=MFHAS1;ID=9258\r\n"
     ]
    }
   ],
   "source": [
    "awk '$4 > 5000000 && $5 < 10000000 ' data/genes/chr8.gff\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Awk: printing columns and doing operations\n",
    "\n",
    "Awk also allows to print only specific columns, and do algebraic operations on them.\n",
    "\n",
    "Remember that each column can be referred as \\$1, \\$2, \\$3, etc...\n",
    "\n",
    "For example the following code prints the first column, and the sum of the fourth and third. We can pipe the output to head or less, to make it easier to visualize:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##gff-version 0\r\n",
      "##source-version 0\r\n",
      "##date 0\r\n",
      "##genome-build 0\r\n",
      "chr8 9968\r\n",
      "chr8 75\r\n",
      "chr8 85\r\n",
      "chr8 81\r\n",
      "chr8 26779\r\n",
      "chr8 21444\r\n",
      "awk: write failure (Broken pipe)\r\n",
      "awk: close failed on file /dev/stdout (Broken pipe)\r\n"
     ]
    }
   ],
   "source": [
    "awk '{print $1, $5-$4}' data/genes/chr8.gff | head\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Notice how this also prints the headers of the file. We can exclude these by adding a grep condition:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8 9968 gene_id=10;symbol=NAT2;;ID=10\r\n",
      "chr8 75 gene_id=100126309;symbol=MIR875;;ID=100126309\r\n",
      "chr8 85 gene_id=100126338;symbol=MIR937;;ID=100126338\r\n",
      "chr8 81 gene_id=100126351;symbol=MIR939;;ID=100126351\r\n",
      "chr8 26779 gene_id=100127983;symbol=C8orf88;;ID=100127983\r\n",
      "chr8 21444 gene_id=100128126;symbol=STAU2-AS1;;ID=100128126\r\n",
      "chr8 12197 gene_id=100128338;symbol=FAM83H-AS1;;ID=100128338\r\n",
      "chr8 1835 gene_id=100128627;symbol=CDC42P3;;ID=100128627\r\n",
      "chr8 3282 gene_id=100128750;symbol=RBPMS-AS1;;ID=100128750\r\n",
      "chr8 69143 gene_id=100128890;symbol=FAM66B;ID=100128890\r\n"
     ]
    }
   ],
   "source": [
    "awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' |  head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Exercise (difficult)\n",
    "\n",
    "Starting from the previous command, can you extract the gene symbol into a separate column?\n",
    "\n",
    "Hints: pipe an additional awk statement after the first. Use the -F option to specify a different field separator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8 9968 gene_id=10 symbol=NAT2\r\n",
      "chr8 75 gene_id=100126309 symbol=MIR875\r\n",
      "chr8 85 gene_id=100126338 symbol=MIR937\r\n",
      "chr8 81 gene_id=100126351 symbol=MIR939\r\n",
      "chr8 26779 gene_id=100127983 symbol=C8orf88\r\n",
      "chr8 21444 gene_id=100128126 symbol=STAU2-AS1\r\n",
      "chr8 12197 gene_id=100128338 symbol=FAM83H-AS1\r\n",
      "chr8 1835 gene_id=100128627 symbol=CDC42P3\r\n",
      "chr8 3282 gene_id=100128750 symbol=RBPMS-AS1\r\n",
      "chr8 69143 gene_id=100128890 symbol=FAM66B\r\n",
      "awk: write failure (Broken pipe)\r\n",
      "awk: close failed on file /dev/stdout (Broken pipe)\r\n",
      "grep: write error\r\n"
     ]
    }
   ],
   "source": [
    "awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' | awk -F';' '{print $1, $2}' | head"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## AWK: searching by regular expressions\n",
    "\n",
    "Awk can also be used to search by regular expression.\n",
    "\n",
    "For example, the following code will print all the lines in which the symbol starts with \"MIR\":"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8\trefgene\tgene\t100549014\t100549089\t.\t-\t.\tgene_id=100126309;symbol=MIR875;;ID=100126309    \r\n",
      "chr8\trefgene\tgene\t144895127\t144895212\t.\t-\t.\tgene_id=100126338;symbol=MIR937;;ID=100126338\r\n",
      "chr8\trefgene\tgene\t145619364\t145619445\t.\t-\t.\tgene_id=100126351;symbol=MIR939;;ID=100126351\r\n",
      "chr8\trefgene\tgene\t65285775\t65295842\t.\t+\t.\tgene_id=100130155;symbol=MIR124-2HG;;ID=100130155\r\n",
      "chr8\trefgene\tgene\t128972879\t128972941\t.\t+\t.\tgene_id=100302161;symbol=MIR1205;;ID=100302161\r\n",
      "chr8\trefgene\tgene\t10682883\t10682953\t.\t-\t.\tgene_id=100302166;symbol=MIR1322;;ID=100302166\r\n",
      "chr8\trefgene\tgene\t129021144\t129021202\t.\t+\t.\tgene_id=100302170;symbol=MIR1206;;ID=100302170\r\n",
      "chr8\trefgene\tgene\t129061398\t129061484\t.\t+\t.\tgene_id=100302175;symbol=MIR1207;;ID=100302175\r\n",
      "chr8\trefgene\tgene\t128808208\t128808274\t.\t+\t.\tgene_id=100302185;symbol=MIR1204;;ID=100302185\r\n",
      "chr8\trefgene\tgene\t145625476\t145625559\t.\t-\t.\tgene_id=100302196;symbol=MIR1234;;ID=100302196\r\n",
      "chr8\trefgene\tgene\t113655722\t113655812\t.\t+\t.\tgene_id=100302225;symbol=MIR2053;;ID=100302225\r\n",
      "chr8\trefgene\tgene\t27743556\t27743633\t.\t-\t.\tgene_id=100422828;symbol=MIR4287;;ID=100422828\r\n",
      "chr8\trefgene\tgene\t29814788\t29814864\t.\t-\t.\tgene_id=100422876;symbol=MIR3148;;ID=100422876\r\n",
      "chr8\trefgene\tgene\t28362633\t28362699\t.\t-\t.\tgene_id=100422903;symbol=MIR4288;;ID=100422903\r\n",
      "chr8\trefgene\tgene\t96085142\t96085221\t.\t+\t.\tgene_id=100422964;symbol=MIR3150A;;ID=100422964\r\n",
      "chr8\trefgene\tgene\t104166842\t104166917\t.\t+\t.\tgene_id=100422992;symbol=MIR3151;;ID=100422992\r\n",
      "chr8\trefgene\tgene\t12584746\t12584808\t.\t+\t.\tgene_id=100500838;symbol=MIR3926-2;;ID=100500838\r\n",
      "chr8\trefgene\tgene\t27559194\t27559276\t.\t+\t.\tgene_id=100500858;symbol=MIR3622A;;ID=100500858\r\n",
      "chr8\trefgene\tgene\t12584741\t12584813\t.\t-\t.\tgene_id=100500870;symbol=MIR3926-1;;ID=100500870\r\n",
      "chr8\trefgene\tgene\t27559190\t27559284\t.\t-\t.\tgene_id=100500871;symbol=MIR3622B;;ID=100500871\r\n",
      "chr8\trefgene\tgene\t96085139\t96085224\t.\t-\t.\tgene_id=100500907;symbol=MIR3150B;;ID=100500907\r\n",
      "chr8\trefgene\tgene\t117886967\t117887039\t.\t-\t.\tgene_id=100500914;symbol=MIR3610;;ID=100500914\r\n",
      "chr8\trefgene\tgene\t42751340\t42751418\t.\t-\t.\tgene_id=100616115;symbol=MIR4469;;ID=100616115\r\n",
      "chr8\trefgene\tgene\t94928250\t94928347\t.\t-\t.\tgene_id=100616169;symbol=MIR378D2;;ID=100616169\r\n",
      "chr8\trefgene\tgene\t29920258\t30108213\t.\t-\t.\tgene_id=100616190;symbol=MIR548O2;;ID=100616190\r\n",
      "chr8\trefgene\tgene\t92217713\t92217786\t.\t+\t.\tgene_id=100616245;symbol=MIR4661;;ID=100616245\r\n",
      "chr8\trefgene\tgene\t124228028\t124228103\t.\t-\t.\tgene_id=100616260;symbol=MIR4663;;ID=100616260\r\n",
      "chr8\trefgene\tgene\t143257700\t143257779\t.\t+\t.\tgene_id=100616268;symbol=MIR4472-1;;ID=100616268\r\n",
      "chr8\trefgene\tgene\t144815253\t144815323\t.\t-\t.\tgene_id=100616318;symbol=MIR4664;;ID=100616318\r\n",
      "chr8\trefgene\tgene\t101394991\t101395073\t.\t+\t.\tgene_id=100616451;symbol=MIR4471;;ID=100616451\r\n",
      "chr8\trefgene\tgene\t62627347\t62627418\t.\t+\t.\tgene_id=100616484;symbol=MIR4470;;ID=100616484\r\n",
      "chr8\trefgene\tgene\t103137660\t103137743\t.\t+\t.\tgene_id=100847001;symbol=MIR5680;;ID=100847001\r\n",
      "chr8\trefgene\tgene\t131020580\t131020699\t.\t-\t.\tgene_id=100847051;symbol=MIR5194;;ID=100847051\r\n",
      "chr8\trefgene\tgene\t81153624\t81153708\t.\t+\t.\tgene_id=100847056;symbol=MIR5708;;ID=100847056\r\n",
      "chr8\trefgene\tgene\t75460778\t75460852\t.\t+\t.\tgene_id=100847058;symbol=MIR5681A;;ID=100847058\r\n",
      "chr8\trefgene\tgene\t75460785\t75460844\t.\t-\t.\tgene_id=100847091;symbol=MIR5681B;;ID=100847091\r\n",
      "chr8\trefgene\tgene\t9760898\t9760982\t.\t-\t.\tgene_id=406907;symbol=MIR124-1;ID=406907\r\n",
      "chr8\trefgene\tgene\t65291706\t65291814\t.\t+\t.\tgene_id=406908;symbol=MIR124-2;;ID=406908\r\n",
      "chr8\trefgene\tgene\t135812763\t135812850\t.\t-\t.\tgene_id=407030;symbol=MIR30B;;ID=407030\r\n",
      "chr8\trefgene\tgene\t135817119\t135817188\t.\t-\t.\tgene_id=407033;symbol=MIR30D;;ID=407033\r\n",
      "chr8\trefgene\tgene\t22102475\t22102556\t.\t-\t.\tgene_id=407037;symbol=MIR320A;;ID=407037\r\n",
      "chr8\trefgene\tgene\t75512101\t75670587\t.\t+\t.\tgene_id=441355;symbol=MIR2052HG;;ID=441355\r\n",
      "chr8\trefgene\tgene\t14710947\t14711019\t.\t-\t.\tgene_id=494332;symbol=MIR383;;ID=494332\r\n",
      "chr8\trefgene\tgene\t41517959\t41518026\t.\t-\t.\tgene_id=619554;symbol=MIR486-1;;ID=619554\r\n",
      "chr8\trefgene\tgene\t1765397\t1765473\t.\t+\t.\tgene_id=693181;symbol=MIR596;;ID=693181\r\n",
      "chr8\trefgene\tgene\t9599182\t9599278\t.\t+\t.\tgene_id=693182;symbol=MIR597;ID=693182\r\n",
      "chr8\trefgene\tgene\t10892716\t10892812\t.\t-\t.\tgene_id=693183;symbol=MIR598;;ID=693183\r\n",
      "chr8\trefgene\tgene\t100548864\t100548958\t.\t-\t.\tgene_id=693184;symbol=MIR599;;ID=693184\r\n",
      "chr8\trefgene\tgene\t145019359\t145019447\t.\t-\t.\tgene_id=724031;symbol=MIR661;;ID=724031\r\n"
     ]
    }
   ],
   "source": [
    "awk '$9 ~ /symbol=MIR/ {print $0}' data/genes/chr8.gff "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "### Last exercise!\n",
    "\n",
    "Calculate the lenght of the gene POU5F1B.\n",
    "\n",
    "Find the Gene whose gene_id is equal to that number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1584\r\n"
     ]
    }
   ],
   "source": [
    "awk '$9 ~ /POU5F1B/ {print $5-$4}' data/genes/chr8.gff \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chr8\trefgene\tGood_Job!\t143953773\t143961236\t.\t-\t.\tgene_id=1584;symbol=CYP11B1;;ID=1584\r\n"
     ]
    }
   ],
   "source": [
    "awk '$9 ~ /gene_id=1584/ {print $0}' data/genes/chr8.gff "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Bonus: Makefiles\n",
    "\n",
    "Let's have a look at the file called Makefile in the exercise directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "test_exercises: start help ignorecase multiplefiles\r\n",
      "generate_exercises: generate_grep generate_awk\r\n",
      "\r\n",
      "testrule:\r\n",
      "\techo this is a Makefile rule\r\n",
      "\techo You can associate it to as many commands you want\r\n",
      "\r\n",
      "notebook:\r\n",
      "\tjupyter nbconvert --to notebook --execute PEB\\ Bash\\ Workshop.ipynb\r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "head Makefile"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Press space or the down key to continue"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## Defining pipelines with Makefiles\n",
    "\n",
    "Makefiles are a basic way to define pipelines of shell commands.\n",
    "\n",
    "Nowadays there are more sophisticated tools available, but most of these are based on Makefiles.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "A Makefile is a collection of \"rules\".\n",
    "\n",
    "Each of these rules follows this basic syntax is:\n",
    "\n",
    "```\n",
    "target: prerequisites\n",
    "    commands to execute\n",
    "```\n",
    "\n",
    "As you can see in the Makefile included, most of the rules allow to regenerate the exercise files, or to execute some commands without having to type them everytime.\n",
    "\n",
    "For example, the rule \"testrule\" is associated to two echo commands."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "## How to run Makefile rules\n",
    "\n",
    "To execute a rule in the Makefile, simply type:\n",
    "\n",
    "```\n",
    "make [name of the rule]\n",
    "```\n",
    "\n",
    "For example:\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "echo this is a Makefile rule\r\n",
      "this is a Makefile rule\r\n",
      "echo You can associate it to as many commands you want\r\n",
      "You can associate it to as many commands you want\r\n"
     ]
    }
   ],
   "source": [
    "make testrule"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The program \"make\" will automatically detect any file named \"Makefile\" in the current directory, and execute any rule with the specific name."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rules can also be nested together. For example the two rules \"test_exercises\" and \"generate_exercises\" at the beginning of the file are a way to call several other rules together."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# The last slide\n",
    "\n",
    "This is the last slide of the workshop. To finish, try to execute the rule \"cow\" in the Makefile."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false,
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " _____________\r\n",
      "/ I hope you  \\\r\n",
      "| have        |\r\n",
      "| enjoyed the |\r\n",
      "| workshop    |\r\n",
      "\\ :-)         /\r\n",
      " -------------\r\n",
      "        \\   ^__^\r\n",
      "         \\  (oo)\\_______\r\n",
      "            (__)\\       )\\/\\\r\n",
      "                ||----w |\r\n",
      "                ||     ||\r\n",
      " ___________\r\n",
      "( Now let's )\r\n",
      "( go to the )\r\n",
      "( beach     )\r\n",
      " -----------\r\n",
      "        o   ^__^\r\n",
      "         o  (oo)\\_______\r\n",
      "            (__)\\       )\\/\\\r\n",
      "                ||----w |\r\n",
      "                ||     ||\r\n"
     ]
    }
   ],
   "source": [
    "make cow"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "hide_input": false,
  "kernelspec": {
   "display_name": "Bash",
   "language": "bash",
   "name": "bash"
  },
  "language_info": {
   "codemirror_mode": "shell",
   "file_extension": ".sh",
   "mimetype": "text/x-sh",
   "name": "bash"
  },
  "nav_menu": {},
  "toc": {
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": 6,
   "toc_cell": true,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}