{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": "true" }, "source": [ "# Table of Contents\n", "

1  PEB Belgrade - Bash workshop
2  Definitions: The Unix Philosophy
2.1  More definitions
3  How to follow this workshop
3.1  Getting a Terminal application
3.1.1  Windows
3.1.2  Mac
3.1.3  Linux
3.2  Getting the workshop materials
3.3  If you get \"command not found\"
3.3.1  Advanced: using git to get the materials
4  Basic Unix Commands: ls, cd
4.1  ls -l
5  Accessing the contents of a file: head, cat, less
6  Searching patterns into file: grep
6.1  Accessing grep documentation
6.2  Searching multiple files
6.3  Searching multiple patterns and the Unix piping system
6.4  Regular Expressions
6.5  Regular Expression exercise
7  Working with tabular files: Awk
7.1  Example of tabular file: the GFF3 format
7.2  Basic AWK syntax: filters
7.2.0.1  Exercise
7.3  Awk: printing columns and doing operations
7.3.1  Exercise (difficult)
7.4  AWK: searching by regular expressions
7.4.1  Last exercise!
8  Bonus: Makefiles
8.1  Defining pipelines with Makefiles
8.2  How to run Makefile rules
9  The last slide
" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# PEB Belgrade - Bash workshop\n", "\n", "Giovanni M. Dall'Olio, GlaxoSmithKline, 12/09/2016. All materials available here: https://dalloliogm.github.io/ \n", "\n", "```\n", " _______________\n", " / Welcome to \\\n", " \\ PEB Belgrade! /\n", " ---------------\n", " \\ ^__^\n", " \\ (oo)\\_______\n", " (__)\\ )\\/\\\n", " ||----w |\n", " || ||\n", "```\n", "\n", "Welcome to Belgrade!\n", "\n", "In this workshop we will review some basic Unix command, as well as bash usage.\n", "\n", "If you attended the [Programming for Evolutionary Biology course in Leipzig](http://evop.bioinf.uni-leipzig.de/), this will be a refreshener. I've hidden some **secrets** in the exercises, so you will not get bored :-)\n", "\n", "If you are new to bash, this will be a short introduction. Press Space or Down do continue." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "# Configuration - this will not appear in the slideshow\n", "alias grep='grep --color'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Definitions: The Unix Philosophy\n", "\n", "**Unix** is the name of an operating system created in the '80s, which became popular for introducing a novel approach to computing." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The Unix philosophy can be summarised as: \n", "\n", "- Make each program do one thing well.\n", "- Expect the output of every program to become the input to another, as yet unknown, program. \n", "\n", "Press Space to continue." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "You will see how each Unix tool is specialized on a single task, and how the piping system allows to combine these tool together.\n", "\n", "These principles can be useful to any person wishing to learn programming. You may use the same approach when learning programming, starting writing small programs and functions, and combining them together in bigger pipelines. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## More definitions\n", "\n", "**Linux**: \n", "\n", " A \"descendant\" of Unix, e.g. an operating system based on Unix that can run on modern computers\n", "\n", "**Terminal**:\n", "\n", " A software that allows to input commands to the computer, by typing them rather than point-and-click\n", " \n", "**Bash**: \n", "\n", " A command-line interpreter, e.g. a software that interprets the commands given from the terminal, and execute them." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# How to follow this workshop\n", "\n", "\n", "## Getting a Terminal application\n", "\n", "All the exercises will be done in a Terminal. \n", "\n", "During the conference we may have also time for a \"Linux Install Party\", to get Linux into some of your laptops. However there are ways to access a bash terminal without installing Linux first.\n", "\n", "Press space or the down key to see what to install or launch." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Windows\n", "\n", "For Windows users, we will use a terminal emulator software called MobaXTerm: http://mobaxterm.mobatek.net/\n", "\n", "The Home Edition is free and contains all the features we will need for the workshop:\n", "\n", "\n", "\n", "To install new software, use (e.g. make):\n", "\n", "```\n", "apt-cyg install make\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Mac\n", "\n", "You should be able to use the Console App in Mac.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Linux\n", "\n", "Congratulations on having Linux installed! You can use your favorite terminal app (e.g. gnome-terminal or konsole)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Getting the workshop materials\n", "\n", "Now that we have a terminal application ready, let's download all the course materials.\n", "\n", "Open the terminal, and type the following commands (omitting the \"$:\"):\n", "\n", "```\n", "$: wget https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n", "$: unzip master.zip\n", "\n", "```\n", "\n", "\n", "Explanation:\n", "\n", " - the **wget** command downloads a .zip file containing all the materials\n", " - the **unzip** command uncompresses the .zip file, creating a new folder in your home area." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## If you get \"command not found\"\n", "\n", "Download https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n", "\n", "Open Cygwin\n", "\n", "cd /cygdrive/c/Documents\\ and\\ Settings/ (your name)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\\# Expected output\n", "\n", "```\n", "$: wget https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n", "\n", "--2016-08-26 09:55:53-- https://github.com/dalloliogm/belgrade_unix_intro/archive/master.zip\n", "Resolving github.com... 192.30.253.113\n", "Connecting to github.com|192.30.253.113|:443... connected.\n", "HTTP request sent, awaiting response... 302 Found\n", "Location: https://codeload.github.com/dalloliogm/belgrade_unix_intro/zip/master [following]\n", "--2016-08-26 09:55:54-- https://codeload.github.com/dalloliogm/belgrade_unix_intro/zip/master\n", "Resolving codeload.github.com... 192.30.253.120\n", "Connecting to codeload.github.com|192.30.253.120|:443... connected.\n", "HTTP request sent, awaiting response... 200 OK\n", "Length: 129112 (126K) [application/zip]\n", "Saving to: `master.zip.1'\n", "\n", "100%[========================================================================================================================================================================>] 129,112 617K/s in 0.2s\n", "\n", "2016-08-26 09:55:54 (617 KB/s) - `master.zip' saved [129112/129112]\n", "$: unzip master.zip\n", "\n", "Archive: master.zip\n", " creating: belgrade_unix_intro-master/\n", " inflating: belgrade_unix_intro-master/PEB Bash Workshop.ipynb\n", " inflating: belgrade_unix_intro-master/README.md\n", " creating: belgrade_unix_intro-master/data/\n", " creating: belgrade_unix_intro-master/data/part1_grep/\n", " inflating: belgrade_unix_intro-master/data/part1_grep/file1.txt\n", " inflating: belgrade_unix_intro-master/data/part1_grep/file2.txt\n", " creating: belgrade_unix_intro-master/src/\n", " creating: belgrade_unix_intro-master/src/data/\n", " inflating: belgrade_unix_intro-master/src/data/README.rst\n", " inflating: belgrade_unix_intro-master/src/generate_grep_exercise.py\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Advanced: using git to get the materials\n", "\n", "If the software git is installed, you can get the materials by the following:\n", "\n", "```\n", "git clone git@github.com:dalloliogm/belgrade_unix_intro.git\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Basic Unix Commands: ls, cd\n", "\n", "Let's have a look at the files we just downloaded.\n", "\n", "We will use two basic Unix commands:\n", "\n", " - **ls** list the number of files in the current directory\n", " - **cd** allows to move to a different directory.\n", " \n", " \n", "Typing **ls** will show all the files in the current directory. Among these you should see a folder called **belgrade_unix_intro-master**, created by the wget and unzip commands. \n", " \n", "Let's move to this new directory, and list the files in it:\n", " \n", "```\n", "$: cd belgrade_unix_intro-master/\n", "$: ls\n", "```\n", "\n", "This will show a list of files, including a file called **start_here.txt**, a README, a few folders (data/, src/), and some other files." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[0m\u001b[01;34mdata\u001b[0m PEB Bash Workshop.slides.html \u001b[01;34msrc\u001b[0m\r\n", "Makefile PEB Bioconductor workshop.ipynb start_here.txt\r\n", "PEB Bash Workshop.ipynb README.md\r\n" ] } ], "source": [ "ls" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Press space or the down key to continue." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## ls -l\n", "\n", "You can use the -l option of ls to visualize more details:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 160\r\n", "drwxrwxr-x 4 gioby gioby 4096 Sep 8 19:02 \u001b[0m\u001b[01;34mdata\u001b[0m\r\n", "-rw-rw-r-- 1 gioby gioby 929 Sep 8 21:57 Makefile\r\n", "-rw-rw-r-- 1 gioby gioby 83026 Sep 8 22:04 PEB Bash Workshop.ipynb\r\n", "-rw-rw-r-- 1 gioby gioby 56603 Sep 8 19:20 PEB Bioconductor workshop.ipynb\r\n", "-rw-rw-r-- 1 gioby gioby 260 Sep 5 18:23 README.md\r\n", "drwxrwxr-x 3 gioby gioby 4096 Sep 8 21:38 \u001b[01;34msrc\u001b[0m\r\n", "-rw-rw-r-- 1 gioby gioby 1877 Sep 5 18:23 start_here.txt\r\n" ] } ], "source": [ "# Contents of the PEB workshop directory\n", "ls -l" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Accessing the contents of a file: head, cat, less\n", "\n", "The new folder contains a file named start_here.txt, containing the first instructions for the workshop.\n", "\n", "To access the contents of a file, we can use several Unix commands:\n", "\n", "| command | description | example |\n", "| :------------:|:------------------------------------- |:------------------- |\n", "| **head** | print the first lines of the file | head start_here.txt |\n", "| **tail** | print the last lines of the file | tail start_here.txt |\n", "| **cat** | print the contents of the file to the screen | cat start_here.txt |\n", "| **less** | allows to navigate contents of the file | less start_here.txt |" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the first exercise, type \"head start_here.txt\" and follow the instructions:\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " _________________________________________________\r\n", "/ To navigate the contents of this file, type: \\\r\n", "| |\r\n", "| less start_here.txt |\r\n", "\\ /\r\n", " --------------------------------------------------\r\n", " \\ ^__^\r\n", " \\ (oo)\\_______\r\n", " (__)\\ )\\/\\\r\n", " ||----w |\r\n" ] } ], "source": [ "head start_here.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Searching patterns into file: grep\n", "\n", "The instructions for the next exercises are stored in the **data/exercise1_grep.txt** file. \n", "\n", "However, if you look at this file with head or less, you will see that its contents have no meaning!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "PWapg3ZDqzNWF6v VocznrWXXLTi gIY7Tj0bVx pmslXrBubMQeoEJXrF0OfHpcpxwTlktHSCm\r\n", "spf5840ZpMkpg4tZvgd3z4dxLVLiXnfmtrNaGL9d BV04Lu18iLMugTwPHRRkLCADC8PKO8jXutZ\r\n", "zBTK9i8ya oe4IoxbCZhST4XvDe mrccT7cwYGAD 1SmSareQB5q8wNsAvaA79aqXlIpmBZgmUVR\r\n", "4gr mwZcxIg6pQwgddsJa4giM7hzjp8lit49D7kH upYIZQr8MbyEk4CX Y7k0uMmW9kk1fNJDea\r\n", "DMj0BJp wJ8BF3xyd61euAWb4IjOv6paBlKGse3a buZnpSOJv9PWhQpQnuxmZosVdYFw6TZF3RG\r\n", "yxArpyFCKt5637qiASyfadyheMBAp4bccq5furIx EOgGCEnWGuJwLSmvoehnXBdlbqDS5YN f7k\r\n", "T016mr v0mzedsHTFReC3ZjqVuYXpPuTulu8F0Z3 pmr9l96nOUEVXckfdiidZUP6UvFNh4Doaqz\r\n", "B0zFnnWEFttxrUjyuHgU9U09wEt7HfHBP1MAstQb WgxYhtDn3swa5fsmYgtxQKjjbIZzuVszEdl\r\n", "qByK4hFg7JQowOAXW60EBXQYSDHFgUWHlJAGYnjO CoB6YKtvZPaS8H8BRdsuBwdqU3KRz O9oXk\r\n", "3Ntf9b6jv7hZsjtfEcaIzMuakpsEjl6i7Mra4M3U MgWDXcpafKACEA0rUAro9DjHo4VgbBJ6tdj\r\n" ] } ], "source": [ "head data/exercise1_grep.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "This is because I've hidden all the instructions in the file, just to make things more interesting!\n", "\n", "Press space or the down key to see how to continue." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "I've hidden the instructions for the first exercise in the lines containing the word \"start\" somewhere.\n", "\n", "To find them, type:\n", "\n", "```\n", "$: grep start data/exercise1_grep\n", "```" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NjIM \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KesMsuqZNWhowFxuSFX4IaLymKGYdef \r\n", "McQYo6 umUY816rvtSGjAl \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K DdBaWfylxrZ ______________________________\r\n", "Ewu1xLvv7OrXNWu4otWYoF gdV4U3i\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KdzYlJ / Congrats! \\ \r\n", "kDDgWqtLBgY85 PQm8p1ajcAEzbQdb \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K rMv | You've used grep correctly, |\r\n", "kCVqk6sGesHvBp6 pNLzStgdhKu \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9YQQNI \\ and found a cow. /\r\n", "tLMfr \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KUY36ToJEfE4uqIAQ3JboyoBOFyL8s ------------------------------\r\n", " bWKJdeuL \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KI4xvVOZxwyC7oMKHaoG5ePF4k \\ ^__^\r\n", "fThKk5wk \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KAo6IzNddHcxuj93oFRam0mneoF \\ (oo)\\_______\r\n", "aw\u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kyw5tH3FzetzVxhw8c VrV7Uyis 5q8Yvj (__)\\ )\\/\\ \r\n", "QgF4gHcEbAz \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9ZbzG90fUafm64BIlTEIIr ||----w |\r\n", "XKWA \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K6DpuNhYTXTcmo0UtCGa4SUo4JvnwvD || ||\r\n", "ZA0BrMOyH99y7VY97lkomNXHUJUv8MWg \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K R \r\n", "YgOF6ahX0hEhMf \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KTZd 1 wDtbgoa86I8Atk \r\n", "vln0CxxjrcgeeQ5EtPdG0Spx7\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KIAT35hzj 6 \r\n", "xglkzByTDeiIKyoZbCQbO4br\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K rb39 ExT9F The command grep allows to search\r\n", "5\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K7zPVWugW3vb9 mYBIxsuIVxhHUdIxiTgFZ for a pattern in a text file.\r\n", "CLcSSkWNF0tHLOluZr43qptA \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KxojHnAwbmJ \r\n", "Fp \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K ZwtvOeMtDld9oahg9rdmBvKtIjPqXFQ It will print all the matching \r\n", "VOsyrwG4UOEsdfYLOfGFGKZWEvtJse \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9Ra lines to the screen.\r\n", "Sv81TKcZ8Fx1lb7xPZVMxW4ODNoKg8p7IHZ\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K \r\n", "GlbV\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KpQQ5eQDweIn0VAGC8bQLbQ0Dzw4Ggvt =================\r\n", "kziTL5jTi\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K pijnXXmWRApPCC 19SUNHN8n7 Next Exercise\r\n", "v768DQ0dRCix6 \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kc0me0SF qIsYfeC704lam =================\r\n", "vebdjvHTd\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K RxBxhJayFkmRXqyOvqg5khG4O \r\n", "QorxdcpNP1utzB\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K6WpDOX4YzyIFpkZEalKW4 In the next exercise we will see \r\n", "GipBzz4Ul5sj3hVmVkQvPg \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kz9v6AF91EirG how to access grep's documentation.\r\n", "CC09wNO65rwuCqUgi8Skg1NZ0SGR7WDUoVT\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K \r\n", "fjT7Ag59 RuhusLFzU \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KGHFvKsYSp bnNsLG Grep the following word to continue:\r\n", "Zx6RINR3hk \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K667gnhTiLYLiB30MxX7irwVP _ _ \r\n", "T0aoAQpfbNkO8LkSzSLJkLVEaXNxzQ \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KVoL3 | | | | \r\n", "Nv0hZYvh0pHN0AlT BN\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K C8pMzkIs7usQUWd | |__ ___ | | _ __ \r\n", "5 \u001b[01;31m\u001b[Kstart\u001b[m\u001b[K 9Rq5tBOFDxiExQrRlPgCXoWt43a3US | '_ \\ / _ \\| || '_ \\ \r\n", "46unsRj3c4ClXQvcoFPyE9cnRHDQOHFNNZ\u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kc | | | || __/| || |_) |\r\n", "40H3 \u001b[01;31m\u001b[Kstart\u001b[m\u001b[Kj6glBCFXqOhMH3BEdgBsPPQuBbOt6D |_| |_| \\___||_|| .__/ \r\n", "Qam1yoNK3BCpwSyhRX8Wb3rA1U\u001b[01;31m\u001b[Kstart\u001b[m\u001b[K djDiAHuT | | \r\n", "PFW \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KKxEzUChDZGSPQNj4gsTS5k1JMBvWuY |_| \r\n", "1bS5w1uaq65\u001b[01;31m\u001b[Kstart\u001b[m\u001b[KnVRWYkojLFMSkMjui8YYz1A5 \r\n", "g0 g8iyP \u001b[01;31m\u001b[Kstart\u001b[m\u001b[KQqkz7F05ST C S73TpreeesnFm \r\n" ] } ], "source": [ "grep start data/exercise1_grep.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Accessing grep documentation\n", "\n", "To access the documentation of a command, we can use the **man** command.\n", "\n", "Let's type the following:\n", "\n", "```\n", "$: man grep\n", "```\n", "\n", "This will allow to navigate the documentation for grep, in the same modality as with the **less** command. Use the arrows to scroll, and q to exit.\n", "\n", "For the next exercise, you will have to identify two options in the man page, and use them to do a case-insensitive search for \"ignorecase\", and count the number of lines." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "td4kwN6cV0kqU3qMkwYOHl9MqjTQ \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K MP6MrF \r\n", "GY \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K yEvNL3RuVQYqiumisBftk8irLzXwt61y The documentation for grep can \r\n", "MTEwA\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KhQapljO9yUtucAiNpvZrKdbwc3KUcsu be accessed through man:\r\n", "NVD5n5HKKKz6GgDmyOGMlKSMTd \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K Rii BCjC \r\n", "ku1GNL6IpSHBvcGroqpHgbMUNCg3Yz3l\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KnOBy $: man grep\r\n", "XkHkwMI\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hidg7uJURR6loj5IAwv9oyeIUmqT \r\n", "sGKar9AKY \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K VhNi3MlGzT3WjAQdpWbvtuWeb Scroll down to see all the \r\n", "jHLbw4whFT1B\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KDfqZqhjXRYPjF0y7pkM8g3z3 parameters for grep and their description.\r\n", "hZ1OQgKcsgZo \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K m4s64C8nSR5zM gU4fYObu \r\n", "9jlkynOW\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KLTaeswR5UnouUc3Ipsd4OjVI5PFO Use / to search for text.\r\n", "k4pjhosSNRgJlr7kt\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KAvkWOHszFMoP yPEbgT Press the q key to exit.\r\n", "DPdj3lg4P6 UtuibInP\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K ErdkiRKtYHTKDAJ \r\n", "5Ru8b5\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K cCuwtAVpbxoqHtK70dT9vtw5NsZR8 \r\n", "O\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hL9xy8U77RDjkZsRX6WDZf8ywnBY83LiL5 \r\n", "8nhxNAlz3yHtfZFEBjwvnKPFB \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K qUV YkeQX ==============\r\n", "FU4nt\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K RStwArEw6UGFM4O7kKlxItNqVfD8Jl Next exercise\r\n", "IcOj \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K P56z3xD7QRkE1admG5sNTulg7B38om ==============\r\n", "FSnpSHWkiELodvyTu Tx\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K uAWnw9UTW0ZPITU \r\n", "iCu7cLxdU0vLMBo\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K46htatY8jYJC6XXVNDHTF For the next exercise, you will need to open \r\n", "qLbjcY\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K cp8USrp6u51ainbnXsp DForAbOq3 grep's documentation and identify two options:\r\n", "LPTakIcUOWmROON8GPJ4szSpKqZn3c \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K k5jK \r\n", "graSN0 cI4H6Zl \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K hCxcK0ynPImVu0Mogdcw - the option for case-insensitive searches\r\n", "MSewEHXyuatdRzy9GSokR DaLKp\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KfDDJd7u p - the option for counting \r\n", "r7k6b1c9XDZcWnxH syn9peY uNq\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KjKyOyg0T the number of matching lines, \r\n", "zi4Rycq58rmxjH zW1AhCWAO1s\u001b[01;31m\u001b[Khelp\u001b[m\u001b[KSyViqAbyAC instead of printing them to the screen.\r\n", "CNx6GsFSs\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K iRQE6pdA0jJiStNjknOaoQPSD \r\n", "ial36NIIePB7P5 \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K tpJ6bnVvVv7gESXp1Apc Once you have identified these options, \r\n", "A9HSI nKdCcuDp8WGEFkbWE8gJsUAZatatIO\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K do a case-insensitive search on this file for the word \r\n", "erU3 7ppIkaPoqBFCFkFFYMo\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K HxVST S9fFj \"ignorecase\", then count the number of lines.\r\n", "lwjWEVzMBJSZiRSXvJzQmePQPFKeL4OQdO\u001b[01;31m\u001b[Khelp\u001b[m\u001b[K R \r\n", "8P5kONdSaqg0tolHUGq8nN9brT7k \u001b[01;31m\u001b[Khelp\u001b[m\u001b[K 6duGCw \r\n" ] } ], "source": [ "grep help data/exercise1_grep.txt" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p14PGGX\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KDoCCJ9sYiegaozfL6LXxDmf \r\n", "o6m1cg7C\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KUJbpjD laYkpG6gdBHbJIM \r\n", "aNqS0Tg4kVIeLlyDeYoBlalps0\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[Kw5dd Remember that, to continue with the exercise, \r\n", "bRe7rR0sM8mcf8W1woMoReyj\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KLtPrHA you need to do a case-insensitive search for the word\r\n", "erU3 7ppIkaPoqBFCFkFFYMohelp HxVST S9fFj \"\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[K\", then count the number of lines.\r\n", "w\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KTt0lDGMb5KCFWEm4t8RmBNXtLvURX ||----w |\r\n" ] } ], "source": [ "# If we do a search for \"ignorecase\" without any option, we only get some of the lines.\n", "# You can notice that the cow is not properly displayed :-)\n", "grep ignorecase data/exercise1_grep.txt" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "p14PGGX\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KDoCCJ9sYiegaozfL6LXxDmf \r\n", "o6m1cg7C\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KUJbpjD laYkpG6gdBHbJIM \r\n", "aNqS0Tg4kVIeLlyDeYoBlalps0\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[Kw5dd Remember that, to continue with the exercise, \r\n", "bRe7rR0sM8mcf8W1woMoReyj\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KLtPrHA you need to do a case-insensitive search for the word\r\n", "erU3 7ppIkaPoqBFCFkFFYMohelp HxVST S9fFj \"\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[K\", then count the number of lines.\r\n", "1ofqHyPgr74Vx 0vUkETWFA\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[Ku8SJQ5C \r\n", "1 vfC7\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KMUtRWYq3KGKJpR8koi7FhtzX _____________\r\n", "OTMODZfX1gD9l38Tu9PEQZrshVzL\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KbI / Good Job! \\ \r\n", "u7YtPPNnVLSzB8HCBvtOcIHey0X8Wt\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[K | You did a |\r\n", "QfX1XYVyUHpwU\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KpT fi6GkHvOkG LDb | case-insens |\r\n", "Vw4ePnDoZ4KxNs58pWlGMoFVc\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KpQj 6 | itive |\r\n", "fN4SOVBxl6\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KeJ5Ldyb0y4PLVSL1ZCv7 \\ search /\r\n", "mmNqW04FRacds3eYb\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KRk5rFhFpKahDt -------------\r\n", "ZgQZAYDnIE7Jk4PLhZ10gx\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KpxQqxB4t \\ ^__^\r\n", "50FY1806\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[K6DzXRGwihWPeO3J gjHsDG \\ (oo)\\_______\r\n", "QxAIpmflI jFcJ\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[KQM06LNCSX lftJUX (__)\\ )\\/\\ \r\n", "w\u001b[01;31m\u001b[Kignorecase\u001b[m\u001b[KTt0lDGMb5KCFWEm4t8RmBNXtLvURX ||----w |\r\n", "w1EeylvQJWMF\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[KWavz4 ICR89dkvr6sf || ||\r\n", "wayAmo30uEjxkMyJvis\u001b[01;31m\u001b[KIGNORECASE\u001b[m\u001b[KkwshDQX DGB \r\n", "45s7W ggf\u001b[01;31m\u001b[KignOrecase\u001b[m\u001b[KUYiHjY0F6BWSqqDfZ6c F \r\n", "zmqy\u001b[01;31m\u001b[KIgnorEcase\u001b[m\u001b[Kqo5w9DIs0DGFlDayGlVaheoIlO \r\n" ] } ], "source": [ "# The -i option allows to do a case-insensitive search.\n", "# As you can see, some lines contain upper case characters:\n", "grep -i ignorecase data/exercise1_grep.txt" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "21\r\n" ] } ], "source": [ "# To solve the exercise, we also have to count the number of output lines.\n", "# This can be done with the \"-c\" option:\n", "grep -i -c ignorecase data/exercise1_grep.txt" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tidjh\u001b[01;31m\u001b[K21\u001b[m\u001b[KyvuMNPDEma8t6PksdgTVkimf6F8LHegXf \r\n", "OllivZL3QFq8OiobDOQjdrPT1KeqT\u001b[01;31m\u001b[K21\u001b[m\u001b[K bRG WMRc \r\n", "eCmkBM\u001b[01;31m\u001b[K21\u001b[m\u001b[KOATsb57fD9ao6czsMB1f7gtWvJCFAW3z ____________________\r\n", "YCOQlk1yUmr8EjN3NBxEB0SSToh\u001b[01;31m\u001b[K21\u001b[m\u001b[KXfpm BiVHS7 / Congrats! Yes \\ \r\n", "JCsq1gs3drLCHAerYroSp331AJMHr\u001b[01;31m\u001b[K21\u001b[m\u001b[Km9Atm4UMR | the answer to the |\r\n", "z3nfFTpzSKGHfdDwtIadMjgiYx\u001b[01;31m\u001b[K21\u001b[m\u001b[Kiiat3S9VVT8R | case-insensitive |\r\n", "0qBEpfp1dcTibKVwObda341CTH9zoYJpBFe8\u001b[01;31m\u001b[K21\u001b[m\u001b[Kyy | and count question |\r\n", "KJIsvaofywLv6uz1\u001b[01;31m\u001b[K21\u001b[m\u001b[K6aZlUBQ3XBJd1jVC5bdHAE | is \u001b[01;31m\u001b[K21\u001b[m\u001b[K. /\r\n", "jy0FgakHM4Tq7ncjhUN\u001b[01;31m\u001b[K21\u001b[m\u001b[KggkNyZhNhJC4eyz ESN --------------------\r\n", "xwwOmWdp5pJ8IsvtNMx9EnWOnjmuUEdt4o8d\u001b[01;31m\u001b[K21\u001b[m\u001b[Kzc \\ ^__^\r\n", "k azZdXgjRGFYTHuMIp0SFkwjp4vHRG1lnlmSj\u001b[01;31m\u001b[K21\u001b[m\u001b[K \\ (oo)\\_______\r\n", "jYe19iH7NaYtPGDC7mXoy5G7\u001b[01;31m\u001b[K21\u001b[m\u001b[Ks8EGrD8wFCZSlJ (__)\\ )\\/\\ \r\n", "CXUYNxwnP8jr3NR5T9SCl5TQAwJI5ZjNCm zw\u001b[01;31m\u001b[K21\u001b[m\u001b[KY ||----w |\r\n", "l\u001b[01;31m\u001b[K21\u001b[m\u001b[KFpLp HaLHc1MaoMXflHI4wr981PUNefC0cKDC || ||\r\n", "fC1BsEyvpDm cCnceoQCj3\u001b[01;31m\u001b[K21\u001b[m\u001b[Kv36bmPx5u9Ht6qxs \r\n", "VAAh4PTzYzWSbMxmtDE8XtwYqSu8KFq50\u001b[01;31m\u001b[K21\u001b[m\u001b[KycKLY \r\n", "WmHhzfH8XzJ4Dd3PvgMoIXAnoJJG3G9HlGUtD\u001b[01;31m\u001b[K21\u001b[m\u001b[Kd =============\r\n", "yCrjC\u001b[01;31m\u001b[K21\u001b[m\u001b[KuBDHKBR1P0XVXQp9XE6T7Nqa6C p8ZQ4H Next exercise\r\n", "zfa7If6rzhvuv O6HFHU\u001b[01;31m\u001b[K21\u001b[m\u001b[KcbLnpW0Yipf3xSKJSS =============\r\n", "FPgwt6n3mfTJtartXVwrMAtmn3ISF\u001b[01;31m\u001b[K21\u001b[m\u001b[KyiK0U9NH4 \r\n", "PnV\u001b[01;31m\u001b[K21\u001b[m\u001b[KlkRoTqqVP 9Hs4v4RlJLFdOx6LkhICM WW1 Searching in multiple files\r\n", "uaRh\u001b[01;31m\u001b[K21\u001b[m\u001b[K9wTTl0wCVin63cfrywW06LwQOb vx1k5Uu \r\n", "FTOHCTMDFlKj cNVu\u001b[01;31m\u001b[K21\u001b[m\u001b[KDgKqN1EZxhU1iPyGRrko1 Grep can search the same pattern\r\n", "EfzALglVAh8cPso5WmyYi8v1QG0c\u001b[01;31m\u001b[K21\u001b[m\u001b[KLUTKPqw66N in more than one file at the same time.\r\n", "rGWMTbnXJnehtyAY3vxTJWUdaUXH MxFnyA\u001b[01;31m\u001b[K21\u001b[m\u001b[KfUN \r\n", "l0UcWWd0LG0GeFwNKlGEyj07pbUOPTee1\u001b[01;31m\u001b[K21\u001b[m\u001b[Kt0MsN The folder data/multiplefiles/ contains hundreds of different files.\r\n", "Ow7gE6ZNvIGLP775npX6j5menzWz4\u001b[01;31m\u001b[K21\u001b[m\u001b[KHg00qDP3w \r\n", "uWhIZ4kk6cI7d9503RXAniriZjemCOZ\u001b[01;31m\u001b[K21\u001b[m\u001b[KJ7BTCBt Can you identify the file containing the word \"regex\"?\r\n", "Wxd46\u001b[01;31m\u001b[K21\u001b[m\u001b[KC JxW68aYMWbeCMY0eVMtTqF8iAfhqazV \r\n", "iekhxfE5LpZ\u001b[01;31m\u001b[K21\u001b[m\u001b[KqUxwIjXpYtMchz489rzXtZ0 VOU \r\n" ] } ], "source": [ "# solution: how to find the instructions for the next exercise\n", "grep 21 data/exercise1_grep.txt" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Searching multiple files\n", "\n", "Grep is useful to search over multiple files in a single command.\n", "\n", "The folder data/multiplefiles/ contains 50 randomly generated files. You can see their contents with head data/multiplefiles/\\* or with less.\n", "\n", "One of these files contains the word \"regex\" in it. Are you able to find it?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[K5gsumFTKbKEJv9dD8W94FhoEQU8qf8RMUc\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KR \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KYgDiqkA C1o\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K9giqI66c3sOwfLirOsgPpSuq \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KIsXSnp 8U8pKR0LsVuK\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KO5GFegOtV4GW4fNQ Good! You've found the \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[Kl 4px8KhPRmfEJgi5uTuVO1XahG3H1sY\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K4wt file containing the word \"\u001b[01;31m\u001b[Kregex\u001b[m\u001b[K\"\r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[Kyz8P5 HC6N5D XRHPncZjTAeM\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KT9bQUoZdsh \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KeWUd18s0MVx5YYrEK KCKeF5hvO\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KIiZbIGUX To continue, \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KMLXiKZJ8KyHMou9lYsz4ZjFYJSfB 14t\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KtpJ grep file32.txt data/exercise1_grep.txt\r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KveFQU\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KfnQxwQw6POJRNvvAeYwToX6ptvN39m \r\n", "\u001b[35m\u001b[Kdata/multiplefiles/file32.txt\u001b[m\u001b[K\u001b[36m\u001b[K:\u001b[m\u001b[KcHoNv\u001b[01;31m\u001b[Kregex\u001b[m\u001b[KiGjHkmptPVTjOzvWVGbrGoHoywV4Vy \r\n" ] } ], "source": [ "# solution: you can use the \"*\" character to specify multiple files:\n", "grep 'regex' data/multiplefiles/*" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Searching multiple patterns and the Unix piping system\n", "\n", "How can we search that contain two or more patterns?\n", "\n", "One solution is to use the Unix piping system, executing one grep command, and then another grep on the output.\n", "\n", "This can be done using the pipe \"|\" symbol, like the following:\n", "\n", "```\n", "$: grep (first pattern) myfile.txt | grep (second pattern)\n", "```\n", "\n", "Press space or the down key for some examples." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The file data/genes/mgat_genes.gb is a genbank file. Notice how this format is well suited for grep searches:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LOCUS HUMUDPCNA 4705 bp DNA linear PRI 19-SEP-1995\r\n", "DEFINITION Human alpha-1,3-mannosyl-glycoprotein beta-1,\r\n", " 2-N-acetylglucosaminyltransferase (MGAT) gene, complete cds.\r\n", "ACCESSION M61829\r\n", "VERSION M61829.1 GI:340075\r\n", "KEYWORDS alpha-1,3-mannosyl-glycoprotein beta-1,2-N-acetylglucosaminyltrae.\r\n", "SOURCE Homo sapiens (human)\r\n", " ORGANISM Homo sapiens\r\n", " Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;\r\n", " Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;\r\n" ] } ], "source": [ "head data/genes/mgat_genes.gb" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Let's say we want to search all the lines where \"ORGANISM\" is \"Homo sapiens\". \n", "\n", "We can do it with two grep commands:\n", "\n", "```\n", "grep ORGANISM data/genes/mgat_genes.gb | grep 'Homo sapiens'\n", "```\n", "Notice that searching for \"Homo sapiens\" alone would not be enough, as there are other lines where the word \"Homo sapiens\" is present." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n", " ORGANISM \u001b[01;31m\u001b[KHomo sapiens\u001b[m\u001b[K\r\n" ] } ], "source": [ "grep ORGANISM data/genes/mgat_genes.gb | grep 'Homo sapiens'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The file contains sequences from two other organisms apart from Homo sapiens. Can you guess which one to search for the next exercise?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t _______________\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t < Well guessed! >\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t ---------------\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\\ ^__^\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t \\ (oo)\\_______\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t \t\t (__)\\ )\\/\\\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\t||----w |\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\t\t|| ||\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t===============\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tNext Exercise\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t===============\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tTo continue, grep\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\t\"regex\" in\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\t\t\t\tdata/exercise1_grep.txt\r\n", " ORGANISM Bos \u001b[01;31m\u001b[Ktaurus\u001b[m\u001b[K\r\n" ] } ], "source": [ "# Solution: grep for \"bos taurus\":\n", "grep ORGANISM data/genes/mgat_genes.gb | grep taurus" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Regular Expressions\n", "\n", "Regular expressions allow to search for more complex patterns. \n", "\n", "Here are some simple regular expression examples:\n", "\n", "| regex | description |\n", "| -------- | ----------- |\n", "| . | matches any character |\n", "| [A-Za-z] | matches any of the characters within parenthesis|\n", "| .\\* | matches any character, any number of times| " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Regular Expression exercise\n", "\n", "Let's have a look at the file data/genes/sequences.fasta:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">seq000 sequence description\r\n", "NANTCNNNGNATNNCNTACNTGNTCGNCCG\r\n", ">seq001 sequence description\r\n", "TGTCAATTNTCNGCGTCNNACNNACTCGCN\r\n", ">seq002 sequence description\r\n", "TGGGCGNTCATGNANAATGTTACGCTCNGG\r\n", ">seq003 sequence description\r\n", "GCCTTTNGGNNCTCACTANGCANGTTTGAN\r\n", ">seq004 sequence description \r\n", "CATNANNAAAccTTTAGGCACTCNACACNG\r\n" ] } ], "source": [ "head data/genes/sequences.fasta " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Can you use grep to identify all the sequences containing three As, followed by any two characters, followed by three Ts?" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CATNANN\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KAGGCACTCNACACNG\r\n", "AGGCCGCNGNGGTA\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KACNAAGAC\r\n", "GTGTNNGTCAAGCNCGNCGTTN\u001b[01;31m\u001b[KAAAGGTTT\u001b[m\u001b[K\r\n", "ATNCGNAGNNCANTNGAC\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KTTGT\r\n", "NTACNTA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KCCACTNTTTANTCAA\r\n", "CNNGAGCG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KNGCAAGTCTGNNCN\r\n", "CATGGGC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KATANATGTNAANCNT\r\n", "GGTGGGNNCCCAGNCGNC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KNNCT\r\n", "GA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KTCNAACTTTNNAATAANNCN\r\n", "GAACAAGCNGCCCTTGGCC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KGNC\r\n", "NNTCGTNGNNNA\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KTAAGAGCACC\r\n", "NNNGNGTNG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KTTGAACGANNAAT\r\n", "CC\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KNTAGNAGCCTGTAGAGCCGC\r\n" ] } ], "source": [ "grep 'AAA..TTT' data/genes/sequences.fasta" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Bonus: if we use the -B 1 grep option, we can retrieve the names of these sequences:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ">seq004 sequence description \r\n", "CATNANN\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KAGGCACTCNACACNG\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq012 sequence description ________________\r\n", "AGGCCGCNGNGGTA\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KACNAAGAC\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq015 sequence description\r\n", "GTGTNNGTCAAGCNCGNCGTTN\u001b[01;31m\u001b[KAAAGGTTT\u001b[m\u001b[K\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq022 sequence description / Congrats! This \\ \r\n", "ATNCGNAGNNCANTNGAC\u001b[01;31m\u001b[KAAAccTTT\u001b[m\u001b[KTTGT\r\n", ">seq023 sequence description | was the last |\r\n", "NTACNTA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KCCACTNTTTANTCAA\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq029 sequence description \\ grep exercise / \r\n", "CNNGAGCG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KNGCAAGTCTGNNCN\r\n", ">seq030 sequence description ----------------\r\n", "CATGGGC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KATANATGTNAANCNT\r\n", ">seq031 sequence description \\ ^__^\r\n", "GGTGGGNNCCCAGNCGNC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KNNCT\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq033 sequence description \\ (oo)\\_______\r\n", "GA\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KTCNAACTTTNNAATAANNCN\r\n", ">seq034 sequence description (__)\\ )\\/\\ \r\n", "GAACAAGCNGCCCTTGGCC\u001b[01;31m\u001b[KAAAgtTTT\u001b[m\u001b[KGNC\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq038 sequence description ||----w |\r\n", "NNTCGTNGNNNA\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KTAAGAGCACC\r\n", ">seq039 sequence description || ||\r\n", "NNNGNGTNG\u001b[01;31m\u001b[KAAActTTT\u001b[m\u001b[KTTGAACGANNAAT\r\n", "\u001b[36m\u001b[K--\u001b[m\u001b[K\r\n", ">seq041 sequence description \r\n", "CC\u001b[01;31m\u001b[KAAAaaTTT\u001b[m\u001b[KNTAGNAGCCTGTAGAGCCGC\r\n" ] } ], "source": [ "grep -B1 'AAA..TTT' data/genes/sequences.fasta " ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false, "run_control": { "frozen": false, "read_only": false }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq004 sequence description \r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq012 sequence description ________________\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq015 sequence description\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq022 sequence description / Congrats! This \\ \r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq023 sequence description | was the last |\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq029 sequence description \\ grep exercise / \r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq030 sequence description ----------------\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq031 sequence description \\ ^__^\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq033 sequence description \\ (oo)\\_______\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq034 sequence description (__)\\ )\\/\\ \r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq038 sequence description ||----w |\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq039 sequence description || ||\r\n", "\u001b[01;31m\u001b[K>\u001b[m\u001b[Kseq041 sequence description \r\n" ] } ], "source": [ "# Bonus: pipe an additional grep '>' to see a cow:\n", "grep -B1 'AAA..TTT' data/genes/sequences.fasta | grep '>'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Working with tabular files: Awk\n", "\n", "\n", "The **awk** command allows to search and manipulate tabular files from the command line.\n", "\n", "Imagine it as the equivalent of Excel/Calc for the command line. It allows to do search on specific columns of a file, to do numerical operations, or to change the order of the columns.\n", "\n", "The advantage of a command-line tool over graphical software is that the memory footprint is much lower. So you can access and modify large files in a fraction of the time that it would take with Excel." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Example of tabular file: the GFF3 format\n", "\n", "The file data/genes/chr8.gff contains an example of file in the GFF3 format:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 3\r\n", "##source-version refgene 1.28.10\r\n", "##date 2016-09-08\r\n", "##genome-build .\thg19\r\n", "chr8\trefgene\tgene\t18248755\t18258723\t.\t+\t.\tgene_id=10;symbol=NAT2;;ID=10\r\n", "chr8\trefgene\tgene\t100549014\t100549089\t.\t-\t.\tgene_id=100126309;symbol=MIR875;;ID=100126309 \r\n", "chr8\trefgene\tgene\t144895127\t144895212\t.\t-\t.\tgene_id=100126338;symbol=MIR937;;ID=100126338\r\n", "chr8\trefgene\tgene\t145619364\t145619445\t.\t-\t.\tgene_id=100126351;symbol=MIR939;;ID=100126351\r\n", "chr8\trefgene\tgene\t91970706\t91997485\t.\t-\t.\tgene_id=100127983;symbol=C8orf88;;ID=100127983\r\n", "chr8\trefgene\tgene\t74332309\t74353753\t.\t+\t.\tgene_id=100128126;symbol=STAU2-AS1;;ID=100128126\r\n" ] } ], "source": [ "head data/genes/chr8.gff" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see it is a tab-separated file, which we could easily read in Excel or Calc.\n", "\n", "The format specifications are defined [here](https://genome.ucsc.edu/FAQ/FAQformat.html#format3), but in short:\n", "\n", "- the first, fourth and fifth columns contain the chromosome name and coordinates\n", "- the second column describes the tool or resource that generated the annotation\n", "- the third column describe the type of feature (e.g. gene, transcript, exon, TF binding site, Histone Acetylation mark, etc...\n", "- the ninth column contains several fields, separated by a semicolon\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Basic AWK syntax: filters\n", "\n", "The basic AWK syntax is the following:\n", "\n", "```\n", "awk 'filters {print statements}' filename\n", "```\n", "\n", "Awk is quite smart at recognizing the field separator, and by default assumes they are separated by tabs.\n", "\n", "Each column of the file can be referred to with the dollar sign followed by the number of column.\n", "\n", "For example $2 refers to the second column, and so on.\n", "\n", "The following code filters all the lines belonging to chromosome 8, between the coordinates 100000 and 200000:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8\trefgene\tgene\t182200\t197339\t.\t+\t.\tgene_id=169270;symbol=ZNF596;;ID=169270\r\n", "chr8\trefgene\tgene\t116086\t117024\t.\t-\t.\tgene_id=441308;symbol=OR4F21;;ID=441308\r\n", "chr8\trefgene\tgene\t158345\t182318\t.\t-\t.\tgene_id=644128;symbol=RPL23AP53;;ID=644128\r\n" ] } ], "source": [ "awk '$1==\"chr8\" && $4>100000 && $5<200000 ' data/genes/chr8.gff" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Exercise\n", "\n", "Can you print all the lines between 5000000 and 10000000 ?" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8\trefgene\tgene\t7143733\t7212876\t.\t-\t.\tgene_id=100128890;symbol=FAM66B;ID=100128890\r\n", "chr8\trefgene\tgene\t7215498\t7220490\t.\t-\t.\tgene_id=100131980;symbol=ZNF705G;ID=100131980\r\n", "chr8\trefgene\tgene\t7812535\t7866277\t.\t+\t.\tgene_id=100132103;symbol=FAM66E;ID=100132103\r\n", "chr8\trefgene\tgene\t7783859\t7809935\t.\t+\t.\t _________\r\n", "chr8\trefgene\tgene\t6261077\t6264069\t.\t-\t.\t / Cows in \\\r\n", "chr8\trefgene\tgene\t7272385\t7274354\t.\t-\t.\t | the |\r\n", "chr8\trefgene\tgene\t7946463\t7946611\t.\t-\t.\t \\ Genome! /\r\n", "chr8\trefgene\tgene\t6602685\t6602765\t.\t+\t.\t ---------\r\n", "chr8\trefgene\tgene\t8905955\t8906028\t.\t+\t.\t \\ ^__^\r\n", "chr8\trefgene\tgene\t6602689\t6602761\t.\t-\t.\t \\ (oo)\\_______\r\n", "chr8\trefgene\tgene\t6693076\t6699975\t.\t+\t.\t (__)\\ )\\/\\\r\n", "chr8\trefgene\tgene\t8559666\t8561617\t.\t+\t.\t ||----w |\r\n", "chr8\trefgene\tgene\t9182561\t9192590\t.\t+\t.\t || |\r\n", "chr8\trefgene\tgene\t8175258\t8239257\t.\t-\t.\tgene_id=157285;symbol=SGK223;ID=157285\r\n", "chr8\trefgene\tgene\t9757574\t9760839\t.\t-\t.\tgene_id=157627;symbol=LINC00599;ID=157627\r\n", "chr8\trefgene\tgene\t6835171\t6856724\t.\t-\t.\tgene_id=1667;symbol=DEFA1;ID=1667\r\n", "chr8\trefgene\tgene\t6793345\t6795786\t.\t-\t.\tgene_id=1669;symbol=DEFA4;ID=1669\r\n", "chr8\trefgene\tgene\t6912829\t6914259\t.\t-\t.\tgene_id=1670;symbol=DEFA5;ID=1670\r\n", "chr8\trefgene\tgene\t6782216\t6783598\t.\t-\t.\tgene_id=1671;symbol=DEFA6;ID=1671\r\n", "chr8\trefgene\tgene\t6728097\t6735529\t.\t-\t.\tgene_id=1672;symbol=DEFB1;ID=1672\r\n", "chr8\trefgene\tgene\t7752199\t7754237\t.\t+\t.\tgene_id=1673;symbol=DEFB4A;ID=1673\r\n", "chr8\trefgene\tgene\t6844700\t6866346\t.\t-\t.\tgene_id=170949;symbol=DEFT1P;ID=170949\r\n", "chr8\trefgene\tgene\t7353368\t7366833\t.\t+\t.\tgene_id=245910;symbol=DEFB107A;ID=245910\r\n", "chr8\trefgene\tgene\t6357175\t6420784\t.\t-\t.\tgene_id=285;symbol=ANGPT2;ID=285\r\n", "chr8\trefgene\tgene\t8086092\t8102387\t.\t+\t.\tgene_id=286042;symbol=FAM86B3P;ID=286042\r\n", "chr8\trefgene\tgene\t6666041\t6693166\t.\t-\t.\tgene_id=389610;symbol=XKR5;ID=389610\r\n", "chr8\trefgene\tgene\t7829183\t7830775\t.\t-\t.\tgene_id=392188;symbol=USP17L8;ID=392188\r\n", "chr8\trefgene\tgene\t7189909\t7191501\t.\t+\t.\tgene_id=401447;symbol=USP17L1;ID=401447\r\n", "chr8\trefgene\tgene\t9760898\t9760982\t.\t-\t.\tgene_id=406907;symbol=MIR124-1;ID=406907\r\n", "chr8\trefgene\tgene\t7413660\t7431920\t.\t-\t.\tgene_id=441317;symbol=FAM90A7P;ID=441317\r\n", "chr8\trefgene\tgene\t7627106\t7628835\t.\t+\t.\tgene_id=441328;symbol=FAM90A10P;ID=441328\r\n", "chr8\trefgene\tgene\t6808248\t6809121\t.\t-\t.\tgene_id=449491;symbol=DEFA8P;ID=449491\r\n", "chr8\trefgene\tgene\t6816811\t6817683\t.\t-\t.\tgene_id=449492;symbol=DEFA9P;ID=449492\r\n", "chr8\trefgene\tgene\t6825663\t6826635\t.\t-\t.\tgene_id=449493;symbol=DEFA10P;ID=449493\r\n", "chr8\trefgene\tgene\t7669242\t7673238\t.\t-\t.\tgene_id=503614;symbol=DEFB107B;ID=503614\r\n", "chr8\trefgene\tgene\t6565878\t6619021\t.\t+\t.\tgene_id=55326;symbol=AGPAT5;ID=55326\r\n", "chr8\trefgene\tgene\t7194637\t7196229\t.\t+\t.\tgene_id=645402;symbol=USP17L4;ID=645402\r\n", "chr8\trefgene\tgene\t7833915\t7835507\t.\t-\t.\tgene_id=645836;symbol=USP17L3;ID=645836\r\n", "chr8\trefgene\tgene\t7705402\t7721319\t.\t+\t.\tgene_id=653423;symbol=SPAG11A;ID=653423\r\n", "chr8\trefgene\tgene\t9599182\t9599278\t.\t+\t.\tgene_id=693182;symbol=MIR597;ID=693182\r\n", "chr8\trefgene\tgene\t6886123\t6887011\t.\t-\t.\tgene_id=724068;symbol=DEFA11P;ID=724068\r\n", "chr8\trefgene\tgene\t6873391\t6875823\t.\t-\t.\tgene_id=728358;symbol=DEFA1B;ID=728358\r\n", "chr8\trefgene\tgene\t6264113\t6501140\t.\t+\t.\tgene_id=79648;symbol=MCPH1;ID=79648\r\n", "chr8\trefgene\tgene\t8993764\t9009152\t.\t-\t.\tgene_id=79660;symbol=PPP1R3B;ID=79660\r\n", "chr8\trefgene\tgene\t9413445\t9639856\t.\t+\t.\tgene_id=8658;symbol=TNKS;ID=8658\r\n", "chr8\trefgene\tgene\t8860314\t8890849\t.\t+\t.\tgene_id=90459;symbol=ERI1;ID=90459\r\n", "chr8\trefgene\tgene\t8641999\t8751131\t.\t-\t.\tgene_id=9258;symbol=MFHAS1;ID=9258\r\n" ] } ], "source": [ "awk '$4 > 5000000 && $5 < 10000000 ' data/genes/chr8.gff\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Awk: printing columns and doing operations\n", "\n", "Awk also allows to print only specific columns, and do algebraic operations on them.\n", "\n", "Remember that each column can be referred as \\$1, \\$2, \\$3, etc...\n", "\n", "For example the following code prints the first column, and the sum of the fourth and third. We can pipe the output to head or less, to make it easier to visualize:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "##gff-version 0\r\n", "##source-version 0\r\n", "##date 0\r\n", "##genome-build 0\r\n", "chr8 9968\r\n", "chr8 75\r\n", "chr8 85\r\n", "chr8 81\r\n", "chr8 26779\r\n", "chr8 21444\r\n", "awk: write failure (Broken pipe)\r\n", "awk: close failed on file /dev/stdout (Broken pipe)\r\n" ] } ], "source": [ "awk '{print $1, $5-$4}' data/genes/chr8.gff | head\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice how this also prints the headers of the file. We can exclude these by adding a grep condition:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8 9968 gene_id=10;symbol=NAT2;;ID=10\r\n", "chr8 75 gene_id=100126309;symbol=MIR875;;ID=100126309\r\n", "chr8 85 gene_id=100126338;symbol=MIR937;;ID=100126338\r\n", "chr8 81 gene_id=100126351;symbol=MIR939;;ID=100126351\r\n", "chr8 26779 gene_id=100127983;symbol=C8orf88;;ID=100127983\r\n", "chr8 21444 gene_id=100128126;symbol=STAU2-AS1;;ID=100128126\r\n", "chr8 12197 gene_id=100128338;symbol=FAM83H-AS1;;ID=100128338\r\n", "chr8 1835 gene_id=100128627;symbol=CDC42P3;;ID=100128627\r\n", "chr8 3282 gene_id=100128750;symbol=RBPMS-AS1;;ID=100128750\r\n", "chr8 69143 gene_id=100128890;symbol=FAM66B;ID=100128890\r\n" ] } ], "source": [ "awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' | head" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise (difficult)\n", "\n", "Starting from the previous command, can you extract the gene symbol into a separate column?\n", "\n", "Hints: pipe an additional awk statement after the first. Use the -F option to specify a different field separator." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8 9968 gene_id=10 symbol=NAT2\r\n", "chr8 75 gene_id=100126309 symbol=MIR875\r\n", "chr8 85 gene_id=100126338 symbol=MIR937\r\n", "chr8 81 gene_id=100126351 symbol=MIR939\r\n", "chr8 26779 gene_id=100127983 symbol=C8orf88\r\n", "chr8 21444 gene_id=100128126 symbol=STAU2-AS1\r\n", "chr8 12197 gene_id=100128338 symbol=FAM83H-AS1\r\n", "chr8 1835 gene_id=100128627 symbol=CDC42P3\r\n", "chr8 3282 gene_id=100128750 symbol=RBPMS-AS1\r\n", "chr8 69143 gene_id=100128890 symbol=FAM66B\r\n", "awk: write failure (Broken pipe)\r\n", "awk: close failed on file /dev/stdout (Broken pipe)\r\n", "grep: write error\r\n" ] } ], "source": [ "awk '{print $1, $5-$4, $9}' data/genes/chr8.gff | grep -v '^#' | awk -F';' '{print $1, $2}' | head" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## AWK: searching by regular expressions\n", "\n", "Awk can also be used to search by regular expression.\n", "\n", "For example, the following code will print all the lines in which the symbol starts with \"MIR\":" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8\trefgene\tgene\t100549014\t100549089\t.\t-\t.\tgene_id=100126309;symbol=MIR875;;ID=100126309 \r\n", "chr8\trefgene\tgene\t144895127\t144895212\t.\t-\t.\tgene_id=100126338;symbol=MIR937;;ID=100126338\r\n", "chr8\trefgene\tgene\t145619364\t145619445\t.\t-\t.\tgene_id=100126351;symbol=MIR939;;ID=100126351\r\n", "chr8\trefgene\tgene\t65285775\t65295842\t.\t+\t.\tgene_id=100130155;symbol=MIR124-2HG;;ID=100130155\r\n", "chr8\trefgene\tgene\t128972879\t128972941\t.\t+\t.\tgene_id=100302161;symbol=MIR1205;;ID=100302161\r\n", "chr8\trefgene\tgene\t10682883\t10682953\t.\t-\t.\tgene_id=100302166;symbol=MIR1322;;ID=100302166\r\n", "chr8\trefgene\tgene\t129021144\t129021202\t.\t+\t.\tgene_id=100302170;symbol=MIR1206;;ID=100302170\r\n", "chr8\trefgene\tgene\t129061398\t129061484\t.\t+\t.\tgene_id=100302175;symbol=MIR1207;;ID=100302175\r\n", "chr8\trefgene\tgene\t128808208\t128808274\t.\t+\t.\tgene_id=100302185;symbol=MIR1204;;ID=100302185\r\n", "chr8\trefgene\tgene\t145625476\t145625559\t.\t-\t.\tgene_id=100302196;symbol=MIR1234;;ID=100302196\r\n", "chr8\trefgene\tgene\t113655722\t113655812\t.\t+\t.\tgene_id=100302225;symbol=MIR2053;;ID=100302225\r\n", "chr8\trefgene\tgene\t27743556\t27743633\t.\t-\t.\tgene_id=100422828;symbol=MIR4287;;ID=100422828\r\n", "chr8\trefgene\tgene\t29814788\t29814864\t.\t-\t.\tgene_id=100422876;symbol=MIR3148;;ID=100422876\r\n", "chr8\trefgene\tgene\t28362633\t28362699\t.\t-\t.\tgene_id=100422903;symbol=MIR4288;;ID=100422903\r\n", "chr8\trefgene\tgene\t96085142\t96085221\t.\t+\t.\tgene_id=100422964;symbol=MIR3150A;;ID=100422964\r\n", "chr8\trefgene\tgene\t104166842\t104166917\t.\t+\t.\tgene_id=100422992;symbol=MIR3151;;ID=100422992\r\n", "chr8\trefgene\tgene\t12584746\t12584808\t.\t+\t.\tgene_id=100500838;symbol=MIR3926-2;;ID=100500838\r\n", "chr8\trefgene\tgene\t27559194\t27559276\t.\t+\t.\tgene_id=100500858;symbol=MIR3622A;;ID=100500858\r\n", "chr8\trefgene\tgene\t12584741\t12584813\t.\t-\t.\tgene_id=100500870;symbol=MIR3926-1;;ID=100500870\r\n", "chr8\trefgene\tgene\t27559190\t27559284\t.\t-\t.\tgene_id=100500871;symbol=MIR3622B;;ID=100500871\r\n", "chr8\trefgene\tgene\t96085139\t96085224\t.\t-\t.\tgene_id=100500907;symbol=MIR3150B;;ID=100500907\r\n", "chr8\trefgene\tgene\t117886967\t117887039\t.\t-\t.\tgene_id=100500914;symbol=MIR3610;;ID=100500914\r\n", "chr8\trefgene\tgene\t42751340\t42751418\t.\t-\t.\tgene_id=100616115;symbol=MIR4469;;ID=100616115\r\n", "chr8\trefgene\tgene\t94928250\t94928347\t.\t-\t.\tgene_id=100616169;symbol=MIR378D2;;ID=100616169\r\n", "chr8\trefgene\tgene\t29920258\t30108213\t.\t-\t.\tgene_id=100616190;symbol=MIR548O2;;ID=100616190\r\n", "chr8\trefgene\tgene\t92217713\t92217786\t.\t+\t.\tgene_id=100616245;symbol=MIR4661;;ID=100616245\r\n", "chr8\trefgene\tgene\t124228028\t124228103\t.\t-\t.\tgene_id=100616260;symbol=MIR4663;;ID=100616260\r\n", "chr8\trefgene\tgene\t143257700\t143257779\t.\t+\t.\tgene_id=100616268;symbol=MIR4472-1;;ID=100616268\r\n", "chr8\trefgene\tgene\t144815253\t144815323\t.\t-\t.\tgene_id=100616318;symbol=MIR4664;;ID=100616318\r\n", "chr8\trefgene\tgene\t101394991\t101395073\t.\t+\t.\tgene_id=100616451;symbol=MIR4471;;ID=100616451\r\n", "chr8\trefgene\tgene\t62627347\t62627418\t.\t+\t.\tgene_id=100616484;symbol=MIR4470;;ID=100616484\r\n", "chr8\trefgene\tgene\t103137660\t103137743\t.\t+\t.\tgene_id=100847001;symbol=MIR5680;;ID=100847001\r\n", "chr8\trefgene\tgene\t131020580\t131020699\t.\t-\t.\tgene_id=100847051;symbol=MIR5194;;ID=100847051\r\n", "chr8\trefgene\tgene\t81153624\t81153708\t.\t+\t.\tgene_id=100847056;symbol=MIR5708;;ID=100847056\r\n", "chr8\trefgene\tgene\t75460778\t75460852\t.\t+\t.\tgene_id=100847058;symbol=MIR5681A;;ID=100847058\r\n", "chr8\trefgene\tgene\t75460785\t75460844\t.\t-\t.\tgene_id=100847091;symbol=MIR5681B;;ID=100847091\r\n", "chr8\trefgene\tgene\t9760898\t9760982\t.\t-\t.\tgene_id=406907;symbol=MIR124-1;ID=406907\r\n", "chr8\trefgene\tgene\t65291706\t65291814\t.\t+\t.\tgene_id=406908;symbol=MIR124-2;;ID=406908\r\n", "chr8\trefgene\tgene\t135812763\t135812850\t.\t-\t.\tgene_id=407030;symbol=MIR30B;;ID=407030\r\n", "chr8\trefgene\tgene\t135817119\t135817188\t.\t-\t.\tgene_id=407033;symbol=MIR30D;;ID=407033\r\n", "chr8\trefgene\tgene\t22102475\t22102556\t.\t-\t.\tgene_id=407037;symbol=MIR320A;;ID=407037\r\n", "chr8\trefgene\tgene\t75512101\t75670587\t.\t+\t.\tgene_id=441355;symbol=MIR2052HG;;ID=441355\r\n", "chr8\trefgene\tgene\t14710947\t14711019\t.\t-\t.\tgene_id=494332;symbol=MIR383;;ID=494332\r\n", "chr8\trefgene\tgene\t41517959\t41518026\t.\t-\t.\tgene_id=619554;symbol=MIR486-1;;ID=619554\r\n", "chr8\trefgene\tgene\t1765397\t1765473\t.\t+\t.\tgene_id=693181;symbol=MIR596;;ID=693181\r\n", "chr8\trefgene\tgene\t9599182\t9599278\t.\t+\t.\tgene_id=693182;symbol=MIR597;ID=693182\r\n", "chr8\trefgene\tgene\t10892716\t10892812\t.\t-\t.\tgene_id=693183;symbol=MIR598;;ID=693183\r\n", "chr8\trefgene\tgene\t100548864\t100548958\t.\t-\t.\tgene_id=693184;symbol=MIR599;;ID=693184\r\n", "chr8\trefgene\tgene\t145019359\t145019447\t.\t-\t.\tgene_id=724031;symbol=MIR661;;ID=724031\r\n" ] } ], "source": [ "awk '$9 ~ /symbol=MIR/ {print $0}' data/genes/chr8.gff " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Last exercise!\n", "\n", "Calculate the lenght of the gene POU5F1B.\n", "\n", "Find the Gene whose gene_id is equal to that number." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1584\r\n" ] } ], "source": [ "awk '$9 ~ /POU5F1B/ {print $5-$4}' data/genes/chr8.gff \n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chr8\trefgene\tGood_Job!\t143953773\t143961236\t.\t-\t.\tgene_id=1584;symbol=CYP11B1;;ID=1584\r\n" ] } ], "source": [ "awk '$9 ~ /gene_id=1584/ {print $0}' data/genes/chr8.gff " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Bonus: Makefiles\n", "\n", "Let's have a look at the file called Makefile in the exercise directory:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "test_exercises: start help ignorecase multiplefiles\r\n", "generate_exercises: generate_grep generate_awk\r\n", "\r\n", "testrule:\r\n", "\techo this is a Makefile rule\r\n", "\techo You can associate it to as many commands you want\r\n", "\r\n", "notebook:\r\n", "\tjupyter nbconvert --to notebook --execute PEB\\ Bash\\ Workshop.ipynb\r\n", "\r\n" ] } ], "source": [ "head Makefile" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Press space or the down key to continue" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Defining pipelines with Makefiles\n", "\n", "Makefiles are a basic way to define pipelines of shell commands.\n", "\n", "Nowadays there are more sophisticated tools available, but most of these are based on Makefiles.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "A Makefile is a collection of \"rules\".\n", "\n", "Each of these rules follows this basic syntax is:\n", "\n", "```\n", "target: prerequisites\n", " commands to execute\n", "```\n", "\n", "As you can see in the Makefile included, most of the rules allow to regenerate the exercise files, or to execute some commands without having to type them everytime.\n", "\n", "For example, the rule \"testrule\" is associated to two echo commands." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## How to run Makefile rules\n", "\n", "To execute a rule in the Makefile, simply type:\n", "\n", "```\n", "make [name of the rule]\n", "```\n", "\n", "For example:\n", "\n" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "echo this is a Makefile rule\r\n", "this is a Makefile rule\r\n", "echo You can associate it to as many commands you want\r\n", "You can associate it to as many commands you want\r\n" ] } ], "source": [ "make testrule" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The program \"make\" will automatically detect any file named \"Makefile\" in the current directory, and execute any rule with the specific name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Rules can also be nested together. For example the two rules \"test_exercises\" and \"generate_exercises\" at the beginning of the file are a way to call several other rules together." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# The last slide\n", "\n", "This is the last slide of the workshop. To finish, try to execute the rule \"cow\" in the Makefile." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " _____________\r\n", "/ I hope you \\\r\n", "| have |\r\n", "| enjoyed the |\r\n", "| workshop |\r\n", "\\ :-) /\r\n", " -------------\r\n", " \\ ^__^\r\n", " \\ (oo)\\_______\r\n", " (__)\\ )\\/\\\r\n", " ||----w |\r\n", " || ||\r\n", " ___________\r\n", "( Now let's )\r\n", "( go to the )\r\n", "( beach )\r\n", " -----------\r\n", " o ^__^\r\n", " o (oo)\\_______\r\n", " (__)\\ )\\/\\\r\n", " ||----w |\r\n", " || ||\r\n" ] } ], "source": [ "make cow" ] } ], "metadata": { "celltoolbar": "Slideshow", "hide_input": false, "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" }, "nav_menu": {}, "toc": { "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 6, "toc_cell": true, "toc_section_display": "block", "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 1 }