{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import Image\n", "from IPython.display import clear_output\n", "from IPython.display import FileLink, FileLinks" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction to\n", "\n", "![title](img/python-logo-master-flat.png)\n", "\n", "### with Application to Bioinformatics\n", "\n", "#### - Day 2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Review Day 1\n", "\n", "Give an example of the following:\n", "\n", "- A number of type `float` \n", "- A variable containing an `integer` \n", "- A `Boolean` / A `list` / A `string` \n", "- What character represents a comment? \n", "- What happens if I take a `list` plus a `list`? \n", "- How do I find out if x is present in a `list`? \n", "- How do I find out if 5 is larger than 3 and the integer 4 is the same as the float 4? \n", "- How do I find the second item in a `list`? \n", "- An example of a `mutable sequence` \n", "- An example of an `immutable sequence` \n", "- Something `iterable` (apart from a list) \n", "- How do I do to print ‘Yes’ if x is bigger than y? \n", "- How do I open a file handle to read a file called ‘somerandomfile.txt’? \n", "- The file contains several lines, how do I print each line? " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Variables and Types\n", "\n", "__A number of type `float`:__ \n", "3.14 \n", "\n", "__A variable containing an `integer`:__ \n", "a = 5 \n", "x = 349852 \n", "\n", "__A `boolean`:__ \n", "True \n", "\n", "__A `list`:__ \n", "[2,6,4,8,9] \n", "\n", "__A `string`:__ \n", "'this is a string'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Literals\n", "\n", "All literals have a type:\n", "\n", "- Strings (str)       ‘Hello’ “Hi”\n", "- Integers (int)\t    5\n", "- Floats (float)\t    3.14\n", "- Boolean (bool)     True or False" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "float" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(3.14)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Variables\n", "\n", "Used to store values and to assign them a name." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "3.14" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = 3.14\n", "a" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Lists\n", "\n", "A collection of values." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "list" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = [1,5,3,7,8]\n", "y = ['a','b','c']\n", "type(x)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Operations\n", "\n", "__What character represents a `comment`?__ \n", "\\# \n", "\n", "__What happens if I take a `list` plus a `list`?__ \n", "The lists will be concatenated \n", "\n", "__How do I find out if x is present in a `list`?__ \n", "`x in [1,2,3,4]` \n", "\n", "__How do I find out if 5 is larger than 3 and the integer 4 is the same as the float 4?__ \n", "`5 > 3 and 4 == 4.0` " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Basic operations\n", "\n", "__Type         Operations__\n", "\n", "int           + - * / ** % // ... \n", "float           + - * / ** % // ... \n", "string           + *" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "3" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = 2\n", "b = 5.46\n", "c = [1,2,3,4]\n", "d = [5,6,7,8]\n", "\n", "7//2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ " ### Comparison/Logical/Membership operators\n", " \n", " \"Drawing\" " ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = [1,2,3,4,5,6,7,8]\n", "b = 5\n", "c = 10\n", "b not in a" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Sequences\n", "\n", "__How do I find the second item in a `list`?__ \n", "`list_a[1]` \n", "\n", "__An example of a `mutable sequence`:__ \n", "`[1,2,3,4,5,6]`\n", "\n", "__An example of an `immutable sequence`:__ \n", "`'a string is immutable'` \n", "\n", "__Something `iterable` (apart from a list):__ \n", "`'a string is also iterable'` " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Indexing\n", "\n", "Lists (and strings) are an ORDERED collection of elements where every element can be access through an index.\n", "\n", "`a[0]` : first item in list a\n", "\n", "REMEMBER! Indexing starts at 0 in python" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[1, 3, 5]" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = [1,2,3,4,5]\n", "b = ['a','b','c']\n", "c = 'a random string'\n", "\n", "a[::2]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Mutable / Immutable sequences and iterables\n", "\n", "Lists are mutable object, meaning you can use an index to change the list, while strings are immutable and therefore not changeable.\n", "\n", "An iterable sequence is anything you can loop over, ie, lists and strings." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "ename": "TypeError", "evalue": "'str' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[0mc\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'a random string'\u001b[0m \u001b[1;31m# immutable\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[0mc\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34m'A'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 6\u001b[0m \u001b[0mc\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: 'str' object does not support item assignment" ] } ], "source": [ "a = [1,2,3,4,5] # mutable\n", "b = ['a','b','c'] # mutable\n", "c = 'a random string' # immutable\n", "\n", "c[0] = 'A'\n", "c" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## New data type: `tuples`\n", "\n", "- A tuple is an immutable sequence of objects\n", "- Unlike a list, nothing can be changed in a tuple\n", "- Still iterable" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "ename": "TypeError", "evalue": "'tuple' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mmyTuple\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;33m(\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m4\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'a'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'b'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;34m'c'\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m42\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m43\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m44\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[0mmyTuple\u001b[0m\u001b[1;33m[\u001b[0m\u001b[1;36m0\u001b[0m\u001b[1;33m]\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m42\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 3\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmyTuple\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mlen\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmyTuple\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;32mfor\u001b[0m \u001b[0mi\u001b[0m \u001b[1;32min\u001b[0m \u001b[0mmyTuple\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mTypeError\u001b[0m: 'tuple' object does not support item assignment" ] } ], "source": [ "myTuple = (1,2,3,4,'a','b','c',[42,43,44])\n", "myTuple[0] = 42\n", "print(myTuple)\n", "print(len(myTuple))\n", "for i in myTuple:\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## If/ Else statements\n", "\n", "How do I do if I want to print ‘Yes’ if x is bigger than y? \n", "`if x > y:` \n", " `print('Yes')`" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 is found in the list b\n" ] } ], "source": [ "a = 2\n", "b = [1,2,3,4]\n", "if a in b:\n", " print(str(a)+' is found in the list b')\n", "else:\n", " print(str(a)+' is not in the list')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Files and loops\n", "\n", "__How do I open a file handle to read a file called ‘somerandomfile.txt’?__ \n", "`fh = open('somerandomfile.txt', 'r', encoding = 'utf-8')` \n", "`fh.close()`\n", "\n", "__The file contains several lines, how do I print each line?__ \n", "`for line in fh:` \n", " `print(line.strip())`" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "just a strange\n", "file with\n", "some\n", "nonsense lines\n" ] } ], "source": [ "fh = open('../files/somerandomfile.txt','r', encoding = 'utf-8')\n", "for line in fh:\n", " print(line.strip())\n", "fh.close()" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n", "6\n", "7\n", "8\n" ] } ], "source": [ "numbers = [5,6,7,8]\n", "i = 0\n", "while i < len(numbers):\n", " print(numbers[i])\n", " i += 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Questions?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "__→ Any unfinished exercises from Day 1__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How to approach a coding task\n", "\n", "Problem: \n", "You have a VCF file with a larger number of samples. You are interested in only one of the samples (sample1) and one region (chr5, 1.000.000-1.005.000). What you want to know is whether this sample has any variants in this region, and if so, what variants.\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Always write pseudocode!\n", "\n", "

\n", "\n", "Pseudocode is a description of what you want to do without actually using proper syntax" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What is your input?\n", "\n", "A VCF file that is iterable\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "#### Basic Pseudocode:\n", "- Open file and loop over lines (ignore lines with #)\n", "- Identify lines where chromosome is 5 and position is between 1.000.000 and 1.005.000\n", "- Isolate the column that contains the genotype for sample1\n", "- Extract the genotypes only from the column\n", "- Check if the genotype contains any alternate alleles\n", "- Print any variants containing alternate alleles for this sample between specified region" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "__- Open file and loop over lines (ignore lines starting with #)__" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "scrolled": true, "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\t10492\t.\tC\tT\t550.31\tLOW_VQSLOD\tAN=26;AC=2\tGT:AD:DP:GQ:PGT:PID:PL\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t0/1:12,7:19:99:0|1:10403_ACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC_A:196,0,340\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\t0/1:18,4:22:48:.:.:48,0,504\t./.:0,0:0:.:.:.:.\t./.:0,0:0:.:.:.:.\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'): \n", " print(line.strip())\n", " break\n", "fh.close()\n", "# Next, find chromosome 5" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__- Identify lines where chromosome is 5 and position is between 1.000.000 and 1.005.000__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5':\n", " print(cols[0])\n", " break\n", "fh.close()\n", "\n", "# Next, find the correct region" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\t1000080\t.\tA\tT\t2557.1\tPASS\tAN=26;AC=2\tGT:AD:DP:GQ:PL\t0/1:15,18:33:99:489,0,357\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t0/1:21,19:40:99:481,0,542\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\t./.:0,0:0:.:.\n", "\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5' and \\\n", " int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:\n", " print(line)\n", " break\n", "fh.close()\n", "# Next, find the genotypes for sample1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__- Isolate the column that contains the genotype for sample1__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0/1:15,18:33:99:489,0,357\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5' and \\\n", " int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:\n", " geno = cols[9]\n", " print(geno)\n", " break\n", "fh.close()\n", "# Next, extract the genotypes only" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__- Extract the genotypes only from the column__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0/1\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5' and \\\n", " int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:\n", " geno = cols[9].split(':')[0]\n", " print(geno)\n", " break\n", "fh.close()\n", "# Next, find in which positions sample1 has alternate alleles" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__- Check if the genotype contains any alternate alleles__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n", "0/1\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5' and \\\n", " int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:\n", " geno = cols[9].split(':')[0]\n", " if geno in ['0/1', '1/1']:\n", " print(geno)\n", "fh.close()\n", "#Next, print nicely" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__- Print any variants containing alternate alleles for this sample between specified region__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5:1000080_A-T has genotype: 0/1\n", "5:1000156_G-A has genotype: 0/1\n", "5:1001097_C-A has genotype: 0/1\n", "5:1001193_C-T has genotype: 0/1\n", "5:1001245_T-C has genotype: 0/1\n", "5:1001339_C-T has genotype: 0/1\n", "5:1001344_G-C has genotype: 0/1\n", "5:1001683_G-T has genotype: 0/1\n", "5:1001755_G-A has genotype: 0/1\n", "5:1002374_G-A has genotype: 0/1\n", "5:1002382_G-C has genotype: 0/1\n", "5:1002620_T-C has genotype: 0/1\n", "5:1002722_G-A has genotype: 0/1\n", "5:1002819_C-A has genotype: 0/1\n", "5:1003043_G-T has genotype: 0/1\n", "5:1003099_C-T has genotype: 0/1\n", "5:1003135_G-A has genotype: 0/1\n", "5:1004648_A-G has genotype: 0/1\n", "5:1004650_A-C has genotype: 0/1\n", "5:1004665_A-G has genotype: 0/1\n", "5:1004702_G-T has genotype: 0/1\n", "5:1004879_T-C has genotype: 0/1\n" ] } ], "source": [ "fh = open('C:/Users/Nina/Documents/courses/Python_Beginner_Course/genotypes.vcf', 'r', encoding = 'utf-8')\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " if cols[0] == '5' and \\\n", " int(cols[1]) >= 1000000 and int(cols[1]) <= 1005000:\n", " geno = cols[9].split(':')[0]\n", " if geno in ['0/1', '1/1']:\n", " var = cols[0]+':'+cols[1]+'_'+cols[3]+'-'+cols[4]\n", " print(var+' has genotype: '+geno)\n", "fh.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__→ Notebook Day_2_Exercise_1 (~50 minutes)__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Comments for Exercise 1" ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The frequency of the rs4988235 SNP is: 0.7833333333333333\n" ] } ], "source": [ "fh = open('../downloads/genotypes_small.vcf', 'r', encoding = 'utf-8')\n", "\n", "wt = 0\n", "het = 0\n", "hom = 0\n", "\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('\\t')\n", " chrom = cols[0] \n", " pos = cols[1] \n", " if chrom == '2' and pos == '136608646': \n", " for geno in cols[9:]: \n", " alleles = geno[0:3] \n", " if alleles == '0/0': \n", " wt += 1 \n", " elif alleles == '0/1':\n", " het += 1\n", " elif alleles == '1/1': \n", " hom += 1\n", " \n", "freq = (2*hom + het)/((wt+hom+het)*2) \n", "print('The frequency of the rs4988235 SNP is: '+str(freq)) \n", "\n", "fh.close()" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The frequency of the rs4988235 SNP is: 0.7833333333333333\n" ] } ], "source": [ "with open('../downloads/genotypes_small.vcf', 'r', encoding = 'utf-8') as fh:\n", " for line in fh:\n", " if line.startswith('2\\t136608646'):\n", " alleles = [int(item) for sub in [geno[0:3].split('/') \\\n", " for geno in line.strip().split('\\t')[9:]] \\\n", " for item in sub]\n", " print('The frequency of the rs4988235 SNP is: '\\\n", " +str(sum(alleles)/len(alleles)))\n", " break" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Although much shorter, but maybe not as intuitive..." ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The frequency of the rs4988235 SNP is: 0.7833333333333333\n" ] } ], "source": [ "with open('../downloads/genotypes_small.vcf', 'r', encoding = 'utf-8') as fh:\n", " for line in fh:\n", " if line.startswith('2\\t136608646'):\n", " genoInfo = [geno for geno in line.strip().split('\\t')[9:]] # extract comlete geno info to list\n", " genotypes = [g[0:3].split('/') for g in genoInfo] # split into alleles to nested list\n", " alleles = [int(item) for sub in genotypes for item in sub] # flatten the nested list to normal list\n", " print('The frequency of the rs4988235 SNP is: '+str(sum(alleles)/len(alleles))) # use sum and len to calculate freq\n", " break" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Shorter than the first version, but easier to follow than the second version" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## More useful functions and methods\n", "\n", "__What is the difference between a `function` and a `method`?__\n", "\n", "A `method` always belongs to an object of a specific class, a `function` does not have to. For example:\n", "\n", "`print('a string')` and `print(42)` both works, even though one is a string and one is an integer\n", "\n", "`'a string '.strip()` works, but `[1,2,3,4].strip()` does not work. `strip()` is a method that only works on strings\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "__What does it matter to me?__\n", "\n", "For now, you mostly need to be aware of the difference, and know the different syntaxes:\n", "\n", "__A function:__ \n", "`functionName()`\n", "\n", "__A method:__ \n", "```.methodName()```\n" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "ename": "AttributeError", "evalue": "'list' object has no attribute 'strip'", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[1;34m'a string '\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 5\u001b[1;33m \u001b[1;33m[\u001b[0m\u001b[1;36m1\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m2\u001b[0m\u001b[1;33m,\u001b[0m\u001b[1;36m3\u001b[0m\u001b[1;33m]\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mstrip\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mAttributeError\u001b[0m: 'list' object has no attribute 'strip'" ] } ], "source": [ "len([1,2,3])\n", "len('a string')\n", "\n", "'a string '.strip()\n", "[1,2,3].strip()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Functions" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "[Python Built-in functions](https://docs.python.org/3/library/functions.html#)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "abs(-5)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[1, 2, 4, 23, 35, 88]" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted([1,2,35,23,88,4])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### From Python documentation\n", "\n", "

\n", "\n", "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on built-in function sum in module builtins:\n", "\n", "sum(iterable, start=0, /)\n", " Return the sum of a 'start' value (default: 0) plus an iterable of numbers\n", " \n", " When the iterable is empty, return the start value.\n", " This function is intended specifically for use with numeric values and may\n", " reject non-numeric types.\n", "\n" ] } ], "source": [ "sum([1,2,3,4],5)\n", "help(sum)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "3.23" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "round(3.234556, 2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Methods\n", "\n", "### Useful operations on strings\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" \n", "\n", "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "' spaciousWith5678.com'" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "' spaciousWith5678.com '.rstrip()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "['split', 'a', 'string', 'into', 'a', 'list']" ] }, "execution_count": 56, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = ' split a string into a list '\n", "a.split()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'a s t r i n g a l r e a d y'" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "' '.join('a string already')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" \n", "\n", "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'long string'.startswith('ng',2)\n", "'long string'.endswith('string')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "\"Drawing\" \n", "\n", "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "'LONGRANDOMSTRING'" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'LongRandomString'.lower()\n", "'LongRandomString'.upper()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Useful operations on Mutable sequences\n", "\n", "

\n", "\n", "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "[5, 5, 5, 5, 4, 3, 2, 1]" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "a = [1,2,3,4,5,5,5,5]\n", "a.append(6)\n", "a.pop()\n", "a.reverse()\n", "a" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary\n", "\n", "- Tuples are immutable sequences of objects\n", "- Always plan your approach before you start coding\n", "- A method always belongs to an object of a specific class, a function does not have to\n", "- The official Python documentation describes the syntax for all built-in functions and methods" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "__→ Notebook Day_2_Exercise_2 (~30 minutes)__" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## IMDb\n", "\n", "Download the 250.imdb file from the course website\n", "\n", "This format of this file is: \n", "- Line by line \n", "- Columns separated by the \\| character\n", "- Header starting with #\n", "\n", "\"Drawing\" \n", "\n", "\\# Votes | Rating | Year | Runtime | URL | Genres | Title" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Find the movie with the highest rating\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "\"Drawing\" " ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[9.3, 'The Shawshank Redemption']\n" ] } ], "source": [ "fh = open('../downloads/250.imdb', 'r', encoding = 'utf-8')\n", "best = [0,''] # here we save the rating and which movie\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('|')\n", " rating = float(cols[1].strip())\n", " if rating > best[0]: # if the rating is higher than previous highest, update best\n", " best = [rating,cols[6]]\n", "fh.close()\n", "print(best)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### For the genre Adventure\n", "\n", "Find the top movie by rating\n", "\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Answer\n", "\n", "Top movie: \n", "The LOTR: The Return of the King with 8.9 " ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[8.9, 'The Lord of the Rings: The Return of the King']\n" ] } ], "source": [ "fh = open('../downloads/250.imdb', 'r', encoding = 'utf-8')\n", "top = [0,'']\n", "\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('|')\n", " genre = cols[5].strip()\n", " glist = genre.split(',') # one movie can be in several genres\n", " if 'Adventure' in glist: # check if movie belongs to genre Adventure\n", " rating = float(cols[1].strip())\n", " if rating > top[0]:\n", " top = [rating,cols[6]] \n", "fh.close()\n", "print(top)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Find the number of genres\n", "\n", "\"Drawing\" " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Answer\n", "\n", "Watch out for the upper/lower cases!\n", "\n", "The correct answer is 22" ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['drama', 'war', 'adventure', 'comedy', 'family', 'animation', 'biography', 'history', 'action', 'crime', 'mystery', 'thriller', 'fantasy', 'romance', 'sci-fi', 'western', 'musical', 'music', 'historical', 'sport', 'film-noir', 'horror']\n", "22\n" ] } ], "source": [ "fh = open('../downloads/250.imdb', 'r', encoding = 'utf-8')\n", "genres = []\n", "\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('|')\n", " genre = cols[5].strip()\n", " glist = genre.split(',')\n", " for entry in glist:\n", " if entry.lower() not in genres: # only add genre if not already in list\n", " genres.append(entry.lower()) \n", "fh.close()\n", "print(genres)\n", "print(len(genres))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## New data type: `set`\n", "\n", "- A set contains an unordered collection of unique and immutable objects\n", "\n", "Syntax: \n", "_For empty set:_ \n", "`setName = set()` \n", "\n", "_For populated sets:_ \n", "`setName = {1,2,3,4,5}`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Common operations on `sets`\n", "\n", "`set.add(a)` \n", "`len(set)` \n", "`a in set` " ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{1, 2, 3, 4, 5}\n" ] } ], "source": [ "x = set()\n", "x.add(100)\n", "x.add(25)\n", "x.add(3)\n", "#for i in x:\n", "# print(i)\n", " \n", "mySet = {1,2,3,4}\n", "mySet.add(5)\n", "mySet.add(4)\n", "print(mySet)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Find the number of genres\n", "\n", "\"Drawing\" \n", "\n", "Modify your code to use sets" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'crime', 'film-noir', 'drama', 'historical', 'biography', 'family', 'action', 'mystery', 'comedy', 'thriller', 'musical', 'romance', 'war', 'sport', 'animation', 'fantasy', 'sci-fi', 'adventure', 'history', 'music', 'horror', 'western'}\n", "22\n" ] } ], "source": [ "fh = open('../downloads/250.imdb', 'r', encoding = 'utf-8')\n", "genres = set()\n", "\n", "for line in fh:\n", " if not line.startswith('#'):\n", " cols = line.strip().split('|')\n", " genre = cols[5].strip()\n", " glist = genre.split(',') \n", " for entry in glist:\n", " genres.add(entry.lower()) # set only adds entry if not already in\n", "fh.close()\n", "print(genres)\n", "print(len(genres))" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" }, "livereveal": { "height": 768, "scroll": true, "width": 1024 } }, "nbformat": 4, "nbformat_minor": 2 }