{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from IPython.display import Image\n", "from IPython.display import clear_output\n", "from IPython.display import FileLink, FileLinks" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "## Introduction to\n", "\n", "![title](img/python-logo-master-flat.png)\n", "\n", "### with Application to Bioinformatics\n", "\n", "#### - Day 4" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Start by doing today's quiz\n", "\n", "Go to Canvas, `Modules -> Day 4 -> Review Day 3`\n", " \n", "~20 minutes\n" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### In what ways does the type of an object matter?\n", "- Questions 1, 2 and 3" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The price is a string!\n" ] } ], "source": [ "row = 'sofa|2000|buy|Uppsala'\n", "fields = row.split('|')\n", "price = fields[1]\n", "if price == 2000:\n", " print('The price is a number!')\n", "if price == '2000':\n", " print('The price is a string!')" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[30, 100, 2000]\n" ] } ], "source": [ "print(sorted([ 2000, 30, 100 ]))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['100', '2000', '30']\n" ] } ], "source": [ "print(sorted(['2000', '30', '100']))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "ord('3')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### In what ways does the type of an object matter?\n", "\n", "- Each type store a specific type of information\n", " - `int` for integers,\n", " - `float` for floating point values (decimals),\n", " - `str` for strings,\n", " - `list` for lists,\n", " - `dict` for dictionaries.\n", "\n", "- Each type supports different operations, functions and methods." ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "- Each type supports different **operations**" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "30 > 2000" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'30' > '2000'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "30 > int('2000')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "'12345'[2]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "12345[2]" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "- Each type supports different **functions**" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "'2'" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "max('2000')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "ename": "TypeError", "evalue": "'int' object is not iterable", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[7], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;43mmax\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m2000\u001b[39;49m\u001b[43m)\u001b[49m\n", "\u001b[0;31mTypeError\u001b[0m: 'int' object is not iterable" ] } ], "source": [ "max(2000)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import math\n", "math.cos(3.14)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "math.cos('3.14')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 0, "slideshow": { "slide_type": "slide" } }, "source": [ "- Each type supports different **methods**" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "'actg'" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'ACTG'.lower()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "ename": "AttributeError", "evalue": "'list' object has no attribute 'lower'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[9], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m2\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m3\u001b[39;49m\u001b[43m]\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlower\u001b[49m()\n", "\u001b[0;31mAttributeError\u001b[0m: 'list' object has no attribute 'lower'" ] } ], "source": [ "[1, 2, 3].lower()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "set([]).add('tiger')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "[].add('tiger')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- How to find what methods are available: Python documentation, or `dir()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "dir('ACTG') # list all attributes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "dir(str) # list all attributes" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Convert string to number\n", "- Questions 4, 5 and 6\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2000.0" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float('2000')" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float('0.5')" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "1000000000.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "float('1e9')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "float('1e-2')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "int('2000')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "int('1.5')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "int('1e9')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Convert to boolean: `1`, `0`, `'1'`, `'0'`, `''`, `{}`\n", "- Question 7\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool(1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool(0)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool('0')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool('')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool([])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "bool({})" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "- Python and the truth: true and false values" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "lines_to_next_cell": 2 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 is true!\n", "0 is false!\n", "'' is false!\n", "'0' is true!\n", "'1' is true!\n", "[] is false!\n", "[0] is true!\n" ] } ], "source": [ "values = [1, 0, '', '0', '1', [], [0]]\n", "for x in values:\n", " if x:\n", " print(repr(x), 'is true!')\n", " else:\n", " print(repr(x), 'is false!')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- `if x` is equivalent to `if bool(x)`" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "- Is `1` equivalent to `True`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "1 == True" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "x = 1\n", "if x is True:\n", " print(repr(x), 'is true!')\n", "else:\n", " print(repr(x), 'is false!')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "x = 1\n", "if bool(x) is True:\n", " print(repr(x), 'is true!')\n", "else:\n", " print(repr(x), 'is false!')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "- Be careful: `if x is True` is **not** equivalent to `if bool(x) is True`" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Container types, when should you use which? (Question 8)\n", "\n", "- **lists**: when order is important\n", "- **dictionaries**: to keep track of the relation between keys and values\n", "- **sets**: to check for membership. No order, no duplicates." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "['comedy', 'drama', 'drama', 'sci-fi']" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genre_list = [\"comedy\", \"drama\", \"drama\", \"sci-fi\"]\n", "genre_list" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "text/plain": [ "{'comedy', 'drama', 'sci-fi'}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genres = set(genre_list)\n", "genres" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'drama' in genre_list\n", "'drama' in genres\n", "# which operation is faster?" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "lines_to_next_cell": 0, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "{'comedy': 1, 'drama': 2, 'sci-fi': 1}" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "genre_counts = {\"comedy\": 1, \"drama\": 2, \"sci-fi\": 1}\n", "genre_counts" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'rating': 10.0, 'title': 'Toy Story'}" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movie = {\"rating\": 10.0, \"title\": \"Toy Story\"}\n", "movie" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Python syntax (Question 9)\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def echo(message): # starts a new function definition\n", " # this function echos the message \n", " print(message) # print state of the variable\n", " return message # return the value to end the function\n" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Converting between strings and lists\n", "- Question 10" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['h', 'e', 'l', 'l', 'o']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\"hello\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "str(['h', 'e', 'l', 'l', 'o'])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'h_e_l_l_o'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "'_'.join('hello')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### TODAY\n", "\n", "- More on functions:\n", " - scope of variables\n", " - positional arguments and keyword arguments\n", " - `return` statement\n", "- Reusing code:\n", " - comments and documentation\n", " - importing modules: using libraries\n", "- Pandas - explore your data!\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### More on functions: scope - global vs local variables\n", "- Global variables can be accessed inside the function" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HOST inside the function = global\n", "HOST outside the function = global\n" ] } ], "source": [ "HOST = 'global'\n", "\n", "def show_host():\n", " print(f'HOST inside the function = {HOST}')\n", "\n", "show_host()\n", "print(f'HOST outside the function = {HOST}')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Change in the function will not change the global variable" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HOST outside the function before change = global\n", "HOST inside the function = local\n", "HOST outside the function after change = global\n", "global\n" ] } ], "source": [ "HOST = 'global'\n", "\n", "def change_host():\n", " HOST = 'local'\n", " print(f'HOST inside the function = {HOST}')\n", "def app2():\n", " print(HOST)\n", "print(f'HOST outside the function before change = {HOST}')\n", "change_host()\n", "print(f'HOST outside the function after change = {HOST}')\n", "app2()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "- Pass global variable as argument" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "HOST = 'global'\n", "\n", "def change_host(HOST):\n", " HOST = 'local'\n", " print(f'HOST inside the function = {HOST}')\n", "\n", "print(f'HOST outside the function before change = {HOST}')\n", "change_host(HOST)\n", "print(f'HOST outside the function after change = {HOST}')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### More on functions: scope - global vs local variables cont.\n", "List as global variables\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "MOVIES = ['Toy story', 'Home alone']\n", "\n", "def change_movie():\n", " MOVIES = ['Fargo', 'The Usual Suspects']\n", " print(f'MOVIES inside the function = {MOVIES}')\n", "\n", "print(f'MOVIES outside the function before change = {MOVIES}')\n", "change_movie()\n", "print(f'MOVIES outside the function after change = {MOVIES}')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Will the global variable never to changed by function?" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MOVIES outside the function before change = ['Toy story', 'Home alone']\n", "MOVIES inside the function = ['Toy story', 'Home alone', 'Fargo', 'The Usual Suspects']\n", "MOVIES outside the function after change = ['Toy story', 'Home alone', 'Fargo', 'The Usual Suspects']\n" ] } ], "source": [ "MOVIES = ['Toy story', 'Home alone']\n", "\n", "def change_movie():\n", " MOVIES.extend(['Fargo', 'The Usual Suspects'])\n", " print(f'MOVIES inside the function = {MOVIES}')\n", "\n", "print(f'MOVIES outside the function before change = {MOVIES}')\n", "change_movie()\n", "print(f'MOVIES outside the function after change = {MOVIES}')\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Take away: be careful when using global variables. Do not use it unless you know what you are doing." ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 1, "slideshow": { "slide_type": "slide" } }, "source": [ "### More on functions: `return` statement\n", "A function that counts the number of occurences of `'C'` in the argument string." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "lines_to_next_cell": 1 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2 \n", " 0\n" ] } ], "source": [ "def cytosine_count(nucleotides):\n", " count = 0\n", " for x in nucleotides:\n", " if x == 'c' or x == 'C':\n", " count += 1\n", " return count\n", "\n", "count1 = cytosine_count('CATATTAC')\n", "count2 = cytosine_count('tagtag')\n", "print(count1, \"\\n\", count2)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "Functions that `return` are easier to repurpose than those that `print` their result" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "lines_to_next_cell": 1 }, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cytosine_count('catattac') + cytosine_count('tactactac')" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "0\n" ] } ], "source": [ "def print_cytosine_count(nucleotides):\n", " count = 0\n", " for x in nucleotides:\n", " if x == 'c' or x == 'C':\n", " count += 1\n", " print(count)\n", "\n", "print_cytosine_count('CATATTAC')\n", "print_cytosine_count('tagtag')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\n", "3\n" ] }, { "ename": "TypeError", "evalue": "unsupported operand type(s) for +: 'NoneType' and 'NoneType'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[27], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mprint_cytosine_count\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mcatattac\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m+\u001b[39;49m\u001b[43m \u001b[49m\u001b[43mprint_cytosine_count\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mtactactac\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m\n", "\u001b[0;31mTypeError\u001b[0m: unsupported operand type(s) for +: 'NoneType' and 'NoneType'" ] } ], "source": [ "print_cytosine_count('catattac') + print_cytosine_count('tactactac')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "- Functions without any `return` statement returns `None`\n" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Return value of foo() = None\n" ] } ], "source": [ "def foo():\n", " do_nothing = 1\n", "\n", "result = foo()\n", "print(f'Return value of foo() = {result}')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Use `return` for all values that you might want to use later in your program" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Small detour: Python's value for missing values: `None`\n", "\n", "- Default value for optional arguments\n", "- Implicit return value of functions without a `return` statement\n", "- `None` is `None`, not anything else" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "None == 0" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "None == False" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "None == ''" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bool(None)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "NoneType" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "type(None)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Keyword arguments\n" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [], "source": [ "fh = open('../files/fruits.txt', mode='w', encoding='utf-8'); fh.close()" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[100, 6, 5, 4, 1]" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted([1, 4, 100, 5, 6], reverse=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Why do we use keyword arguments?" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['gene_id', 'INSR', '\"insulin receptor\"']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "record = 'gene_id INSR \"insulin receptor\"'\n", "\n", "record.split(' ', 2)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['gene_id', 'INSR', '\"insulin receptor\"']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "record.split(sep=' ', maxsplit=2)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* It increases the clarity and readability" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### The order of keyword arguments does not matter" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "fh = open('../files/fruits.txt', mode='w', encoding='utf-8'); fh.close()" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [], "source": [ "fh = open('../files/fruits.txt', encoding='utf-8', mode='w'); fh.close()" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Can be used in both ways, with or without keyword \n", "- if there is no ambiguity" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "fh = open('../files/fruits.txt', 'w', encoding='utf-8'); fh.close()" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [], "source": [ "fh = open('../files/fruits.txt', mode='w', encoding='utf-8'); fh.close()" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### But there are some exceptions " ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "ename": "SyntaxError", "evalue": "positional argument follows keyword argument (986095044.py, line 1)", "output_type": "error", "traceback": [ "\u001b[0;36m Cell \u001b[0;32mIn[42], line 1\u001b[0;36m\u001b[0m\n\u001b[0;31m fh = open('files/recipes.txt', encoding='utf-8', 'w'); fh.close()\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m positional argument follows keyword argument\n" ] } ], "source": [ "fh = open('files/recipes.txt', encoding='utf-8', 'w'); fh.close()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "* Positional arguments must be in front of keyword arguments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Restrictions by purpose" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[100, 6, 5, 4, 1]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted([1, 4, 100, 5, 6], reverse=True)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "sorted expected 1 argument, got 2", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[44], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[38;5;28;43msorted\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m4\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m100\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m5\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m6\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\u001b[43m)\u001b[49m\n", "\u001b[0;31mTypeError\u001b[0m: sorted expected 1 argument, got 2" ] } ], "source": [ "sorted([1, 4, 100, 5, 6], True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "sorted(iterable, /, *, key=None, reverse=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- arguments before `/` must be specified with position\n", "- arguments after `*` must be specified with keyword" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 1, "slideshow": { "slide_type": "slide" } }, "source": [ "### How to define functions taking keyword arguments\n", "\n", "- Just define them as usual:" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The lecture is ongoing.\n" ] }, { "ename": "TypeError", "evalue": "format_sentence() got multiple values for argument 'value'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "Cell \u001b[0;32mIn[45], line 6\u001b[0m\n\u001b[1;32m 2\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mThe \u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;241m+\u001b[39m subject \u001b[38;5;241m+\u001b[39m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m is \u001b[39m\u001b[38;5;124m'\u001b[39m \u001b[38;5;241m+\u001b[39m value \u001b[38;5;241m+\u001b[39m end\n\u001b[1;32m 4\u001b[0m \u001b[38;5;28mprint\u001b[39m(format_sentence(\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlecture\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mongoing\u001b[39m\u001b[38;5;124m'\u001b[39m, \u001b[38;5;124m'\u001b[39m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m'\u001b[39m))\n\u001b[0;32m----> 6\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[43mformat_sentence\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mlecture\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43m!\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvalue\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mongoing\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m)\u001b[49m)\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28mprint\u001b[39m(format_sentence(subject\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mlecture\u001b[39m\u001b[38;5;124m'\u001b[39m, value\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mongoing\u001b[39m\u001b[38;5;124m'\u001b[39m, end\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m...\u001b[39m\u001b[38;5;124m'\u001b[39m))\n", "\u001b[0;31mTypeError\u001b[0m: format_sentence() got multiple values for argument 'value'" ] } ], "source": [ "def format_sentence(subject, value = 13, end = \"....\"):\n", " return 'The ' + subject + ' is ' + value + end\n", "\n", "print(format_sentence('lecture', 'ongoing', '.'))\n", "\n", "print(format_sentence('lecture', '!', value='ongoing'))\n", "\n", "print(format_sentence(subject='lecture', value='ongoing', end='...'))" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 1, "slideshow": { "slide_type": "slide" } }, "source": [ "### Defining functions with default arguments" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The lecture is ongoing...\n" ] } ], "source": [ "def format_sentence(subject, value, end='.'):\n", " return 'The ' + subject + ' is ' + value + end\n", "\n", "#print(format_sentence('lecture', 'ongoing'))\n", "\n", "print(format_sentence('lecture', 'ongoing', '...'))" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 1, "slideshow": { "slide_type": "slide" } }, "source": [ "### Defining functions with optional arguments\n", "\n", "- Convention: use the object `None`" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The lecture is ongoing.\n", "The lecture is ongoing and self-referential!\n" ] } ], "source": [ "def format_sentence(subject, value, end='.', second_value=None):\n", " if second_value is None:\n", " return 'The ' + subject + ' is ' + value + end\n", " else:\n", " return 'The ' + subject + ' is ' + value + ' and ' + second_value + end\n", "\n", "print(format_sentence('lecture', 'ongoing'))\n", "\n", "print(format_sentence('lecture', 'ongoing', second_value='self-referential', end='!'))" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Comparing `None`\n", "\n", "- To differentiate `None` to the other false values such as `0`, `False` and `''` use `is None`:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "counts = {'drama': 2, 'romance': 0}\n", "\n", "counts.get('romance'), counts.get('thriller')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "counts.get('romance') is None" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "counts.get('thriller') is None" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "- Python and the truth, take two" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "lines_to_next_cell": 2, "scrolled": true, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "values = [None, 1, 0, '', '0', '1', [], [0]]\n", "for x in values:\n", " if x is None:\n", " print(repr(x), 'is None')\n", " if not x:\n", " print(repr(x), 'is false')\n", " if x:\n", " print(repr(x), 'is true')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Exercise 1\n", "\n", "- Notebook Day_4_Exercise_1 (~30 minutes)\n", "- Go to Canvas, `Modules -> Day 4 -> Exercise 1 - day 4`\n", "\n", "\n", "- Extra reading:\n", " - https://realpython.com/python-kwargs-and-args/\n", " - https://able.bio/rhett/python-functions-and-best-practices--78aclaa\n" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### A short note on code structure\n", "\n", "- Functions\n", " - e.g. sum(), print(), open()\n", "- Modules\n", " - files containing a collection of functions and methods, e.g. string.py \n", "- Documentation\n", " - docstring, comments" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Why functions?\n", "- Cleaner code\n", "- Better defined tasks in code\n", "- Re-usability\n", "- Better structure" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Why modules?\n", "\n", "- Cleaner code\n", "- Better defined tasks in code\n", "- Re-usability\n", "- Better structure" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "fragment" } }, "source": [ "- Collect all related functions in one file\n", "- Import a module to use its functions\n", "- Only need to understand what the functions do, not how" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Example of modules" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'-f'" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sys\n", "\n", "sys.argv[1]" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2024-11-14 14:43:47.521944\n" ] } ], "source": [ "from datetime import datetime\n", "print(datetime.now())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "import os\n", "\n", "os.system(\"ls\")" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### How to find the right module and instructions?\n", "\n", "- Look at the [module index](https://docs.python.org/3/py-modindex.html) for Python standard modules\n", "- Search [PyPI](http://pypi.org)\n", "- Search https://www.w3schools.com/python/\n", "- Ask your colleagues\n", "- Search the web\n", "- Use ChatGPT" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Standard modules: no installation needed\n", "- Other libraries: install with `pip install` or `conda install`" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### How to understand it?\n", "- E.g. I want to know how to split a string by the separator `,`" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [], "source": [ "text = 'Programming,is,cool'" ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on built-in function split:\n", "\n", "split(sep=None, maxsplit=-1) method of builtins.str instance\n", " Return a list of the words in the string, using sep as the delimiter string.\n", " \n", " sep\n", " The delimiter according which to split the string.\n", " None (the default value) means split according to any whitespace,\n", " and discard empty strings from the result.\n", " maxsplit\n", " Maximum number of splits to do.\n", " -1 (the default value) means no limit.\n", "\n" ] } ], "source": [ "help(text.split)" ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['Programming', 'is', 'cool']" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text.split(sep=',')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "#### For slightly more complicated problems\n", "- e.g. how to download Python logo from internet with `urllib`, given the URL https://www.python.org/static/img/python-logo@2x.png" ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on package urllib:\n", "\n", "NAME\n", " urllib\n", "\n", "MODULE REFERENCE\n", " https://docs.python.org/3.9/library/urllib\n", " \n", " The following documentation is automatically generated from the Python\n", " source files. It may be incomplete, incorrect or include features that\n", " are considered implementation detail and may vary between Python\n", " implementations. When in doubt, consult the module reference at the\n", " location listed above.\n", "\n", "PACKAGE CONTENTS\n", " error\n", " parse\n", " request\n", " response\n", " robotparser\n", "\n", "FILE\n", " /Users/kostas/opt/miniconda3/envs/python-workshop-teacher/lib/python3.9/urllib/__init__.py\n", "\n", "\n" ] } ], "source": [ "import urllib\n", "\n", "help(urllib)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "- Probably easier to find the answer by searching the web or using ChatGPT" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### One minute exercise \n", "- get help from ChatGPT (https://chat.openai.com/)\n", "\n", "Using Python to download the Python logo from internet with urllib\n", "providing the url as https://www.python.org/static/img/python-logo@2x.png" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "import urllib.request\n", "\n", "url = \"https://www.python.org/static/img/python-logo@2x.png\"\n", "filename = \"python-logo.png\" # The name you want to give to the downloaded file\n", "\n", "urllib.request.urlretrieve(url, filename)\n", "\n", "print(\"Download completed.\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "help(math.sqrt)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "math.sqrt(3)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "### Various ways of importing" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import math\n", "\n", "math.sqrt(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import math as m\n", "m.sqrt(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from math import sqrt\n", "sqrt(3)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from pprint import pprint" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Documentation and commenting your code\n" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "def process_file(filename, chrom, pos):\n", " \"\"\"\n", " Read a very large vcf file, search for lines matching\n", " chromosome chrom and position pos.\n", "\n", " Print the genotypes of the matching lines.\n", " \"\"\"\n", " for line in open(filename):\n", " if not line.startswith('#'):\n", " col = line.split('\\t')\n", " if col[0] == chrom and int(col[1]) == pos:\n", " print(col[9:])" ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function process_file in module __main__:\n", "\n", "process_file(filename, chrom, pos)\n", " Read a very large vcf file, search for lines matching\n", " chromosome chrom and position pos.\n", " \n", " Print the genotypes of the matching lines.\n", "\n" ] } ], "source": [ "help(process_file)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "fragment" } }, "source": [ "- This works because somebody has documented the code!" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Your code may have two types of users:\n", "\n", "- library users\n", "- maintainers (maybe yourself!)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "fragment" } }, "source": [ "### Write documentation for both of them!\n", "\n", "- library users (docstrings):\n", " ```python\n", " \"\"\"\n", " What does this function do?\n", " \"\"\"\n", " ```\n", "- maintainers (comments):\n", " ```python\n", " # implementation details\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Places for documentation" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''" }, "source": [ "- At the beginning of the file\n", "\n", " ```python\n", " \"\"\"\n", " This module provides functions for ...\n", " \"\"\"\n", " ```" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''" }, "source": [ "- At every function definition" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import random\n", "def make_list(x):\n", " \"\"\"Returns a random list of length x.\"\"\"\n", " li = list(range(x))\n", " random.shuffle(li)\n", " return li" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Comments" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''" }, "source": [ " - Wherever the code is hard to understand\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_list[5] += other_list[3] # explain why you do this!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "### Demo: write a Python script with documentation and use it" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from files import mywork\n", "mywork.pipeline([\"accctt\", \"gaccct\"])" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Read more:\n", "\n", "https://realpython.com/documenting-python-code/\n", "\n", "https://www.python.org/dev/peps/pep-0008/?#comments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Quiz time\n", "\n", "Go to Canvas, `Modules -> Day 4 -> PyQuiz 4.1`\n", "\n", "~10 min" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lunch" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "title = 'Toy Story'\n", "rating = 10\n", "print('The result is: ' + title + ' with rating: ' + str(rating))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# f-strings (since python 3.6)\n", "print(f'The result is: {title} with rating: {rating}')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# format method\n", "print('The result is: {} with rating: {}'.format(title, rating))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# the ancient way (python 2)\n", "print('The result is: %s with rating: %s' % (title, rating))" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "Learn more from the Python docs: https://docs.python.org/3.9/library/string.html#format-string-syntax" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Exercise 2 (??) wrong example?\n", "\n", "\n", "```py\n", "pick_movie(year=1996, rating_min=8.5)\n", "The Bandit\n", "pick_movie(rating_max=8.0, genre=\"Mystery\")\n", "Twelve Monkeys\n", "```\n", "\n", "- Notebook Day_4_Exercise_2" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Exercise 2\n", "\n", "\n", "Documentation\n", "\n", "- Notebook Day_4_Exercise_2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Pandas!!!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pandas\n", "- Library for working with tabular data\n", "- Data analysis: \n", " - filter\n", " - transform\n", " - aggregate\n", " - plot\n", "- Main hero: the `DataFrame` type" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### DataFrame " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "![01_table_dataframe1](img/01_table_dataframe1.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Creating a small DataFrame" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 60, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "data = {\n", " 'age': [1,2,3,4],\n", " 'circumference': [2,3,5,10],\n", " 'height': [30, 35, 40, 50]\n", "}\n", "df = pd.DataFrame(data)\n", "df" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "# add row index\n", "row_index = [\"tree1\", \"tree2\", \"tree3\", \"tree4\"]\n", "df = df.set_index(pd.Index(row_index))\n", "help(pd.Index)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Pandas can import data from many formats\n", "\n", "- `pd.read_table`: tab separated values `.tsv`\n", "- `pd.read_csv`: comma separated values `.csv`\n", "- `pd.read_excel`: Excel spreadsheets `.xlsx`\n", "\n", "- For a data frame `df`: `df.to_table()`, `df.to_csv()`, `df.to_excel()`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![test](img/02_io_readwrite.png)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "#### Orange tree data" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "lines_to_next_cell": 2, "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_table('../downloads/Orange_1.tsv')\n", "df" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "fragment" } }, "source": [ "- One implict index (0, 1, 2, 3)\n", "- Columns: `age`, `circumference`, `height`\n", "- Rows: one per data point, identified by their index" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Read data from Excel file" ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.read_excel('../downloads/Orange_1.xlsx')\n", "df2" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Overview of your data, basic statistics" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "(4, 3)" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
count4.0000004.0000004.000000
mean2.5000005.00000038.750000
std1.2909943.5590268.539126
min1.0000002.00000030.000000
25%1.7500002.75000033.750000
50%2.5000004.00000037.500000
75%3.2500006.25000042.500000
max4.00000010.00000050.000000
\n", "
" ], "text/plain": [ " age circumference height\n", "count 4.000000 4.000000 4.000000\n", "mean 2.500000 5.000000 38.750000\n", "std 1.290994 3.559026 8.539126\n", "min 1.000000 2.000000 30.000000\n", "25% 1.750000 2.750000 33.750000\n", "50% 2.500000 4.000000 37.500000\n", "75% 3.250000 6.250000 42.500000\n", "max 4.000000 10.000000 50.000000" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 4\n", "circumference 10\n", "height 50\n", "dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.max()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting columns from a dataframe\n", "```py\n", "dataframe.columnname\n", "dataframe['columnname']\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![03_subset_columns](img/03_subset_columns.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting one column" ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "Name: age, dtype: int64" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_new = df.age\n", "df_new " ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "0 1\n", "1 2\n", "2 3\n", "3 4\n", "Name: age, dtype: int64" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting multiple columns" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageheight
0130
1235
2340
3450
\n", "
" ], "text/plain": [ " age height\n", "0 1 30\n", "1 2 35\n", "2 3 40\n", "3 4 50" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['age', 'height']]" ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
heightage
0301
1352
2403
3504
\n", "
" ], "text/plain": [ " height age\n", "0 30 1\n", "1 35 2\n", "2 40 3\n", "3 50 4" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['height', 'age']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting rows from a dataframe" ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 1\n", "circumference 2\n", "height 30\n", "Name: 0, dtype: int64" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[0] # select the first row" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[1:3] # select from row 2 to 4" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
12335
341050
01230
\n", "
" ], "text/plain": [ " age circumference height\n", "1 2 3 35\n", "3 4 10 50\n", "0 1 2 30" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[[1, 3, 0]] # select row 2, 4 and 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting cells from a dataframe " ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 77, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
age
01
\n", "
" ], "text/plain": [ " age\n", "0 1" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.loc[[0], ['age']]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Run statistics on specific rows, columns, cells" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumference
count4.0000004.000000
mean2.5000005.000000
std1.2909943.559026
min1.0000002.000000
25%1.7500002.750000
50%2.5000004.000000
75%3.2500006.250000
max4.00000010.000000
\n", "
" ], "text/plain": [ " age circumference\n", "count 4.000000 4.000000\n", "mean 2.500000 5.000000\n", "std 1.290994 3.559026\n", "min 1.000000 2.000000\n", "25% 1.750000 2.750000\n", "50% 2.500000 4.000000\n", "75% 3.250000 6.250000\n", "max 4.000000 10.000000" ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[['age', 'circumference']].describe()" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "1.2909944487358056" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['age'].std()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df.loc[1:10]" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting data from a dataframe by index\n", "```py\n", "dataframe.iloc[index]\n", "dataframe.iloc[start:stop]\n", "```\n", "Further reading from pandas documentation: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html" ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2" ] }, "execution_count": 83, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df\n", "#df.iloc[:,0] # Show the first column\n", "#df.iloc[1] # Show the second row\n", "df.iloc[1,0] # Show the cell of the second row and the first column (you get number without index)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Creating new column derived from existing column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![05_newcolumn_1](img/05_newcolumn_1.png)" ] }, { "cell_type": "code", "execution_count": 84, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheightradius
012300.318310
123350.477465
235400.795775
3410501.591549
\n", "
" ], "text/plain": [ " age circumference height radius\n", "0 1 2 30 0.318310\n", "1 2 3 35 0.477465\n", "2 3 5 40 0.795775\n", "3 4 10 50 1.591549" ] }, "execution_count": 84, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import math\n", "df['radius'] = df['circumference'] / (2.0 * math.pi)\n", "df" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "![03_subset_rows](img/03_subset_rows.png)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Expand dataframe by concatenating " ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1 = pd.DataFrame({\n", " 'age': [1,2,3,4],\n", " 'circumference': [2,3,5,10],\n", " 'height': [30, 35, 40, 50]\n", "})\n", "\n", "df1" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameprice
0palm1423
1ada2000
2ek102
3olive30
\n", "
" ], "text/plain": [ " name price\n", "0 palm 1423\n", "1 ada 2000\n", "2 ek 102\n", "3 olive 30" ] }, "execution_count": 86, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df2 = pd.DataFrame({\n", " 'name': ['palm', 'ada', 'ek', 'olive'],\n", " 'price': [1423, 2000, 102, 30]\n", "})\n", "\n", "df2" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepriceagecircumferenceheight
0palm1423.0NaNNaNNaN
1ada2000.0NaNNaNNaN
2ek102.0NaNNaNNaN
3olive30.0NaNNaNNaN
4NaNNaN1.02.030.0
5NaNNaN2.03.035.0
6NaNNaN3.05.040.0
7NaNNaN4.010.050.0
\n", "
" ], "text/plain": [ " name price age circumference height\n", "0 palm 1423.0 NaN NaN NaN\n", "1 ada 2000.0 NaN NaN NaN\n", "2 ek 102.0 NaN NaN NaN\n", "3 olive 30.0 NaN NaN NaN\n", "4 NaN NaN 1.0 2.0 30.0\n", "5 NaN NaN 2.0 3.0 35.0\n", "6 NaN NaN 3.0 5.0 40.0\n", "7 NaN NaN 4.0 10.0 50.0" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.concat([df2, df1], axis=0).reset_index(drop=True)\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting/filtering the dataframe by condition\n", "e.g. \n", "* Only trees with age larger than 100 \n", "* Only tree with circumference shorter than 20" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Slightly bigger data frame of orange trees " ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Treeagecircumference
0111830
1148458
2166487
\n", "
" ], "text/plain": [ " Tree age circumference\n", "0 1 118 30\n", "1 1 484 58\n", "2 1 664 87" ] }, "execution_count": 90, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_table('../downloads/Orange.tsv')\n", "df.head(3) # can also use .head()" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3])" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Tree.unique()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Selecting with condition" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Treeagecircumference
0111830
1148458
2166487
311004115
411231120
511372142
611582145
\n", "
" ], "text/plain": [ " Tree age circumference\n", "0 1 118 30\n", "1 1 484 58\n", "2 1 664 87\n", "3 1 1004 115\n", "4 1 1231 120\n", "5 1 1372 142\n", "6 1 1582 145" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['Tree'] == 1]" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Treeagecircumference
2166487
311004115
411231120
511372142
611582145
92664111
1021004156
1121231172
1221372203
1321582203
16366475
1731004108
1831231115
1931372139
2031582140
\n", "
" ], "text/plain": [ " Tree age circumference\n", "2 1 664 87\n", "3 1 1004 115\n", "4 1 1231 120\n", "5 1 1372 142\n", "6 1 1582 145\n", "9 2 664 111\n", "10 2 1004 156\n", "11 2 1231 172\n", "12 2 1372 203\n", "13 2 1582 203\n", "16 3 664 75\n", "17 3 1004 108\n", "18 3 1231 115\n", "19 3 1372 139\n", "20 3 1582 140" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.age > 500]" ] }, { "cell_type": "code", "execution_count": 94, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Treeagecircumference
2166487
16366475
\n", "
" ], "text/plain": [ " Tree age circumference\n", "2 1 664 87\n", "16 3 664 75" ] }, "execution_count": 94, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[(df.age > 500) & (df.circumference < 100) ]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "type(pd.DataFrame({\"genre\": ['Thriller', 'Drama'], \"rating\": [10, 9]}).rating.iloc[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "#young = df[df.age < 200]\n", "#young\n", "df[df.age < 1000]" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Small exercise 1\n", "* Find the maximal circumference and then filter the data frame by it" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Treeagecircumference
1221372203
1321582203
\n", "
" ], "text/plain": [ " Tree age circumference\n", "12 2 1372 203\n", "13 2 1582 203" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df\n", "max_c=df.circumference.max()\n", "max_c\n", "df[df.circumference==max_c]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "max_c = df.circumference.max()\n", "print(max_c)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df[df.circumference == max_c]" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "### Filter with multiple conditions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df[(df.age > 100) & (df.age <= 250)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Small exercise 2\n", "\n", "Here's a dictionary of students and their grades:\n", "```\n", "students = {'student': ['bob', 'sam', 'joe'], 'grade': [1, 3, 4]}\n", "```\n", "Use Pandas to:\n", "- create a dataframe with this information\n", "- get the mean value of the grades" ] }, { "cell_type": "code", "execution_count": 96, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2.6666666666666665" ] }, "execution_count": 96, "metadata": {}, "output_type": "execute_result" } ], "source": [ "students = {'student': ['bob', 'sam', 'joe'], 'grade': [1, 3, 4]}\n", "\n", "ds=pd.DataFrame(students)\n", "\n", "ds.grade.mean()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "students = {'student': ['bob', 'sam', 'joe'], 'grade': [1, 3, 4]}\n", "\n", "df = pd.DataFrame(students)\n", "\n", "df.grade.mean()\n", "# df['grade'].mean()" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Plotting\n", "```py\n", "df.columnname.plot()\n", "```" ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agecircumferenceheight
01230
12335
23540
341050
\n", "
" ], "text/plain": [ " age circumference height\n", "0 1 2 30\n", "1 2 3 35\n", "2 3 5 40\n", "3 4 10 50" ] }, "execution_count": 97, "metadata": {}, "output_type": "execute_result" } ], "source": [ "small_df = pd.read_table('../downloads/Orange_1.tsv')\n", "small_df" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "slideshow": { "slide_type": "slide" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 98, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "small_df.plot(x='age', y='circumference', kind='line') # plot the relationship of age and height\n", "# try with other types of plots, e.g. scatter" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Tips: what if no plots shows up?" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Plotting - bars" ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "small_df[['age']].plot(kind='bar')" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "slide" } }, "source": [ "### Plotting multiple columns" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "slideshow": { "slide_type": "-" } }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "small_df[['circumference', 'age']].plot(kind='bar')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df[['circumference', 'age']].plot(kind='bar', figsize=(12, 8), fontsize=16)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Plotting histogram" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "small_df.plot(kind='hist', y = 'age', fontsize=18)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Plotting box" ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 104, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "small_df.plot(kind='box', y = 'age')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "Further reading: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Scatterplot\n", "\n", "```py\n", " df.plot(kind=\"scatter\", x=\"column_name\", y=\"other_column_name\")\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df.plot(kind=\"scatter\", x='age', y='circumference',\n", " figsize=(12, 8), fontsize=14)" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "slideshow": { "slide_type": "skip" } }, "source": [ "#### Line plot\n", "```py\n", "dataframe.plot(kind=\"line\", x=..., y=...)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "tree1 = df[df['Tree'] == 1]\n", "tree1.plot(kind=\"line\", x='age', y='circumference',\n", " fontsize=14, figsize=(12,8))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "#### Multiple graphs - grouping" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df.groupby('Tree')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df.groupby('Tree').plot(kind=\"line\", x='age', y='circumference')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "df.groupby('Tree').groups" ] }, { "cell_type": "markdown", "metadata": { "cell_marker": "'''", "lines_to_next_cell": 2, "slideshow": { "slide_type": "slide" } }, "source": [ "### Exercise 2 (~30 minutes)\n", "- Go to Canvas, `Modules -> Day 4 -> Exercise 2 - day 4` \n", "- **Easy**:\n", " - Explore the `Orange_1.tsv`\n", "- **Medium/hard**:\n", " - Use Pandas to read IMDB\n", " - Explore it by making graphs\n", "- **Extra exercises**:\n", " - Read the pandas documentation :)\n", " - Start exploring your own data\n", "- After exercise, do Quiz 4.2 and then take a break\n", "- After break, working on the project" ] } ], "metadata": { "celltoolbar": "Slideshow", "jupytext": { "cell_markers": "'''" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 4 }