{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# From nothing to something\n", "\n", "## Topics\n", "\n", " * From nothing to something:\n", " * Pairwise correlation between rows in a pandas dataframe\n", " * Sketch of the process\n", " * In class exercise:\n", " * Write the code!\n", " * Rejoining, sharing ideas, problems, thoughts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Notes left over from last class. Maybe be useful. May not.\n", "\n", "Notes from last class:\n", "\n", "* You can read a csv that is zipped directly into pandas...\n", "\n", "```\n", "import pandas as pd\n", "\n", "pd.read_csv('http://faculty.washington.edu/dacb/HCEPDB_moldata.zip')\n", "```\n", "\n", "* You can also do this in your own Python... Let's see how... The `os` package has tools for checking if a file exists: ``os.path.exists``\n", "```\n", "import os\n", "filename = 'HCEPDB_moldata.zip'\n", "if os.path.exists(filename):\n", " print(\"wahoo!\")\n", "```\n", "* Use the `requests` package to get the file given a url (got this from the requests docs)\n", "```\n", "import requests\n", "url = 'http://faculty.washington.edu/dacb/HCEPDB_moldata.zip'\n", "req = requests.get(url)\n", "assert req.status_code == 200 # if the download failed, this line will generate an error\n", "with open(filename, 'wb') as f:\n", " f.write(req.content)\n", "```\n", "* Use the `zipfile` package to decompress the file while reading it into `pandas`\n", "```\n", "import pandas as pd\n", "import zipfile\n", "csv_filename = 'HCEPDB_moldata.csv'\n", "zf = zipfile.ZipFile(filename)\n", "data = pd.read_csv(zf.open(csv_filename))\n", "```\n", "\n", "Here was my solution\n", "```\n", "import os\n", "import requests\n", "import pandas as pd\n", "import zipfile\n", "\n", "filename = 'HCEPDB_moldata.zip'\n", "url = 'http://faculty.washington.edu/dacb/HCEPDB_moldata.zip'\n", "csv_filename = 'HCEPDB_moldata.csv'\n", "\n", "if os.path.exists(filename):\n", " pass\n", "else:\n", " req = requests.get(url)\n", " assert req.status_code == 200 # if the download failed, this line will generate an error\n", " with open(filename, 'wb') as f:\n", " f.write(req.content)\n", "\n", "zf = zipfile.ZipFile(filename)\n", "data = pd.read_csv(zf.open(csv_filename))\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def download_if_not_exists(filename):\n", " if os.path.exists(filename):\n", " pass\n", " else:\n", " req = requests.get(url)\n", " assert req.status_code == 200 # if the download failed, this line will generate an error\n", " with open(filename, 'wb') as f:\n", " f.write(req.content)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### global and local variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Parameters (or arguments) in Python are all passed by reference. This means that if you modify the parameters in the function, they are modified outside of the function.\n", "\n", "See the following example:\n", "\n", "```\n", "def change_list(my_list):\n", " \"\"\"This changes a passed list into this function\"\"\"\n", " my_list.append('four');\n", " print('list inside the function: ', my_list)\n", " return\n", "\n", "my_list = [1, 2, 3];\n", "print('list before the function: ', my_list)\n", "change_list(my_list);\n", "print('list after the function: ', my_list)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def change_list(my_list):\n", " \"\"\"This changes a passed list into this function\"\"\"\n", " my_list.append('four');\n", " print('list inside the function: ', my_list)\n", " return\n", "\n", "my_list = [1, 2, 3];\n", "print('list before the function: ', my_list)\n", "change_list(my_list);\n", "print('list after the function: ', my_list)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Variables have scope: `global` and `local`\n", "\n", "In a function, new variables that you create are not saved when the function returns - these are `local` variables. Variables defined outside of the function can be accessed but not changed - these are `global` variables, _Note_ there is a way to do this with the `global` keyword. Generally, the use of `global` variables is not encouraged, instead use parameters.\n", "\n", "```\n", "my_global_1 = 'bad idea'\n", "my_global_2 = 'another bad one'\n", "my_global_3 = 'better idea'\n", "\n", "def my_function():\n", " print(my_global)\n", " my_global_2 = 'broke your global, man!'\n", " global my_global_3\n", " my_global_3 = 'still a better idea'\n", " return\n", " \n", "my_function()\n", "print(my_global_2)\n", "print(my_global_3)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "my_global_1 = 'bad idea'\n", "my_global_2 = 'another bad one'\n", "my_global_3 = 'better idea'\n", "\n", "def my_function():\n", " print(my_global)\n", " my_global_2 = 'broke your global, man!'\n", " global my_global_3\n", " my_global_3 = 'still a better idea'\n", " return\n", " \n", "my_function()\n", "print(my_global_2)\n", "print(my_global_3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Review the stack trace!\n", "\n", "```\n", "def hello(who_string):\n", " print(\"Hello \" + who_str + \"!\")\n", "\n", "hello(\"World\")\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def hello(who_string):\n", " print(\"Hello \" + who_str + \"!\")\n", "\n", "hello(\"World\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How to read this stack trace? First, let's identify the Exception and read the error text. Next start at the bottom and work your way up the trace to the top. Let's see the example from above, but annotated.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In cyan, we see the exception type. An exception is an error that has occurred. We often talk about **raising** exceptions. Thus, we might talk about _\"raising an exception when an error ocurred.\"_\n", "\n", "In magenta is the exception text. This is a message written by the programmer (or Python developer) and it is intended to provide human readable guidance as to the nature of the failure.\n", "\n", "In purple, we see the **stack** of function calls that led us to the error. In a notebook, the top level of the **stack** is the notebook cell. The lowest item in the **stack** is the function where the error occurred. We read this stack trace from the bottom to the top.\n", "\n", "In complicated stack traces, the traceback can be 100s of lines long and will show the full stack trace. I.e. function `A` called function `B` which called function `C` which called function `D` which called me and asked me to stop writing this call stack in the notebok because you get the picture." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Functions can return multiple values\n", "\n", "\n", "In general, you want to use parameters to provide data to a function and return a result with the `return`. E.g.\n", "\n", "```\n", "def sum(x, y):\n", " my_sum = x + y\n", " return my_sum\n", "```\n", "\n", "If you are going to return multiple objects, what data structure that we talked about can be used? Give and example below by modifying the above function to return `x`, `y`, **and** `my_sum`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "## Importing code from .py files\n", " \n", "When you put python code in a .py file in the same directory as your notebook, you can import that code into your notebook and use it.\n", " \n", "Example..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Steps to putting your own code in a `.py` file:\n", "\n", "1. Open your editor. \n", "2. Write (or paste in) your code. \n", "3. Save the code with a filename like `lowercase_underscores_separating_words.py`.\n", "4. Return to your notebook.\n", "5. In a cell, do: `import lowercase_underscores_separating_words`. Notice you don't need the `.py` and if your filename has other characters than letters, numbers and `_` the `import` will fail.\n", "\n", "Let's do that now!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", "## From nothing to something\n", "\n", "### Task: Compute the pairwise Pearson correlation between rows in a dataframe.\n", "\n", "Let's say we have three molecules ($X$, $Y$, $Z$) with three measurements each, ine the case of $X$: $x_1$, $x_2$, $x_3$. Often, in machine learning settings, these \"measurements\" are often called **features**. So for each molecule we have a vector of measurements or **features**:\n", "\n", "$$X=\\begin{bmatrix}\n", " x_1 \\\\\n", " x_2 \\\\\n", " x_3 \\\\\n", " \\end{bmatrix} $$\n", " \n", "Where X is a molecule and the components are the values for each of the measurements. These make up the **rows** in our matrix.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "----" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, if we had measurements from three molecules, arranged into a matrix, ready for use in a Pandas `data frame`: \n", "\n", "$$df=\\begin{bmatrix}\n", " x_1 \\; x_2 \\; x_3 \\\\\n", " y_1 \\; y_2 \\; y_3 \\\\\n", " z_1 \\; z_2 \\; z_3\n", " \\end{bmatrix} $$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What would this look like in `.csv` file?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Why are the molecules listed on the **rows** vs. the **columns**?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What data have we looked at already in class that follows this form, but with more than three features or measurements?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Often, we want to compare molecules to determine how similar or different they are. One measure is the Pearson correlation.\n", "\n", "Pearson correlation: \n", "\n", "Expressed graphically, when you plot the paired measurements for two samples (in this case molecules) against each other you can see positively correlated, no correlation, and negatively correlated. Eg.\n", "\n", "\n", "To understand the quantity on a number line, we can use the following guide:\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Simple input dataframe (_note_ when you are writing code it is always a good idea to have a simple test case where you can readily compute by hand or know the output):\n", "\n", "| index | v1 | v2 | v3 |\n", "|-------|----|----|----|\n", "| A | -1 | 0 | 1 |\n", "| B | 1 | 0 | -1 |\n", "| C | .5 | 0 | .5 |\n", "\n", "### Take 3 minutes and answer the following questions in the notebook." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* If the above is a dataframe that will be used as input what shape and size is the output?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Whare are some unique features of the output?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For our test case, what will the output be?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Let's sketch the idea in pseudocode.\n", "\n", "**pseudocode** is a plain language description of the steps in an algorithm or another system. There is no \"formal\" definition of **pseudocode** - or else it would be a programming language. It is you writing out the steps in a mix of English and code (e.g. Python). Writing a **pseudocode** version of an algorithm will help you identify all the components ahead of time." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## In class exercise\n", "### 20-30 minutes\n", "#### Objectives: \n", "1. Write code using functions to compute the pairwise Pearson correlation between rows in a pandas dataframe. You will have to use ``for`` and possibly ``if``.\n", "2. Use a cell to test each function with an input that yields an expected output. Think about the shape and values of the outputs.\n", "3. Put the code in a ``.py`` file in the directory with the Jupyter notebook, import and run!\n", "\n", "\n", "#### To help you get started...\n", "To create the sample dataframe:\n", "```\n", "df = pd.DataFrame([[-1, 0, 1], [1, 0, -1], [.5, 0, .5]])\n", "```\n", "\n", "To loop over rows in a dataframe, check out (Google is your friend):\n", "```\n", "DataFrame.iterrows\n", "```\n", "\n", "For a row, to compute correlation to another list, series, vector, use:\n", "```\n", "my_row.corr(other_row)\n", "```\n", "\n", "You may want to use a ``numpy`` matrix, e.g.\n", "```\n", "import numpy as np\n", "pair_corr_mat = np.zeros((2,2))\n", "pair_corr_mat[1,1] = 42\n", "pair_corr_mat\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "\n", "## How do we know it is working?\n", "\n", "\n", "#### Use the test case!\n", "Our three row example is a useful tool for checking that our code is working. We can write some tests that compare the output of our functions to our expectations.\n", "\n", "E.g. The diagonals should be 1, and corr(A, B) = -1, ...\n", "\n", "#### But first, let's talk ``assert`` and ``raise``\n", "\n", "We've already briefly been exposed to assert in this code:\n", "```\n", "if os.path.exists(filename):\n", " pass\n", "else:\n", " req = requests.get(url)\n", " # if the download failed, next line will raise an error\n", " assert req.status_code == 200\n", " with open(filename, 'wb') as f:\n", " f.write(req.content)\n", "```\n", "\n", "What is the assert doing there?\n", "\n", "Let's play with ``assert``. What should the following asserts do?\n", "```\n", "assert True == False, \"You assert wrongly, sir!\"\n", "assert 'Dave' in instructors\n", "assert function_that_returns_True_or_False(parameters)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So when an assert statement is true, the code keeps executing and when it is false, it ``raises`` an exception (also known as an error).\n", "\n", "We've all probably seen lots of exception. E.g.\n", "\n", "```\n", "def some_function(parameter):\n", " return\n", "\n", "some_function()\n", "```\n", "\n", "```\n", "some_dict = { }\n", "print(some_dict['invalid key'])\n", "```\n", "\n", "```\n", "'fourty' + 2\n", "```\n", "\n", "Like C++ and other languages, Python let's you ``raise`` your own exception. You can do it with ``raise`` (surprise!). Exceptions are special objects and you can create your own type of exceptions. For now, we are going to look at the simplest ``Exception``.\n", "\n", "We create an ``Exception`` object by calling the generator:\n", "```\n", "Exception()\n", "```\n", "\n", "This isn't very helpful. We really want to supply a description. The Exception object takes any number of strings. One good form if you are using the generic exception object is:\n", "```\n", "Exception('Short description', 'Long description')\n", "```\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Creating an exception object isn't useful alone, however. We need to send it down the software stack to the Python interpreter so that it can handle the exception condition. We do this with ``raise``.\n", "\n", "```\n", "raise Exception(\"An error has occurred.\")\n", "```\n", "\n", "Now you can create your own error messages like a pro!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### DETOUR!\n", "\n", "There are lots of types of exceptions beyond the generic class ``Exception``. You can use them in your own code if they make sense. E.g. \n", "```\n", "import math\n", "my_variable = math.inf\n", "if my_variable == math.inf:\n", " raise ValueError('my_variable cannot be infinity')\n", "```\n", "\n", "

List of Standard Exceptions −

\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
EXCEPTION NAMEDESCRIPTION
ExceptionBase class for all exceptions
StopIterationRaised when the next() method of an iterator does not point to any object.
SystemExitRaised by the sys.exit() function.
StandardErrorBase class for all built-in exceptions except StopIteration and SystemExit.
ArithmeticErrorBase class for all errors that occur for numeric calculation.
OverflowErrorRaised when a calculation exceeds maximum limit for a numeric type.
FloatingPointErrorRaised when a floating point calculation fails.
ZeroDivisonErrorRaised when division or modulo by zero takes place for all numeric types.
AssertionErrorRaised in case of failure of the Assert statement.
AttributeErrorRaised in case of failure of attribute reference or assignment.
EOFErrorRaised when there is no input from either the raw_input() or input() function and the end of file is reached.
ImportErrorRaised when an import statement fails.
KeyboardInterruptRaised when the user interrupts program execution, usually by pressing Ctrl+c.
LookupErrorBase class for all lookup errors.

IndexError

KeyError

Raised when an index is not found in a sequence.

Raised when the specified key is not found in the dictionary.

NameErrorRaised when an identifier is not found in the local or global namespace.

UnboundLocalError

EnvironmentError

Raised when trying to access a local variable in a function or method but no value has been assigned to it.

Base class for all exceptions that occur outside the Python environment.

IOError

IOError

Raised when an input/ output operation fails, such as the print statement or the open() function when trying to open a file that does not exist.

Raised for operating system-related errors.

SyntaxError

IndentationError

Raised when there is an error in Python syntax.

Raised when indentation is not specified properly.

SystemErrorRaised when the interpreter finds an internal problem, but when this error is encountered the Python interpreter does not exit.
SystemExitRaised when Python interpreter is quit by using the sys.exit() function. If not handled in the code, causes the interpreter to exit.
Raised when Python interpreter is quit by using the sys.exit() function. If not handled in the code, causes the interpreter to exit.Raised when an operation or function is attempted that is invalid for the specified data type.
ValueErrorRaised when the built-in function for a data type has the valid type of arguments, but the arguments have invalid values specified.
RuntimeErrorRaised when a generated error does not fall into any category.
NotImplementedErrorRaised when an abstract method that needs to be implemented in an inherited class is not actually implemented.
" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Put it all together... ``assert`` and ``raise``\n", "\n", "Breaking assert down, it is really just an if test followed by a raise. So the code below:\n", "```\n", "assert , \n", "```\n", "is equivalent to a short hand for:\n", "```\n", "if not :\n", " raise AssertionError() \n", "```\n", "\n", "Prove it? OK.\n", "\n", "```\n", "instructors = ['Meep the Clown', 'Guido van Rossum']\n", "assert 'Dave' in instructors, \"Dave isn't in the instructor list!\"\n", "```\n", "\n", "```\n", "instructors = ['Meep the Clown', 'Guido van Rossum']\n", "assert 'Dave' in instructors, \"Dave isn't in the instructor list!\"\n", "if not 'Dave' in instructors:\n", " raise AssertionError(\"Dave isn't in the instructor list!\")\n", "```\n", "\n", "#### Questions?\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### All of this was in preparation for some testing...\n", "\n", "Can we write some quick tests that make sure our code is doing what we think it is? Something of the form:\n", "\n", "```\n", "corr_matrix = pairwise_row_correlations(my_sample_dataframe)\n", "assert corr_matrix looks like what we expect, \"The function is broken!\"\n", "```\n", "\n", "What are the smallest units of code that we can test?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What asserts can we make for these pieces of code?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remember, in computers, 1.0 does not necessarily = 1\n", "\n", "Put the following in an empty cell:\n", "```\n", ".99999999999999999999\n", "```\n", "\n", "How can we test for two floating point numbers being (almost) equal? Pro tip: [Google!](http://lmgtfy.com/?q=python+assert+almost+equal)\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## From nothing to something wrap up\n", "\n", "Here we created some functions from just a short description of our needs. \n", "* Before we wrote any code, we walked through the flow control and decided on the parts that were necessary.\n", "* Before we wrote any code, we created a simple test example with simple predictable output.\n", "* We wrote some code according to our specifications.\n", "* We wrote tests using ``assert`` to verify our code against the simple test example.\n", "\n", "\n", "### QUESTIONS?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" } }, "nbformat": 4, "nbformat_minor": 1 }