{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses). \n", ">\n", ">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 8: Overview \n", "\n", "In the prior [module 7](https://yint.org/pybasic07) you had an introduction to main Pandas objects: `Series` and `DataFrame`. You were also introduced to dictionaries. In this worksheet, we only see a bit more of dictionaries, and get to apply Pandas to solving practical problems you have seen in prior modules.\n", "\n", " Check out this repo using Git. Use your favourite Git user-interface, or at the command line:\n", ">```\n", ">git clone git@github.com:kgdunn/python-basic-notebooks.git\n", ">\n", "># If you already have the repo cloned:\n", ">git pull\n", ">```\n", "\n", "to update it to the later version.\n", "\n", "\n", "\n", "### Preparing for this module###\n", "\n", "You should have completed [worksheet 7](https://yint.org/pybasic07).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More about ``Dictionary`` objects\n", "\n", "It was [said earlier](https://yint.org/pybasic07) that a dictionary is a Python ***object*** which is a flexible data container for other objects. It contains objects using what are called ***key*** - ***value*** pairs. You create a dictionary like this:\n", "\n", "```python\n", "random_objects = {'key1': 45,\n", " 2: 'Yes, keys can even be integers!',\n", " 3.0: 'Or floating point objects',\n", " (4,5): 'Or tuples!',\n", " }\n", "print(random_objects)\n", "```\n", "\n", "### Iterating over the keys-values of a dictionary\n", "\n", "Once you have a dictionary, it is common to operate on the keys, or values, or both - in an iterative loop:\n", "\n", "```python\n", "for key, value in random_objects.items():\n", " print('The key is \"{}\" and the value is: {}'.format(key, value))\n", " random_objects[key] = value * 2\n", "```\n", "\n", "If you need only the values, and not the keys:\n", "```python\n", "for value in random_objects.values():\n", " # Do something here with\n", " value\n", "```\n", "\n", "or, if you need only the keys, and not the values:\n", "```python\n", "for key in random_objects.keys():\n", " # Do something here with \n", " key\n", "```\n", "\n", "### Setting and getting key-values\n", "\n", "We already saw how to set a new key or overwrite an existing key:\n", "```python\n", "random_objects['key1'] = 'will now be replaced'\n", "random_objects['key2'] = 'is newly added'\n", "```\n", "\n", "You can get a value, from a given key, using the square bracket notation, and then immediately use it for further calculation or processing:\n", "```python\n", "uppercase_value = random_objects['key2'].upper()\n", "\n", "# but this will fail:\n", "random_objects['key3']\n", "```\n", "\n", "with a ``KeyError``, because you are trying to access a non-existent key. Here are two possible solutions to deal with the case if you are not sure if the key exists, but you need your code to continue running without failing:\n", "\n", "```python\n", "# Option 1: try-except\n", "try:\n", " value = random_objects['key3']\n", "except KeyError:\n", " # Key not present: use a missing value as fallback \n", " value = float('nan')\n", " \n", "# Now \"value\" is guaranteed to exist after these 4 lines.\n", "# Or, option 2, in a single line of code:\n", "value = random_objects.get('key3', float('nan'))\n", "```\n", "You probably will prefer using the last version, since it is compact, and provides the same functionality as the first option.\n", "\n", "### Ordered vs Unordered dictionaries (advanced)\n", "Dictionaries are an ***unordered*** container; though in the very recent versions of Python 3.7 above they are now ordered in the order that you add key-values. \n", "\n", "That means the above dictionary is created in a certain order (not necessarily as shown in the code!), but once you add new key-values sequentially, they will retain that order. This means if you create an empty dictionary, and add pairs ...\n", "\n", "```python\n", "testing_order = {}\n", "testing_order['key1'] = 45\n", "testing_order[2] = 'Yes, keys can even be integers!'\n", "testing_order[3.0] = 'Or floating point objects'\n", "testing_order.keys()\n", "```\n", "\n", "... that they will retain the order you added them. Because this is such a new feature, and people do not quickly upgrade their Python version, you probably should not count on it being available.\n", "\n", "If you need to test the Python version in the code, use the ``sys.version_info`` attribute:\n", "```python\n", "import sys\n", "\n", "if (sys.version_info.major >= 3) and (sys.version_info.minor >= 7):\n", " print('I can rely on ordered dictionaries!')\n", " testing_order = dict()\n", "else:\n", " print('Use the OrderedDict class from \"import collections\".')\n", " from collections import OrderedDict\n", " testing_order = OrderedDict()\n", " \n", "testing_order['key1'] = 45\n", "testing_order[2] = 'Yes, keys can even be integers!'\n", "testing_order[3.0] = 'Or floating point objects'\n", "\n", "# Guaranteed to be in order, no matter which version of Python you use!\n", "testing_order.keys()\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ➜ Challenge yourself: working with dictionaries\n", "\n", "Create a dictionary containing the molar mass of pure species. Let the key be the chemical element (as a string), and the value be a floating point molar mass:\n", "\n", "* ``C``: carbon = 12.0107\n", "* ``O``: oxygen = 15.999\n", "* ``N``: nitrogen = 14.0067\n", "* ``H``: hydrogen = 1.00784\n", "* ``S``: sulfur = 32.065\n", "* ``P``: phosphorous = 30.973762\n", "\n", "Now write a function ``calculate_molar_mass`` which accepts 1 input, a chemical formula as a string, and returns the calculated molar mass.\n", "\n", "Water, $\\text{H}_2\\text{O}$ has 2 hydrogens and 1 oxygen. It could be represented as `H2O1`, and therefore has the molar mass of $(2 \\times 1.00784) + (1 \\times 15.999)$ = 18.01468.\n", "\n", "Now try it yourself for an amino acid, Methionine, which is $\\text{C}_5\\text{H}_{11}\\text{N}\\text{O}_2 \\text{S}$:\n", "```python\n", "# make life easier: explicitly add the '1' for single atoms\n", "methionine = 'C5H11N1O2S1' \n", "met_mm = molar_mass(methionine)\n", "```\n", "\n", "The molar mass of Methionine is 149.21 g/mol. Try your function on some other amino acids, such as Lysine, $\\text{C}_6\\text{H}_{14}\\text{N}_2\\text{O}_2$, which has a molar mass of 146.190 g/mol.\n", "\n", "*Suggested solution approach:*\n", "\n", "Work backwards: start with the dictionary written below (`formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}`), and implement the last 2 bullet points here. Then write the code to create that dictionary:\n", "\n", "* The input string will always start with an alphabetical letter, not a number. \n", "* Start by iterate over every character in the string, until you encounter a number (use `.isnumeric()` on each character)\n", "* Keep the preceding character(s): in this example, it will be `C`.\n", "* Keep iterating until the numeric value switches back to an alphabetic one (use `.isalpha()` on each character)\n", "* Then you have the value(s). In this example, `5`.\n", "* Store, in a dictionary that letter `C` as the ***key***, and the `5` numeric part as a ***value***.\n", "* Keep going, until you have built up a dictionary that should appear as:\n", "```python\n", "formula = {'C': 5, 'H': 11, 'N': 1, 'O': 2, 'S': 1}\n", "```\n", "* Now iterate over the dictionary, looking up the molar mass in a second dictionary, and add up the molecular weight.\n", "\n", "Challenge yourself even more: adjust the code so that it can work with *natural* formulas, where the `'1'` parts are not given. E.g. your function should be able to handle `methionine = 'C5H11NO2S'` instead of `'C5H11N1O2S1'`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ➜ Challenge yourself: reading data from many files\n", "\n", "A common problem in automated data analysis is reading data from many files in a directory, or sub-directories. Try this:\n", "* Create about 4 to 8 Excel files for yourself in the same directory. \n", "* Put different values in the cells, but always use the same cell location in the files. Here's an example:\n", " \n", "* Save each of the files in the directory.\n", "* Create two or three sub-directories, and spread the files into some of those.\n", "* Now read the files, modifying the template code below:\n", "\n", ">```python\n", ">import os\n", ">import fnmatch\n", ">pattern = '*.xlsx'\n", ">\n", "># Dataframe for the result:\n", ">result = pd.DataFrame(___)\n", ">for root, dirs, files in os.walk(r'C:\\location\\to\\your\\files'):\n", "> for name in fnmatch.filter(files, pattern):\n", "> full_filename = os.path.join(root, name)\n", "> \n", "> # Use Pandas to read the Excel file\n", "> excel_values = pd.____\n", "> \n", "> # Add the result as a new row or column\n", "> # in your Pandas DataFrame, df:\n", "> result.____\n", ">\n", "> \n", "># Finally, write the dataframe to CSV or Excel\n", ">result.to_excel(\"output.xlsx\", sheet_name='All file results')\n", ">```\n", "\n", "You can also use a dictionary instead of a Pandas DataFrame. The keys of the dictionary could be ``full_filename``, while the values of each key could be a list of the number(s) you extracted from the Excel file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ➜ Challenge yourself: moving average\n", "\n", "Back in [module 3](https://yint.org/pybasic03#Challenge-2) you had a challenge problem of calculating the moving average from a long vector of data.\n", "\n", "You downloaded and used the ``Ammonia`` series of data: http://openmv.net/info/ammonia and calculated the moving average over $n=5$ values; called a window of 5 values.\n", "* Accumulate the first 5 entries in the window and calculate the average.\n", "* Then throw away the first entry, add the 6th entry to update your window. \n", "* Calculate the average based on the 2nd to the 6th values. \n", "* Keep going until you run out of values.\n", "\n", "If you look back at your original code, it was probably many lines. Now you can make it even shorter: **reduce it down to 3 lines**!\n", "\n", "```python\n", "import pandas as pd\n", "\n", "# Read the ammonia.csv files as a Pandas data frame:\n", "ammonia = pd.read_csv(___)\n", "\n", "# Calculate the moving average:\n", "ammonia.___\n", "```\n", "\n", "The last line is obviously the key to solving this. Look at the documentation for ``df.rolling``: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html\n", "\n", "Compare the solution in [module 3](https://yint.org/pybasic03#Challenge-2) with the solution from Pandas." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further tips\n", "\n", "1. Read about different user interfaces for writing and editing your Python code: https://www.datacamp.com/community/tutorials/data-science-python-ide\n", " * You have already seen and used Spyder and PyCharm, which are the top two listed. But have you tried Atom, or Jupyter Notebooks?\n", "2. Get even more comfortable with Pandas DataFrames: https://www.datacamp.com/courses/manipulating-dataframes-with-pandas\n", " * Follow the first chapter of that online course for free.\n", " * See how to slice, filter and transform your dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">***Feedback and comments about this worksheet?***\n", "> Please provide any anonymous [comments, feedback and tips](https://docs.google.com/forms/d/1Fpo0q7uGLcM6xcLRyp4qw1mZ0_igSUEnJV6ZGbpG4C4/edit)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# IGNORE this. Execute this cell to load the notebook's style sheet.\n", "from IPython.core.display import HTML\n", "css_file = './images/style.css'\n", "HTML(open(css_file, \"r\").read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "221.984px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }