{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses). \n", ">\n", ">Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Module 7: Overview \n", "\n", "In [module 5](https://yint.org/pybasic05) and [module 6](https://yint.org/pybasic06) you used NumPy to create arrays, and perform mathematical calculations on them. Even though module 6 was about Python functions in general, the applications were all with NumPy.\n", "\n", "Now we take a look at Pandas. This is currently the best library for data manipulation. Along the way we will also learn about Jupyter notebooks, and Python's other important built-in data type, called ***dictionaries***.\n", "\n", "

Once again, don't forget to use your version control system. Commit your work regularly, where ever you see this icon; actually even more frequently. We will show this icon fewer times, as it should now be almost automatic to make regular commits (***several per hour!***)\n", "\n", "
\n", "\n", "### Preparing for this module###\n", "\n", "You should have \n", "1. completed [worksheet 6](https://yint.org/pybasic06)\n", "2. finished the short [project on DataCamp](https://projects.datacamp.com/projects/33) about Jupyter notebooks.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ``Dictionary`` data type\n", "\n", "A dictionary is a Python ***object*** that is a flexible data container for other objects. It contains these objects using what are called ***key*** - ***value*** pairs. You create a dictionary like this:\n", "\n", "```python\n", "random_objects = {'an int': 45,\n", " 'a float': 12.34,\n", " 'short_list': [1, 4, 7],\n", " 'longer list': [2, 4, 6, 9, 12, 16, 20, 25, 30, 36, 42],\n", " 'website': \"https://learnche.org\",\n", " 'nested_tuple': (1, 2.0, [3, 4], '5', (6, 7.0))\n", " }\n", "print(random_objects)\n", "```\n", "\n", "Notice that it is unlikely when you print the ``random_objects`` variable, that you see the dictionary written back in the order shown above. Dictionaries are an ***unordered*** container. \n", "\n", "The ***keys*** and ***values*** of this dictionary are:\n", "```python\n", "\n", "# These both return a list:\n", "random_objects.keys()\n", "random_objects.values()\n", "\n", "# What is the \"type\" of this dictionary?\n", "print('The object is of: {}'.format(type(random_objects)))\n", "\n", "# You can access individual elements:\n", "random_objects['short_list']\n", "```\n", "\n", "In the above example, the keys were all ***string*** objects. But that is not required. You can use integers, floating point values, strings, tuples, or a mixture of them. There are other options too, but these are comprehensive enough.\n", "\n", "Dictionary values may be any ***objects***, even other dictionaries. Yes, so a dictionary within a dictionary is possible. We will use this below. That is why we partially why we waited to introduce dictionaries until now.\n", "\n", "Dictionary objects are excellent ***containers***. If you need to return several objects from a function, collect them in a dictionary, and return them in that single object. It is not required, but it can make your code neater, and more logical.\n", "\n", "### Try it\n", "\n", "Create a dictionary for yourself with 5 `key`-`value` pairs, which summarizes a regression model. The `key` is the first item below, followed by a description of what you should create as the `value`:\n", "1. `intercept`: containing a floating-point value which is the intercept of your linear model\n", "2. `slope`: a floating-point slope value\n", "3. `R2`: the $R^2$ value of the regression model\n", "4. `standard_error`: a value which is the model's standard error\n", "5. `residuals`: a NumPy vector of residuals\n", "\n", "You can create the above dictionary in a single line of code. But what if you want to add something new to an existing dictionary later?\n", "\n", "```python\n", "d = { ... } # create your dictionary\n", "d['new key'] = 'additional value'\n", "```\n", "\n", "And you can overwrite an existing key-value pair in the same way:\n", "```python\n", "random_objects['an int'] = 'replaced with a string!'\n", "```\n", "This implies you can never have 2 keys which are the same. If you try to create a second key which already exists, it will overwrite the value associated with the existing key.\n", "\n", "### Extend yourself (later)\n", "\n", "At a later time, checkout your code from the [linear regression function](https://yint.org/pybasic06#%E2%9E%9C-Challenge-yourself-3:-linear-regression) you wrote earlier. Modify the code to return a single dictionary, instead of 4 items in a tuple." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using the Pandas library\n", "\n", "Why use ``pandas`` if you already can use NumPy?\n", "\n", "* In NumPy you have arrays of data. Pandas adds column headings and row labels (indexes) and calls the result a ``DataFrame``. Think of a spreadsheet.\n", "* But much better than a spreadsheet, Pandas can merge two tables together, to align data from different sources.\n", "* If the axis is time-based, it can be taken advantage of: e.g. you can then average over a week, or a month. In other languages you have to manually program that averaging, including taking into account that months sometimes have 28, 29, 30 or 31 days.\n", "* Data which are not time-based are equally well handled.\n", "* If you do something on a data frame, like calculate an average over all rows, then the result has the labels, the column headings in this case, kept in place.\n", "* Pandas takes care of missing data handling.\n", "* It has a database-type thinking, so in later modules, when we handle databases, it will not be an unfamiliar topic." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can load the Pandas library, similar to how you load the NumPy library, with this command:\n", "\n", "```python\n", "import pandas as pd\n", "pd.__version__ # ensure you have a version >= 0.20\n", "```\n", "\n", "Before we start with DataFrames, there is a simpler object in Pandas, called a ``Series``; roughly the equivalent of a vector in NumPy.\n", "\n", "Let's see some characteristics of a ``Series``:\n", "```python\n", "# Create a Series from a list. Put your own numbers here:\n", "s = pd.Series([ ... ]) \n", "print(s)\n", "```\n", "Notice the index (the column to the left of your numbers)? Let's look at another example:\n", "```python\n", ">>> s = pd.Series([ 5, 9, 1, -4, float('nan'), 5 ])\n", ">>> print(s) \n", "0 5.0\n", "1 9.0\n", "2 1.0\n", "3 -4.0\n", "4 NaN\n", "5 5.0\n", "dtype: float64\n", "```\n", "If you do not provide any labels for the rows, the these will be automatically generated for you, starting from 0.\n", "\n", "What if you have your own labels already?\n", "```python\n", "# You call the function with two inputs. One input is \n", "# mandatory (the first one), the other is optional.\n", "s = pd.Series(data = [5, 9, 1, -4, float('nan'), 5 ], \n", " index = ['a', 'b', 'c', 'd', 'e', 'f'])\n", "s.values\n", "type(s.values)\n", "```\n", "Ah ha! See what you get there in the output from ``s.values``? Pandas is built on top of the NumPy library. The underlying data are still stored as NumPy arrays, and you can access them with the `.values` attribute. This is partly why understanding NumPy first is helpful before using Pandas.\n", "\n", "Lastly, give your series a nice name:\n", "```python\n", "s.name = 'Random values'\n", "print(s)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using a Pandas ``Series``\n", "\n", "### Mathematical calculations\n", "\n", "The series you created above, can be used in calculations. Notice how missing data are handled seamlessly.\n", "\n", "```python\n", "import pandas as pd\n", "s = pd.Series(data = [5, 9, 1, -4, float('nan'), 5 ], \n", " index = ['a', 'b', 'c', 'd', 'e', 'f'],\n", " name = 'Calculations')\n", "s * 5 + 2\n", "\n", "import numpy as np\n", "np.sqrt(s)\n", "```\n", "The last line shows that Pandas and NumPy are compatible with each other. You can call NumPy operations on a Pandas object, and the result is returned as a Pandas object to you, with the row labels (indexes) intact.\n", "\n", "Also notice, that taking the square root of a negative number is not defined for real values, so the square root of $-4$ in row `d` returns a `NaN`.\n", "\n", "Logical operations are possible too:\n", "```python\n", "s > 4\n", "s.isna()\n", "np.sqrt(s).isna()\n", "s.notna()\n", "```\n", "\n", "### Accessing entries\n", "\n", "Like in NumPy, you can access the data entries using the square bracket notation. In Pandas:\n", "```python\n", "s[2]\n", "s['e']\n", "```\n", "\n", "Selected subsets from the series can be accessed too, again using square brackets:\n", "```python\n", "s[[2, 4, 0]]\n", "s[['f', 'd', 'b']]\n", "\n", "# Selection based on logic: I want only values greater than 4\n", "s[s > 4]\n", "```\n", "\n", "You can also access a ``range`` of entries:\n", "```python\n", "s[0:2]\n", "s['a':'c']\n", "```\n", "Take a careful look at that output. You might have expected them to be the same length, but they are not! When accessing with the index **names**, you get the range inclusive of the last entry. When accessing by index **number**, it behaves consistent with Python and NumPy.\n", "\n", "That makes sense. Names of the rows, the index, do not necessarily have to be sequential, like ``['a', 'b', ... 'f']`` as in this example. Often the index is unordered. E.g. if you have a series related to different Canadian cities: \n", "\n", "`['Toronto', 'Vancouver', 'Ottawa', 'Montréal', 'Halifax']`\n", "\n", "then with `['Vancouver':'Montréal']` you expect to see the middle 3 entries, inclusive of `Montréal`.\n", "\n", "\n", "### Creating a Series from a dictionary\n", "\n", "Now we can combine two new concepts you have just learned: Dictionaries and Pandas.\n", "\n", "```python\n", "raw_data = {'Germany': 27, 'Belgium': 13, 'Netherlands': 52, 'Sweden': 54, 'Ireland': 5}\n", "tons_herring_eaten = pd.Series(raw_data)\n", "print(tons_herring_eaten)\n", "```\n", "\n", "The row names (index) are taken from the dictionary keys, associated with each value. Because dictionaries are not ordered, the rows in the series will **not** necessarily be in the order written above.\n", "\n", "1. Write the Pandas command to determine which country eats the most herring. It is **not** with the ``tons_herring_eaten.max()`` command!\n", "2. And the least herring?\n", "3. What does this do? ``tons_herring_eaten.sort_values()``. Print the variable afterwards. \n", " * If this command fails, you might have an older version of Pandas. Try ``tons_herring_eaten.sort()`` instead.\n", "4. And what does this do then? ``tons_herring_eaten.sort_index()``" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas ``DataFrame``\n", "\n", "\n", "Back in [module 5](https://yint.org/pybasic05) you created a NumPy matrix of 5 temperatures for 4 cities (columns).\n", ">```python\n", ">import numpy as np\n", ">temp_np = np.array([[7, 9, 12, 10], \n", "> [1, 4, 5, 2], \n", "> [-3, 1, -2, -3], \n", "> [-2, -1, -2, -2], \n", "> [-3, -1, -2, -4]])\n", ">print(('The temperatures are given one column per '\n", "> 'city, each row is a daily average '\n", "> 'temperature:\\n{}').format(temp_np))\n", ">\n", ">max_value_0 = np.amax(temp_np, axis=0)\n", ">```\n", "\n", "Now let's try the same in Pandas, creating a ``DataFrame`` from a list-of-lists:\n", "```python\n", "import pandas as pd\n", "rawdata = [[7, 9, 12, 10], \n", " [1, 4, 5, 2], \n", " [-3, 1, -2, -3], \n", " [-2, -1, -2, -2], \n", " [-3, -1, -2, -4]]\n", "temp_df = pd.DataFrame(data=rawdata, columns = ['Toronto', 'Vancouver', 'Ottawa', 'Montreal'])\n", "```\n", "\n", "We saw then that you could calculate things on each ***axis*** of the NumPy array. You can also do this in Pandas:\n", "\n", "```python\n", "# NumPy and Pandas do different things here!\n", "temp_np.max() \n", "temp_df.max() \n", "\n", "# Here they are are consistent\n", "temp_np.max(axis=0) \n", "temp_df.max(axis=0) \n", "\n", "# Calculate across the rows, over all cities\n", "temp_np.max(axis=1) \n", "temp_df.max(axis=1) \n", "```\n", "\n", "In general, the same things you can calculate in NumPy, you can repeat in Pandas:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NumPyPandasDescription
np.sumdf.sumSum of entries
np.proddf.prodProduct (multiplication) of all entries
np.meandf.meanArithmetic average
np.mediandf.medianMedian value
np.stddf.stdStandard deviation
np.vardf.varVariance
np.mindf.minMinimum value
np.maxdf.maxMaximum value
np.argmindf.idxminIndex of the minimum
np.argmaxdf.idxmaxIndex of the maximum
np.diffdf.diffDifference between entries
\n", "\n", "### Try it:\n", "\n", "* Calculations on certain columns. The beauty of Pandas is how easy it is to write equations, based on the columns:\n", "```python\n", "temp_df['Toronto'] * 4 - temp_df['Montreal']\n", "```\n", "The above does exactly what you think it should.\n", "\n", "* What does this do? \n", "\n", "```python\n", ">>> temp_df.diff().abs().max()\n", "\n", "# and this? \n", ">>> temp_df.diff().abs().max().argmax()\n", "```\n", "\n", "* What is the interpretation of that long command?\n", "\n", "You can stack up your sequential operations quite compactly in Pandas. It works because the output from one function is the input for the next one to the right. Refer back to [the section on functions](https://yint.org/pybasic06), if necessary." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Create a DataFrame from a dictionary\n", "\n", "You used a list-of-lists approach above to create your DataFrame. You can also use dictionaries. Each key in the dictionary can contain a list of equal length:\n", "\n", "```python\n", "data = {'Herring': [27, 13, 52, 54, 5, 19], \n", " 'Coffee': [90, 94, 96, 97, 30, 73],\n", " 'Tea': [88, 48, 98, 93, 99, 88]}\n", "countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']\n", "food_consumed = pd.DataFrame(data, index=countries)\n", "```\n", "\n", "Try to interpret what the following lines return, based on the values you see when you run each command:\n", "```python\n", "food_consumed.T\n", "food_consumed.values\n", "food_consumed.columns\n", "food_consumed.index\n", "food_consumed.head()\n", "food_consumed.tail()\n", "food_consumed.describe()\n", "```\n", "\n", "The first one is particularly helpful, if you need to switch rows and columns around.\n", "\n", "***Hint***: whenever you create a data frame (by hand, or by loading a file, which we will see next), **always** use ``.head()`` and ``.describe()`` to check you have the data correctly loaded. It will save you a lot of time from making errors, only to discover them later." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### DataFrame operations\n", "\n", "We will show code for these commonly-used Pandas operations: shape of an array, unique entries, adding and merging columns, adding rows, deleting rows, and removing missing values.\n", "\n", ">```python\n", ">import pandas as pd\n", ">data = {'Herring': [27, 13, 52, 54, 5, 19], \n", "> 'Coffee': [90, 94, 96, 97, 30, 73],\n", "> 'Tea': [88, 48, 98, 93, 99, 88]}\n", ">countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']\n", ">food_consumed = pd.DataFrame(data, index=countries)\n", ">```\n", "\n", "#### 1. Shape of a data frame\n", "\n", "```python\n", "# There were 6 countries, and 3 food types. Verify:\n", "food_consumed.shape\n", "\n", "# Transposed and then shape:\n", "food_consumed.T.shape\n", "\n", "# Interesting: what shapes do summary vectors have?\n", "food_consumed.mean().shape\n", "```\n", "\n", "#### 2. Unique entries\n", "```python\n", "# Access the column names directly. \n", "# Does not work if there is a space in the name though :(\n", "food_consumed.Tea.unique()\n", "\n", "# So this is clearer, in my opinion. It is also more programmatic.\n", "# In other words, you can replaced 'Tea' with a string variable, and\n", "# the code will still work.\n", "food_consumed['Tea'].unique()\n", "\n", "# Names (indexes) of the unique rows:\n", "food_consumed.index.unique()\n", "\n", "# In newer versions of Pandas, you can get counts (n) of\n", "# the unique entries:\n", "food_consumed.nunique() # in each column \n", "food_consumed.nunique(axis=1) # in each row\n", "```\n", "\n", "#### 3. Add a new column\n", "```python\n", "# Works just like a dictionary!\n", "# If the data are in the same row order\n", "food_consumed['Yoghurt'] = [30, 20, 53, 2, 3, 48]\n", "```\n", "\n", "#### 4. Merging dataframes \n", "```python\n", "# Note the row order is different this time:\n", "more_foods = pd.DataFrame(index=['Belgium', 'Germany', 'Ireland', 'Netherlands', 'Sweden', 'Switzerland'],\n", " data={'Garlic': [29, 22, 5, 15, 9, 64]})\n", "\n", "# Merge 'more_foods' into the 'food_consumed' data frame\n", "food_consumed = food_consumed.join(more_foods)\n", "```\n", "\n", "#### 5. Adding a new row\n", "```python\n", "# Collect the new data in a Series. Note that 'Tea' is (intentionally) missing!\n", "portugal = pd.Series({'Coffee': 72, 'Herring': 20, 'Yoghurt': 6, 'Garlic': 89},\n", " name = 'Portugal')\n", "\n", "food_consumed = food_consumed.append(portugal)\n", "# See the missing value created?\n", "```\n", "\n", "#### 6. Delete or drop a row/column\n", "```python\n", "# Drop a column, and returns its values to you\n", "coffee_column = food_consumed.pop('Coffee')\n", "print(coffee_column)\n", "print(food_consumed)\n", "\n", "# Leaves the original data untouched; returns only \n", "# a copy, with those columns removed\n", "food_consumed.drop(['Garlic', 'Yoghurt'], axis=1)\n", "\n", "# Leaves the original data untouched; returns only \n", "# a copy, with those rows removed. \n", "non_EU_consumption = food_consumed.drop(['Switzerland', ], axis=0)\n", "```\n", "\n", "#### 7. Remove rows with missing values\n", "```python\n", "# Returns a COPY of the array, with no missing values:\n", "cleaned_data = food_consumed.dropna() \n", "\n", "# Makes the deletion inplace; more efficient for large data sets\n", "food_consumed.dropna(inplace=True) \n", "\n", "# Remove only rows where all values are missing:\n", "food_consumed.dropna(how='all')\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dealing with CSV files in Pandas\n", "\n", "> *You can skip this theory section for now, and go straight to the application below. Read this later, if it is interesting.*\n", "\n", "CSV stands for Comma-Separated Values. It is a table of data, in a plain text file, where the numbers are separated by commas.\n", "\n", "In the figure below, you see an example CSV file and what it might look like if you open one in a text editor. Download the file: https://openmv.net/file/batch-yield-and-purity.csv to follow along. Verify that it looks similar to what is shown.\n", "\n", "**How is a CSV file processed by Python?**\n", "\n", " \n", "\n", "* When the file is imported, it is read from top-to-bottom, line-by-line.\n", "* Within each line we will read from left-to-right. Every time a comma is encountered, the file import process will assume it is the start of a new column, and place the following number into that column.\n", "* It continues until it reaches the end of the line of text.\n", "\n", "Terminology: the comma symbol is used to separate one number from the next. In this situation, the comma is also known as a ***delimiter***. You can imagine that you can devise a file format where you specify some other symbol as the delimiter, for example the 'tab' character, or the semi-colon, or even a space.\n", "\n", "You will see the word ***delimiter*** used below in the code. Now you know what it means.\n", "\n", " \n", "\n", "***Problems*** with CSV files:\n", "* The comma indicator often used in European countries for decimals, as in \"72,6\" for the number 72.6 will be problematic. Since importing reads from left to right \"72,6\" will create an integer of 72 in one column, and a value of 6 in the next column. CSV files must therefore be created using a period (or full-stop) as the decimal separator for floating point numbers (i.e. non-integer numbers).\n", "* The CSV format is not economical in terms of storage space for large datasets.\n", "* It should be clear from the above description that it is not certain how many rows or columns will be encountered before loading a CSV file. So computer memory cannot be allocated up-front. This can slow down processing of large files.\n", "* Editing data in a CSV file is not simple: you cannot easily delete an entire column, for example. \n", "\n", "***Advantages***\n", " \n", "\n", "* The CSV format is future-proof: since everything inside such a file is just plain-old text, it can always be opened, no matter which software you use in the future. You can still open text files from 1970 on your computer from today.\n", "* You can quickly change the CSV file to update/add/remove parts of it using your text editor (e.g. a single row). No special software is required.\n", "* We can handle missing data, or create missing values: put a `NaN` between two commas. For example: `71,NaN,73` indicates a value is missing between the `71` and the `73`.\n", "* The CSV file is a good data-interchange format. By that we mean: almost all data manipulation software can export to CSV, and once you have that as an intermediate file, you can re-import it into other software, including Python. Use it, for example, to export a database to CSV. Or to **share data** between R, MATLAB, Python or other software tools.\n", "* Because of its simplicity, it is widely supported in most data manipulation software. But its simplicity is also its downfall for more complex data (e.g. data greater than 2-dimensions, such as digital images, or data from batch processes)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reading a CSV file with Pandas\n", "\n", "The Pandas function ``pd.read_csv`` has a lot of flexibility and [smart processing](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) built-in to make reading CSV files easy with headers, missing values, and other settings. It is a swiss-army knife function: very versatile, but you need to know how to use it.\n", "\n", "#### Read the CSV file from your computer/network drive:\n", "Download this CSV file: http://openmv.net/file/batch-yield-and-purity.csv and adjust the code below, where necessary:\n", "```python\n", "import os\n", "import pandas as pd\n", "\n", "directory = r'C:\\location\\of\\file'\n", "filename = 'batch-yield-and-purity.csv'\n", "full_filename = os.path.join(directory, filename)\n", "yield_purity_pd = pd.read_csv(full_filename)\n", "```\n", "\n", "#### Read the CSV file directly from a web server:\n", "```python\n", "import pandas as pd\n", "yield_purity_pd = pd.read_csv('http://openmv.net/file/batch-yield-and-purity.csv')\n", "\n", "# If you are on a work computer behind a proxy server, you\n", "# have to take a few more steps. Add these 6 lines of code.\n", "import io\n", "import requests\n", "proxyDict = {\"http\" : \"http://replace.with.proxy.address:port\"}\n", "url = \"http://openmv.net/file/batch-yield-and-purity.csv\"\n", "s = requests.get(url, proxies=proxyDict).content\n", "web_dataset = io.StringIO(s.decode('utf-8'))\n", "\n", "# Convert the file fetched from the web to a Pandas dataframe\n", "yield_purity_pd = pd.read_csv(web_dataset)\n", "```\n", "\n", "### Write a DataFrame (or Series) to CSV \n", "\n", "This is as simple as can be:\n", "```python\n", "yield_purity_pd.to_csv('output_filename_here.csv')\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ➜ Challenge yourself: working with CSV files in Pandas\n", "\n", "Read in the Batch yield and Purity dataset above. The dataset is described here http://openmv.net/info/batch-yield-and-purity\n", "1. Advisable, the moment you have created a dataset with Pandas, is to call ``df.head()`` on the file. In this case: ``yield_purity_pd.head()``. Does the data frame match the website's description?\n", "\n", "2. What is the lowest ``yield`` recorded? Make sure your command returns only a floating point value.\n", "3. And the highest?\n", "4. What is the average ``purity`` of the raw material?\n", "5. Check the output of ``yield_purity_pd.describe()``: can you interpret each row of output?\n", "6. In the description [given on the website](http://openmv.net/info/batch-yield-and-purity), there is the idea that batch yield is affected by purity. For a cause-and-effect relationship to exist, there should be a correlation. Do you see that? Use ``yield_purity_pd.corr()`` to calculate the correlation between the two columns.\n", "7. We use relative standard deviation (RSD) as a way to judge how noisy a variable is. For a single variable, this is defined as: $$\\text{RSD} = \\dfrac{\\text{standard deviation}}{\\text{average}}$$ \n", "\n", " Calculate the numerator (``yield_purity_pd.std()``) and the denominator separately. Do the values look reasonable? Now divide them. Does ``yield`` or ``purity`` have the highest RSD?\n", "8. Create a new calculation `hypothesis` = $ 4\\times(\\text{purity} - 50)$ and add that as a new column to the existing data frame.\n", "9. Write the data frame, now with 3 columns, to a CSV file.\n", "10. Open the CSV file in a text editor to ensure the data are properly stored, with the expected accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Reading and writing Excel files with Pandas\n", "\n", "There is not too much to say here, other than to show the basic commands:\n", "\n", "```python\n", "colour_data = pd.read_excel(excel_filename, \n", " sheet_name='Colours', \n", " skiprows=5, \n", " index_col=0)\n", "colour_data.head()\n", "```\n", "You can call the function with various inputs, depending on your situation. Read the full documentation for reading Excel files: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html\n", "\n", "Similarly, for writing Excel files, it is often enough to just use:\n", "```python\n", "df = pd.DataFrame(...)\n", "df.to_excel(\"output.xlsx\", sheet_name='Summary')\n", "```\n", "and it is worth checking the documentation for further function options: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## ➜ Challenge yourself: Fridge simulation\n", "\n", "Back in prior [worksheet 6](https://yint.org/pybasic06) you simulated the cooling process taking place in a fridge.\n", "\n", "\n", "By the end of the exercise you had assembled 3 columns of equal length in 3 separate variables. Here is how your code might have looked:\n", "\n", "```python\n", "import numpy as np\n", "def simulate_cooling(time_final=30, initial_temp=25, delta_t=2.5):\n", " \"\"\"Models Newton's Law of Cooling: \\dfrac{dT}{dt} = -k (T-F)\n", "\n", " The fridge has a constant temperature, F=5 [°C]\n", " Heat transfer coefficient = k = 0.08 [1/minutes]\n", " \n", " Discretizing: T_{i+1} = T_i - k (\\delta t)(T_i - F)\n", " \n", " Temperature at time point $i+1$ (one step in the future, T_{i+1}) is related\n", " to the temperature now, at time $i$, T_{i}.\n", " \n", " The total simulation time is `time_final` minutes, starting at\n", " t=0. The initial temperature of the object in the fridge is \n", " ``initial_temp` °C, and the simulation time steps are `delta_t`\n", " minutes apart.\n", " \"\"\"\n", "\n", " F_temperature = 5 # °C\n", " heat_transfer_coeff = 0.08 # 1/minutes\n", "\n", " # Create the two outputs of interest\n", " time = np.arange(start=0.0, stop=time_final, step=delta_t)\n", " temp = np.zeros(time.shape)\n", " temp[0] = initial_temp\n", "\n", " for idx, t_value in enumerate(time[1:]):\n", " temp[idx + 1] = temp[idx] - heat_transfer_coeff * delta_t * (temp[idx] - F_temperature)\n", "\n", " # After the loop:\n", " time[idx + 1] = t_value + delta_t\n", "\n", " # Exact value\n", " exact = F_temperature + (initial_temp - F_temperature) * np.exp(-heat_transfer_coeff * time)\n", "\n", " return (time, temp, exact)\n", "\n", "\n", "time, temperature, true_value = simulate_cooling(time_final=30, initial_temp=25)\n", "```\n", "\n", "The moment it happens that you collect several variables as output which logically belong together, then you should combine them in a single variable, a Pandas data frame. Put each variable as a new column so your function output is simplified:\n", "\n", "```python\n", "simulation = simulate_cooling(time_final=30, initial_temp=25)\n", "```\n", "\n", "Modify the above code. Remove the last 2 lines inside the function, and replace them with code that:\n", "* Creates a data frame, with 2 columns: one for ``time`` and one for ``temperature``.\n", "* Append a 3rd column, ``exact``, which contains the exact solution, but calculated from the values in the first two columns.\n", "* Append a 4th column, ``error``, which contains the difference between the true value and the simulated value.\n", "* Make the first column, ``time``, to become your index! That actually removes that column. If your data frame were called ``results`` inside the function, you do this by writing: ``results.set_index('time')``. Check the ``results.shape`` before and after doing this. \n", "* Return a single output, the data frame.\n", "* Modify outside the function, so that only the single output is used, and not 3 outputs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## ➜ Challenge yourself: CSV file processing\n", "\n", "Load the CSV file from https://openmv.net/info/raw-material-height which is actual data from a process, where the height is a critical parameter to monitor, to ensure it does not go too low.\n", "\n", "1. What is the minimum value seen in the history of the process?\n", "2. When did that minimum occur?\n", "3. Try this: ``data['Level'].plot()``\n", "4. You loaded the data from CSV. Now export the data set to Excel.\n", "5. Open the Excel file, and make a plot of the data. Verify that the Python plot matches the Excel plot.\n", "\n", "We will return to plotting in a later worksheet." ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "## ➜ Challenge yourself: KNMI data loading\n", "\n", "Back in [worksheet 4](https://yint.org/pybasic04) you used a data set from The Dutch meteorological service (KNMI): temperature readings from a location in The Netherlands, since 1901. [Download the file](http://projects.knmi.nl/klimatologie/onderzoeksgegevens/homogeen_260/tg_hom_mnd260.txt), or use the direct web address: http://projects.knmi.nl/klimatologie/onderzoeksgegevens/homogeen_260/tg_hom_mnd260.txt\n", "\n", "The first column is the station number, the next is the date, and the third column is the temperature, measured in units of °C.\n", "\n", "Unfortunately, the KNMI service has used commas as the delimiter in line 27 for the column headings, and then spaces in the rest of the file as delimiter. This makes it hard to find the right settings to import the file. Nevertheless, try using [the documentation for ``pd.read_csv``](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).\n", "\n", "***Hint:***\n", "```python\n", "knmi = pd.read_csv('URL here, or the file name on your computer', delimiter=..., skiprows=..., header=0)\n", "knmi.head()\n", "```\n", "Warning: it can be a frustrating exercise dealing with other people's badly formatted data. ***But that's reality.***" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Further tips\n", "\n", "1. Download a *cheatsheet* of Pandas tips: https://www.kdnuggets.com/2017/01/pandas-cheat-sheet.html \n", "2. Learn more about Indexing Pandas arrays: https://www.kdnuggets.com/2019/04/pandas-dataframe-indexing.html (to be covered in the Advanced course)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Wrap up this section by committing all your work. Have you used a good commit message? Push your work, to refer to later, but also as a backup." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ ">***Feedback and comments about this worksheet?***\n", "> Please provide any anonymous [comments, feedback and tips](https://docs.google.com/forms/d/1Fpo0q7uGLcM6xcLRyp4qw1mZ0_igSUEnJV6ZGbpG4C4/edit)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# IGNORE this. Execute this cell to load the notebook's style sheet.\n", "from IPython.core.display import HTML\n", "css_file = './images/style.css'\n", "HTML(open(css_file, \"r\").read())" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "221.984px" }, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 2 }