<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction-to-Pandas" data-toc-modified-id="Introduction-to-Pandas-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction to Pandas</a></span></li><li><span><a href="#Getting-started-with-the-Pandas-library" data-toc-modified-id="Getting-started-with-the-Pandas-library-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Getting started with the Pandas library</a></span></li><li><span><a href="#Working-with-Pandas-Series-objects" data-toc-modified-id="Working-with-Pandas-Series-objects-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Working with Pandas <code>Series</code> objects</a></span><ul class="toc-item"><li><span><a href="#What-is-a-Series-object?" data-toc-modified-id="What-is-a-Series-object?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What is a <code>Series</code> object?</a></span></li><li><span><a href="#Mathematical-calculations" data-toc-modified-id="Mathematical-calculations-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Mathematical calculations</a></span></li><li><span><a href="#Accessing-entries" data-toc-modified-id="Accessing-entries-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Accessing entries</a></span></li></ul></li><li><span><a href="#Working-with-Pandas-DataFrame-objects" data-toc-modified-id="Working-with-Pandas-DataFrame-objects-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Working with Pandas <code>DataFrame</code> objects</a></span></li><li><span><a href="#Time-for-a-diversion:-Dictionaries!" data-toc-modified-id="Time-for-a-diversion:-Dictionaries!-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Time for a diversion: Dictionaries!</a></span><ul class="toc-item"><li><span><a href="#Try-it" data-toc-modified-id="Try-it-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Try it</a></span></li><li><span><a href="#Add-a-new-item-to-the-dictionary" data-toc-modified-id="Add-a-new-item-to-the-dictionary-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Add a new item to the dictionary</a></span></li><li><span><a href="#Creating-a-Series-from-a-dictionary" data-toc-modified-id="Creating-a-Series-from-a-dictionary-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Creating a Series from a dictionary</a></span></li></ul></li><li><span><a href="#DataFrame-operations" data-toc-modified-id="DataFrame-operations-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>DataFrame operations</a></span></li><li><span><a href="#Reading-and-writing-Excel-files-with-Pandas" data-toc-modified-id="Reading-and-writing-Excel-files-with-Pandas-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Reading and writing Excel files with Pandas</a></span></li><li><span><a href="#Exercise:-mass-balance" data-toc-modified-id="Exercise:-mass-balance-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Exercise: mass balance</a></span></li><li><span><a href="#Exporting-to-Excel" data-toc-modified-id="Exporting-to-Excel-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Exporting to Excel</a></span></li></ul></div>

> All content here is under a Creative Commons Attribution [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/) and all source code is released under a [BSD-2 clause license](https://en.wikipedia.org/wiki/BSD_licenses).
>
>Please reuse, remix, revise, and [reshare this content](https://github.com/kgdunn/python-basic-notebooks) in any way, keeping this notice.

# Course overview

This is the second module of several (11, 12, 13, 14, 15 and 16), which refocuses the course material in the [prior 10  modules](https://github.com/kgdunn/python-basic-notebooks) in a slightly different way. It places more emphasis on

* dealing with data: importing, merging, filtering;
* calculations from the data;
* visualization of it.

In short: ***how to extract value from your data***.


# Module 12 Overview

This is the second of 6 modules. In this module we will cover

* Using and understanding the Pandas library
* Creating a Pandas data frame
* Reading in Excel files as an alternative to create a data frame
* Basic calculations with data frames

**Requirements before starting**

* Have your Python installation working as for module 11, and also the Pandas library installed.

## Introduction to Pandas


Why use ``pandas`` if you already can use tools like MATLAB and Excel?

* In MATLAB you have arrays (matrix) of data. Pandas adds column headings and row labels (indexes) and calls the result a ``DataFrame``. Think of a spreadsheet.
* In Pandas we often use the variable name ``df`` to refer to the data frame.
* The advantage of using column heading is that you can then write code like this:

    ``df["TemperatureC"] = (df["TemperatureF"] - 32) * 5 / 9``

  to convert Fahrenheit to Celsius for the entire column. 
  
  You do *not* need to  know the column number, like in MATLAB, where you have to write ``X(:, 5) = (X(:, 2) - 32) * 5 / 9``, for example. In a spreadsheet you typically also write your formulas using the column labels, like ``= (B:B - 32) * 5 / 9``, if column B is the temperature in Fahrenheit.
* Apart from referring to columns (or rows) by name, you can also merge two data frames together: for example to merge data from the lab, and data from the process. You have to specify which column is the common column. In Excel you can use `VLOOKUP` to do this, but it is messy, and with MATLAB you have to write code yourself to merge two data sets.
* With Pandas, if your row names are time-based, then you can take advantage of that: e.g. you can, with 1 line of code, calculate the average over a week, or a month. In other languages you have to manually program that averaging, including taking into account that months sometimes have 28, 29, 30 or 31 days.
* Data which are not time-based are equally well handled by Pandas.
* Pandas also has multi-level indexing, or hierarchical indexing. More on that later.
* If you do something on a data frame, like calculate an average over all rows, then the output result also gets those labels, the column headings in this case, kept in place.
* Pandas takes care of missing data handling. So if you calculate the average, it will, by default, ignore missing values. Unlike MATLAB where you get `nan` as a result.
* With Pandas you can quickly visualize your data, often with  1 line of code: 

    * ``df["TemperatureC"].plot()``
    * ``df.boxplot(column='activity', by='reactor')`` will create a boxplot of the values in the `activity` column, for every `reactor` 
    

## Getting started with the Pandas library

A library is a collection of someone else's Python code. It saves time to use existing, good-quality libraries, so you can focus on your work. E.g. focus on interpreting the data, and less on how to manipulate/process your data.

You can load the Pandas library with this command:

```python
import pandas as pd
pd.__version__ # ensure you have a version >= 1.3
```

Try it below:

There are 2 types of objects in Pandas we will use: a ``Series`` and a ``DataFrame``. 

* A ``Series`` is roughly the equivalent of a vector, or a column/row in a spreadsheet.
* A ``DataFrame`` is a collection of ``Series`` objects, next to each other, to create a matrix of data.

## Working with Pandas ``Series`` objects

### What is a ``Series`` object?
Let's see some characteristics of a ``Series``
```python
# Create a Series from a list. 
s = pd.Series([ ... ]) 
print(s)
```

Put your own numbers inside the list in the space below. You learned [about lists in the prior module](https://yint.org/pybasic11).

Notice the index (the column to the left of your numbers)? Let's look at another example:
```python
>>> s = pd.Series([ 5, 9, 1, -4, float('nan'), 5 ])
>>> print(s)  
0    5.0
1    9.0
2    1.0
3   -4.0
4    NaN
5    5.0
dtype: float64
```
If you do not provide any labels for the rows, the these will be automatically generated for you, starting from 0.

What if you have your own labels already?
```python
# You call the function with two inputs. One input is 
# mandatory (the first one), the other is optional.
s = pd.Series(
    data  = [5,   9,   1,   -4,  float('nan'), 5 ], 
    index = ['a', 'b', 'c', 'd', 'e',         'f']
)
print(s)
print(s.values)
type(s.values)
```

Ah ha! See what you get there in the output from ``s.values``? Pandas is built on top of another library, called NumPy. The underlying data are NumPy arrays, and Pandas adds extra functionality on top of that. We will refer back to NumPy later, or you will see it commonly referenced in Python websites that deal with data processing. So it is good to know about it.

Lastly, give your series a nice name:
```python
s.name = 'Random values'
print(s)
```

### Mathematical calculations

The series you created above, can be used in calculations. Notice how missing data are handled seamlessly.

```python
s = pd.Series(
    data  = [5,   9,   1,   -4,  float('nan'), 5 ], 
    index = ['a', 'b', 'c', 'd', 'e',         'f'],
    name = 'Calculations'
)
print(s * 5 + 2)
```

What type is a series object?  Hint, use the ``type(...)`` function.

Calculate the square root of this column `s`. Remember in the [prior module](https://yint.org/pybasic11) how we calculated the square root by raising the number to the power of 0.5? 

Since the square root is not defined for negative numbers, such as the $-4$ in row `d`, what do you expect as an answer?  Check it out in the space below.

Logical operations are possible too. Try some of these out:
```python
s > 4
s.isna()
s.notna()
```

### Accessing entries

Like with lists, you can access the data entries using the square bracket notation. In Pandas:
```python
s[2]
s['e']
```

Notice the second example above: you can access entries in the Series by their name!

Selected subsets from the series can be accessed too, again using square brackets:
```python
s[[2, 4, 0]]
s[['f', 'd', 'b']]

# Selection based on logic: I want only values greater than 4. This is called filtering.
s[s > 4]
```

You can also access a ``range`` of entries:
```python
s[0:2]
s['a':'c']
```
Take a careful look at that output. You might have expected them to be the same length, but they are not! When accessing with the index **names**, you get the range _inclusive_ of the last entry. When accessing by index **number**, it behaves consistent with Python lists.

That makes sense. Names of the rows, the index, do not necessarily have to be sequential, like ``['a', 'b', ... 'f']`` as in this example. Often the index is unordered. 

For example, if you had a series related to different Canadian cities: 

`['Toronto', 'Vancouver', 'Ottawa', 'MontrÃ©al', 'Halifax']`

then with `['Vancouver':'MontrÃ©al']` you expect to see the middle 3 entries, inclusive of `MontrÃ©al`.

## Working with Pandas ``DataFrame`` objects


Imagine you have 5 temperature measurements (rows) for 4 cities (columns). In actual data the columns would be the temperature measurement from a different part of the process. For this example, each column is a city.

We can create a ``DataFrame`` using a ***list-of-lists***:
```python
import pandas as pd
rawdata = [[17, 19, 22, 20], 
           [11, 14, 15, 12], 
           [ 7, 11,  8,  7], 
           [ 8,  9,  8,  8], 
           [ 7,  9,  8,  6]]
df = pd.DataFrame(
    data=rawdata, 
    columns = ['Johannesburg', 'Cape Town', 'Pretoria', 'Durban']
)
print(df)
```

Tip: Pandas can handle column names with a space in them.ðŸ˜Š This is why when you want to see one column, you can refer to it as follows:

* ``df["Cape Town"]``
* ``df['Johannesburg']``
* What type is each column inside ``df``? Try finding out: ``type(df["Cape Town"])``

Try some calculations now:
```python
df.max() 
df.max(axis=0) 
df.max(axis=1) 
```

Now try some other types of calculations on all the columns: 

* ``df.sum``
* ``df.mean``
* ``df.median``
* ``df.std  # Standard deviation``
* ``df.min ``
* ``df.idxmin``
* ``df.diff``

Notice that these calculations take place on the columns, by default. What if you wanted to do them on the rows?


Try the following to expand your knowledge.


* Calculations on certain columns. The beauty of Pandas is how easy it is to write equations, based on the columns:
```python
df['Johannesburg'] * 4 - df['Durban']
```
The above does exactly what you think it should.

* What does this do? 

```python
>>> df.diff().abs().max()

# and this? 
>>> df.diff().abs().max().idxmax()
```

* What is the interpretation of that long command?

You can stack up your sequential operations quite compactly in Pandas. It works because the output from one function is the input for the next one to the right.

**A tip on style**

You can also use ``df.Johannesburg`` to access a column, but this is not good Pandas style, so don't do this. It cannot handle column names with spaces, and if you have a column name that is also a built-in operation, like ``max``, for example, it is confusing.

## Time for a diversion: Dictionaries!

A dictionary is a Python ***object*** that is a flexible data container for other objects. It contains these objects using what are called ***key*** - ***value*** pairs. You create a dictionary like this:

```python
random_objects = {
    'my integer': 45,
    'a float': 12.34,
    'short_list': [1, 4, 7],
    'longer list': [2, 4, 6, 9, 12, 16, 20, 25, 30, 36, 42],
    'website': "https://learnche.org",
    'a tuple': (1, 2.0, 33, 444, '5555', 'etc'),
}
print(random_objects)
```

In older Python versions, the dictionary print out will be a random order. Newer versions of Python ***maintain the order*** of the container. 

The dictionary has what are called ***keys*** and ***values***:
```python

# These both return a list:
random_objects.keys()
random_objects.values()

# What is the "type" of this dictionary?
print(f'The object is of: {type(random_objects)}')

# You can access individual values from the dictionary by using the key:
random_objects['short_list']

# What happens when you use a non-existent key?
random_objects['mystery']
```

In the above example, the keys were all ***string*** objects. But that is not required. You can use integers, floating point values, strings, tuples, or a mixture of them. There are other options too, but these are comprehensive enough.

Dictionary values may be any ***objects***, even other dictionaries. Yes, so a dictionary within a dictionary is possible. 

Dictionary objects are excellent ***containers***. If you need to return several objects from a function, collect them in a dictionary, and return them in that single object. It is not required, but it can make your code neater, and more logical.

### Try it

Create a dictionary for yourself with 4 `key`-`value` pairs, which summarizes a regression model. The `key` is the first item below, followed by a description of what you should create as the `value`:
1. `intercept`: make up a floating-point value which is the intercept of your linear model
2. `slope`: pick any floating-point value as the slope
3. `R2`: the $R^2$ value of the regression model
5. `residuals`: a list (vector) of residuals. You can use a Pandas Series here also!

You can create the above dictionary in a single line of code. 

In [None]:
regression_model = ___


### Add a new item to the dictionary

```python
regression_model = { ... } # create your dictionary
regression_model['new key'] = 'additional value'
```

And you can overwrite/update an existing key-value pair in the same way:
```python
random_objects['my integer'] = 42
```
This implies you can never have 2 keys which are the same. If you try to create a second key which already exists, it will overwrite the object associated with the existing key.


### Creating a Series from a dictionary

Now we can combine two new concepts you have just learned: Dictionaries and Pandas.

```python
raw_data = {
    'Germany': 27, 
    'Belgium': 13, 
    'Netherlands': 52, 
    'Sweden': 54, 
    'Ireland': 5
}
tons_herring_eaten = pd.Series(raw_data)
print(tons_herring_eaten)
```

The row names (index) are taken from the dictionary keys, associated with each value. 

1. Write the Pandas command to determine which country eats the most herring. It is **not** with the ``tons_herring_eaten.max()`` command!
2. And the least herring?
3. What does this do? ``tons_herring_eaten.sort_values()``. Print the variable afterwards. 
4. And what does this do then? ``tons_herring_eaten.sort_index()``

## DataFrame operations

Now you will use your knowledge of dictionaries you just developed above.

We will show code for some commonly-used Pandas operations:
* shape of an array, 
* what are the unique entries, 
* adding and merging columns, 
* adding rows, 
* deleting rows,
* removing missing values.

We will use this made-up data set, showing how much food is used by each country. You can replace these data with numbers and columns and rows which make sense to your application.

>```python
>import pandas as pd
>data = {
>    'Herring':  [27, 13, 52, 54,  5, 19], 
>    'Coffee':   [90, 94, 96, 97, 30, 73],
>    'Tea':      [88, 48, 98, 93, 99, 88]
>}
>countries = ['Germany', 'Belgium', 'Netherlands', 'Sweden', 'Ireland', 'Switzerland']
>food_consumed = pd.DataFrame(data, index=countries)
>
>print(data)
>print(countries)
>print(type(data))
>print(type(countries))
>print(type(food_consumed))
>food_consumed
>```

#### 0. Getting an idea about your data first

```python
# The first rows:
food_consumed.head()

# The last rows:
food_consumed.tail()

# Some basic statistics
food_consumed.describe()

# Some information about the data structure: missing values, memory usage, etc
food_consumed.info()
```

#### 1. Shape of a data frame

```python
# There were 6 countries, and 3 food types. Verify:
food_consumed.shape

# Transposed and then shape:
food_consumed.T.shape

# Interesting: what shapes do summary vectors have?
food_consumed.mean().shape
```

#### 2. Unique entries
```python
food_consumed['Tea'].unique()

# Unique names of the rows: (not so useful in this example, because they are already unique)
food_consumed.index.unique()

# Get counts (n) of the unique entries:
food_consumed.nunique()       # in each column 
food_consumed.nunique(axis=1) # in each row
```

#### 3. Add a new column
```python
# Works just like a dictionary!
# If the data are in the same row order
food_consumed['Yoghurt'] = [30, 20, 53, 2, 3, 48]
print(food_consumed)
```

#### 4. Merging dataframes 
```python
# Note the row order is different this time:
more_foods = pd.DataFrame(
    index=['Belgium', 'Germany', 'Ireland', 'Netherlands', 'Sweden', 'Switzerland'],
    data={'Garlic': [29, 22, 5, 15, 9, 64]},
)
print(food_consumed)
print(more_foods)
# Merge 'more_foods' into the 'food_consumed' data frame. Merging works, even if row order is not the same!
food_consumed = food_consumed.join(more_foods)
food_consumed
```

#### 5. Adding a new row
```python
# Collect the new data in a Series. Note that 'Tea' is (intentionally) missing!
portugal = pd.Series(
    data = {
        'Coffee': 72,  
        'Herring': 20, 
        'Yoghurt': 6, 
        'Garlic': 89,
    },
    name = 'Portugal'
)

food_consumed = food_consumed.append(portugal)
# See the missing value created?
print(food_consumed)

# What happens if you run the above commands more than once?
```

#### 6. Delete or drop a row/column
```python
# Drop a column, and returns its values to you
coffee_column = food_consumed.pop('Coffee')
print(coffee_column)
print(food_consumed)

# Leaves the original data untouched; returns only 
# a copy, with those columns removed
food_consumed.drop(['Garlic', 'Yoghurt'], axis=1)
print(food_consumed)

# Leaves the original data untouched; returns only 
# a copy, with those rows removed. 
non_EU_consumption = food_consumed.drop(['Switzerland', ], axis=0)
```

#### 7. Remove rows with missing values
```python
# Returns a COPY of the array, with no missing values:
cleaned_data = food_consumed.dropna() 

# Makes the deletion inplace; you do not not have to assign the output to a new variable.
# Inplace is not always faster!
food_consumed.dropna(inplace=True) 

# Remove only rows where all values are missing:
food_consumed.dropna(how='all')
```

#### 8. Sort the data

```python
food_consumed.sort_values(by="Garlic")
food_consumed.sort_values(by="Garlic", inplace=True)
food_consumed.sort_values(by="Garlic", inplace=True, ascending=False)
```

## Reading and writing Excel files with Pandas

The basic command to read an Excel file is straight-forward:

```python
filename = r"C:\temp\colour-reference.xlsx"  # use the 'r' at the start with Windows directory names

# or, you can even specify the web address for the file
filename = "https://yint.org/static/colour-reference.xlsx"
colour_data = pd.read_excel(
    filename, 
    sheet_name='Sheet1', 
    skiprows=0, 
    index_col=0,
)
print(colour_data)
```

Try it: 
* Download an Excel file (or use your own); here is the one used in the demo code above: https://yint.org/static/colour-reference.xlsx
* Save the file somewhere on your hard drive.
* Open it up to see the file structure, and to see what data you expect to see next.
* Change the `filename` line in the code above.
* Run the code and verify you got what you expected.
* Adjust the `skiprows` and `index_col` function inputs to see what happens.


Excel files can be complex, with different layouts, so read the documentation about Pandas and Excel files: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

## Exercise: mass balance

1. Create an Excel file of a tank reactor: each column is a measurement from the reactor, and each row is a measurement that is taken at some point in time.
2. Simulate (create) some data. Save that file to your hard drive.
3. Use the knowledge you learned above to read in that Excel file. Use the ``df.head()`` function to make sure you have the correct values.
4. Use the mass balance principle:  $$ \text{Accumulation}  = \text{Input} - \text{Output} + \text{Generation} - \text{Consumption}     $$
5. Collect all the columns that are needed for the right hand side of the equation. For example, consider a carbon balance (then the $\text{Generation}$ and $\text{Consumption}$ columns are zero). Therefore calculate the input and the output carbon, and check if there is an accumulation in the tank over time.

## Exporting to Excel


Similarly, for writing Excel files, it is often enough to just use:
```python
df = ... # code goes here to create/update your data frame, df
df.to_excel("output.xlsx", sheet_name='Summary')
```
and it is worth checking the documentation for further function options: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html