# Data provenance


We've now successfully created a command line program - `plot_precipitation_climatology.py` - that calculates and plots the precipitation climatology for a given month. The last step is to capture the provenace of that plot. In other words, we need a record of all the data processing steps that were taken from the intial download of the data files to the end result (i.e. the .png image).

The simplest way to do this is to follow the lead of the [NCO](http://nco.sourceforge.net/) and [CDO](https://code.mpimet.mpg.de/projects/cdo) command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file.

In [3]:
import xarray as xr

csiro_pr_file = '../data/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc'
dset = xr.open_dataset(csiro_pr_file)

print(dset.attrs['history'])

Fri Dec 8 10:05:56 2017: ncatted -O -a history,pr,d,, pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc
Fri Dec 01 08:01:43 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/CSIRO-Mk3-6-0/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_185001-200512.nc pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc
2011-07-27T02:26:04Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.


Fortunately, there is a Python package called [cmdline-provenance](http://cmdline-provenance.readthedocs.io/en/latest/)
that creates NCO/CDO-style records of what was executed at the command line.
We can use it to generate a new command line record:

In [6]:
#Example: This is the command that was run to launch the jupyter notebook we’re using.
import cmdline_provenance as cmdprov
new_record = cmdprov.new_log()
print(new_record)

Tue Nov 03 07:23:59 2020: C:\Users\yorksea\anaconda3\python.exe C:\Users\yorksea\anaconda3\lib\site-packages\ipykernel_launcher.py -f C:\Users\yorksea\AppData\Roaming\jupyter\runtime\kernel-4a6f7305-ff2a-46f4-b58a-c2e3ed534d8f.json



If we want to create our own entry for the history attribute, 
we'll need to be able to create a: 

* Time stamp
* Record of what was entered at the command line in order to execute `plot_precipitation_climatology.py`
* Method of indicating which verion of the script was run (i.e. because the script is in our git repository)

### Time stamp

A library called `datetime` can be used to find out the time and date right now:

In [25]:
import datetime
 
time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
print(time_stamp)

Fri Dec 08 14:05:17 2017


The `strftime` function can be used to customise the appearance of a datetime object;
in this case we've made it look just like the other time stamps in our data file.

### Command line record

The `sys.argv` function, which is what the `argparse` library is built on top of, contains all the arguments entered by the user at the command line:

In [26]:
import sys
print(sys.argv)

['/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py', '-f', '/Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json']


In launching this IPython notebook,
you can see that a command line program called `ipykernel_launcher.py` was run. 
To join all these list elements up, 
we can use the `join` function that belongs to Python strings:

In [27]:
args = " ".join(sys.argv)
print(args)

/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json


While this list of arguments is very useful, 
it doesn't tell us which Python installation was used to execute those arguments. 
The `sys` library can help us out here too:

In [28]:
exe = sys.executable
print(exe) 

/Applications/anaconda/envs/pyaos-lesson/bin/python


### Git hash

In the lesson on version control using git
we learned that each commit is associated with a unique 40-character identifier known as a hash. 
We can use the git Python library to get the hash associated with the script:

In [29]:
import git
import os
 
repo_dir = '/Users/irv033/Documents/volunteer/teaching' 
#repo_dir = os.getcwd()
git_hash = git.Repo(repo_dir).heads[0].commit
print(git_hash)

588f96dcab5c78d10b4c994eb3ca67955c882697


### Putting it all together

We can now put all this together into a function that generates our history record,

In [30]:
def get_history_record(repo_dir):
 """Create a new history record."""

 time_stamp = datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")
 exe = sys.executable
 args = " ".join(sys.argv)
 git_hash = git.Repo(repo_dir).heads[0].commit

 entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
 
 return entry

In [31]:
new_history = get_history_record('/Users/irv033/Documents/volunteer/teaching')

print(new_history)

2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d)


which can be combined with the previous history to compile a record that goes all the way back to when we obtained the original data file:

In [32]:
complete_history = '%s \n %s' %(new_history, previous_history)

print(complete_history)

2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d) 
 Fri Dec 8 10:05:47 2017: ncatted -O -a history,pr,d,, pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc
Fri Dec 01 07:59:16 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/ACCESS1-3/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc
CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-02-08T06:45:54Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 09:55:30 2012: forcing attribute modified to correct value Fri Apr 13 12:13:10 2012: updated version number to v20120413. Fri Apr 13 12:29:34 2012: corrected model_id from ACC

(Noting that in real example of this process in action, the new history would refer to what was entered at the command line to run `plot_precipitation_climatology.py`, as opposed to running `ipykernel_launcher.py` to run a notebook.)

## Writing your own modules

We could place this new `get_history_record()` function directly into the `plot_precipitation_climatology.py` script, but there's a good chance we'll want to use it in many scripts that we write into the future. In the functions lesson we discussed all the reasons why code duplication is a bad thing, and it's the same principle here. The solution is to place the `get_history_record()` function in a separate script full of functions (which is called a module) that we use regularly across many scripts. 

(A slight modification has been added to `get_history_record()` so that the `repo_dir` isn't hard wired into the code. Instead, the script defines `repo_dir` as the current working directory, which is assumed to be the top of the directory tree in the git repository, as that's the input information required by `git.Repo`.)

In [33]:
!cat provenance.py

"""
A collection of commonly used functions for data provenance

"""

import sys
import datetime
import git
import os


def get_history_record():
 """Create a new history record."""

 time_stamp = datetime.datetime.now().strftime("%a %b %d %H:%M:%S %Y")
 exe = sys.executable
 args = " ".join(sys.argv)
 
 repo_dir = os.getcwd()
 try:
 git_hash = git.Repo(repo_dir).heads[0].commit
 except git.exc.InvalidGitRepositoryError:
 print('To record the git hash, must run script from top of directory tree in git repo')
 git_hash = 'unknown'
 
 entry = """%s: %s %s (Git hash: %s)""" %(time_stamp, exe, args, str(git_hash)[0:7])
 
 return entry

We can then import that module and use it in all of our scripts.

In [36]:
import provenance

The first line of a module file is similar to the first line of a function - if you enter a string, it will be picked up by the help generator.

In [34]:
help(provenance)

Help on module provenance:

NAME
 provenance - A collection of commonly used functions for data provenance

FUNCTIONS
 get_history_record(repo_dir)
 Create a new history record.

FILE
 /Users/irv033/Documents/volunteer/teaching/amos-icshmo/provenance.py




In [35]:
help(provenance.get_history_record)

Help on function get_history_record in module provenance:

get_history_record(repo_dir)
 Create a new history record.



## Challenge

Import the new `provenance` module into your `plot_precipitation_climatology.py` script and use it to record the complete history of the output figure.

Things to consider:
- For command line programs that output a netCDF file, the history record is typically added to the global history attribute. In this case the output is a `.png` file, so it will be necessary to have `plot_precipitation_climatology.py` output a `.txt` file that contains the history information (it's usually easiest for this metadata file to have exactly the same name as the figure file, just with a `.txt` instead of `.png` file extension)
- Do you need to record the history of the land surface fraction file, or just the precipitation file?

Don't forget to commit your changes to git and push to GitHub.