{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data provenance\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We've now successfully created a command line program - `plot_precipitation_climatology.py` - that calculates and plots the precipitation climatology for a given month. The last step is to capture the provenace of that plot. In other words, we need a record of all the data processing steps that were taken from the intial download of the data files to the end result (i.e. the .png image).\n", "\n", "The simplest way to do this is to follow the lead of the [NCO](http://nco.sourceforge.net/) and [CDO](https://code.mpimet.mpg.de/projects/cdo) command line tools, which insert a record of what was executed at the command line into the history attribute of the output netCDF file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fri Dec 8 10:05:56 2017: ncatted -O -a history,pr,d,, pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc\n", "Fri Dec 01 08:01:43 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/CSIRO-Mk3-6-0/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_185001-200512.nc pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc\n", "2011-07-27T02:26:04Z CMOR rewrote data to comply with CF standards and CMIP5 requirements.\n" ] } ], "source": [ "import xarray as xr\n", "\n", "csiro_pr_file = '../data/pr_Amon_CSIRO-Mk3-6-0_historical_r1i1p1_200101-200512.nc'\n", "dset = xr.open_dataset(csiro_pr_file)\n", "\n", "print(dset.attrs['history'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fortunately, there is a Python package called [cmdline-provenance](http://cmdline-provenance.readthedocs.io/en/latest/)\n", "that creates NCO/CDO-style records of what was executed at the command line.\n", "We can use it to generate a new command line record:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Tue Nov 03 07:23:59 2020: C:\\Users\\yorksea\\anaconda3\\python.exe C:\\Users\\yorksea\\anaconda3\\lib\\site-packages\\ipykernel_launcher.py -f C:\\Users\\yorksea\\AppData\\Roaming\\jupyter\\runtime\\kernel-4a6f7305-ff2a-46f4-b58a-c2e3ed534d8f.json\n", "\n" ] } ], "source": [ "#Example: This is the command that was run to launch the jupyter notebook we’re using.\n", "import cmdline_provenance as cmdprov\n", "new_record = cmdprov.new_log()\n", "print(new_record)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to create our own entry for the history attribute, \n", "we'll need to be able to create a: \n", "\n", "* Time stamp\n", "* Record of what was entered at the command line in order to execute `plot_precipitation_climatology.py`\n", "* Method of indicating which verion of the script was run (i.e. because the script is in our git repository)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Time stamp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A library called `datetime` can be used to find out the time and date right now:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fri Dec 08 14:05:17 2017\n" ] } ], "source": [ "import datetime\n", " \n", "time_stamp = datetime.datetime.now().strftime(\"%a %b %d %H:%M:%S %Y\")\n", "print(time_stamp)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `strftime` function can be used to customise the appearance of a datetime object;\n", "in this case we've made it look just like the other time stamps in our data file." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Command line record" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `sys.argv` function, which is what the `argparse` library is built on top of, contains all the arguments entered by the user at the command line:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py', '-f', '/Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json']\n" ] } ], "source": [ "import sys\n", "print(sys.argv)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In launching this IPython notebook,\n", "you can see that a command line program called `ipykernel_launcher.py` was run. \n", "To join all these list elements up, \n", "we can use the `join` function that belongs to Python strings:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json\n" ] } ], "source": [ "args = \" \".join(sys.argv)\n", "print(args)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While this list of arguments is very useful, \n", "it doesn't tell us which Python installation was used to execute those arguments. \n", "The `sys` library can help us out here too:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/Applications/anaconda/envs/pyaos-lesson/bin/python\n" ] } ], "source": [ "exe = sys.executable\n", "print(exe) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Git hash" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the lesson on version control using git\n", "we learned that each commit is associated with a unique 40-character identifier known as a hash. \n", "We can use the git Python library to get the hash associated with the script:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "588f96dcab5c78d10b4c994eb3ca67955c882697\n" ] } ], "source": [ "import git\n", "import os\n", " \n", "repo_dir = '/Users/irv033/Documents/volunteer/teaching' \n", "#repo_dir = os.getcwd()\n", "git_hash = git.Repo(repo_dir).heads[0].commit\n", "print(git_hash)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Putting it all together\n", "\n", "We can now put all this together into a function that generates our history record," ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def get_history_record(repo_dir):\n", " \"\"\"Create a new history record.\"\"\"\n", "\n", " time_stamp = datetime.datetime.now().strftime(\"%Y-%m-%dT%H:%M:%S\")\n", " exe = sys.executable\n", " args = \" \".join(sys.argv)\n", " git_hash = git.Repo(repo_dir).heads[0].commit\n", "\n", " entry = \"\"\"%s: %s %s (Git hash: %s)\"\"\" %(time_stamp, exe, args, str(git_hash)[0:7])\n", " \n", " return entry" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d)\n" ] } ], "source": [ "new_history = get_history_record('/Users/irv033/Documents/volunteer/teaching')\n", "\n", "print(new_history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "which can be combined with the previous history to compile a record that goes all the way back to when we obtained the original data file:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2017-12-08T14:05:34: /Applications/anaconda/envs/pyaos-lesson/bin/python /Applications/anaconda/envs/pyaos-lesson/lib/python3.6/site-packages/ipykernel_launcher.py -f /Users/irv033/Library/Jupyter/runtime/kernel-7183ce41-9fd9-4d30-9e46-a0d16bc9bd5e.json (Git hash: 588f96d) \n", " Fri Dec 8 10:05:47 2017: ncatted -O -a history,pr,d,, pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc\n", "Fri Dec 01 07:59:16 2017: cdo seldate,2001-01-01,2005-12-31 /g/data/ua6/DRSv2/CMIP5/ACCESS1-3/historical/mon/atmos/r1i1p1/pr/latest/pr_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc pr_Amon_ACCESS1-3_historical_r1i1p1_200101-200512.nc\n", "CMIP5 compliant file produced from raw ACCESS model output using the ACCESS Post-Processor and CMOR2. 2012-02-08T06:45:54Z CMOR rewrote data to comply with CF standards and CMIP5 requirements. Fri Apr 13 09:55:30 2012: forcing attribute modified to correct value Fri Apr 13 12:13:10 2012: updated version number to v20120413. Fri Apr 13 12:29:34 2012: corrected model_id from ACCESS1-3 to ACCESS1.3\n" ] } ], "source": [ "complete_history = '%s \\n %s' %(new_history, previous_history)\n", "\n", "print(complete_history)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(Noting that in real example of this process in action, the new history would refer to what was entered at the command line to run `plot_precipitation_climatology.py`, as opposed to running `ipykernel_launcher.py` to run a notebook.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Writing your own modules\n", "\n", "We could place this new `get_history_record()` function directly into the `plot_precipitation_climatology.py` script, but there's a good chance we'll want to use it in many scripts that we write into the future. In the functions lesson we discussed all the reasons why code duplication is a bad thing, and it's the same principle here. The solution is to place the `get_history_record()` function in a separate script full of functions (which is called a module) that we use regularly across many scripts. \n", "\n", "(A slight modification has been added to `get_history_record()` so that the `repo_dir` isn't hard wired into the code. Instead, the script defines `repo_dir` as the current working directory, which is assumed to be the top of the directory tree in the git repository, as that's the input information required by `git.Repo`.)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"\"\"\r\n", "A collection of commonly used functions for data provenance\r\n", "\r\n", "\"\"\"\r\n", "\r\n", "import sys\r\n", "import datetime\r\n", "import git\r\n", "import os\r\n", "\r\n", "\r\n", "def get_history_record():\r\n", " \"\"\"Create a new history record.\"\"\"\r\n", "\r\n", " time_stamp = datetime.datetime.now().strftime(\"%a %b %d %H:%M:%S %Y\")\r\n", " exe = sys.executable\r\n", " args = \" \".join(sys.argv)\r\n", " \r\n", " repo_dir = os.getcwd()\r\n", " try:\r\n", " git_hash = git.Repo(repo_dir).heads[0].commit\r\n", " except git.exc.InvalidGitRepositoryError:\r\n", " print('To record the git hash, must run script from top of directory tree in git repo')\r\n", " git_hash = 'unknown'\r\n", " \r\n", " entry = \"\"\"%s: %s %s (Git hash: %s)\"\"\" %(time_stamp, exe, args, str(git_hash)[0:7])\r\n", " \r\n", " return entry" ] } ], "source": [ "!cat provenance.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then import that module and use it in all of our scripts." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "import provenance" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first line of a module file is similar to the first line of a function - if you enter a string, it will be picked up by the help generator." ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on module provenance:\n", "\n", "NAME\n", " provenance - A collection of commonly used functions for data provenance\n", "\n", "FUNCTIONS\n", " get_history_record(repo_dir)\n", " Create a new history record.\n", "\n", "FILE\n", " /Users/irv033/Documents/volunteer/teaching/amos-icshmo/provenance.py\n", "\n", "\n" ] } ], "source": [ "help(provenance)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function get_history_record in module provenance:\n", "\n", "get_history_record(repo_dir)\n", " Create a new history record.\n", "\n" ] } ], "source": [ "help(provenance.get_history_record)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Challenge\n", "\n", "Import the new `provenance` module into your `plot_precipitation_climatology.py` script and use it to record the complete history of the output figure.\n", "\n", "Things to consider:\n", "- For command line programs that output a netCDF file, the history record is typically added to the global history attribute. In this case the output is a `.png` file, so it will be necessary to have `plot_precipitation_climatology.py` output a `.txt` file that contains the history information (it's usually easiest for this metadata file to have exactly the same name as the figure file, just with a `.txt` instead of `.png` file extension)\n", "- Do you need to record the history of the land surface fraction file, or just the precipitation file?\n", "\n", "Don't forget to commit your changes to git and push to GitHub." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }