{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Create a Graph with Pandas and Matplotlib\n", "\n", "[Pandas](https://pandas.pydata.org/) is a \"library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.\" I use it instead of dumping data out to Excel. For me, having data manipulations well documented and repeatable is much easier to accomplish through Python code than a set of instructions writen in Word and completed in Excel. I know Excel is great, and I've used it for years, but it has become uneeded for most of my work now thanks to Pandas. Recognizing Excel is still in heavy use, Pandas does provide easy utilities to read and write to Excel files.\n", "\n", "We won't get deep into the data analysis part of Pandas and what it can really do in this notebook. We are going to create a simple graph representing aged WIP for a single timekeeper. Such a graph can be published to a static report, sent via email, or embedded in a PDF as part of a timekeeper's factsheet for the compensation committee.\n", "\n", "As before, we can easily install the latest Pandas and Matplotlib libraries if they are not already installed." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!{sys.executable} -m pip install --upgrade pandas\n", "!{sys.executable} -m pip install --upgrade matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From our [Interacting with SQL](Interacting%20with%20SQL.ipynb) notebook, a reminder:\n", "\n", "> As we're publishing this to the web, I've placed my server name, database name, user and password in environment variables so they don't appear here in the Notebook. I use Linux and Systemd to start my Jupyter Hub and set the environment variables in my /etc/systemd/system/jupyter.service file. In my server's jupyterhub_config.py file I add any of the variables I want exposed to the individual users in the c.Spawner.env_keep option. Any users that I'm opening up Jupyter Hub to likely know these read only accounts anyway as they've had some SQL need in the past. Generally this would be other developers, report writers, and analysts." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import pyodbc \n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "elite_server = os.environ['ELITE_PROD_HOST']\n", "elite_db = os.environ['ELITE_PROD_DB']\n", "elite_user = os.environ['ELITE_RO_NAME']\n", "elite_pass = os.environ['ELITE_RO_PASS']\n", "\n", "conn_str = (f'DRIVER={{FreeTDS}};SERVER={elite_server};'\n", " f'PORT=1433;DATABASE={elite_db};UID={elite_user};PWD={elite_pass};'\n", " f'TDS_Version=8.0;ClientCharset=UTF-8')\n", "e3_conn = pyodbc.connect(conn_str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The SQL statement below will create some typical aging buckets. You could link out to `AgingScheme` and `AgingSchemeBuckets` and build this dynamically, however, for easier reading and potentially better performance, I'll present it like this and leave the specifics for your firm up to to you. \n", "\n", "There is one parameter in the statement, and that's to hold the timekeeper index (not number)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "aged_wip_sql = \"\"\"\n", "declare @tkpr as integer = ? -- The timekeeper index [this can be optional]\n", "declare @today as datetime = convert(datetime, floor(convert(float, getdate())))\n", "\n", "select \n", "timekeeper.Number, \n", "timekeeper.DisplayName, \n", "sum(case when aging_grouped.bucket = 1 then sum_WIPAmt else 0 end) '0-30',\n", "sum(case when aging_grouped.bucket = 2 then sum_WIPAmt else 0 end) '31-60',\n", "sum(case when aging_grouped.bucket = 3 then sum_WIPAmt else 0 end) '61-90',\n", "sum(case when aging_grouped.bucket = 4 then sum_WIPAmt else 0 end) '91-120',\n", "sum(case when aging_grouped.bucket = 5 then sum_WIPAmt else 0 end) 'Over 120',\n", "sum(sum_WIPAmt) 'Total'\n", "from (\n", " select aging_detail.Timekeeper, aging_detail.bucket, sum(aging_detail.WIPAmt) sum_WIPAmt \n", " from (\n", " select timecard.Timekeeper, timecard.WIPAmt,\n", " case \n", " when DATEDIFF(day,timecard.workdate,@today) < 31 then 1\n", " when DATEDIFF(day,timecard.workdate,@today) between 31 and 60 then 2\n", " when DATEDIFF(day,timecard.workdate,@today) between 61 and 90 then 3\n", " when DATEDIFF(day,timecard.workdate,@today) between 91 and 120 then 4\n", " else 5\n", " end 'bucket'\n", " from timecard\n", " where timecard.IsActive = 1 and timecard.WIPRemoveDate is null and timecard.IsNB = 0 and timecard.WorkAmt <> 0\n", " and timecard.Timekeeper = @tkpr -- comment this line if we want all timekeepers\n", " ) aging_detail\n", " group by aging_detail.Timekeeper, aging_detail.bucket\n", ") aging_grouped\n", "join timekeeper on timekeeper.TkprIndex = aging_grouped.Timekeeper\n", "group by \n", "timekeeper.Number, \n", "timekeeper.DisplayName\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# change this to an index for a timekeeper in your firm\n", "timekeeper_index = 1234" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is where Pandas starts to make things easier. We create a `dataframe` to house the results from the query above and simultaneously query the database with the specified Timekeeper using `read_sql`. We set the index of the dataframe to the Timekeeper's Number." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_agedwip = pd.read_sql(aged_wip_sql, e3_conn, params=(timekeeper_index,), index_col='Number')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to see what that `dataframe` has in it we can view it like this. The data may get truncated if you have many columns or many rows. This technique of displaying is mostly to ensure you have the type of data you were expecting." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df_agedwip" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So there's our data. \n", "\n", "But I promised a chart. We will use `matplotlib` for this. Pandas' `plot` method is a wrapper around matplotlib's `pyplot.plot()` so when looking at the documentation, the two have many interchangeable things, but some are not available in the wrapper and you need to use calls like `plt.show()` or `plt.savefig()`. There are lots of options and settings, but by example here is how a bar chart could be created. The [main function `plot`](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) manages the visualization. \n", "\n", "Like much of the time, the data isn't exactly as we need it. Let's massage it a little bit. We don't want the `Total` amount or the `DisplayName` so we `drop` those first. The `plot` method of the dataframe expects the data with the categories as rows, so do need a call to `transpose` which is much like Excel's [Paste Special -> Transpose](https://support.office.com/en-us/article/transpose-rotate-data-from-rows-to-columns-or-vice-versa-3419f2e3-beab-4318-aae5-d0f862209744) to flip flop or rotate the data. Then we `plot` it.\n", "\n", "It may appear complicated on the first look, but we could have structured the SQL statement to have exactly the format we needed, but I didn't, and thought it made for an opportunity to show how to chain some methods together to quicly alter the dataset being plotted. The beauty of Pandas is that it efficiently handles large datasets in the same way." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# good habit to close any existing figures\n", "plt.close('all') \n", "\n", "# pick some colors for the bars\n", "colors = ['#8BC34A', '#7CB342', '#FFB300', '#E53935', '#B71C1C'] \n", "\n", "# build the plot\n", "ax = df_agedwip.drop(['Total','DisplayName'], axis=1).transpose().plot(kind=\"bar\", \n", " figsize=(5,2),\n", " color=[colors], \n", " legend=False, rot=0, grid=True )\n", "\n", "# label the axis\n", "ax.set_xlabel(\"Days\")\n", "ax.set_ylabel(\"Dollars\")\n", "\n", "# put the grid lines behind the colour bars\n", "ax.set_axisbelow(True) \n", "\n", "# personal preference to remove the top and right axix splines\n", "ax.spines['right'].set_color('none')\n", "ax.spines['top'].set_color('none')\n", "\n", "# every chart deserves a title\n", "ax.set_title(f\"Aged WIP (Days)\", fontdict={'fontsize': 14, 'fontweight': 'heavy'})\n", "\n", "# format the dollar amounts\n", "ax.set_yticklabels([f\"${label:,.0f}\" for label in ax.get_yticks() ])\n", "\n", "# add some text and kind of hide it over to the right giving the run time\n", "ax.text(1,.5, f'As at {pd.datetime.now():%Y-%b-%d %I:%M %p}', \n", " verticalalignment='center', \n", " rotation='vertical',\n", " fontdict={'fontstyle': 'italic', 'fontsize':7, 'color': '#cccccc'},\n", " transform=ax.transAxes\n", " )\n", "\n", "# let's take a look at our work!\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# for interest sake. Note that this doesn't change our original, only displays the output from the `transpose` method.\n", "df_agedwip.transpose()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n", "In this Notebook, we've set up a connection to our Elite database server, run a SQL query using Pandas to create a `dataframe` with the results, performed some simple 'massaging' of the data, and created a graph with the results.\n", "\n", "-30-" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 2 }