{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n",
"\n",
"
\n",
"\n",
"
\n",
"\n",
"\n",
"\n",
"# Exploring NSERC Awards Data\n",
"\n",
"\n",
"Canada's [Open Government Portal](http://open.canada.ca/en) includes [NSERC Awards Data](http://open.canada.ca/data/en/dataset/c1b0f627-8c29-427c-ab73-33968ad9176e) from 1995 through 2016.\n",
"\n",
"The awards data (in .csv format) were copied to an [Amazon Web Services S3 bucket](http://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html). This open Jupyter notebook shows the large trends over the data and compares entire selections with each other.\n",
"\n",
"> **Acknowledgement:** I thank [Ian Allison](https://github.com/ianabc) and [James Colliander](http://colliand.com) of the [Pacific Institute for the Mathematical Sciences](http://www.pims.math.ca/) for building the [JupyterHub service](http://syzygy.ca) and for help with this notebook. -- I. Heisz"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1995\n",
"1996\n",
"1997\n",
"1998\n",
"1999\n",
"2000\n",
"2001\n",
"2002\n",
"2003\n",
"2004\n",
"2005\n",
"2006\n",
"2007\n",
"2008\n",
"2009\n",
"2010\n",
"2011\n",
"2012\n",
"2013\n",
"2014\n",
"2015\n",
"2016\n"
]
},
{
"data": {
"text/plain": [
"Index(['Name', 'Department', 'OrganizationID', 'Institution', 'ProvinceEN',\n",
" 'CountryEN', 'FiscalYear', 'AwardAmount', 'ProgramID', 'Committee'],\n",
" dtype='object')"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import sys\n",
"\n",
"## Bring in a selection of the NSERC awards data starting with 1995 and ending with 2016.\n",
"## Throw away as much as you can to keep the DataFrame small enough to manipulate using a laptop.\n",
"\n",
"df = pd.DataFrame()\n",
"\n",
"startYear = 1995\n",
"endYear = 2017 # The last year is not included, so if it was 2017 it means we include the 2016 collection but not 2017.\n",
"\n",
"for year in range(startYear, endYear):\n",
" file = 'https://s3.ca-central-1.amazonaws.com/open-data-ro/NSERC/NSERC_GRT_FYR' + str(year) + '_AWARD.csv.gz'\n",
" df = df.append(pd.read_csv(file, \n",
" compression='gzip', \n",
" usecols = [1, 2, 3, 4, 5, 7, 9, 11, 12, 17], \n",
" encoding='latin-1'\n",
" )\n",
" )\n",
" print(year)\n",
" \n",
"## Rename columns for better readability.\n",
"df.columns = ['Name', 'Department', 'OrganizationID',\n",
" 'Institution', 'ProvinceEN', 'CountryEN',\n",
" 'FiscalYear', 'AwardAmount', 'ProgramID',\n",
" 'Committee'] \n",
"\n",
"## Strip out any leading or trailing whitespace\n",
"df.columns.str.strip()"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"DataFrame: 207.20 Mb\n"
]
}
],
"source": [
"## Quantify data stored in memory.\n",
"print(\"DataFrame: {:4.2f} Mb\".format(sys.getsizeof(df) / (1024. * 1024)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Check the data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Name 457894\n",
"Department 457677\n",
"OrganizationID 457896\n",
"Institution 457896\n",
"ProvinceEN 454115\n",
"CountryEN 457896\n",
"FiscalYear 457896\n",
"AwardAmount 457896\n",
"ProgramID 457896\n",
"Committee 457896\n",
"dtype: int64\n"
]
},
{
"data": {
"text/html": [
"
| \n", " | Name | \n", "Department | \n", "OrganizationID | \n", "Institution | \n", "ProvinceEN | \n", "CountryEN | \n", "FiscalYear | \n", "AwardAmount | \n", "ProgramID | \n", "Committee | \n", "
|---|---|---|---|---|---|---|---|---|---|---|
| 16700 | \n", "NaN | \n", "Head Office | \n", "11846 | \n", "Pacific Institute for the Mathematical Sciences | \n", "British Columbia | \n", "CANADA | \n", "1999 | \n", "40373 | \n", "MISPJ | \n", "103 | \n", "
| 16701 | \n", "NaN | \n", "Head Office | \n", "14687 | \n", "National Engineering Week | \n", "Ontario | \n", "CANADA | \n", "1999 | \n", "7500 | \n", "MISPJ | \n", "103 | \n", "