{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![alt text](http://library.columbia.edu/content/libraryweb/locations/business/_jcr_content/layout_featuredcontent/image.img.jpg/1369336790208.jpg)\n", "#What do we know about Density's data?" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import datetime\n", "import seaborn as sns\n", "import pylab as plt\n", "import numpy as np\n", "import requests\n", "%matplotlib inline \n", "plt.rcParams.update({'font.size': 28})\n", "sns.set_context(\"talk\")\n", "sns.set_style('darkgrid')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##[CUIT](https://cuit.columbia.edu/) has supplied us with our data set\n", "\n", "###We receive device counts from specific routers aggregated into 15 minute intervals\n", "\n", "Devices connect to wifi routers across Columbia's campus. CUIT aggregates device collection data by building and by floor and supplies with a live count of the number of devices connected at each location every 15 minutes.\n", "\n", "Below is an example query from the API showing how data is shared publicly" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
client_countgroup_idgroup_nameparent_idparent_namepercent_full
dump_time
2014-10-10 21:15:003125John Jay Dining Hall75John Jay1.500000
2014-10-10 21:15:004155JJ's Place75John Jay3.100775
2014-10-10 21:00:001125John Jay Dining Hall75John Jay0.500000
2014-10-10 21:00:002155JJ's Place75John Jay1.550388
2014-10-10 20:45:008125John Jay Dining Hall75John Jay4.000000
\n", "
" ], "text/plain": [ " client_count group_id group_name parent_id \\\n", "dump_time \n", "2014-10-10 21:15:00 3 125 John Jay Dining Hall 75 \n", "2014-10-10 21:15:00 4 155 JJ's Place 75 \n", "2014-10-10 21:00:00 1 125 John Jay Dining Hall 75 \n", "2014-10-10 21:00:00 2 155 JJ's Place 75 \n", "2014-10-10 20:45:00 8 125 John Jay Dining Hall 75 \n", "\n", " parent_name percent_full \n", "dump_time \n", "2014-10-10 21:15:00 John Jay 1.500000 \n", "2014-10-10 21:15:00 John Jay 3.100775 \n", "2014-10-10 21:00:00 John Jay 0.500000 \n", "2014-10-10 21:00:00 John Jay 1.550388 \n", "2014-10-10 20:45:00 John Jay 4.000000 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "prms = {'auth_token': 'OCAPDPVGRZ9B7L0YJ4F44VJCYTN6QIKM'}\n", "url = 'http://density.adicu.com/window/2014-10-10T08:00/2014-10-10T21:30/building/75'\n", "r = requests.get(url, params=prms)\n", "data = pd.DataFrame().from_dict(r.json()['data'])\n", "jj = pd.DataFrame().append(data)\n", "nxt = requests.get(r.json()['next_page'])\n", "jj = jj.append(nxt.json()['data']).reset_index(drop=True)\n", "jj['dump_time'] = pd.to_datetime(jj['dump_time'])\n", "jj = jj.set_index('dump_time')\n", "jj.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll load our demo data in its original format to analyze long-term trends" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [], "source": [ "raw = pd.read_csv('../scripts/demo_data.csv')\n", "buildings = pd.read_csv('../scripts/buildings.csv', index_col = 0, header = None, squeeze = True)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "raw['dump_time'] = pd.to_datetime(raw['dump_time'])\n", "raw = raw.set_index('dump_time')" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "ename": "NameError", "evalue": "name 'raw' is not defined", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mraw\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mNameError\u001b[0m: name 'raw' is not defined" ] } ], "source": [ "raw.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "len(raw.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#What spaces do we have?\n", "\n", "##We have various Libraries, John Jay, and Lerner" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print buildings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#Just how full is a study space?\n", "\n", "\n", "\n", "![alt text](http://www.hercampus.com/sites/default/files/2013/12/02/4192655241_df21ba1706.jpg)\n", "\n", "\n", "##There's an obvious difference beteween a space that is unusually crowded like during orgo night, and a typical busy day\n", "\n", "![alt text](http://library.columbia.edu/content/libraryweb/locations/butler/_jcr_content/layout_featuredcontent/image.img.jpg/1368472856821.jpg)\n", "\n", "##Density assumes that capacity is relative to fullness on an average day, not on outlier days\n", "\n", "##How do we do that? By considering what the average day like at each space " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "raw['hrfrac'] = raw.index.hour + raw.index.minute/60." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def dayplot(plotfn, title):\n", " f, ax = plt.subplots(figsize=(9, 7))\n", " plotfn.plot(ax=ax)\n", " plt.title(title)\n", " plt.tight_layout()\n", " plt.xticks([0,3,6,9,12,15,18,21])\n", " plt.ylabel('Devices')\n", " plt.xlabel('Time of Day')\n", " return ax" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##On an average day at Columbia, device count grows until it stablizes for the day" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "totalday = pd.crosstab(raw.hrfrac, raw.index.date, values=raw.client_count, aggfunc=pd.Series.sum)\n", "total = raw.reset_index().groupby('dump_time').agg({'client_count':'sum', 'hrfrac':'mean'}).reset_index()\n", "\n", "ax = dayplot(totalday.mean(axis=1), 'Average Number of Devices at Columbia')\n", "palette = sns.color_palette()\n", "low = total.groupby(\"hrfrac\")['client_count'].apply(np.percentile, 25)\n", "high = total.groupby(\"hrfrac\")['client_count'].apply(np.percentile, 75)\n", "ax.fill_between(low.index, low, high, alpha=.2, color=palette.pop(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#Capacity is the level at which population growth stabilizes\n", "\n", "###For the capacity of whole campus of Columbia, it would be the following" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def find_capacity_overall(total_count):\n", " avg_day = total_count.mean(axis=1)\n", " return avg_day.max()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "print find_capacity_overall(totalday)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#Different spaces have different properties" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tabs = pd.crosstab(raw.hrfrac, raw.group_name, raw.client_count, aggfunc=pd.Series.mean)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def bldg_plot(bldg, ci=False):\n", " floors = list(buildings[buildings == bldg].index)\n", " title = 'Average Daily Device Count in %s' % bldg\n", " ax = dayplot(tabs[floors], title)\n", " palette = sns.color_palette()\n", " if ci:\n", " for f in floors:\n", " low = raw[raw['group_name'] == f].groupby('hrfrac')['client_count'].apply(np.percentile, 25)\n", " high = raw[raw['group_name'] == f].groupby('hrfrac')['client_count'].apply(np.percentile, 75)\n", " ax.fill_between(low.index, low, high, alpha=.2, color=palette.pop(0))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##For example, Butler library behaves very differently from John Jay:\n", "\n", "###This is Butler:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bldg_plot('Butler')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "People tend to come in after lunch, take a break for dinner, and work late into the night.\n", "\n", "###This is John Jay:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "bldg_plot('John Jay')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dining halls are most crowded around mealtimes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "#How does campus population vary over time?\n", "\n", "##We can find differences in population based on changes in semesters, exams, and holidays" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f, ax = plt.subplots(figsize=(10, 7))\n", "raw.groupby(raw.index.date).client_count.sum().plot(ax=ax)\n", "plt.tight_layout()\n", "plt.title('Total Number of Devices Across Columbia')\n", "plt.ylabel('Number of Devices')\n", "plt.xlabel('Date')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There's a clear weekly cycle in the number of devices connected across campus.\n", "\n", "##Let's remove the weekly cycle by grouping the data into weeks" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "raw['date'] = raw.index.date\n", "raw['weekofyear'] = raw.index.weekofyear\n", "weeks = raw['2014-06-30':'2015-4-19'].groupby('weekofyear').client_count.sum()\n", "weeks.index = list(weeks[weeks.index < 27].index + 52) + list(weeks.index[16:])\n", "weeks = weeks.sort_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "###Smoothing out the data by grouping by week allows us to see differences in semesters and holidays more clearly\n", "\n", "E.g. the big drop around Thanksgiving and the following spike for Finals" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [ "f, ax = plt.subplots(figsize=(9, 6))\n", "weeks.index = raw[raw.index.dayofweek == 0]['2014-06-30':'2015-04-19']['date'].drop_duplicates()\n", "weeks.plot(ax=ax)\n", "plt.tight_layout()\n", "plt.title('Weekly Device Count Across Comumbia')\n", "plt.ylabel('Total Number of Devices')\n", "plt.savefig('weekly_count.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Aggregating the data by week allows us to see that there's a jump in the number of devices when the semester starts. We also see dips for fall break, spring break, and Thanksgiving. Thanksgiving is followed by a spike for finals. Winter break is also visible following the dramatic drop in device count.\n", "\n", "---\n", "\n", "##How does day of the week affect study habits?\n", "\n", "###People don't study on weekends:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "f, ax = plt.subplots(figsize=(9, 7))\n", "dailycounts = raw.groupby(raw.date).client_count.sum()\n", "dailycounts.index = pd.to_datetime(dailycounts.index)\n", "dailycounts.groupby(dailycounts.index.dayofweek).mean().plot(ax=ax)\n", "plt.tight_layout()\n", "plt.title('Average Count Across Columbia')\n", "plt.ylabel('Average Daily Device Count')\n", "plt.xlabel('Day of Week')\n", "weekday = 'Mon Tue Wed Thu Fri Sat Sun'.split()\n", "plt.xticks(range(7), weekday)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "reshaped = pd.crosstab(raw.index, raw.group_name, raw.client_count, aggfunc=pd.Series.sum)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##We can also find days with unusually high number of devices" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "max_days = [d.strftime('%Y-%m-%d') for d in reshaped.apply(lambda x: x.argmax())]\n", "max_days = pd.Series(max_days, index = reshaped.columns)\n", "max_day_table = pd.DataFrame(columns = max_days.index, index = sorted(raw.hrfrac.drop_duplicates()))\n", "\n", "for d in max_days.index:\n", " data = reshaped[d].ix[max_days[d]]\n", " data = reshaped[d].ix[max_days[d]]\n", " data.index = data.index.hour + data.index.minute / 60.\n", " max_day_table[d] = data" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "str(max_days.ix['Roone Arledge Auditorium'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def maxplot(bldg):\n", " dayplot(max_day_table[bldg], 'Max Device Counts in %s' % (bldg))\n", " labels = str(max_days.ix[bldg])\n", " print labels \n", " plt.legend([labels], loc = 2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our plots show the days with the max. number of devices recorded at any time at that location. Using these graphs, we can identify special days during the school year.\n", "\n", "##Like NSOP\n", "\n", "We see multiple spikes in Roone Arledge Auditorium around the same time as orientation. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "maxplot('Roone Arledge Auditorium')\n", "plt.savefig('nsop.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##And Orgo Night!\n", "\n", "Notice the spike in Butler 2 around midnight." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "maxplot('Butler Library 2')\n", "plt.savefig('orgonight.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##For more tools with the Density API, check out Chris Mulligan's [Relative Density App!](https://chmullig.shinyapps.io/relative_density/)\n", "\n", "#Check out our [API](http://density.adicu.com/docs) to use our data!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }