{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## All graphs dealing w/any billing data are broken!\n", "0. be sure to run this notebook in the notebook/ directory!\n", "1. session length - general stats on how long sessions are. means,\n", "percentiles, etc.\n", "2. User 'profile' - how many 'kinds' of users do we have? Some who\n", "just pop in once? Some who pop in a few times a week for a fixed\n", "amount of time? some who are there all the time? what kinda user\n", "clusters do we have?\n", "3. Who are those people using it 400 times in a semester?\n", "4. are there user corelations? Can we spot 'groups' of users with\n", "similar behavior? What kinda behavior is it? etc" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from datetime import datetime\n", "import altair as alt\n", "from IPython import display" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ThemeRegistry.enable('my-chart')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set the altair theme\n", "def my_theme(*args, **kwargs):\n", " return {'config': {'axis': {'labelFontSize': 20, 'titleFontSize': 20}}}\n", "alt.themes.register('my-chart', my_theme)\n", "alt.themes.enable('my-chart')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Params and functions" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "semester_start = pd.Timestamp('2023-01-10').tz_localize('US/Pacific')\n", "semester_end = pd.Timestamp('2023-05-12').tz_localize('US/Pacific')" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "def convert_tz(series):\n", " series = series.dt.tz_localize('UTC')\n", " return series.dt.tz_convert('US/Pacific')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Load data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## User session data" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Log data for user activity\n", "path_sessions = '../data/processed/spring-2023/user-sessions.jsonl'\n", "sessions = pd.read_json(path_sessions, convert_dates=['start', 'stop'])\n", "\n", "for col in ['start', 'stop']:\n", " sessions[col] = convert_tz(sessions[col])" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "263833\n", "261231\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/tmp/ipykernel_11091/3046363681.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.\n", " sessions = sessions[sessions['start'] > semester_start][sessions['start'] < semester_end]\n" ] } ], "source": [ "print(len(sessions))\n", "# Only between start and end of semester\n", "sessions = sessions[sessions['start'] > semester_start][sessions['start'] < semester_end]\n", "print(len(sessions))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## BROKEN -- Cost per day" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "cost = pd.read_json('../data/processed/fall-2018/cloud-costs.jsonl', lines=True)\n", "cost['start_time'] = convert_tz(cost['start_time'])\n", "cost = cost.drop(columns=['end_time']).set_index('start_time')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Only between start and end of semester\n", "cost = cost[cost.index > semester_start][cost.index < semester_end]\n", "# We only use indexing timestamps to make the tz_localize easier.\n", "# after that, we drop it to make everything else easier\n", "cost = cost.reset_index()\n", "\n", "# Fill in any missing data before beginning of date\n", "missing_dates = pd.date_range(semester_start, cost.start_time.min(), name='start_time')\n", "missing_dates_cost = pd.DataFrame(missing_dates, np.full(len(missing_dates), np.nan), columns={'start_time', 'cost'})\n", "cost = cost.append(missing_dates_cost)" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Viz and analysis" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## Daily Active user\n", "\n", "We count someone as a 'daily active user' if they start / stop their notebook server\n", "at least once. Due to anonimization techniques applied earlier, this might slightly\n", "under count users" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Unique daily users - we count anyone who has logged in at least once a day\n", "# We want a dataframe with no index so we can use it easily with Altair\n", "daily_active_users = pd.DataFrame(sessions.set_index('start')['user'].resample('D').nunique()).reset_index()" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "
\n", "" ], "text/plain": [ "alt.Chart(...)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt.Chart(daily_active_users, width=900).mark_line().encode(\n", " x='start',\n", " y='user'\n", ")" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " | start | \n", "
---|---|
count | \n", "12257.000000 | \n", "
mean | \n", "21.312801 | \n", "
std | \n", "21.201713 | \n", "
min | \n", "1.000000 | \n", "
25% | \n", "3.000000 | \n", "
50% | \n", "15.000000 | \n", "
75% | \n", "34.000000 | \n", "
max | \n", "193.000000 | \n", "