{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayesian Modelling in Python\n", "Mark Regan\n", "\n", "---\n", "\n", "### Section 0: Introduction\n", "Welcome to \"Bayesian Modelling in Python\" - a tutorial for those interested in learning Bayesian statistics in Python. You can find a list of all tutorial sections on the project's [homepage](https://github.com/markdregan/Hangout-with-PyMC3).\n", "\n", "Statistics is a topic that never resonated with me throughout my years in university. The frequentist techniques that we were taught (p-values, etc.) felt contrived and ultimately, I turned my back on statistics as a topic that I wasn't interested in.\n", "\n", "That was until I stumbled upon Bayesian statistics - a branch of statistics quite different from the traditional frequentist statistics that most universities teach. I was inspired by a number of different publications, blogs & videos that I would highly recommend any newbies to Bayesian stats to begin with. They include:\n", "- [Doing Bayesian Data Analysis](http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition/dp/0124058884/ref=dp_ob_title_bk) by John Kruschke\n", "- [Python port](https://github.com/aloctavodia/Doing_Bayesian_data_analysis) of John Kruschke's examples by Osvaldo Martin\n", "- [Bayesian Methods for Hackers](https://github.com/CamDavidsonPilon/Probabilistic-Programming-and-Bayesian-Methods-for-Hackers) provided me with a great source of inspiration to learn Bayesian stats. In recognition of this influence, I've adopted the same visual styles as BMH.\n", "- [While My MCMC Gently Samples](http://twiecki.github.io/) blog by Thomas Wiecki\n", "- [Healthy Algorithms](http://healthyalgorithms.com/tag/pymc/) blog by Abraham Flaxman\n", "- [Scipy Tutorial 2014](https://github.com/fonnesbeck/scipy2014_tutorial) by Chris Fonnesbeck\n", "\n", "I created this tutorial in the hope that others find it useful and it helps them learn Bayesian techniques just like the above resources helped me. I'd welcome any corrections/comments/contributions from the community.\n", "\n", "---\n", "\n", "### Loading your Google Hangout chat data\n", "Throughout this tutorial, we will use a dataset containing all of my Google Hangout chat messages. I've removed the messages content and anonymized my friends' names; the rest of the dataset is unaltered.\n", "\n", "If you'd like to use your Hangout chat data whilst working through this tutorial, you can download your Google Hangout data from [Google Takeout](https://www.google.com/settings/takeout/custom/chat). The Hangout data is downloadable in JSON format. After downloading, you can replace the `hangouts.json` file in the data folder.\n", "\n", "The json file is heavily nested and contains a lot of redundant information. Some of the key fields are summarized below:\n", "\n", "| Field | Description | Example |\n", "|-----------------|----------------------------------------------------------------|----------------------------------------------|\n", "| `conversation_id` | Conversation id representing the chat thread | Ugw5Xrm3ZO5mzAfKB7V4AaABAQ |\n", "| `participants` | List of participants in the chat thread | [Mark, Peter, John] |\n", "| `event_id` | Id representing an event such as chat message or video hangout | 7-H0Z7-FkyB7-H0au2avdw |\n", "| `timestamp` | Timestamp | 2014-08-15 01:54:12 |\n", "| `message` | Content of the message sent | Went to the local wedding photographer today |\n", "| `sender` | Sender of the message | Mark Regan |" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import json\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import seaborn.apionly as sns\n", "\n", "from datetime import datetime\n", "\n", "%matplotlib inline\n", "plt.style.use('bmh')\n", "colors = ['#348ABD', '#A60628', '#7A68A6', '#467821', '#D55E00', \n", " '#CC79A7', '#56B4E9', '#009E73', '#F0E442', '#0072B2']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The below code loads the json data and parses each message into a single row in a pandas DataFrame.\n", "> Note: the data/ directory is missing the hangouts.json file. You must download and add your own JSON file as described above. Alternatively, you can skip to the next section where I import an anonymized dataset." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/mregan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:75: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)\n" ] } ], "source": [ "# Import json data\n", "with open('data/Hangouts.json') as json_file:\n", " json_data = json.load(json_file)\n", "\n", "# Generate map from gaia_id to real name\n", "def user_name_mapping(data):\n", " user_map = {'gaia_id': ''}\n", " for state in data['conversation_state']:\n", " participants = state['conversation_state']['conversation']['participant_data']\n", " for participant in participants:\n", " if 'fallback_name' in participant:\n", " user_map[participant['id']['gaia_id']] = participant['fallback_name']\n", "\n", " return user_map\n", "\n", "user_dict = user_name_mapping(json_data)\n", "\n", "# Parse data into flat list\n", "def fetch_messages(data):\n", " messages = []\n", " for state in data['conversation_state']:\n", " conversation_state = state['conversation_state']\n", " conversation = conversation_state['conversation']\n", " conversation_id = conversation_state['conversation']['id']['id']\n", " participants = conversation['participant_data']\n", "\n", " all_participants = []\n", " for participant in participants:\n", " if 'fallback_name' in participant:\n", " user = participant['fallback_name']\n", " else:\n", " # Scope to call G+ API to get name\n", " user = participant['id']['gaia_id']\n", " all_participants.append(user)\n", " num_participants = len(all_participants)\n", " \n", " for event in conversation_state['event']:\n", " try:\n", " sender = user_dict[event['sender_id']['gaia_id']]\n", " except:\n", " sender = event['sender_id']['gaia_id']\n", " \n", " timestamp = datetime.fromtimestamp(float(float(event['timestamp'])/10**6.))\n", " event_id = event['event_id']\n", "\n", " if 'chat_message' in event:\n", " content = event['chat_message']['message_content']\n", " if 'segment' in content:\n", " segments = content['segment']\n", " for segment in segments:\n", " if 'text' in segment:\n", " message = segment['text']\n", " message_length = len(message)\n", " message_type = segment['type']\n", " if len(message) > 0:\n", " messages.append((conversation_id,\n", " event_id, \n", " timestamp, \n", " sender, \n", " message,\n", " message_length,\n", " all_participants,\n", " ', '.join(all_participants),\n", " num_participants,\n", " message_type))\n", "\n", " messages.sort(key=lambda x: x[0])\n", " return messages\n", "\n", "# Parse data into data frame\n", "cols = ['conversation_id', 'event_id', 'timestamp', 'sender', \n", " 'message', 'message_length', 'participants', 'participants_str', \n", " 'num_participants', 'message_type']\n", "\n", "messages = pd.DataFrame(fetch_messages(json_data), columns=cols).sort(['conversation_id', 'timestamp'])" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
| \n", " | conversation_id | \n", "event_id | \n", "timestamp | \n", "sender | \n", "message | \n", "message_length | \n", "participants | \n", "participants_str | \n", "num_participants | \n", "message_type | \n", "prev_timestamp | \n", "prev_sender | \n", "time_delay_seconds | \n", "time_delay_mins | \n", "day_of_week | \n", "year_month | \n", "is_weekend | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | \n", "Ugw5Xrm3ZO5mzAfKB7V4AaABAQ | \n", "7-H0Z7-FkyB7-HDBYj4KKh | \n", "2014-08-15 03:44:12.840015 | \n", "Mark Regan | \n", "Thanks guys!!! | \n", "14 | \n", "[Keir Alexander, Louise Alexander Regan, Mark ... | \n", "Keir Alexander, Louise Alexander Regan, Mark R... | \n", "3 | \n", "TEXT | \n", "2014-08-15 03:44:00.781653 | \n", "Keir Alexander | \n", "12.0 | \n", "1.0 | \n", "4 | \n", "2014-08 | \n", "0 | \n", "