{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "# Anniversary Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## TOC:\n", "* [Description](#description)\n", "* [Parsing the info](#parsing_info)\n", " * [Facebook](#facebook)\n", " * [WhatsApp](#whatsapp)\n", "* [Data Preparation](#preparation)\n", "* [Data Exploration](#exploration)\n", " * [How many messages?](#how_many)\n", " * [Message exchange per day](#messages_day)\n", " * [Message exchange per week](#messages_week)\n", " * [Why FB gap in April?](#gap_april)\n", " * [Some FB landmarks](#fb_landmarks)\n", " * [Who said it first](#first)\n", " * [Who said it more?](#more)\n", " * [LOVE distribution per weeks](#love_weeks)\n", " * [Any preferred days to communicate?](#preferred_days)\n", " * [What about the hours?](#preferred_hours)\n", "* [Further exploration: playing with words (NLP)](#nlp)\n", " * [Text normalization](#normalization)\n", " * [Word Frequency](#frequency)\n", " * [Bigrams](#bigrams)\n", " * [Trigrams](#trigrams)\n", " * [Sentiment analysis](#sentiment)\n", " * [Topic detection](#topic)\n", "* [Annex](#annex)\n", " * [Normalize](#normalize)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Description " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "My partner and I met under very random circumstances. Some call it fate! ❤️\n", "\n", "Anyway, after spending __only two__ days together we decided that our story couldn't end there. So, despite the fact that she went back to California, we tried to shorten the ~__6,000 mi__ (~9,500 km) that separeted us through online conversations.\n", "\n", "We started using [Facebook Messenger](https://www.messenger.com/) for most of our communications. Then, we also started using [WhatsApp](https://www.whatsapp.com/), so there is a lot of data stored about us! \n", "\n", "As our first year anniversary is approaching, and because I'm a super data nerd, I decided to explore a little bit our conversations. \n", "\n", "This is not intended to be a sickeningly sweet Notebook. It's mostly a way to share how quickly one can obtain insights on data, but due to its nature you might find yourself involved into some rainbow stuff 🌈 🦄. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "General imports:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from bs4 import BeautifulSoup\n", "import csv\n", "import datetime as dt\n", "import pandas as pd\n", "import numpy as np\n", "import re\n", "import string\n", "from collections import defaultdict\n", "from nltk.tokenize import word_tokenize\n", "from nltk.tokenize import WordPunctTokenizer\n", "from nltk.tokenize import sent_tokenize\n", "from nltk.corpus import stopwords\n", "from nltk.corpus import wordnet\n", "from nltk.sentiment.vader import SentimentIntensityAnalyzer\n", "from nltk.stem.porter import PorterStemmer\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "from gensim import corpora, models\n", "import gensim\n", "import seaborn as sns\n", "from wordcloud import WordCloud, ImageColorGenerator\n", "from PIL import Image\n", "sns.set_style(\"darkgrid\")\n", "sns.set_context(\"notebook\", font_scale=1.5, rc={\"lines.linewidth\": 2.5})\n", "import matplotlib.pyplot as plt\n", "import matplotlib.dates as mdates\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Parsing the info " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Facebook " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Facebook includes an option that allows you to download __ALL__ the information you have ever shared in their platform.\n", "\n", "In a folder called `messages` it stores all your Messenger conversations and files shared. The download consists of a set of `.html` files for every conversation you have ever had. \n", "\n", "The initial idea is to parse the `.html` files and store them into a `.csv` file to get rid of all unnecessary tags." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "content_path = './messages/487.html'\n", "f = open(content_path, 'r')\n", "raw_content = f.read()\n", "f.close()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "soup = BeautifulSoup(raw_content, 'html.parser')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "pretty_lines = soup.prettify().splitlines()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exploring a little bit the document structure we realize that there is a lot of `css` content at the beginning." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[u'',\n", " u' ',\n", " u' ',\n", " u' ',\n", " u'