{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# A Python Tour of Data Science: Data Acquisition & Exploration \n", "\n", "[Michaƫl Defferrard](http://deff.ch), *PhD student*, [EPFL](http://epfl.ch) [LTS2](http://lts2.epfl.ch)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exercise: problem definition\n", "\n", "Theme of the exercise: **understand the impact of your communication on social networks**. A real life situation: the marketing team needs help in identifying which were the most engaging posts they made on social platforms to prepare their next [AdWords](https://www.google.com/adwords/) campaign.\n", "\n", "This notebook is the second part of the exercise. Given the data we collected from Facebook an Twitter in the last exercise, we will construct an ML model and evaluate how good it is to predict the number of likes of a post / tweet given the content." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1 Data importation\n", "\n", "1. Use `pandas` to import the `facebook.sqlite` and `twitter.sqlite` databases.\n", "2. Print the 5 first rows of both tables.\n", "\n", "The `facebook.sqlite` and `twitter.sqlite` SQLite databases can be created by running the [data acquisition and exploration exercise](01_sol_acquisition_exploration.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "from IPython.display import display\n", "import os.path\n", "\n", "folder = os.path.join('..', 'data', 'social_media')\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2 Vectorization\n", "\n", "First step: transform the data into a format understandable by the machine. What to do with text ? A common choice is the so-called [*bag-of-word*](https://en.wikipedia.org/wiki/Bag-of-words_model) model, where we represent each word a an integer and simply count the number of appearances of a word into a document.\n", "\n", "**Example**\n", "\n", "Let's say we have a vocabulary represented by the following correspondance table.\n", "\n", "| Integer | Word |\n", "|:-------:|---------|\n", "| 0 | unknown |\n", "| 1 | dog |\n", "| 2 | school |\n", "| 3 | cat |\n", "| 4 | house |\n", "| 5 | work |\n", "| 6 | animal |\n", "\n", "Then we can represent the following document\n", "> I have a cat. Cats are my preferred animals.\n", "\n", "by the vector $x = [6, 0, 0, 2, 0, 0, 1]^T$.\n", "\n", "**Tasks**\n", "\n", "1. Construct a vocabulary of the 100 most occuring words in your dataset.\n", "2. Build a vector $x \\in \\mathbb{R}^{100}$ for each document (post or tweet).\n", "\n", "Tip: the natural language modeling libraries [nltk](http://www.nltk.org/) and [gensim](https://radimrehurek.com/gensim/) are useful for advanced operations. You don't need them here.\n", "\n", "Arise a first *data cleaning* question. We may have some text in french and other in english. What do we do ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "\n", "nwords = 100\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Exploration question: what are the 5 most used words ? Exploring your data while playing with it is a useful sanity check." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 3 Pre-processing\n", "\n", "1. The independant variables $X$ are the bags of words.\n", "2. The target $y$ is the number of likes.\n", "3. Split in half for training and testing sets." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 4 Linear regression\n", "\n", "Using `numpy`, fit and evaluate the [linear model](https://en.wikipedia.org/wiki/Linear_regression) $$\\hat{w}, \\hat{b} = \\operatorname*{arg min}_{w,b} \\| Xw + b - y \\|_2^2.$$\n", "\n", "Please define a class `LinearRegression` with two methods:\n", "1. `fit` learn the parameters $w$ and $b$ of the model given the training examples.\n", "2. `predict` gives the estimated number of likes of a post / tweet. That will be used to evaluate the model on the testing set.\n", "\n", "To evaluate the classifier, create an `accuracy(y_pred, y_true)` function which computes the mean squared error $\\frac1n \\| \\hat{y} - y \\|_2^2$.\n", "\n", "Hint: you may want to use the function `scipy.sparse.linalg.spsolve()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import scipy.sparse\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interpretation: what are the most important words a post / tweet should include ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 5 Interactivity\n", "\n", "1. Create a slider for the number of words, i.e. the dimensionality of the samples $x$.\n", "2. Print the accuracy for each change on the slider." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import ipywidgets\n", "from IPython.display import clear_output\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 6 Scikit learn\n", "\n", "1. Fit and evaluate the linear regression model using `sklearn`.\n", "2. Evaluate the model with the mean squared error metric provided by `sklearn`.\n", "3. Compare with your implementation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn import linear_model, metrics\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 7 Deep Learning\n", "\n", "Try a simple deep learning model !\n", "\n", "Another modeling choice would be to use a Recurrent Neural Network (RNN) and feed it the sentence words after words." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "os.environ['KERAS_BACKEND'] = 'theano' # tensorflow\n", "import keras\n", "\n", "# Your code here." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 8 Evaluation\n", "\n", "Use [matplotlib](http://matplotlib.org) to plot a performance visualization. E.g. the true number of likes and the real number of likes for all posts / tweets.\n", "\n", "What do you observe ? What are your suggestions to improve the performance ?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from matplotlib import pyplot as plt\n", "plt.style.use('ggplot')\n", "%matplotlib inline\n", "\n", "# Your code here." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.2" } }, "nbformat": 4, "nbformat_minor": 1 }