{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Yellowbrick for Machine Learning Visualizations on Facebook Data\n", "\n", "Paul Witt\n", "\n", "The dataset below was provided to the UCI Machine Learning Repository from researchers who used Neural Networks and Decision Trees to predict how many comments a given Facebook post would generate. \n", "\n", "There are five variants of the dataset. This notebook only uses the first. \n", "\n", "The full paper can be found here: \n", "\n", "http://uksim.info/uksim2015/data/8713a015.pdf\n", "\n", "\n", "### The primary purpose of this notebook is to test Yellowbrick. \n", "\n", "# Attribute Information:\n", "\n", "\n", "All features are integers or float values. \n", "\n", "\n", "1 \n", "Page Popularity/likes \n", "Decimal Encoding \n", "Page feature \n", "Defines the popularity or support for the source of the document. \n", "\n", "\n", "2 \n", "Page Checkins’s \n", "Decimal Encoding \n", "Page feature \n", "Describes how many individuals so far visited this place. This feature is only associated with the places eg:some institution, place, theater etc. \n", "\n", "\n", "3 \n", "Page talking about \n", "Decimal Encoding \n", "Page feature \n", "Defines the daily interest of individuals towards source of the document/ Post. The people who actually come back to the page, after liking the page. This include activities such as comments, likes to a post, shares, etc by visitors to the page.\n", "\n", "\n", "4 \n", "Page Category \n", "Value Encoding \n", "Page feature \n", "Defines the category of the source of the document eg: place, institution, brand etc. \n", "\n", "\n", "5 - 29 \n", "Derived \n", "Decimal Encoding \n", "Derived feature \n", "These features are aggregated by page, by calculating min, max, average, median and standard deviation of essential features. \n", "\n", "\n", "30 \n", "CC1 \n", "Decimal Encoding \n", "Essential feature \n", "The total number of comments before selected base date/time. \n", "\n", "\n", "31 \n", "CC2 \n", "Decimal Encoding \n", "Essential feature \n", "The number of comments in last 24 hours, relative to base date/time. \n", "\n", "\n", "32 \n", "CC3 \n", "Decimal Encoding \n", "Essential feature \n", "The number of comments in last 48 to last 24 hours relative to base date/time. \n", "\n", "\n", "33 \n", "CC4 \n", "Decimal Encoding \n", "Essential feature \n", "The number of comments in the first 24 hours after the publication of post but before base date/time. \n", "\n", "\n", "34 \n", "CC5 \n", "Decimal Encoding \n", "Essential feature \n", "The difference between CC2 and CC3. \n", "\n", "\n", "35 \n", "Base time \n", "Decimal(0-71) Encoding \n", "Other feature \n", "Selected time in order to simulate the scenario. \n", "\n", "\n", "36 \n", "Post length \n", "Decimal Encoding \n", "Other feature \n", "Character count in the post. \n", "\n", "\n", "37 \n", "Post Share Count \n", "Decimal Encoding \n", "Other feature \n", "This features counts the no of shares of the post, that how many peoples had shared this post on to their timeline. \n", "\n", "\n", "38 \n", "Post Promotion Status \n", "Binary Encoding \n", "Other feature \n", "To reach more people with posts in News Feed, individual promote their post and this features tells that whether the post is promoted(1) or not(0). \n", "\n", "\n", "39 \n", "H Local \n", "Decimal(0-23) Encoding \n", "Other feature \n", "This describes the H hrs, for which we have the target variable/ comments received. \n", "\n", "\n", "40-46 \n", "Post published weekday \n", "Binary Encoding \n", "Weekdays feature \n", "This represents the day(Sunday...Saturday) on which the post was published. \n", "\n", "\n", "47-53 \n", "Base DateTime weekday \n", "Binary Encoding \n", "Weekdays feature \n", "This represents the day(Sunday...Saturday) on selected base Date/Time. \n", "\n", "54 \n", "Target Variable \n", "Decimal \n", "Target \n", "The no of comments in next H hrs(H is given in Feature no 39).\n", "\n", "\n", "\n", "\n", "\n", "\n", "## Data Exploration \n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import os\n", "import json\n", "import time\n", "import pickle\n", "import requests\n", "\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import yellowbrick as yb \n", "import matplotlib.pyplot as plt\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data Shape (40948, 54)\n", "634995 int64\n", "0 int64\n", "463 int64\n", "1 int64\n", "0.0 float64\n", "806.0 float64\n", "11.291044776119403 float64\n", "1.0 float64\n", "70.49513846124168 float64\n", "0.0.1 float64\n", "806.0.1 float64\n", "7.574626865671642 float64\n", "0.0.2 float64\n", "69.435826365571 float64\n", "0.0.3 float64\n", "76.0 float64\n", "2.6044776119402986 float64\n", "0.0.4 float64\n", "8.50550186882253 float64\n", "0.0.5 float64\n", "806.0.2 float64\n", "10.649253731343284 float64\n", "1.0.1 float64\n", "70.25478763764251 float64\n", "-69.0 float64\n", "806.0.3 float64\n", "4.970149253731344 float64\n", "0.0.6 float64\n", "69.85058043098057 float64\n", "0.1 int64\n", "0.2 int64\n", "0.3 int64\n", "0.4 int64\n", "0.5 int64\n", "65 int64\n", "166 int64\n", "2 int64\n", "0.6 int64\n", "24 int64\n", "0.7 int64\n", "0.8 int64\n", "0.9 int64\n", "1.1 int64\n", "0.10 int64\n", "0.11 int64\n", "0.12 int64\n", "0.13 int64\n", "0.14 int64\n", "0.15 int64\n", "0.16 int64\n", "0.17 int64\n", "0.18 int64\n", "1.2 int64\n", "0.19 int64\n", "dtype: object\n" ] } ], "source": [ "df=pd.read_csv(\"/Users/pwitt/Documents/machine-learning/examples/pbwitt/Dataset/Training/Features_Variant_1.csv\")\n", "\n", "# Fetch the data if required\n", "DATA = df\n", "print('Data Shape ' + str(df.shape))\n", "\n", "print(df.dtypes)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Page Popularity/likes | \n", "Page Checkins’s | \n", "Page talking about | \n", "Page Category | \n", "Derived5 | \n", "Derived6 | \n", "Derived7 | \n", "Derived8 | \n", "Derived9 | \n", "Derived10 | \n", "... | \n", "Post published weekday-Fri | \n", "Post published weekday-Sat | \n", "Base DateTime weekday-Sun | \n", "Base DateTime weekday-Mon | \n", "Base DateTime weekday-Tues | \n", "Base DateTime weekday-Wed | \n", "Base DateTime weekday-Thurs | \n", "Base DateTime weekday-Fri | \n", "Base DateTime weekday-Sat | \n", "Target_Variable | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "634995 | \n", "0 | \n", "463 | \n", "1 | \n", "0.0 | \n", "806.0 | \n", "11.291045 | \n", "1.0 | \n", "70.495138 | \n", "0.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
| 1 | \n", "634995 | \n", "0 | \n", "463 | \n", "1 | \n", "0.0 | \n", "806.0 | \n", "11.291045 | \n", "1.0 | \n", "70.495138 | \n", "0.0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
| 2 | \n", "634995 | \n", "0 | \n", "463 | \n", "1 | \n", "0.0 | \n", "806.0 | \n", "11.291045 | \n", "1.0 | \n", "70.495138 | \n", "0.0 | \n", "... | \n", "1 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 3 | \n", "634995 | \n", "0 | \n", "463 | \n", "1 | \n", "0.0 | \n", "806.0 | \n", "11.291045 | \n", "1.0 | \n", "70.495138 | \n", "0.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "
| 4 | \n", "634995 | \n", "0 | \n", "463 | \n", "1 | \n", "0.0 | \n", "806.0 | \n", "11.291045 | \n", "1.0 | \n", "70.495138 | \n", "0.0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "
5 rows × 54 columns
\n", "