{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Analysis in Ruby\n", "\n", "## Introduction\n", "\n", "This is an example of a data analysis in Ruby using [`daru`](https://github.com/v0dro/daru) to organize and manipulate data, [`mixed_models`](https://github.com/agisga/mixed_models) to fit a statistical model, and [`gnuplotrb`](https://github.com/dilcom/gnuplotrb) for visualization.\n", "\n", "We will analyze [these data](http://archive.ics.uci.edu/ml/datasets/BlogFeedback)1 from the UCI machine learning repository2, which originate from blog posts from various sources in 2010-2012.\n", "\n", "Here, we use the data to investigate the following question:\n", "\n", "> *If a blog post has received a comment in the first 24 hours after publication, how many more comments will it receive before 24 hours after its publication have passed?*\n", "\n", "After doing the entire data cleaning and preprocessing with `daru`, we use `mixed_models` to fit a linear mixed model. The fitted model can be used to make predictions for future observations, and make inferences about the relationships between different variables in the data.\n", "\n", "------------------------------------------------------------------------\n", "\n", "[1] For more information on the data set we refer to: Buza, K. (2014). *Feedback Prediction for Blogs*. In Data Analysis, Machine Learning and Knowledge Discovery (pp. 145-152). Springer International Publishing.\n", "\n", "[2] Lichman, M. (2013). *UCI Machine Learning Repository* . Irvine, CA: University of California, School of Information and Computer Science." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data preprocessing with `daru`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `daru` requires CSV files to have a header line, we add a header to the data file and save the new data frame." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "without_header = '../examples/data/blogData_train.csv'\n", "with_header = '../examples/data/blogData_train_with_header.csv'\n", "colnames = (1..281).to_a.map { |x| \"v#{x}\" }\n", "header = colnames.join(',')\n", "File.open(with_header, 'w') do |fo|\n", " fo.puts header\n", " File.foreach(without_header) do |li|\n", " fo.puts li\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can load the data with `daru`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows: 52397\n", "Number of columns: 281\n" ] } ], "source": [ "require 'daru'\n", "df = Daru::DataFrame.from_csv '../examples/data/blogData_train_with_header.csv'\n", "puts \"Number of rows: #{df.nrows}\"\n", "puts \"Number of columns: #{df.ncols}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Selecting and renaming data columns\n", "\n", "We don't want to keep most of the variables for further analysis, as the great majority of the columns represent bag-of-words features and summary statistics for other attributes.\n", "So, we select the data columns which we want to keep, and assign them meaningful names." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139036762220 rows: 10 cols: 13
host_comments_avghost_trackbacks_avgcommentslengthmotuwethfrsasuparentsparents_comments
034.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
134.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
234.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
334.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
434.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
534.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
634.5675660.9729735.00.00.00.01.00.00.00.00.00.00.0
734.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
834.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
934.5675660.9729732.00.00.00.00.01.00.00.00.00.00.0
" ], "text/plain": [ "\n", "#\n", " host_comme host_track comments length mo tu we th fr sa su parents parents_co \n", " 0 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " 1 34.567566 0.972973 5.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 2 34.567566 0.972973 5.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 3 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " 4 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " 5 34.567566 0.972973 5.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 6 34.567566 0.972973 5.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 \n", " 7 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " 8 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n", " 9 34.567566 0.972973 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 \n" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# select a subset of columns of the data frame\n", "keep = ['v16', 'v41', 'v54', 'v62', 'v270', 'v271', 'v272', \n", " 'v273', 'v274', 'v275', 'v276', 'v277', 'v280']\n", "blog_data = df[*keep]\n", "df = nil\n", "\n", "# assign meaningful names for the selected columns\n", "meaningful_names = [:host_comments_avg, :host_trackbacks_avg, \n", " :comments, :length, :mo, :tu, :we, :th, \n", " :fr, :sa, :su, :parents, :parents_comments]\n", "blog_data.vectors = Daru::Index.new(meaningful_names)\n", "\n", "# the resulting data set\n", "blog_data.head" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Selecting a subset of the data rows\n", "\n", "As can be seen in the above output, the length of the text in a blog post is often given as zero. Those are probably missing values, and we get rid of the corresponding observations. \n", "\n", "We also delete observation which have zero comments in the first 24 hours after publication, to comply with our research objective stated in the beginning." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Remaining number of rows: 22435\n" ] } ], "source": [ "nonzero_ind = blog_data[:length].each_index.select do |i| \n", " blog_data[:length][i] > 0 && blog_data[:comments][i] > 0\n", "end\n", "blog_data = blog_data.row[*nonzero_ind]\n", "puts \"Remaining number of rows: #{blog_data.nrows}\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Replacing and transforming variables\n", "\n", "##### Creating a categorical \"day\" variable\n", "\n", "For a more clear representation of the data, and in order to use the day of the week as a grouping variable for the observations, we replace the respective seven 0-1-valued columns with one column of categorical data (with values 'mo', 'tu', 'we', 'th', 'fr', 'sa', 'su')." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139045191780 rows: 3 cols: 7
host_comments_avghost_trackbacks_avgcommentslengthparentsparents_commentsday
1221110.300870.074.03501.00.00.0we
1222110.300870.074.03501.00.00.0we
1223110.300870.0218.04324.00.00.0th
" ], "text/plain": [ "\n", "#\n", " host_comme host_track comments length parents parents_co day \n", " 1221 110.30087 0.0 74.0 3501.0 0.0 0.0 we \n", " 1222 110.30087 0.0 74.0 3501.0 0.0 0.0 we \n", " 1223 110.30087 0.0 218.0 4324.0 0.0 0.0 th \n" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "days = Array.new(blog_data.nrows) { :unknown }\n", "[:mo, :tu, :we, :th, :fr, :sa, :su].each do |d|\n", " ind = blog_data[d].to_a.each_index.select { |i| blog_data[d][i]==1 }\n", " ind.each { |i| days[i] = d.to_s }\n", " blog_data.delete_vector(d)\n", "end\n", "blog_data[:day] = days\n", "blog_data.head 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Replacing two highly correlated variables\n", "\n", "The variable `parents` denotes the number of parent blog posts, where we consider a blog post P as a\n", "parent of blog post B if B is a reply (trackback) to blog post P. \n", "Related to it, the variable `parents_comments` denotes the number of comments that the parents of a blog post received on average.\n", "Clearly, the two variables are highly correlated (as zero `parents` implies zero `parents_comments`). Therefore, we shouldn't include both these variables in the linear mixed model.\n", "\n", "We combine the variables `parents` and `parents_comments` into one variable called `has_parent_with_comments`, which is a categorical variable designating whether a blog post has at least one parent post with at least one comment." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139031507220 rows: 3 cols: 6
host_comments_avghost_trackbacks_avgcommentslengthdayhas_parent_with_comments
1221110.300870.074.03501.0weno
1222110.300870.074.03501.0weno
1223110.300870.0218.04324.0thno
" ], "text/plain": [ "\n", "#\n", " host_comme host_track comments length day has_parent \n", " 1221 110.30087 0.0 74.0 3501.0 we no \n", " 1222 110.30087 0.0 74.0 3501.0 we no \n", " 1223 110.30087 0.0 218.0 4324.0 th no \n" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# create a binary indicator vector specifying if a blog post has at least \n", "# one parent post which has comments\n", "hpwc = (blog_data[:parents] * blog_data[:parents_comments]).to_a\n", "blog_data[:has_parent_with_comments] = hpwc.map { |t| t == 0 ? 'no' : 'yes'} \n", "blog_data.delete_vector(:parents)\n", "blog_data.delete_vector(:parents_comments)\n", "blog_data.head 3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Log transforms\n", "\n", "Some prior experimentation with the data suggests that it is necessary to take $log$-transforms of the response variable `comments` and the predictor variable `host_comments_avg`.\n", "This results in a much better agreement with the normality assumption on the model residuals." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139031510440 rows: 3 cols: 8
host_comments_avghost_trackbacks_avgcommentslengthdayhas_parent_with_commentslog_commentslog_host_comments_avg
1221110.300870.074.03501.0weno4.304065093204174.7032118138076795
1222110.300870.074.03501.0weno4.304065093204174.7032118138076795
1223110.300870.0218.04324.0thno5.3844950627890894.7032118138076795
" ], "text/plain": [ "\n", "#\n", " host_comme host_track comments length day has_parent log_commen log_host_c \n", " 1221 110.30087 0.0 74.0 3501.0 we no 4.30406509 4.70321181 \n", " 1222 110.30087 0.0 74.0 3501.0 we no 4.30406509 4.70321181 \n", " 1223 110.30087 0.0 218.0 4324.0 th no 5.38449506 4.70321181 \n" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "log_comments = blog_data[:comments].to_a.map { |c| Math::log(c) }\n", "log_host_comments_avg = blog_data[:host_comments_avg].to_a.map { |c| Math::log(c) }\n", "blog_data[:log_comments] = log_comments\n", "blog_data[:log_host_comments_avg] = log_host_comments_avg\n", "blog_data.head 3" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "## Linear mixed model\n", "\n", "The logarithm of the number of comments of a given blog post is modeled as a function of the blog post text length, the average number of comments and trackbacks per blog article at the hosting website of the blog, and the existence of commented parent blog posts. Additionally, we model random fluctuations of the number of comments due to the day of the week when the blog post was released.\n", "\n", "That is, the number of comments of the $i$th blog post, which is published on weekday $d$, is estimated as\n", "\n", "$$log(comments_i) \\approx \\beta_0 + length_i \\cdot \\beta_1 + log(host\\_comments_i) \\cdot \\beta_2 + host\\_trackbacks_i \\cdot \\beta_3 + parent\\_with\\_comments\\_yes \\cdot \\beta_4 + b_d.$$\n", "\n", "*Note:* It is questionable whether a linear mixed model is a good choice here, because the response variable represents count data, which for obvious reasons cannot follow a normal distribution. However, we can still get a reasonble model fit, and certainly gain some insight into the data from the linear mixed model, even if it wont match the data very well. In fact, by taking the $log$-transform, we partially account for the fact that the response variable represents count data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We fit the model with `mixed_models` and display the estimated fixed effects coefficients (the $\\beta$'s from the above equation)." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>0.18944500002662124, :log_host_comments_avg=>0.7861228519075351, :host_trackbacks_avg=>0.05579925647682859, :length=>2.580329980620255e-05, :has_parent_with_comments_lvl_yes=>-0.48841082096511457}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'mixed_models'\n", "model_fit = LMM.from_formula(formula: \"log_comments ~ log_host_comments_avg + host_trackbacks_avg + length + has_parent_with_comments + (1 | day)\", data: blog_data)\n", "model_fit.fix_ef" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139041993940 rows: 5 cols: 4
coefsdz_scoreWaldZ_p_value
intercept0.189445000026621240.019300366951039869.8156164858002540.0
log_host_comments_avg0.78612285190753510.0053227324152766145147.691597205094780.0
host_trackbacks_avg0.055799256476828590.0081028499656306546.8863741416302615.723421736547607e-12
length2.580329980620255e-052.0833561682273268e-0612.3854481531873170.0
has_parent_with_comments_lvl_yes-0.488410820965114570.08884797074497962-5.4971522351028193.859734754030342e-08
" ], "text/plain": [ "\n", "#\n", " coef sd z_score WaldZ_p_va \n", " intercept 0.18944500 0.01930036 9.81561648 0.0 \n", "log_host_c 0.78612285 0.00532273 147.691597 0.0 \n", "host_track 0.05579925 0.00810284 6.88637414 5.72342173 \n", " length 2.58032998 2.08335616 12.3854481 0.0 \n", "has_parent -0.4884108 0.08884797 -5.4971522 3.85973475 \n" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Assess the quality of the fit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Fitted vs. residuals plot\n", "\n", "We can assess the goodness of the model fit (to some extent) by plotting the residuals agains the fitted values. We definately see a pattern here -- the model seems to make somewhat better predictions for observations with a small number of comments. Additionally, a subset of the observations displays what appears to be a linear relationship between the fitted values and the residuals, which is a reason for concern. However, this is due to the fact the the response variable has a discrete range ($log$ transfrom of count data), with possible values at the low end being especially far apart." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "# \"Fitted\", :ylabel => \"Residuals\", :term => [\"png\"]], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>], @cmd=\"plot \">" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'gnuplotrb'\n", "include GnuplotRB\n", "\n", "x, y = model_fit.fitted, model_fit.residuals\n", "fitted_vs_residuals = Plot.new([[x,y], with: 'points', pointtype: 6, notitle: true],\n", " xlabel: 'Fitted', ylabel: 'Residuals')\n", "fitted_vs_residuals.term('png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Histogram of the residuals\n", "\n", "We can further analyze the validity of the linear mixed model somewhat, by looking at a histogram and checking if the residuals appear to be approximately normally distributed.\n", "The distribution looks in general not too different from a bell-shaped normal curve." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAMAAAACDyzWAAABNVBMVEX///8AAACgoKD/AAAAwAAAgP/AAP8A7u7AQADIyABBaeH/wCAAgEDAgP8wYICLAABAgAD/gP9//9SlKir//wBA4NAAAAAaGhozMzNNTU1mZmZ/f3+ZmZmzs7PAwMDMzMzl5eX////wMjKQ7pCt2ObwVfDg///u3YL/tsGv7u7/1wAA/wAAZAAA/38iiyIui1cAAP8AAIsZGXAAAIAAAM2HzusA////AP8AztH/FJP/f1DwgID/RQD6gHLplnrw5oy9t2u4hgv19dyggCD/pQDugu6UANPdoN2QUEBVay+AFACAFBSAQBSAQICAYMCAYP+AgAD/gED/oED/oGD/oHD/wMD//4D//8DNt57w//Cgts3B/8HNwLB8/0Cg/yC+vr6fn58fHx+/v79fX1/f398/Pz/Jf+lXIGH2AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAQ4ElEQVR4nO2dDXbiTBIE0Tm4z5xDP83e/wgLje2xwNs72apEqBzxnlnre0kn7gobIePZ0wkAAAAAAAAAAAAAAAAAAAAA4PcxTsM0fh6cvx8A+JmnuX5UlnM5lcu46wOC38UyXm/G5X5QPSzTbg8Gfh9TOT04h4DwQoZvt5WyjLs8EPidPAo4DMteDwV+I//wE3AAeCZIwH84B+yq6nt8Kave/fG9suqZZbzenD+edm+fI+Bb3Clp1TOlXgcs9xV/vg6YcysQcI+qH5in4X4d+rbi8nmwuerttwIB96jq4oVVcBgQEHYFAWFXEBB2BQFhVxAQdgUBYVcQEHYFAWFXEBB2BQFhVxAQdgUBYVcQEHYFAWFXEBB2BQFhVxAQdgUBYVcQEHYFAWFXEBB2BQH9/BHZ+/G+FAT08+c/Egh4/Kr3AgEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsAEC+kHABgjoBwEbIKAfBGyAgH4QsMFWK8ZpmMbPg/P1YL7+7zLcmIKrDgsCNthoxXwVbq7SXVnO5VQu4+l0KYaq44KADTZasYzXm3G5H9w+P5UJAR9AwAYbrZhuqpXvT7a3gwkBv4OADTZaMTwtcj5fBZwv384Mg6qOCwI2iBZwvFxvLteTwfnJQAREwGeCBTwvX5/OT6+Cb2yrOyYI+D8IMGJ9DliW8/flH+u2VR0XBGwQ8Sr468fe5eN6TH0NMl9iq44LAjbYaEWp1wHLfaFl/PivnAOuQMAGW62YP375cVtouFOuT8UTr4L/goAN+F2wHwRsgIB+ELABAvpBwAYI6AcBGyCgHwRsgIB+ELABAvpBwAYI6AcBGyCgHwRsgIB+ELABAvpBwAYI6AcBGyCgHwRsgIB+ELABAvpBwAYI6AcBGyCgHwRsgIB+ELABAvpBwAYI6AcBGyCgHwRsgIB+VAE19v7qNoKAflQBtfTeX91GENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANENAPAjZAQD8I2AAB/SBgAwT0g4ANwqwYp2EaPw/O14PZVnU0ELBBlBXzVbj5U7rlXE7lMpqqDgcCNoiyYhmvN+NyP7h9fiqTqepwIGCDKCumcnpwDgE/QcAGUVYMT6udz6aqw4GADWwCjhdX1Rsg/r9pIeD/xiXgeXmO3Aiq2xnRESn9iwSMNGJ9DliWx+ffU66fgEajfpGAN0JfBX/92Ls8XQQMrHoDREekNAJ2Uep1wHJfcRmdVW+A6IiURsA+5s9ffgwfz+3DUExV+yM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdIImKDKjeiIlEbABFVuREekNAImqHIjOiKlETBBlRvRESmNgAmq3IiOSGkETFDlRnRESiNggio3oiNSGgETVLkRHZHSCJigyo3oiJRGwARVbkRHpDQCJqhyIzoipREwQZUb0REpjYAJqtyIjkhpBExQ5UZ0REojYIIqN6IjUhoBE1S5ER2R0giYoMqN6IiURsAEVW5ER6Q0AiaociM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdIImKDKjeiIlEbABFVuREekNAImqHIjOiKlETBBlRvRESmNgAmq3IiOSGkETFDlRnRESiNggio3oiNSGgETVLkRHZHSCJigyo3oiJRGwARVbkRHpDQCJqhyIzoipREwQZUb0REpjYAJqtyIjkhpBExQ5UZ0REojYIIqN6IjUhoBE1S5ER2R0giYoMqN6IiURsAEVW5ER6Q0AiaociM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdIImKDKjeiIlEbABFVuREekNAImqHIjOiKlETBBlRvRESmNgAmq3IiOSGkETFDlRnRESiNggio3oiNSGgElxmmYxs+DstyXW4YbU3DVGyE6IqURUGGe5vpRGafxvtylGKreCdERKY2ACst4vRmXj4NyQsDNRiGgwnRTrfx9sh3+/tfoqndCdERKI6B+9+HheJov384Mg6reCdERKY2A+t0fBbycy/XMcIyteidER6Q0Aup3fxSwMj+9Cr6xre5dEB2R0r9IwAAjfj4HfP78p+MDIzoipX+RgDciXgWfl4fl6muQ+RJb9U6IjkhpBFQo9TpgOa2fjDkH3GAUAkrM03C/Dj18PKXfntTLMvEquNcoBExQ5UZ0REojYIIqN6IjUhoBE1S5ER2R0giYoMqN6IiURsAEVW5ER6Q0AiaociM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdIImKDKjeiIlEbABFVuREekNAImqHIjOiKlETBBlRvRESmNgAmq3IiOSGkETFDlRnRESiNggio3oiNSGgETVLkRHZHSCJigyo3oiJRGwARVbkRHpDQCJqhyIzoipREwQZUb0REpjYAJqtyIjkhpBExQ5UZ0REojYIIqN6IjUhoBE1S5ER2R0giYoMqN6IiURsAEVW5ER6Q0AiaociM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdIImKDKjeiIlEbABFVuREekNAImqHIjOiKlETBBlRvRESmNgAmq3IiOSGkETFDlRnRESiNggio3oiNSGgETVLkRHZHSCJigyo3oiJRGwARVbkRHpDQCJqhyIzoipREwQZUb0REpjYAJqtyIjkhpBExQ5UZ0REojYIIqN6IjUhoBE1S5ER2R0giYoMqN6IiURsAEVW5ER6Q0AiaociM6IqURMEGVG9ERKY2ACarciI5IaQRMUOVGdERKI2CCKjeiI1IaARNUuREdkdII2Mc4DdP4eVCWH9ZFQEccAe/M01w/KuM0ImCvUQjYxTJeb8bl46D8tC4COuIIeGcq15sytdZFQEccAb+vMzwcW6reANERKY2AG9ZBwO1GIeCGddoC3giq2xnRESn9iwSMNIJzwCijfpGAN0JfBZ+X1rrvLOAfEaNRCNhFqdcBy+n5yTi8yoHXEevie2/dRsKsmKfhfh16+Hhuf3p2R0DL4ntv3Ub4XXDF64h18b23biMIWPE6Yl18763bCAJWvI5YF9976zaCgBWvI9bFNfbe6CcQsOJ15H0WR8A3RRsjAsaBgBVtjAgYBwJWtDEiYBwIWNHGiIBxIGBFGyMCxoGAFW2MCBgHAla0MSJgHAhY0caIgHEgYEUbIwLGgYAVbYwIGAcCVrQxImAcCFjRxoiAcSBgRRsjAsaBgBVtjAgYBwJWtDEiYBwIWNHGiIBxIGBFGyMCxoGAFW2MCBgHAla0MSJgHAhY0caIgHEgYEUbIwLGgYAVbYwIGAcCVrQxImAcCFjRxoiAcSBgRRsjAsaBgBVtjAgYBwJWtDEiYBwIWNHGiIBxIGBFGyMCxoGAFW2MCBgHAla0MSJgHAhY0caIgHEgYEUbIwLGgYAVbYwIGAcCVrQxImAcCFjRxoiAcSBgRRsjAsaBgBVtjAgYBwJWtDEiYBwIWNHGiIBxIGBFGyMCxoGAFW2MCBgHAla0MSJgHAhY0caIgHEgYEUbIwLGgYAVbYwIGAcCVrQxImAcCFjRxoiAcSBgRRsjAsaBgBVtjAgYBwJWtDEiYBwIWNHGiIBxIGBFGyMCxoGAFW2MCBjHVivGaZjGx4NluDEFVznRxoiAcWy0Yp7m+rE+uBRDlRVtjAgYx0YrlvF6My4PBwi4IY2ACtNNtTI9HEwI2J9GQP3uw8PBNF++nRkGVVnRxoiAcVgEvJzL9WRwjK2yoo0RAeOwCFiZn14F39hWZ0MbIwLGEGDEz+eAH8s/1m2rsqKNEQHjiHgVfF4eDuprkPkSW2VFGyMCxrHRilIv/ZX7Ql8HnANuSCOgxDwN9+vQw7eDsky8Cu5NI2CCKhltjAgYBwJWtDEiYBxpBfyjIY0RAePIK6BxjAgYBwJ2jBEB40DAjjEiYBwI2DFGBIwDATvGiIBxIGDHGBEwDgTsGCMCxoGAHWNEwDgQsGOMCBgHAnaMEQHjQMCOMSJgHAjYMUYEjAMBO8aIgHEgYMcYETAOBOwYIwLGgYAdY0TAOBCwY4wIGAcCdowRAeNAwI4xImAcCNgxRgSMAwE7xoiAcSBgxxgRMA4E7BgjAsaBgB1jRMA4ELBjjAgYBwJ2jBEB40DAjjEiYBwI2DFGBIwDATvGeGABRfxzQsCOMR5YQCn9ih+YCNg1mKMurqURcAPewRx1cS2NgBvwDuaoi2tpBFzh/DdP1cEcdXEtjYArnFutpY+7uJZGwBXOrdbSx11cSyPgCudWa+njLq6lEXCFc6u19HEX19IIuMK51Vr6uItraQRc4dxqLX3cxbU0Aq5wbrWWPu7iWhoBVzi3Wksfd3EtjYArnFutpY+7uJZGwBXOrdbSx11cSyPgCudWa+njLq6lEXCFc6u19HEX19IIuMK51Vr6uItraQRc4dxqLX3cxbU0Aq5wbrWWPu7iWhoBVzi3Wksfd3EtjYArnFutpY+7uJZGwBXOrdbSx11cSyPgCudWa+njLq6lEXCFc6u19HEX19IIuMK51Vr6uItraQRc4dxqLX3cxbX0gQUcp2Eag6ucW62lj7u4lj6ugPM014/QKudWa+njLq6ljyvgMl5vxiWg6tudnFutpY+7uJZW/zGtjQMOZCrXmzIFVCHgjml18Y0DDmT4ae2nKvH7S9wNY/q4i2vp9AJad8OYPu7iWjqbgADPWAT88RwQ4FUs4/XmvPyfFICJUq8Dlr0fBvxa5ml4vA4NAAAAAAAAAAAAAOBkqb957vj18Kj/uvr8wx8E/NOd9GvnZdEf3k9/r2Ap6vyiurbvhjyqbik6uHT+Yq5M8sYv53Iql/EFd7q6pH9//Pj3Co6il+7EqWdUvVL00Nt1mfUJ327U9+KMPXc6LUV/Q9tyq3r8ewVHUecX1bV9N/RRvVLAzrcmnMfOtyx2vRms507yw+t9r1rvO+detRMdo3rl+1Wm+dJxZjFfOje+LHLXlfNZv4/88H58t66j6IOeL6pj+3pG1SdFH5frmcWslpXbt0jPxg9Dz5sRx0tPVd8dXiZgzxfVsX1do+qSQufv+67nf//BXu+03L55la/qq0r5Fv68k/YO2s97vbmAfW8L1n8CyqP6QpBiO+Ij/PizgY7nEP0kpiw9NW9+Dtj7RemPb8OoXvPPw9TTzfklz3Fj7VN38NL5Btq+V8EdP5h65tTzRY23m74/6FEfYr8UOv1P9y+6Dijmv5AfXu/fK/RcBxz1+/RfB3zbc8BKWXqvr+sbv3Rc//94ChG96PrTwp6/V+j7G8a+L6pn+z76xHy/FAAAAAAAAAAAAAAAAAAAAAAAACDxX66uA7frdpvBAAAAAElFTkSuQmCC", "text/plain": [ "# \"fill solid 0.5\", :term => [\"png\"]], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:notitle => true, :with => \"boxes\"]>], @cmd=\"plot \">" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bin_width = (y.max - y.min)/10.0\n", "bins = (y.min..y.max).step(bin_width).to_a\n", "rel_freq = Array.new(bins.length-1){0.0}\n", "y.each do |r|\n", " 0.upto(bins.length-2) do |i|\n", " if r >= bins[i] && r < bins[i+1] then\n", " rel_freq[i] += 1.0/y.length\n", " end\n", " end\n", "end\n", "bins_center = bins[0...-1].map { |b| b + bin_width/2.0 }\n", " \n", "residuals_hist = Plot.new([[bins_center, rel_freq], with: 'boxes', notitle: true],\n", " style: 'fill solid 0.5')\n", "residuals_hist.term('png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Q-Q plot of the residuals\n", "\n", "The Q-Q scatter plot deviates a little at both ends from the diagonal. However, it isn't horrible considering that we are modeling count data with a normal distribution." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "# \"Q-Q plot of the residuals\", :xlabel => \"Normal theoretical quantiles\", :ylabel => \"Observed quantiles\", :term => [\"png\"]], @datasets=Hamster::Vector[#, @options=Hamster::Hash[:pointtype => 6, :notitle => true, :with => \"points\"]>, # true, :with => \"lines\"]>], @cmd=\"plot \">" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "require 'distribution'\n", "\n", "observed = model_fit.residuals.sort\n", "n = observed.length\n", "theoretical = (1..n).to_a.map { |t| Distribution::Normal.p_value(t.to_f/n.to_f) * model_fit.sigma}\n", "qq_plot = Plot.new([[theoretical, observed], with: 'points', pointtype: 6, notitle: true],\n", " ['x', with: 'lines', notitle: true],\n", " xlabel: 'Normal theoretical quantiles', ylabel: 'Observed quantiles',\n", " title: 'Q-Q plot of the residuals')\n", "qq_plot.term('png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In general, we see that the model residuals satisfy the normality assumption to a reasonable extent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Estimation results\n", "\n", "Let's look at the estimated model parameters, and see what they reveal about the data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Fixed effects\n", "\n", "Looking at the fixed effects coefficients it is striking that the estimate corresponding to the effect of the blog post text length is almost zero. Possibly, the length of a blog post has practically no effect on how many comments the blog post will receive." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Obtained fixed effects coefficient estimates:\n", "{:intercept=>0.18944500002662124, :log_host_comments_avg=>0.7861228519075351, :host_trackbacks_avg=>0.05579925647682859, :length=>2.580329980620255e-05, :has_parent_with_comments_lvl_yes=>-0.48841082096511457}\n" ] } ], "source": [ "puts \"Obtained fixed effects coefficient estimates:\"\n", "puts model_fit.fix_ef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The directionality of the obtained estimates implies that blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments. Moreover, blog posts which have parent blog posts, tend to have fewer comments. \n", "The effects of the average number of trackbacks per blog post on the hosting website and the blog post length seem rather small in magnitude." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Based on the [Wald Z test statistics](https://en.wikipedia.org/wiki/Wald_test#Test_on_a_single_parameter), we can carry out hypotheses tests for each fixed effects terms $\\beta_{i}$, testing the null $H_{0} : \\beta_{i} = 0$ against the alternative $H_{a} : \\beta_{i} \\neq 0$.\n", "\n", "*Note:* The Wald methods for $p$-values and confidence intervals are not absolutely trustworthy and should be treated with caution, as pointed out in [this blog post](http://agisga.github.io/MixedModels_p_values_and_CI/).\n", "\n", "The corresponding (approximate) p-values are obtained with:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "{:intercept=>0.0, :log_host_comments_avg=>0.0, :host_trackbacks_avg=>5.723421736547607e-12, :length=>0.0, :has_parent_with_comments_lvl_yes=>3.859734754030342e-08}" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.fix_ef_p(method: :wald)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, all $p$-values are tiny, implying that each predictor has a very strong linear relationship with the response variable.\n", "\n", "However, we should be careful with our conclusions. The very small $p$-values can also be explained by the large sample size (>20000 observations)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at Wald confidence intervals for the fixed effects coefficient estimates, which are in general more informative than $p$-values." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139052686000 rows: 5 cols: 3
lower95upper95coef
intercept0.151616976164225870.22727302388901660.18944500002662124
log_host_comments_avg0.77569048814320870.79655521567186140.7861228519075351
host_trackbacks_avg0.0399179624770390350.071680550476618140.05579925647682859
length2.1719996776498967e-052.988660283590613e-052.580329980620255e-05
has_parent_with_comments_lvl_yes-0.6625496425736548-0.31427199935657435-0.48841082096511457
" ], "text/plain": [ "\n", "#\n", " lower95 upper95 coef \n", " intercept 0.15161697 0.22727302 0.18944500 \n", "log_host_c 0.77569048 0.79655521 0.78612285 \n", "host_track 0.03991796 0.07168055 0.05579925 \n", " length 2.17199967 2.98866028 2.58032998 \n", "has_parent -0.6625496 -0.3142719 -0.4884108 \n" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conf_int = model_fit.fix_ef_conf_int(level: 0.95, method: :wald)\n", "ci = Daru::DataFrame.rows(conf_int.values, order: [:lower95, :upper95], index: model_fit.fix_ef_names)\n", "ci[:coef] = model_fit.fix_ef.values\n", "ci" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that none of the 95% confidence intervals contains 0, which suggest high statistical significance of the linear predictors (equivalent to $p$-values). Also, all of the intervals seem rather narrow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Random effects\n", "\n", "We can look at the obtained random effects estimates (the values $b_d$ from the above equation, where $d\\in\\{mo, tu, we, th, fr, sa, su\\}$)." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Obtained random effects coefficient estimates:\n", "{:intercept_fr=>0.0, :intercept_mo=>0.0, :intercept_sa=>0.0, :intercept_su=>0.0, :intercept_th=>0.0, :intercept_tu=>0.0, :intercept_we=>0.0}\n" ] } ], "source": [ "puts \"Obtained random effects coefficient estimates:\"\n", "puts model_fit.ran_ef" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also look at the estimated correlation structure of the random effects:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
Daru::DataFrame:47139054735200 rows: 1 cols: 1
day
day0.0
" ], "text/plain": [ "\n", "#\n", " day \n", " day 0.0 \n" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "model_fit.ran_ef_summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Interestingly, the estimates of the random effects coefficients and standard deviation are all zero!\n", "\n", "That is, we have a singular fit. Thus, our results imply that the variability between different days of the week is not large enough to justify non-zero random effects in this model. Practically, we can coclude that the day of the week on which a blog post is published has no effect on the number of comments that the blog post will receive in the first 24 hours." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusions\n", "\n", "This turned out to be an example of a degenerate model fit (random effects variance estimated to be zero), and we saw that `mixed_models` can handle degenerate fits very well. \n", "\n", "Of course we learned alot about the blog post data by doing this analysis. Here is a list of some of the findings.\n", "\n", "* Blog posts hosted on websites with a high average of comments per blog post, also tend to have more comments (obviously). \n", "\n", "* Blog posts which have parent blog posts, seem to have fewer comments (wonder why that is...).\n", "\n", "* The effect of the average number of trackbacks per blog post on the hosting website seems to have a very small effect on the number of comments of a given blog post.\n", "\n", "* The blog post text length has an extremely small positive effect on the number of comments.\n", "\n", "* All considered fixed effects predictor variables seem to have a significant influence on the number of comments that a blog post receives in the first 24 hours after publication (according to Wald Z tests).\n", "\n", "* The day of the week on which a blog post is published has practically no influence on the number of comments that the blog post will receive in the first 24 hours after publication." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Ruby 2.3.1", "language": "ruby", "name": "ruby" }, "language_info": { "file_extension": ".rb", "mimetype": "application/x-ruby", "name": "ruby", "version": "2.3.1" } }, "nbformat": 4, "nbformat_minor": 0 }