{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Estimating simple regression models in Julia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By Tyler Ransom, Duke University Social Science Research Institute\n", "\n", "tyler.ransom@duke.edu\n", "\n", "https://github.com/tyleransom" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With our knowledge of the `DataFrames` package, we can use Julia to estimate common regression models which are useful in summarizing and describing data.\n", "\n", "First, let's call the package we'll need for this demonstration. We'll be using Julia version 0.4.1 with `GLM` version 0.5.0 and `DataFrames` version 0.6.10." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "using GLM\n", "using DataFrames" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## OLS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start by loading in the same CSV file from our `DataFrames` example: (available at https://github.com/jmxpearson/duke-julia-ssri-2016/auto.csv)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "74x12 DataFrames.DataFrame\n", "| Col # | Name | Eltype | Missing |\n", "|-------|--------------|------------|---------|\n", "| 1 | make | UTF8String | 0 |\n", "| 2 | price | Int64 | 0 |\n", "| 3 | mpg | Int64 | 0 |\n", "| 4 | rep78 | Int64 | 5 |\n", "| 5 | headroom | Float64 | 0 |\n", "| 6 | trunk | Int64 | 0 |\n", "| 7 | weight | Int64 | 0 |\n", "| 8 | length | Int64 | 0 |\n", "| 9 | turn | Int64 | 0 |\n", "| 10 | displacement | Int64 | 0 |\n", "| 11 | gear_ratio | Float64 | 0 |\n", "| 12 | foreign | Int64 | 0 |" ] } ], "source": [ "auto = readtable(\"auto.csv\");\n", "showcols(auto)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Now let's use the `GLM` package to estimate a simple ordinary least squares regression. Let's use `:price` as the dependent variable and `:mpg` and `:weight` as the indepenent variables.\n", "\n", "The syntax for estimating an OLS model with the GLM package is:\n", "\n", "`object = glm(Y~X,data,Normal(),IdentityLink())`\n", "where\n", "- `object` is the Julia object that stores all of the results\n", "- `Y` is the dependent variable\n", "- `X` is a linear combination of independent variables\n", "- `data` is the name of the data frame that holds the data\n", "- And the last two options can be adjusted to estimate other models (which we'll get to later)\n", "\n", "Users with experience in `R` will recognize the tight correspondence between Julia's GLM estimation syntax and `R`'s." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let's estimate our model now:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:\n", "\n", "Coefficients:\n", " Estimate Std.Error z value Pr(>|z|)\n", "(Intercept) 1946.07 3597.05 0.541018 0.5885\n", "mpg -49.5122 86.156 -0.574681 0.5655\n", "weight 1.74656 0.641354 2.72324 0.0065\n" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olsest = glm(price~mpg+weight,auto,Normal(),IdentityLink())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We see the reported coefficients, standard errors, z-values, and corresponding p-values for each covariate included in the model. Note that the intercept was automatically included." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Additionally, we can extract these objects, and others, by the following syntax:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3-element Array{Float64,1}:\n", " 1946.07 \n", " -49.5122 \n", " 1.74656" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "coef(olsest)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3-element Array{Float64,1}:\n", " 3597.05 \n", " 86.156 \n", " 0.641354" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "stderr(olsest)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "3x2 Array{Float64,2}:\n", " 1946.07 3597.05 \n", " -49.5122 86.156 \n", " 1.74656 0.641354" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[coef(olsest) stderr(olsest)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Additional objects that we can extract from the stored estimates object include:\n", "- `deviance()`, which is the weighted sum of squares of the residuals\n", "- `vcov()`, which is the estimated variance-covariance matrix of the parameters\n", "- `predict()`, which obtains predicted values of the outcome for each observation" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "4.487441163821706e8" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "deviance(olsest)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "74-element Array{Float64,1}:\n", " 5974.22\n", " 6955.33\n", " 5467.72\n", " 6632.14\n", " 8329.35\n", " 7464.72\n", " 4553.58\n", " 6684.54\n", " 7930.52\n", " 6943.64\n", " 8815.5 \n", " 8064.48\n", " 8399.05\n", " ⋮ \n", " 3918.89\n", " 7226.13\n", " 3854.95\n", " 3793.59\n", " 5264.06\n", " 4253.62\n", " 5718.16\n", " 4579.86\n", " 3479.05\n", " 4079.12\n", " 4183.92\n", " 6640.95" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pricehat = predict(olsest)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "74x2 Array{Float64,2}:\n", " 4099.0 5974.22\n", " 4749.0 6955.33\n", " 3799.0 5467.72\n", " 4816.0 6632.14\n", " 7827.0 8329.35\n", " 5788.0 7464.72\n", " 4453.0 4553.58\n", " 5189.0 6684.54\n", " 10372.0 7930.52\n", " 4082.0 6943.64\n", " 11385.0 8815.5 \n", " 14500.0 8064.48\n", " 15906.0 8399.05\n", " ⋮ \n", " 3995.0 3918.89\n", " 12990.0 7226.13\n", " 3895.0 3854.95\n", " 3798.0 3793.59\n", " 5899.0 5264.06\n", " 3748.0 4253.62\n", " 5719.0 5718.16\n", " 7140.0 4579.86\n", " 5397.0 3479.05\n", " 4697.0 4079.12\n", " 6850.0 4183.92\n", " 11995.0 6640.95" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[convert(Array,auto[:,:price]) pricehat]" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "3x3 Array{Float64,2}:\n", " 1.29388e7 -2.9276e5 -2191.9 \n", " -2.9276e5 7422.86 44.6017 \n", " -2191.9 44.6017 0.411335" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vcov(olsest)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "3x2 Array{Float64,2}:\n", " 3597.05 3597.05 \n", " 86.156 86.156 \n", " 0.641354 0.641354" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[sqrt(diag(vcov(olsest))) stderr(olsest)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Binary response models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will estimate parameters associated with a binary dependent variable. The two canonical models that we will discuss here are the logistic regression model (logit) and the probit model. The two models usually give similar results, and discussing their differences is beyond the current scope." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Logit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The logit model is called using the `Binomial()` and `LogitLink()` options in the `glm()` function:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.LogitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:\n", "\n", "Coefficients:\n", " Estimate Std.Error z value Pr(>|z|)\n", "(Intercept) 13.7084 4.51742 3.03456 0.0024\n", "mpg -0.168587 0.0919032 -1.8344 0.0666\n", "weight -0.0039067 0.00101117 -3.86356 0.0001\n" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "logitest = glm(foreign~mpg+weight,auto,Binomial(),LogitLink())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here, we can predict the probability that a given vehicle is from a foreign maker and compare that with what was observed." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "74x2 Array{Float64,2}:\n", " 0.0 0.190436 \n", " 0.0 0.0957767 \n", " 0.0 0.422082 \n", " 0.0 0.0862626 \n", " 0.0 0.00849477\n", " 0.0 0.0249945 \n", " 0.0 0.648662 \n", " 0.0 0.0774615 \n", " 0.0 0.0155653 \n", " 0.0 0.0585486 \n", " 0.0 0.00380411\n", " 0.0 0.0200754 \n", " 0.0 0.00136983\n", " ⋮ \n", " 1.0 0.714123 \n", " 1.0 0.117869 \n", " 1.0 0.898059 \n", " 1.0 0.449941 \n", " 1.0 0.778794 \n", " 1.0 0.471888 \n", " 1.0 0.560431 \n", " 1.0 0.800974 \n", " 1.0 0.236247 \n", " 1.0 0.875856 \n", " 1.0 0.848046 \n", " 1.0 0.176266 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[convert(Array,auto[:,:foreign]) predict(logitest)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Probit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similar to the logit case, the probit model is called using the `Binomial()` and `ProbitLink()` options in the `glm()` function:" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial,GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:\n", "\n", "Coefficients:\n", " Estimate Std.Error z value Pr(>|z|)\n", "(Intercept) 8.27532 2.57855 3.20929 0.0013\n", "mpg -0.103948 0.0542063 -1.91763 0.0552\n", "weight -0.00233551 0.000556961 -4.19331 <1e-4\n" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "probitest = glm(foreign~mpg+weight,auto,Binomial(),ProbitLink())" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "74x2 Array{Float64,2}:\n", " 0.0 0.196393 \n", " 0.0 0.0941283 \n", " 0.0 0.429645 \n", " 0.0 0.0816519 \n", " 0.0 0.00245574 \n", " 0.0 0.0151149 \n", " 0.0 0.642254 \n", " 0.0 0.0715818 \n", " 0.0 0.0071502 \n", " 0.0 0.0504584 \n", " 0.0 0.000496129\n", " 0.0 0.0110559 \n", " 0.0 4.30191e-5 \n", " ⋮ \n", " 1.0 0.702837 \n", " 1.0 0.121525 \n", " 1.0 0.902976 \n", " 1.0 0.440127 \n", " 1.0 0.781031 \n", " 1.0 0.466058 \n", " 1.0 0.566884 \n", " 1.0 0.799495 \n", " 1.0 0.226333 \n", " 1.0 0.878817 \n", " 1.0 0.848251 \n", " 1.0 0.185297 " ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "[convert(Array,auto[:,:foreign]) predict(probitest)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Factor variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Suppose you have a categorical variable (e.g. ethnicity, or gender) that you want to include as an independent variable in your model.\n", "\n", "Several existing software packages have this capability. In Stata, use `i.gender`; in `R` use `gender <- as.factor(gender)`.\n", "\n", "In Julia, factor levels are created using what is called a \"pooled data array.\"\n", "\n", "Let's see how they work by estimating an OLS model on the `auto` data frame, using the `:rep78` variable, which takes on categorical values 1 through 5." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:\n", "\n", "Coefficients:\n", " Estimate Std.Error z value Pr(>|z|)\n", "(Intercept) -1939.87 3622.35 -0.535528 0.5923\n", "mpg -52.2172 83.7396 -0.623567 0.5329\n", "weight 2.11149 0.619008 3.41108 0.0006\n", "rep78 820.812 320.899 2.55786 0.0105\n" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olsest2 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here, the `glm()` function thinks that `:rep78` is a continuous variable. In order to \"pool\" this variable, we use the following syntax:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "collapsed": true, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "pool!(auto,:rep78);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once the pooling is done, we can look at the levels of the pooled vector using the `levels()` function:" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "5-element Array{Int64,1}:\n", " 1\n", " 2\n", " 3\n", " 4\n", " 5" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "levels(auto[:rep78])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now when we re-run this regression, we should expect to obtain four different coefficient estimates representing the four uniquely identifiable categories of `:rep78`." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "collapsed": false, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Normal,GLM.IdentityLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Float64}:\n", "\n", "Coefficients:\n", " Estimate Std.Error z value Pr(>|z|)\n", "(Intercept) -598.967 3960.9 -0.15122 0.8798\n", "mpg -63.0971 87.4528 -0.721499 0.4706\n", "weight 2.09307 0.636901 3.28633 0.0010\n", "rep78 - 2 753.702 1919.76 0.392602 0.6946\n", "rep78 - 3 1349.36 1772.71 0.761187 0.4465\n", "rep78 - 4 2030.47 1810.09 1.12175 0.2620\n", "rep78 - 5 3376.91 1900.17 1.77716 0.0755\n" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "olsest3 = glm(price~mpg+weight+rep78,auto,Normal(),IdentityLink())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is indeed the case." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Pooling imported data automatically" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It is possible to automatically pool the data frame upon import using the `makefactors` option of `readtable`:" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "collapsed": true, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "auto2 = readtable(\"auto.csv\",makefactors=true);" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, this functionality only works with string variables, so any numerical categorical variables must be pooled manually." ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.4.3", "language": "julia", "name": "julia-0.4" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }