{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biostat M280 Homework 3\n", "\n", "**Due ~~Friday, May 10 @ 11:59PM~~ Wednesday, May 15 @ 11:59PM**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q1 - Big $n$ linear regression\n", "\n", "People often think linear regression on a dataset with millions of observations is a big data problem. Now we learnt various methods for solving linear regression and should realize that, with right choice of algorithm, it is a problem that can be handled by any moderate computer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1(1)\n", "\n", "Download the flight data from . For this exercise, we only need data from years 2003-2008. If you are using Mac or Linux, you can run the following Julia script to download and unzip files for all years. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "using CodecBzip2\n", "\n", "for year in 2003:2008\n", " fn = string(year) * \".csv.bz2\"\n", " @info \"download $fn\"\n", " download(\"http://stat-computing.org/dataexpo/2009/$fn\", \"./$fn\")\n", " @info \"unzip $fn\"\n", " funzip = open(string(year) * \".csv\", \"w\")\n", " open(Bzip2DecompressorStream, fn) do stream\n", " write(funzip, stream)\n", " end\n", " close(funzip)\n", "end\n" ] } ], "source": [ ";cat getflightdata.jl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Please do **not** put data files in Git. For grading purpose (reproducibility), we will assume that data files are in the same directory as the Jupyter notebook file.\n", "\n", "Find out how many data points are in each year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1(2) \n", "\n", "We are interested in how the time gain of a flight, defined as `DepDelay - ArrDelay`, depends on the distance traveled (`Distance`), departure delay (`DepDelay`), and carrier (`UniqueCarrier`). \n", "\n", "We want to fit a linear regression `Gain ~ 1 + Distance + DepDelay + UniqueCarrier` using data from 2003-2008. Note `UniqueCarrier` is a factor with 23 levels: \"9E\", \"AA\", \"AQ\", \"AS\", \"B6\", \"CO\", \"DH\", \"DL\", \"EV\", \"F9\", \"FL\", \"HA\", \"HP\", \"MQ\", \"NW\", \"OH\", \"OO\", \"TZ\", \"UA\", \"US\", \"WN\", \"XE\", and \"YV\". We use the dummy coding with \"9E\" as base level.\n", "\n", "Will the design matrix $\\mathbf{X}$ (in double precision) fit into the memory of you computer? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1(3)\n", "\n", "Review the [Summary of Linear Regression](http://hua-zhou.github.io/teaching/biostatm280-2019spring/slides/13-linreg/linreg.html) and choose one method in the table to solve the linear regression.\n", "\n", "Report the estimated regression coefficients $\\widehat \\beta$, estimated variance $\\widehat \\sigma^2 = \\sum_i (y_i - \\widehat y_i)^2 / (n - p)$, and standard errors of $\\widehat \\beta$.\n", "\n", "Hint: It took my laptop less than 3 minutes to import data and fit linear regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1(4)\n", "\n", "Go to your resume/cv and claim you have experience performing analytics on data with millions of observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample code\n", "\n", "Following code explores the data in 2003 and generates the design matrix and responses for that year. Feel free to use the code in your solution." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.1.0\n", "Commit 80516ca202 (2019-01-21 21:24 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin14.5.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-6.0.1 (ORCJIT, skylake)\n", "Environment:\n", " JULIA_EDITOR = code\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Print first 10 lines of 2003 data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay\n", "2003,1,29,3,1651,1655,1912,1913,UA,1017,N202UA,141,138,119,-1,-4,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,30,4,1654,1655,1910,1913,UA,1017,N311UA,136,138,108,-3,-1,ORD,MSY,837,2,26,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,31,5,1724,1655,1936,1913,UA,1017,N317UA,132,138,110,23,29,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,1,3,1033,1035,1625,1634,UA,1018,N409UA,232,239,215,-9,-2,OAK,ORD,1835,6,11,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,2,4,1053,1035,1726,1634,UA,1018,N496UA,273,239,214,52,18,OAK,ORD,1835,13,46,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,3,5,1031,1035,1640,1634,UA,1018,N412UA,249,239,223,6,-4,OAK,ORD,1835,13,13,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,4,6,1031,1035,1626,1634,UA,1018,N455UA,235,239,219,-8,-4,OAK,ORD,1835,5,11,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,5,7,1035,1035,1636,1634,UA,1018,N828UA,241,239,227,2,0,OAK,ORD,1835,5,9,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,6,1,1031,1035,1653,1634,UA,1018,N453UA,262,239,241,19,-4,OAK,ORD,1835,7,14,0,NA,0,NA,NA,NA,NA,NA\n" ] } ], "source": [ ";head 2003.csv" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6488541" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how many data points\n", "countlines(\"2003.csv\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 37.143025 seconds (434.77 M allocations: 17.350 GiB, 15.11% gc time)\n" ] }, { "data": { "text/plain": [ "Table with 6488540 rows, 4 columns:\n", "DepDelay ArrDelay UniqueCarrier Distance\n", "───────────────────────────────────────────\n", "-4 -1 \"UA\" 837\n", "-1 -3 \"UA\" 837\n", "29 23 \"UA\" 837\n", "-2 -9 \"UA\" 1835\n", "18 52 \"UA\" 1835\n", "-4 6 \"UA\" 1835\n", "-4 -8 \"UA\" 1835\n", "0 2 \"UA\" 1835\n", "-4 19 \"UA\" 1835\n", "3 4 \"UA\" 413\n", "-4 -23 \"UA\" 413\n", "-3 -19 \"UA\" 413\n", "⋮\n", "missing missing \"DL\" 1891\n", "29 62 \"DL\" 581\n", "39 66 \"DL\" 1891\n", "26 27 \"DL\" 1678\n", "114 134 \"DL\" 946\n", "44 53 \"DL\" 813\n", "16 47 \"DL\" 432\n", "50 54 \"DL\" 432\n", "-3 -5 \"DL\" 453\n", "3 3 \"DL\" 689\n", "-1 -1 \"DL\" 1929" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# import data from csv\n", "using JuliaDB\n", "@time yrtable = loadtable(\"2003.csv\", \n", " datacols = [\"DepDelay\", \"ArrDelay\", \"UniqueCarrier\", \"Distance\"])" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Table with 6375689 rows, 4 columns:\n", "DepDelay ArrDelay UniqueCarrier Distance\n", "───────────────────────────────────────────\n", "-4 -1 \"UA\" 837\n", "-1 -3 \"UA\" 837\n", "29 23 \"UA\" 837\n", "-2 -9 \"UA\" 1835\n", "18 52 \"UA\" 1835\n", "-4 6 \"UA\" 1835\n", "-4 -8 \"UA\" 1835\n", "0 2 \"UA\" 1835\n", "-4 19 \"UA\" 1835\n", "3 4 \"UA\" 413\n", "-4 -23 \"UA\" 413\n", "-3 -19 \"UA\" 413\n", "⋮\n", "70 66 \"DL\" 1891\n", "29 62 \"DL\" 581\n", "39 66 \"DL\" 1891\n", "26 27 \"DL\" 1678\n", "114 134 \"DL\" 946\n", "44 53 \"DL\" 813\n", "16 47 \"DL\" 432\n", "50 54 \"DL\" 432\n", "-3 -5 \"DL\" 453\n", "3 3 \"DL\" 689\n", "-1 -1 \"DL\" 1929" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop rows with missing values\n", "yrtable = dropmissing(yrtable)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "generate_xy (generic function with 1 method)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# mapping from variable names to X columns\n", "# carrier \"9E\" is used as base level\n", "const var2col = Dict(\n", " \"Intercept\" => 1,\n", " \"Distance\" => 2,\n", " \"DepDelay\" => 3,\n", " \"AA\" => 4,\n", " \"AQ\" => 5,\n", " \"AS\" => 6,\n", " \"B6\" => 7,\n", " \"CO\" => 8,\n", " \"DH\" => 9,\n", " \"DL\" => 10,\n", " \"EV\" => 11,\n", " \"F9\" => 12,\n", " \"FL\" => 13,\n", " \"HA\" => 14,\n", " \"HP\" => 15,\n", " \"MQ\" => 16,\n", " \"NW\" => 17,\n", " \"OH\" => 18,\n", " \"OO\" => 19,\n", " \"TZ\" => 20,\n", " \"UA\" => 21,\n", " \"US\" => 22,\n", " \"WN\" => 23,\n", " \"XE\" => 24,\n", " \"YV\" => 25,\n", " \"Gain\" => 26)\n", "# mapping from column to variable names\n", "const col2var = Dict(value => key for (key, value) in var2col)\n", "\n", "# a custom function to generate [X y] from data table\n", "function generate_xy(tbl::IndexedTable)\n", " # X matrix\n", " XY = zeros(length(tbl), 26)\n", " # intercept term\n", " XY[:, 1] .= 1\n", " # Distance term\n", " XY[:, 2] = columns(tbl, :Distance)\n", " # DepDelay term\n", " XY[:, 3] = columns(tbl, :DepDelay)\n", " # Dummy coding for airline\n", " @inbounds for i in 1:length(tbl)\n", " tbl[i][:UniqueCarrier] == \"9E\" && continue # base level\n", " XY[i, var2col[tbl[i][:UniqueCarrier]]] = 1\n", " end\n", " # last column is response: gain = depdelay - arrdelay\n", " XY[:, 26] = select(tbl, \n", " (:DepDelay, :ArrDelay) => p -> Float64(p.DepDelay - p.ArrDelay))\n", " # return\n", " XY\n", "end" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 1.536713 seconds (1.45 M allocations: 1.355 GiB, 17.07% gc time)\n" ] }, { "data": { "text/plain": [ "6375689×26 Array{Float64,2}:\n", " 1.0 837.0 -4.0 0.0 0.0 0.0 … 0.0 1.0 0.0 0.0 0.0 0.0 -3.0\n", " 1.0 837.0 -1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 2.0\n", " 1.0 837.0 29.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 6.0\n", " 1.0 1835.0 -2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 7.0\n", " 1.0 1835.0 18.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 -34.0\n", " 1.0 1835.0 -4.0 0.0 0.0 0.0 … 0.0 1.0 0.0 0.0 0.0 0.0 -10.0\n", " 1.0 1835.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 4.0\n", " 1.0 1835.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 -2.0\n", " 1.0 1835.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 -23.0\n", " 1.0 413.0 3.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 -1.0\n", " 1.0 413.0 -4.0 0.0 0.0 0.0 … 0.0 1.0 0.0 0.0 0.0 0.0 19.0\n", " 1.0 413.0 -3.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 16.0\n", " 1.0 413.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 12.0\n", " ⋮ ⋮ ⋱ ⋮ ⋮ \n", " 1.0 406.0 104.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -4.0\n", " 1.0 1891.0 70.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.0\n", " 1.0 581.0 29.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -33.0\n", " 1.0 1891.0 39.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 -27.0\n", " 1.0 1678.0 26.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0\n", " 1.0 946.0 114.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -20.0\n", " 1.0 813.0 44.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -9.0\n", " 1.0 432.0 16.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -31.0\n", " 1.0 432.0 50.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 -4.0\n", " 1.0 453.0 -3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0\n", " 1.0 689.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0\n", " 1.0 1929.0 -1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "@time xy = generate_xy(yrtable)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2 - Google PageRank\n", "\n", "We are going to try different numerical methods learnt in class on the [Google PageRank problem](https://en.wikipedia.org/wiki/PageRank)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(1)\n", "\n", "Let $\\mathbf{A} \\in \\{0,1\\}^{n \\times n}$ be the connectivity matrix of $n$ web pages with entries\n", "$$\n", "\\begin{eqnarray*}\n", "\ta_{ij}= \\begin{cases}\n", "\t1 & \\text{if page $i$ links to page $j$} \\\\\n", "\t0 & \\text{otherwise}\n", "\t\\end{cases}.\n", "\\end{eqnarray*}\n", "$$\n", "$r_i = \\sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules. \n", "\n", "- From a page $i$ with $r_i>0$\n", " * with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page \n", " * with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", "- From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", " \n", "The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix." ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "### Q2(2)\n", "\n", "According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\\mathbf{x} = (x_1,\\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\\mathbf{P}$. That is $\\mathbf{P}^T \\mathbf{x} = \\mathbf{x}$. Therefore $\\mathbf{x}$ can be solved as an eigen-problem. Show that it can also be cast as solving a linear system. Since the row sums of $\\mathbf{P}$ are 1, $\\mathbf{P}$ is rank deficient. We can replace the first equation by the $\\sum_{i=1}^n x_i = 1$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(3)\n", "\n", "Obtain the connectivity matrix `A` from the `SNAP/web-Google` data in the MatrixDepot package. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "include group.jl for user defined matrix generators\n", "verify download of index files...\n", "used remote site is https://sparse.tamu.edu/?per_page=All\n", "populating internal database...\n" ] } ], "source": [ "using MatrixDepot" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "916428×916428 SparseArrays.SparseMatrixCSC{Bool,Int64} with 5105039 stored entries:\n", " [11343 , 1] = true\n", " [11928 , 1] = true\n", " [15902 , 1] = true\n", " [29547 , 1] = true\n", " [30282 , 1] = true\n", " [31301 , 1] = true\n", " [38717 , 1] = true\n", " [43930 , 1] = true\n", " [46275 , 1] = true\n", " [48193 , 1] = true\n", " [50823 , 1] = true\n", " [56911 , 1] = true\n", " ⋮\n", " [618730, 916427] = true\n", " [622998, 916427] = true\n", " [673046, 916427] = true\n", " [716616, 916427] = true\n", " [720325, 916427] = true\n", " [772226, 916427] = true\n", " [785097, 916427] = true\n", " [788476, 916427] = true\n", " [822938, 916427] = true\n", " [833616, 916427] = true\n", " [417498, 916428] = true\n", " [843845, 916428] = true" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "md = mdopen(\"SNAP/web-Google\")\n", "md.A" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# I found I need to run following line at end of notebook to avoid a bug\n", "Base.Filesystem.rm(dirname(pathof(MatrixDepot)) * \"/../data/db.data\", force=true)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute summary statistics:\n", "* number of pages\n", "* number of edges\n", "* number of dangling nodes (pages with no out links)\n", "* histogram of in-degrees\n", "* which page has max in-degree?\n", "* histogram of out-degrees\n", "* which page has max out-degree?\n", "* visualize the sparsity pattern of $\\mathbf{A}$\n", "\n", "Hint: For plots, you can use the [UnicodePlots.jl](https://github.com/Evizero/UnicodePlots.jl) package." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(4)\n", "\n", "Set the _teleportation_ parameter at $p = 0.85$. Try the following methods to solve the PageRank problem using the `SNAP/web-Google` data.\n", "\n", "1. A dense linear system solver such as LU decomposition (if possible). \n", "2. An iterative linear system solver such as [GMRES](https://juliamath.github.io/IterativeSolvers.jl/dev/linear_systems/gmres/). \n", "3. A dense eigen-solver (if possible). \n", "4. An iterative eigen-solver such as [Arnoldi method](https://julialinearalgebra.github.io/Arpack.jl/stable/). \n", "\n", "For iterative methods, you can use the [`IterativeSolvers.jl`](https://github.com/JuliaMath/IterativeSolvers.jl) and [`Arpack.j`](https://github.com/JuliaLinearAlgebra/Arpack.jl) packages. Make sure to utilize the special structure of $\\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication. (Hint: The [LinearMaps.jl](https://github.com/Jutho/LinearMaps.jl) packages exists exactly for this purpose.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(5)\n", "\n", "List the top 20 nodes you found." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(6)\n", "\n", "As of Monday Apr 29, 2019, there are at least 5.49 billion indexed webpages on internet according to . Explain whether each of these methods works for the PageRank problem at this scale." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2(7)\n", "\n", "Go to your resume/cv and claim you have experience performing analysis on a network of one million nodes." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.1.0", "language": "julia", "name": "julia-1.1" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.1.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "87px", "width": "252px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 2 }