{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biostat 257 Homework 3\n", "\n", "**Due May 14 @ 11:59PM**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Julia Version 1.6.0\n", "Commit f9720dc2eb (2021-03-24 12:55 UTC)\n", "Platform Info:\n", " OS: macOS (x86_64-apple-darwin19.6.0)\n", " CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz\n", " WORD_SIZE: 64\n", " LIBM: libopenlibm\n", " LLVM: libLLVM-11.0.1 (ORCJIT, skylake)\n", "Environment:\n", " JULIA_EDITOR = code\n", " JULIA_NUM_THREADS = 4\n" ] } ], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q1. Big $n$ linear regression\n", "\n", "People often think linear regression on a dataset with millions of observations is a big data problem. Now we learnt various methods for solving linear regression and should realize that, with right choice of algorithm, it is a problem that can be handled by any moderate computer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1.1 Download data (10pts)\n", "\n", "Download the flight data from . **Do not put data files in Git.** You will lose points if you do. For grading purpose (reproducibility), we will assume that data files are in a subfolder `flights`." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "total 3095984\n", "-rw-r--r--@ 1 huazhou wheel 12652442 Apr 29 13:24 1987.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 49499025 Apr 29 13:24 1988.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 49202298 Apr 29 13:24 1989.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 52041322 Apr 29 13:24 1990.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 49877448 Apr 29 13:25 1991.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 50040946 Apr 29 13:25 1992.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 50111774 Apr 29 13:25 1993.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 51123887 Apr 29 13:25 1994.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 74881752 Apr 29 13:26 1995.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 75887707 Apr 29 13:26 1996.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 76705687 Apr 29 13:26 1997.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 76683506 Apr 29 13:27 1998.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 79449438 Apr 29 13:27 1999.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 82537924 Apr 29 13:27 2000.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 83478700 Apr 29 13:28 2001.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 75907218 Apr 29 13:29 2002.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 95326801 Apr 29 13:20 2003.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 110825331 Apr 29 13:21 2004.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 112450321 Apr 29 13:21 2005.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 115019195 Apr 29 13:22 2006.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 121249243 Apr 29 13:23 2007.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 39277452 Apr 29 13:23 2008.csv.bz2\n", "-rw-r--r--@ 1 huazhou wheel 244438 Apr 29 13:24 airports.csv\n", "-rw-r--r--@ 1 huazhou wheel 43758 Apr 29 13:24 carriers.csv\n", "-rw-r--r--@ 1 huazhou wheel 428796 Apr 29 13:24 plane-data.csv\n", "-rw-r--r--@ 1 huazhou wheel 1091 Apr 29 13:24 variable-descriptions.csv\n" ] } ], "source": [ ";ls -l flights" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find out how many data points are in each year." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1.2 (10 pts) Problem size\n", "\n", "We are interested in how the arrival delay of a flight, `ArrDelay`, depends on the distance traveled (`Distance`), departure delay (`DepDelay`), weekday (`DayOfWeek`), and airline (`UniqueCarrier`). \n", "\n", "We want to fit a linear regression `ArrDelay ~ 1 + Distance + DepDelay + DayOfWeek + UniqueCarrier` using data from 1987-2008. Treat `DayOfWeek` as a factor with 6 levels. We use the dummy coding with `1` (Monday) as the base level. Treat `UniqueCarrier` as a factor with 8 levels: \"AA\", \"AS\", \"CO\", \"DL\", \"NW\", \"UA\", \"US\", and \"WN\". We use the dummy coding with \"AA\" as the base level.\n", "\n", "Will the design matrix $\\mathbf{X}$ (in double precision) fit into the memory of your computer? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1.3 (30 pts) Choose algorithm\n", "\n", "Assume your computer has limited memory, say only 1GB. Review the [Summary of Linear Regression](https://ucla-biostat-257-2021spring.github.io/slides/15-linreg/linreg.html) and choose one method in the table to solve the linear regression.\n", "\n", "Report the estimated regression coefficients $\\widehat \\beta$, estimated variance $\\widehat \\sigma^2 = \\sum_i (y_i - \\widehat y_i)^2 / (n - p)$, and standard errors of $\\widehat \\beta$.\n", "\n", "Hint: It took my laptop about 10-11 minutes to import data and fit linear regression." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q1.4 Be proud of yourself\n", "\n", "Go to your resume/cv and claim you have experience performing analytics on data with 100 million observations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample code\n", "\n", "Following code explores the data in 2003 and generates the design matrix and responses for that year. Feel free to use the code in your solution." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "200" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using CodecBzip2, CSV, DataFrames, Distributions, LinearAlgebra, \n", "Serialization, StatsModels, SweepOperator\n", "ENV[\"COLUMNS\"] = 200" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay\n", "2003,1,29,3,1651,1655,1912,1913,UA,1017,N202UA,141,138,119,-1,-4,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,30,4,1654,1655,1910,1913,UA,1017,N311UA,136,138,108,-3,-1,ORD,MSY,837,2,26,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,31,5,1724,1655,1936,1913,UA,1017,N317UA,132,138,110,23,29,ORD,MSY,837,5,17,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,1,3,1033,1035,1625,1634,UA,1018,N409UA,232,239,215,-9,-2,OAK,ORD,1835,6,11,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,2,4,1053,1035,1726,1634,UA,1018,N496UA,273,239,214,52,18,OAK,ORD,1835,13,46,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,3,5,1031,1035,1640,1634,UA,1018,N412UA,249,239,223,6,-4,OAK,ORD,1835,13,13,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,4,6,1031,1035,1626,1634,UA,1018,N455UA,235,239,219,-8,-4,OAK,ORD,1835,5,11,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,5,7,1035,1035,1636,1634,UA,1018,N828UA,241,239,227,2,0,OAK,ORD,1835,5,9,0,NA,0,NA,NA,NA,NA,NA\n", "2003,1,6,1,1031,1035,1653,1634,UA,1018,N453UA,262,239,241,19,-4,OAK,ORD,1835,7,14,0,NA,0,NA,NA,NA,NA,NA\n" ] }, { "data": { "text/plain": [ "Base.ProcessChain(Base.Process[Process(`\u001b[4mbunzip2\u001b[24m \u001b[4m-c\u001b[24m \u001b[4mflights/2003.csv.bz2\u001b[24m`, ProcessSignaled(13)), Process(`\u001b[4mhead\u001b[24m`, ProcessExited(0))], Base.DevNull(), Base.DevNull(), Base.DevNull())" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Print first 10 lines of 2003 data.\n", "run(pipeline(`bunzip2 -c flights/2003.csv.bz2`, `head`))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6488541" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how many data points in 2003?\n", "open(\"flights/2003.csv.bz2\", \"r\") do io\n", " countlines(Bzip2DecompressorStream(io))\n", "end" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# # figure out which airlines appear in each year of 1987-2008\n", "# airlines = Vector{Vector{String}}(undef, 22)\n", "# @time for year in 1987:2008\n", "# println(\"year $year\")\n", "# filename = \"flights/\" * string(year) * \".csv.bz2\"\n", "# df = open(filename, \"r\") do io\n", "# CSV.File(\n", "# Bzip2DecompressorStream(io),\n", "# select = [\"UniqueCarrier\"],\n", "# types = Dict(\"UniqueCarrier\" => String),\n", "# missingstring = \"NA\"\n", "# ) |> DataFrame\n", "# end\n", "# airlines[year - 1986] = unique(df[!, :UniqueCarrier])\n", "# end\n", "# intersect(airlines...) |> sort" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 32.675195 seconds (39.79 M allocations: 4.854 GiB, 2.26% gc time, 24.62% compilation time)\n" ] }, { "data": { "text/html": [ "

6,488,540 rows × 5 columns

DayOfWeekUniqueCarrierArrDelayDepDelayDistance
UInt8StringFloat64?Float64?UInt16
13UA-1.0-4.0837
24UA-3.0-1.0837
35UA23.029.0837
43UA-9.0-2.01835
54UA52.018.01835
65UA6.0-4.01835
76UA-8.0-4.01835
87UA2.00.01835
91UA19.0-4.01835
103UA4.03.0413
114UA-23.0-4.0413
125UA-19.0-3.0413
136UA-12.00.0413
147UA64.082.0413
151UA-4.00.0413
162UA-8.02.0413
173UA-21.0-4.0413
184UA-27.0-4.0413
195UA-16.0-3.0413
206UA-16.0-2.0413
217UA-24.0-6.0413
221UA-12.04.0413
232UA-11.0-3.0413
243UA-9.0-1.0413
254UA-10.01.0413
265UA-10.0-5.0413
276UA-23.0-4.0413
287UA-13.0-8.0413
291UA-25.0-1.0413
302UA-20.0-5.0413
" ], "text/latex": [ "\\begin{tabular}{r|ccccc}\n", "\t& DayOfWeek & UniqueCarrier & ArrDelay & DepDelay & Distance\\\\\n", "\t\\hline\n", "\t& UInt8 & String & Float64? & Float64? & UInt16\\\\\n", "\t\\hline\n", "\t1 & 3 & UA & -1.0 & -4.0 & 837 \\\\\n", "\t2 & 4 & UA & -3.0 & -1.0 & 837 \\\\\n", "\t3 & 5 & UA & 23.0 & 29.0 & 837 \\\\\n", "\t4 & 3 & UA & -9.0 & -2.0 & 1835 \\\\\n", "\t5 & 4 & UA & 52.0 & 18.0 & 1835 \\\\\n", "\t6 & 5 & UA & 6.0 & -4.0 & 1835 \\\\\n", "\t7 & 6 & UA & -8.0 & -4.0 & 1835 \\\\\n", "\t8 & 7 & UA & 2.0 & 0.0 & 1835 \\\\\n", "\t9 & 1 & UA & 19.0 & -4.0 & 1835 \\\\\n", "\t10 & 3 & UA & 4.0 & 3.0 & 413 \\\\\n", "\t11 & 4 & UA & -23.0 & -4.0 & 413 \\\\\n", "\t12 & 5 & UA & -19.0 & -3.0 & 413 \\\\\n", "\t13 & 6 & UA & -12.0 & 0.0 & 413 \\\\\n", "\t14 & 7 & UA & 64.0 & 82.0 & 413 \\\\\n", "\t15 & 1 & UA & -4.0 & 0.0 & 413 \\\\\n", "\t16 & 2 & UA & -8.0 & 2.0 & 413 \\\\\n", "\t17 & 3 & UA & -21.0 & -4.0 & 413 \\\\\n", "\t18 & 4 & UA & -27.0 & -4.0 & 413 \\\\\n", "\t19 & 5 & UA & -16.0 & -3.0 & 413 \\\\\n", "\t20 & 6 & UA & -16.0 & -2.0 & 413 \\\\\n", "\t21 & 7 & UA & -24.0 & -6.0 & 413 \\\\\n", "\t22 & 1 & UA & -12.0 & 4.0 & 413 \\\\\n", "\t23 & 2 & UA & -11.0 & -3.0 & 413 \\\\\n", "\t24 & 3 & UA & -9.0 & -1.0 & 413 \\\\\n", "\t25 & 4 & UA & -10.0 & 1.0 & 413 \\\\\n", "\t26 & 5 & UA & -10.0 & -5.0 & 413 \\\\\n", "\t27 & 6 & UA & -23.0 & -4.0 & 413 \\\\\n", "\t28 & 7 & UA & -13.0 & -8.0 & 413 \\\\\n", "\t29 & 1 & UA & -25.0 & -1.0 & 413 \\\\\n", "\t30 & 2 & UA & -20.0 & -5.0 & 413 \\\\\n", "\t$\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ & $\\dots$ \\\\\n", "\\end{tabular}\n" ], "text/plain": [ "\u001b[1m6488540×5 DataFrame\u001b[0m\n", "\u001b[1m Row \u001b[0m│\u001b[1m DayOfWeek \u001b[0m\u001b[1m UniqueCarrier \u001b[0m\u001b[1m ArrDelay \u001b[0m\u001b[1m DepDelay \u001b[0m\u001b[1m Distance \u001b[0m\n", "\u001b[1m \u001b[0m│\u001b[90m UInt8 \u001b[0m\u001b[90m String \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m Float64? \u001b[0m\u001b[90m UInt16 \u001b[0m\n", "─────────┼──────────────────────────────────────────────────────────\n", " 1 │ 3 UA -1.0 -4.0 837\n", " 2 │ 4 UA -3.0 -1.0 837\n", " 3 │ 5 UA 23.0 29.0 837\n", " 4 │ 3 UA -9.0 -2.0 1835\n", " 5 │ 4 UA 52.0 18.0 1835\n", " 6 │ 5 UA 6.0 -4.0 1835\n", " 7 │ 6 UA -8.0 -4.0 1835\n", " 8 │ 7 UA 2.0 0.0 1835\n", " 9 │ 1 UA 19.0 -4.0 1835\n", " 10 │ 3 UA 4.0 3.0 413\n", " 11 │ 4 UA -23.0 -4.0 413\n", " ⋮ │ ⋮ ⋮ ⋮ ⋮ ⋮\n", " 6488531 │ 5 DL 62.0 29.0 581\n", " 6488532 │ 5 DL 66.0 39.0 1891\n", " 6488533 │ 5 DL 27.0 26.0 1678\n", " 6488534 │ 5 DL 134.0 114.0 946\n", " 6488535 │ 5 DL 53.0 44.0 813\n", " 6488536 │ 5 DL 47.0 16.0 432\n", " 6488537 │ 5 DL 54.0 50.0 432\n", " 6488538 │ 5 DL -5.0 -3.0 453\n", " 6488539 │ 5 DL 3.0 3.0 689\n", " 6488540 │ 5 DL -1.0 -1.0 1929\n", "\u001b[36m 6488519 rows omitted\u001b[0m" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# load 2003 data into DataFrame\n", "@time df = open(\"flights/2003.csv.bz2\", \"r\") do io\n", " CSV.File(\n", " Bzip2DecompressorStream(io), \n", " select = [\"DayOfWeek\", \"UniqueCarrier\", \"ArrDelay\", \n", " \"DepDelay\", \"Distance\"],\n", " types = Dict(\n", " \"DayOfWeek\" => UInt8,\n", " \"UniqueCarrier\" => String, \n", " \"ArrDelay\" => Float64, \n", " \"DepDelay\" => Float64, \n", " \"Distance\" => UInt16\n", " ),\n", " missingstring = \"NA\"\n", " ) |> DataFrame\n", "end" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6488540" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# how many rows?\n", "size(df, 1)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "6375689" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# drop rows with missing values\n", "dropmissing!(df)\n", "size(df, 1)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4230285" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# filter out rows not in the airline list\n", "airlines = [\"AA\", \"AS\", \"CO\", \"DL\", \"NW\", \"UA\", \"US\", \"WN\"]\n", "filter!(row -> row[:UniqueCarrier] ∈ airlines, df)\n", "size(df, 1)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ModelFrame{NamedTuple{(:ArrDelay, :DayOfWeek, :Distance, :DepDelay, :UniqueCarrier), Tuple{Vector{Float64}, Vector{UInt8}, Vector{UInt16}, Vector{Float64}, Vector{String}}}, StatisticalModel}(ArrDelay ~ 1 + DayOfWeek + Distance + DepDelay + UniqueCarrier, StatsModels.Schema with 5 entries:\n", " DayOfWeek => DayOfWeek\n", " Distance => Distance\n", " UniqueCarrier => UniqueCarrier\n", " ArrDelay => ArrDelay\n", " DepDelay => DepDelay\n", ", (ArrDelay = [-1.0, -3.0, 23.0, -9.0, 52.0, 6.0, -8.0, 2.0, 19.0, 4.0 … 62.0, 66.0, 27.0, 134.0, 53.0, 47.0, 54.0, -5.0, 3.0, -1.0], DayOfWeek = UInt8[0x03, 0x04, 0x05, 0x03, 0x04, 0x05, 0x06, 0x07, 0x01, 0x03 … 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05, 0x05], Distance = UInt16[0x0345, 0x0345, 0x0345, 0x072b, 0x072b, 0x072b, 0x072b, 0x072b, 0x072b, 0x019d … 0x0245, 0x0763, 0x068e, 0x03b2, 0x032d, 0x01b0, 0x01b0, 0x01c5, 0x02b1, 0x0789], DepDelay = [-4.0, -1.0, 29.0, -2.0, 18.0, -4.0, -4.0, 0.0, -4.0, 3.0 … 29.0, 39.0, 26.0, 114.0, 44.0, 16.0, 50.0, -3.0, 3.0, -1.0], UniqueCarrier = [\"UA\", \"UA\", \"UA\", \"UA\", \"UA\", \"UA\", \"UA\", \"UA\", \"UA\", \"UA\" … \"DL\", \"DL\", \"DL\", \"DL\", \"DL\", \"DL\", \"DL\", \"DL\", \"DL\", \"DL\"]), StatisticalModel)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# model frame for year 2003\n", "mf = ModelFrame(\n", " @formula(ArrDelay ~ 1 + DayOfWeek + Distance + DepDelay + UniqueCarrier), \n", " df,\n", " contrasts = Dict(\n", " :DayOfWeek => StatsModels.DummyCoding(base = 1, levels = UInt8.(1:7)),\n", " :UniqueCarrier => StatsModels.DummyCoding(\n", " base = \"AA\", \n", " levels = [\"AA\", \"AS\", \"CO\", \"DL\", \"NW\", \"UA\", \"US\", \"WN\"]\n", " )\n", " )\n", ")" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4230285×16 Matrix{Float64}:\n", " 1.0 0.0 1.0 0.0 0.0 0.0 0.0 837.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 1.0 0.0 0.0 0.0 837.0 -1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 837.0 29.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1835.0 -2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1835.0 18.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1835.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 0.0 1.0 0.0 1835.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1835.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1835.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 1.0 0.0 0.0 0.0 0.0 413.0 3.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 1.0 0.0 0.0 0.0 413.0 -4.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 413.0 -3.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 0.0 1.0 0.0 413.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0\n", " ⋮ ⋮ ⋮ ⋮\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 406.0 104.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1891.0 70.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 581.0 29.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1891.0 39.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1678.0 26.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 946.0 114.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 813.0 44.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 432.0 16.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 432.0 50.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 453.0 -3.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 689.0 3.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0\n", " 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1929.0 -1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# generate the covariate matrix X for year 2003\n", "X = modelmatrix(mf)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4230285-element Vector{Float64}:\n", " -1.0\n", " -3.0\n", " 23.0\n", " -9.0\n", " 52.0\n", " 6.0\n", " -8.0\n", " 2.0\n", " 19.0\n", " 4.0\n", " -23.0\n", " -19.0\n", " -12.0\n", " ⋮\n", " 108.0\n", " 66.0\n", " 62.0\n", " 66.0\n", " 27.0\n", " 134.0\n", " 53.0\n", " 47.0\n", " 54.0\n", " -5.0\n", " 3.0\n", " -1.0" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# generate the response vector Y for year 2003\n", "y = df[!, :ArrDelay]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2. PageRank\n", "\n", "We are going to try different numerical methods learnt in class on the [Google PageRank problem](https://en.wikipedia.org/wiki/PageRank)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.1 (5 pts) Recognize structure\n", "\n", "Let $\\mathbf{A} \\in \\{0,1\\}^{n \\times n}$ be the connectivity matrix of $n$ web pages with entries\n", "$$\n", "\\begin{eqnarray*}\n", "\ta_{ij}= \\begin{cases}\n", "\t1 & \\text{if page $i$ links to page $j$} \\\\\n", "\t0 & \\text{otherwise}\n", "\t\\end{cases}.\n", "\\end{eqnarray*}\n", "$$\n", "$r_i = \\sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules. \n", "\n", "- From a page $i$ with $r_i>0$\n", " * with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page \n", " * with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", "- From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", " \n", "The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.2 Relate to numerical linear algebra\n", "\n", "According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\\mathbf{x} = (x_1,\\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\\mathbf{P}$. That is $\\mathbf{P}^T \\mathbf{x} = \\mathbf{x}$. Therefore $\\mathbf{x}$ can be solved as an **eigen-problem**. It can also be cast as **solving a linear system**. Since the row sums of $\\mathbf{P}$ are 1, $\\mathbf{P}$ is rank deficient. We can replace the first equation by the $\\sum_{i=1}^n x_i = 1$.\n", "\n", "Hint: For iterative solvers, we don't need to replace the 1st equation. We can use the matrix $\\mathbf{I} - \\mathbf{P}^T$ directly if we start with a vector with all positive entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.3 (10 pts) Explore data\n", "\n", "Obtain the connectivity matrix `A` from the `SNAP/web-Google` data in the MatrixDepot package. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "include group.jl for user defined matrix generators\n", "verify download of index files...\n", "reading database\n", "adding metadata...\n", "adding svd data...\n", "writing database\n", "used remote sites are sparse.tamu.edu with MAT index and math.nist.gov with HTML index\n" ] }, { "data": { "text/latex": [ "\\section{SNAP/web-Google}\n", "\\subsubparagraph{MatrixMarket matrix coordinate pattern general}\n", "\\rule{\\textwidth}{1pt}\n", "\\begin{itemize}\n", "\\item UF Sparse Matrix Collection, Tim Davis\n", "\n", "\n", "\\item http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google\n", "\n", "\n", "\\item name: SNAP/web-Google\n", "\n", "\n", "\\item [Web graph from Google]\n", "\n", "\n", "\\item id: 2301\n", "\n", "\n", "\\item date: 2002\n", "\n", "\n", "\\item author: Google\n", "\n", "\n", "\\item ed: J. Leskovec\n", "\n", "\n", "\\item fields: name title A id date author ed kind notes\n", "\n", "\n", "\\item kind: directed graph\n", "\n", "\\end{itemize}\n", "\\rule{\\textwidth}{1pt}\n", "\\begin{itemize}\n", "\\item notes:\n", "\n", "\n", "\\item Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets, \n", "\n", "\n", "\\item Jure Leskovec http://snap.stanford.edu/data/index.html \n", "\n", "\n", "\\item email jure at cs.stanford.edu \n", "\n", "\n", "\\item \n", "\\item Google web graph \n", "\n", "\n", "\\item \n", "\\item Dataset information \n", "\n", "\n", "\\item \n", "\\item Nodes represent web pages and directed edges represent hyperlinks between them.\n", "\n", "\n", "\\item The data was released in 2002 by Google as a part of Google Programming \n", "\n", "\n", "\\item Contest. \n", "\n", "\n", "\\item \n", "\\item Dataset statistics \n", "\n", "\n", "\\item Nodes 875713 \n", "\n", "\n", "\\item Edges 5105039 \n", "\n", "\n", "\\item Nodes in largest WCC 855802 (0.977) \n", "\n", "\n", "\\item Edges in largest WCC 5066842 (0.993) \n", "\n", "\n", "\\item Nodes in largest SCC 434818 (0.497) \n", "\n", "\n", "\\item Edges in largest SCC 3419124 (0.670) \n", "\n", "\n", "\\item Average clustering coefficient 0.6047 \n", "\n", "\n", "\\item Number of triangles 13391903 \n", "\n", "\n", "\\item Fraction of closed triangles 0.05523 \n", "\n", "\n", "\\item Diameter (longest shortest path) 22 \n", "\n", "\n", "\\item 90-percentile effective diameter 8.1 \n", "\n", "\n", "\\item \n", "\\item Source (citation) \n", "\n", "\n", "\\item \n", "\\item J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large \n", "\n", "\n", "\\item Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.\n", "\n", "\n", "\\item arXiv.org:0810.1355, 2008. \n", "\n", "\n", "\\item \n", "\\item Google programming contest, 2002 \n", "\n", "\n", "\\item http://www.google.com/programming-contest/ \n", "\n", "\n", "\\item \n", "\\item Files \n", "\n", "\n", "\\item File Description \n", "\n", "\n", "\\item web-Google.txt.gz Webgraph from the Google programming contest, 2002 \n", "\n", "\\end{itemize}\n", "\\rule{\\textwidth}{1pt}\n", "916428 916428 5105039\n", "\n" ], "text/markdown": [ "# SNAP/web-Google\n", "\n", "###### MatrixMarket matrix coordinate pattern general\n", "\n", "---\n", "\n", " * UF Sparse Matrix Collection, Tim Davis\n", " * http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google\n", " * name: SNAP/web-Google\n", " * [Web graph from Google]\n", " * id: 2301\n", " * date: 2002\n", " * author: Google\n", " * ed: J. Leskovec\n", " * fields: name title A id date author ed kind notes\n", " * kind: directed graph\n", "\n", "---\n", "\n", " * notes:\n", " * Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,\n", " * Jure Leskovec http://snap.stanford.edu/data/index.html\n", " * email jure at cs.stanford.edu\n", " * \n", " * Google web graph\n", " * \n", " * Dataset information\n", " * \n", " * Nodes represent web pages and directed edges represent hyperlinks between them.\n", " * The data was released in 2002 by Google as a part of Google Programming\n", " * Contest.\n", " * \n", " * Dataset statistics\n", " * Nodes 875713\n", " * Edges 5105039\n", " * Nodes in largest WCC 855802 (0.977)\n", " * Edges in largest WCC 5066842 (0.993)\n", " * Nodes in largest SCC 434818 (0.497)\n", " * Edges in largest SCC 3419124 (0.670)\n", " * Average clustering coefficient 0.6047\n", " * Number of triangles 13391903\n", " * Fraction of closed triangles 0.05523\n", " * Diameter (longest shortest path) 22\n", " * 90-percentile effective diameter 8.1\n", " * \n", " * Source (citation)\n", " * \n", " * J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large\n", " * Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.\n", " * arXiv.org:0810.1355, 2008.\n", " * \n", " * Google programming contest, 2002\n", " * http://www.google.com/programming-contest/\n", " * \n", " * Files\n", " * File Description\n", " * web-Google.txt.gz Webgraph from the Google programming contest, 2002\n", "\n", "---\n", "\n", "916428 916428 5105039\n" ], "text/plain": [ "\u001b[1m SNAP/web-Google\u001b[22m\n", "\u001b[1m ≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡≡\u001b[22m\n", "\n", "\u001b[1m MatrixMarket matrix coordinate pattern general\u001b[22m\n", "\n", " ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "\n", " • UF Sparse Matrix Collection, Tim Davis\n", "\n", " • http://www.cise.ufl.edu/research/sparse/matrices/SNAP/web-Google\n", "\n", " • name: SNAP/web-Google\n", "\n", " • [Web graph from Google]\n", "\n", " • id: 2301\n", "\n", " • date: 2002\n", "\n", " • author: Google\n", "\n", " • ed: J. Leskovec\n", "\n", " • fields: name title A id date author ed kind notes\n", "\n", " • kind: directed graph\n", "\n", " ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "\n", " • notes:\n", "\n", " • Networks from SNAP (Stanford Network Analysis Platform) Network Data Sets,\n", "\n", " • Jure Leskovec http://snap.stanford.edu/data/index.html\n", "\n", " • email jure at cs.stanford.edu\n", "\n", " • \n", "\n", " • Google web graph\n", "\n", " • \n", "\n", " • Dataset information\n", "\n", " • \n", "\n", " • Nodes represent web pages and directed edges represent hyperlinks between them.\n", "\n", " • The data was released in 2002 by Google as a part of Google Programming\n", "\n", " • Contest.\n", "\n", " • \n", "\n", " • Dataset statistics\n", "\n", " • Nodes 875713\n", "\n", " • Edges 5105039\n", "\n", " • Nodes in largest WCC 855802 (0.977)\n", "\n", " • Edges in largest WCC 5066842 (0.993)\n", "\n", " • Nodes in largest SCC 434818 (0.497)\n", "\n", " • Edges in largest SCC 3419124 (0.670)\n", "\n", " • Average clustering coefficient 0.6047\n", "\n", " • Number of triangles 13391903\n", "\n", " • Fraction of closed triangles 0.05523\n", "\n", " • Diameter (longest shortest path) 22\n", "\n", " • 90-percentile effective diameter 8.1\n", "\n", " • \n", "\n", " • Source (citation)\n", "\n", " • \n", "\n", " • J. Leskovec, K. Lang, A. Dasgupta, M. Mahoney. Community Structure in Large\n", "\n", " • Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters.\n", "\n", " • arXiv.org:0810.1355, 2008.\n", "\n", " • \n", "\n", " • Google programming contest, 2002\n", "\n", " • http://www.google.com/programming-contest/\n", "\n", " • \n", "\n", " • Files\n", "\n", " • File Description\n", "\n", " • web-Google.txt.gz Webgraph from the Google programming contest, 2002\n", "\n", " ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "\n", " 916428 916428 5105039" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using MatrixDepot\n", "\n", "md = mdopen(\"SNAP/web-Google\")\n", "# display documentation for the SNAP/web-Google data\n", "mdinfo(md)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "916428×916428 SparseArrays.SparseMatrixCSC{Bool, Int64} with 5105039 stored entries:\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿\n", "⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿⣿" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# connectivity matrix\n", "A = md.A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute summary statistics:\n", "* How much memory does `A` take? If converted to a `Matrix{Float64}` (don't do it!), how much memory will it take? \n", "* number of web pages\n", "* number of edges (web links). \n", "* number of dangling nodes (pages with no out links)\n", "* histogram of in-degrees \n", "* list the top 20 pages with the largest in-degrees? \n", "* histogram of out-degrees\n", "* which the top 20 pages with the largest out-degrees?\n", "* visualize the sparsity pattern of $\\mathbf{A}$ or a submatrix of $\\mathbf{A}$ say `A[1:10000, 1:10000]`. \n", "\n", "**Hint**: For plots, you can use the [UnicodePlots.jl](https://github.com/Evizero/UnicodePlots.jl) package (`spy`, `histogram`, etc), which is fast for large data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.4 (5 pts) Dense linear algebra? \n", "\n", "Consider the following methods to obtain the page ranks of the `SNAP/web-Google` data. \n", "\n", "1. A dense linear system solver such as LU decomposition. \n", "2. A dense eigen-solver for asymmetric matrix. \n", "\n", "For the LU approach, estimate (1) the memory usage and (2) how long it will take assuming that the LAPACK functions can achieve the theoretical throughput of your computer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.5 Iterative solvers\n", "\n", "Set the _teleportation_ parameter at $p = 0.85$. Consider the following methods for solving the PageRank problem. \n", "\n", "1. An iterative linear system solver such as GMRES. \n", "2. An iterative eigen-solver such as Arnoldi method.\n", "\n", "For iterative methods, we have many choices in Julia. See a list of existing Julia packages for linear solvers at this [page](https://jutho.github.io/KrylovKit.jl/stable/#Package-features-and-alternatives-1). The start-up code below uses the [KrylovKit.jl](https://github.com/Jutho/KrylovKit.jl) package. You can use other packages if you prefer. Make sure to utilize the special structure of $\\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1 (15 pts)\n", "\n", "Let's implement a type `PageRankImPt` that mimics the matrix $\\mathbf{M} = \\mathbf{I} - \\mathbf{P}^T$. For iterative methods, all we need to provide are methods for evaluating $\\mathbf{M} \\mathbf{v}$ and $\\mathbf{M}^T \\mathbf{v}$ for arbitrary vector $\\mathbf{v}$." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "using BenchmarkTools, LinearAlgebra, SparseArrays, Revise\n", "\n", "# a type for the matrix M = I - P^T in PageRank problem\n", "struct PageRankImPt{TA <: Number, IA <: Integer, T <: AbstractFloat} <: AbstractMatrix{T}\n", " A :: SparseMatrixCSC{TA, IA} # adjacency matrix\n", " telep :: T\n", " # working arrays\n", " # TODO: whatever intermediate arrays you may want to pre-allocate\n", "end\n", "\n", "# constructor\n", "function PageRankImPt(A::SparseMatrixCSC, telep::T) where T <: AbstractFloat\n", " n = size(A, 1)\n", " # TODO: initialize and pre-allocate arrays\n", " PageRankImPt(A, telep)\n", "end\n", "\n", "LinearAlgebra.issymmetric(::PageRankImPt) = false\n", "Base.size(M::PageRankImPt) = size(M.A)\n", "# TODO: implement this function for evaluating M[i, j]\n", "Base.getindex(M::PageRankImPt, i, j) = M.telep\n", "\n", "# overwrite `out` by `(I - Pt) * v`\n", "function LinearAlgebra.mul!(\n", " out :: Vector{T}, \n", " M :: PageRankImPt{<:Number, <:Integer, T}, \n", " v :: Vector{T}) where T <: AbstractFloat\n", " # TODO: implement mul!(out, M, v)\n", " sleep(1e-2) # wait 10 ms as if your code takes 1ms\n", " return out\n", "end\n", "\n", "# overwrite `out` by `(I - P) * v`\n", "function LinearAlgebra.mul!(\n", " out :: Vector{T}, \n", " Mt :: Transpose{T, PageRankImPt{TA, IA, T}}, \n", " v :: Vector{T}) where {TA<:Number, IA<:Integer, T <: AbstractFloat}\n", " M = Mt.parent\n", " # TODO: implement mul!(out, transpose(M), v)\n", " sleep(1e-2) # wait 10 ms as if your code takes 1ms\n", " out\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check correctness. Note \n", "$$\n", "\\mathbf{M}^T \\mathbf{1} = \\mathbf{0}\n", "$$\n", "and\n", "$$\n", "\\mathbf{M} \\mathbf{x} = \\mathbf{0}\n", "$$\n", "for stationary distribution $\\mathbf{x}$.\n", "\n", "Download the solution file `pgrksol.csv.gz`. **Do not put this file in your Git**. You will lose points if you do. You can add a line `pgrksol.csv.gz` to your `.gitignore` file." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "916428-element Vector{Float64}:\n", " 3.3783428216975054e-5\n", " 2.0710155392568165e-6\n", " 3.663065984832893e-6\n", " 7.527510785028837e-7\n", " 8.63328599674051e-7\n", " 1.769418252415541e-6\n", " 2.431230382883396e-7\n", " 6.368417180141445e-7\n", " 4.744973703681939e-7\n", " 2.6895486110647536e-7\n", " 3.18574314847409e-6\n", " 7.375106374416742e-7\n", " 2.431230382883396e-7\n", " ⋮\n", " 1.1305006040148547e-6\n", " 4.874825281822915e-6\n", " 3.167946973112519e-6\n", " 9.72688040308568e-7\n", " 6.588614479285245e-7\n", " 7.737011774300648e-7\n", " 2.431230382883396e-7\n", " 1.6219204214797293e-6\n", " 3.912130060551738e-7\n", " 2.431230382883396e-7\n", " 7.296033831163157e-6\n", " 6.330939996912478e-7" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using CodecZlib, DelimitedFiles\n", "\n", "isfile(\"pgrksol.csv.gz\") || download(\"https://raw.githubusercontent.com/ucla-biostat-257-2021spring/ucla-biostat-257-2021spring.github.io/master/hw/hw3/pgrksol.csv.gz\")\n", "xsol = open(\"pgrksol.csv.gz\", \"r\") do io\n", " vec(readdlm(GzipDecompressorStream(io)))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You will lose all 35 points (Steps 1 and 2)** if the following statements throw AssertError." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "M = PageRankImPt(A, 0.85)\n", "n = size(M, 1)\n", "\n", "@assert transpose(M) * ones(n) ≈ zeros(n)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "@assert M * xsol ≈ zeros(n)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2 (20 pts)\n", "\n", "We want to benchmark the hot functions `mul!` to make sure they are efficient and allocate little memory." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 144 bytes\n", " allocs estimate: 5\n", " --------------\n", " minimum time: 10.274 ms (0.00% GC)\n", " median time: 12.783 ms (0.00% GC)\n", " mean time: 12.790 ms (0.00% GC)\n", " maximum time: 14.319 ms (0.00% GC)\n", " --------------\n", " samples: 355\n", " evals/sample: 1" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "M = PageRankImPt(A, 0.85)\n", "n = size(M, 1)\n", "v, out = ones(n), zeros(n)\n", "bm_mv = @benchmark mul!($out, $M, $v) setup=(fill!(out, 0); fill!(v, 1))" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "BenchmarkTools.Trial: \n", " memory estimate: 144 bytes\n", " allocs estimate: 5\n", " --------------\n", " minimum time: 10.146 ms (0.00% GC)\n", " median time: 12.720 ms (0.00% GC)\n", " mean time: 12.644 ms (0.00% GC)\n", " maximum time: 14.118 ms (0.00% GC)\n", " --------------\n", " samples: 359\n", " evals/sample: 1" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bm_mtv = @benchmark mul!($out, $(transpose(M)), $v) setup=(fill!(out, 0); fill!(v, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will lose 1 point for each 100 bytes memory allocation. So the points you will get is" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "17.12" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clamp(10 - median(bm_mv).memory / 100, 0, 10) + \n", "clamp(10 - median(bm_mtv).memory / 100, 0, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My median run times are 30-40 ms and memory allocations are 0 bytes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3 (20 pts)\n", "\n", "Let's first try to solve the PageRank problem by the GMRES method for solving linear equations. " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(value = ([3.5373439728225696e-5, 1.1625074089088258e-6, 7.4732619144138796e-6, 6.642899479479004e-7, 9.964349219218506e-7, 2.6571597917916015e-6, 1.660724869869751e-7, 6.642899479479004e-7, 3.321449739739502e-7, 3.321449739739502e-7 … 2.989304765765552e-6, 1.3285798958958008e-6, 2.4910873048046265e-6, 4.982174609609253e-7, 1.660724869869751e-7, 2.4910873048046265e-6, 3.321449739739502e-7, 1.660724869869751e-7, 7.4732619144138796e-6, 4.982174609609253e-7], ConvergenceInfo: one converged value after 0 iterations and 1 applications of the linear map;\n", "norms of residuals are given by (0.0,).\n", "), time = 0.068877865, bytes = 27636729, gctime = 0.017346028, gcstats = Base.GC_Diff(27636729, 3, 0, 79377, 3, 0, 17346028, 1, 0))" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using KrylovKit\n", "\n", "# normalize in-degrees to be the start point\n", "x0 = vec(sum(A, dims = 1)) .+ 1.0\n", "x0 ./= sum(x0)\n", "\n", "# right hand side\n", "b = zeros(n)\n", "\n", "# warm up (compilation)\n", "linsolve(M, b, x0, issymmetric = false, isposdef = false, maxiter = 1) \n", "# output is complex eigenvalue/eigenvector\n", "(x_gmres, info), time_gmres, = @timed linsolve(M, b, x0, issymmetric = false, isposdef = false)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check correctness. **You will lose all 20 points if the following statement throws `AssertError`.**" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "ename": "LoadError", "evalue": "AssertionError: norm(x_gmres - xsol) < 1.0e-8", "output_type": "error", "traceback": [ "AssertionError: norm(x_gmres - xsol) < 1.0e-8", "", "Stacktrace:", " [1] top-level scope", " @ In[24]:1", " [2] eval", " @ ./boot.jl:360 [inlined]", " [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)", " @ Base ./loading.jl:1094" ] } ], "source": [ "@assert norm(x_gmres - xsol) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GMRES should be reasonably fast. The points you'll get is" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20.0" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clamp(20 / time_gmres * 20, 0, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My runtime is about 7-8 seconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4 (20 pts)\n", "\n", "Let's first try to solve the PageRank problem by the Arnoldi method for solving eigen problems. " ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(value = ([0.0], [[0.0057161240427034184, 0.0001878538417789856, 0.0012076318400077645, 0.00010734505244513462, 0.00016101757866770192, 0.00042938020978053847, 2.6836263111283654e-5, 0.00010734505244513462, 5.367252622256731e-5, 5.367252622256731e-5 … 0.00048305273600310583, 0.00021469010489026923, 0.00040254394666925487, 8.050878933385096e-5, 2.6836263111283654e-5, 0.00040254394666925487, 5.367252622256731e-5, 2.6836263111283654e-5, 0.0012076318400077645, 8.050878933385096e-5]], ConvergenceInfo: one converged value after 1 iterations and 1 applications of the linear map;\n", "norms of residuals are given by (0.0,).\n", "), time = 0.089409929, bytes = 34596553, gctime = 0.0152803, gcstats = Base.GC_Diff(34596553, 4, 0, 79217, 16, 0, 15280300, 1, 0))" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# warm up (compilation)\n", "eigsolve(M, x0, 1, :SR, issymmetric = false, maxiter = 1)\n", "# output is complex eigenvalue/eigenvector\n", "(vals, vecs, info), time_arnoldi, = @timed eigsolve(M, x0, 1, :SR, issymmetric = false)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check correctness. **You will lose all 20 points if the following statement throws `AssertError`.**" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "@assert abs(Real(vals[1])) < 1e-8" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "ename": "LoadError", "evalue": "AssertionError: norm(x_arnoldi - xsol) < 1.0e-8", "output_type": "error", "traceback": [ "AssertionError: norm(x_arnoldi - xsol) < 1.0e-8", "", "Stacktrace:", " [1] top-level scope", " @ In[28]:3", " [2] eval", " @ ./boot.jl:360 [inlined]", " [3] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)", " @ Base ./loading.jl:1094" ] } ], "source": [ "x_arnoldi = abs.(Real.(vecs[1]))\n", "x_arnoldi ./= sum(x_arnoldi)\n", "@assert norm(x_arnoldi - xsol) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arnoldi should be reasonably fast. The points you'll get is" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20.0" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clamp(20 / time_arnoldi * 20, 0, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My runtime is about 11-12 seconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.6 (5 pts) Results\n", "\n", "List the top 20 pages you found and their corresponding PageRank score. Do they match the top 20 pages ranked according to in-degrees? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Q2.7 Be proud of yourself\n", "\n", "Go to your resume/cv and claim you have experience performing analysis on a network of one million nodes." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.6.0", "language": "julia", "name": "julia-1.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.6.0" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "87px", "width": "252px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }