{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Biostat 257 Homework 4\n", "\n", "**Due May 20 @ 11:59PM**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "versioninfo()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are going to try different numerical methods learnt in class on the [Google PageRank problem](https://en.wikipedia.org/wiki/PageRank)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q1 (5 pts) Recognize structure\n", "\n", "Let $\\mathbf{A} \\in \\{0,1\\}^{n \\times n}$ be the connectivity matrix of $n$ web pages with entries\n", "$$\n", "\\begin{eqnarray*}\n", "\ta_{ij}= \\begin{cases}\n", "\t1 & \\text{if page $i$ links to page $j$} \\\\\n", "\t0 & \\text{otherwise}\n", "\t\\end{cases}.\n", "\\end{eqnarray*}\n", "$$\n", "$r_i = \\sum_j a_{ij}$ is the out-degree of page $i$. That is $r_i$ is the number of links on page $i$. Imagine a random surfer exploring the space of $n$ pages according to the following rules. \n", "\n", "- From a page $i$ with $r_i>0$\n", " * with probability $p$, (s)he randomly chooses a link on page $i$ (uniformly) and follows that link to the next page \n", " * with probability $1-p$, (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", "- From a page $i$ with $r_i=0$ (a dangling page), (s)he randomly chooses one page from the set of all $n$ pages (uniformly) and proceeds to that page \n", " \n", "The process defines a Markov chain on the space of $n$ pages. Write the transition matrix $\\mathbf{P}$ of the Markov chain as a sparse matrix plus rank 1 matrix." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q2 Relate to numerical linear algebra\n", "\n", "According to standard Markov chain theory, the (random) position of the surfer converges to the stationary distribution $\\mathbf{x} = (x_1,\\ldots,x_n)^T$ of the Markov chain. $x_i$ has the natural interpretation of the proportion of times the surfer visits page $i$ in the long run. Therefore $\\mathbf{x}$ serves as page ranks: a higher $x_i$ means page $i$ is more visited. It is well-known that $\\mathbf{x}$ is the left eigenvector corresponding to the top eigenvalue 1 of the transition matrix $\\mathbf{P}$. That is $\\mathbf{P}^T \\mathbf{x} = \\mathbf{x}$. Therefore $\\mathbf{x}$ can be solved as an **eigen-problem**. It can also be cast as **solving a linear system**. Since the row sums of $\\mathbf{P}$ are 1, $\\mathbf{P}$ is rank deficient. We can replace the first equation by the $\\sum_{i=1}^n x_i = 1$.\n", "\n", "Hint: For iterative solvers, we don't need to replace the 1st equation. We can use the matrix $\\mathbf{I} - \\mathbf{P}^T$ directly if we start with a vector with all positive entries." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q3 (10 pts) Explore data\n", "\n", "Obtain the connectivity matrix `A` from the `SNAP/web-Google` data in the MatrixDepot package. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "using MatrixDepot\n", "\n", "md = mdopen(\"SNAP/web-Google\")\n", "# display documentation for the SNAP/web-Google data\n", "mdinfo(md)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# connectivity matrix\n", "A = md.A" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute summary statistics:\n", "* How much memory does `A` take? If converted to a `Matrix{Float64}` (don't do it!), how much memory will it take? \n", "* number of web pages\n", "* number of edges (web links). \n", "* number of dangling nodes (pages with no out links)\n", "* histogram of in-degrees \n", "* list the top 20 pages with the largest in-degrees? \n", "* histogram of out-degrees\n", "* which the top 20 pages with the largest out-degrees?\n", "* visualize the sparsity pattern of $\\mathbf{A}$ or a submatrix of $\\mathbf{A}$ say `A[1:10000, 1:10000]`. \n", "\n", "**Hint**: For plots, you can use the [UnicodePlots.jl](https://github.com/Evizero/UnicodePlots.jl) package (`spy`, `histogram`, etc), which is fast for large data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q4 (5 pts) Dense linear algebra? \n", "\n", "Consider the following methods to obtain the page ranks of the `SNAP/web-Google` data. \n", "\n", "1. A dense linear system solver such as LU decomposition. \n", "2. A dense eigen-solver for asymmetric matrix. \n", "\n", "For the LU approach, estimate (1) the memory usage and (2) how long it will take assuming that the LAPACK functions can achieve the theoretical throughput of your computer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q5 (75 pts) Iterative solvers\n", "\n", "Set the _teleportation_ parameter at $p = 0.85$. Consider the following methods for solving the PageRank problem. \n", "\n", "1. An iterative linear system solver such as GMRES. \n", "2. An iterative eigen-solver such as Arnoldi method.\n", "\n", "For iterative methods, we have many choices in Julia. See a list of existing Julia packages for linear solvers at this [page](https://jutho.github.io/KrylovKit.jl/stable/#Package-features-and-alternatives-1). The start-up code below uses the [KrylovKit.jl](https://github.com/Jutho/KrylovKit.jl) package. You can use other packages if you prefer. Make sure to utilize the special structure of $\\mathbf{P}$ (sparse + rank 1) to speed up the matrix-vector multiplication. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 1 (15 pts)\n", "\n", "Let's implement a type `PageRankImPt` that mimics the matrix $\\mathbf{M} = \\mathbf{I} - \\mathbf{P}^T$. For iterative methods, all we need to provide are methods for evaluating $\\mathbf{M} \\mathbf{v}$ and $\\mathbf{M}^T \\mathbf{v}$ for arbitrary vector $\\mathbf{v}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "using BenchmarkTools, LinearAlgebra, SparseArrays, Revise\n", "\n", "# a type for the matrix M = I - P^T in PageRank problem\n", "struct PageRankImPt{TA <: Number, IA <: Integer, T <: AbstractFloat} <: AbstractMatrix{T}\n", " A :: SparseMatrixCSC{TA, IA} # adjacency matrix\n", " telep :: T\n", " # working arrays\n", " # TODO: whatever intermediate arrays you may want to pre-allocate\n", "end\n", "\n", "# constructor\n", "function PageRankImPt(A::SparseMatrixCSC, telep::T) where T <: AbstractFloat\n", " n = size(A, 1)\n", " # TODO: initialize and pre-allocate arrays\n", " PageRankImPt(A, telep)\n", "end\n", "\n", "LinearAlgebra.issymmetric(::PageRankImPt) = false\n", "Base.size(M::PageRankImPt) = size(M.A)\n", "# TODO: implement this function for evaluating M[i, j]\n", "Base.getindex(M::PageRankImPt, i, j) = M.telep\n", "\n", "# overwrite `out` by `(I - Pt) * v`\n", "function LinearAlgebra.mul!(\n", " out :: Vector{T}, \n", " M :: PageRankImPt{<:Number, <:Integer, T}, \n", " v :: Vector{T}\n", " ) where T <: AbstractFloat\n", " # TODO: implement mul!(out, M, v)\n", " sleep(1e-2) # wait 10 ms as if your code takes 1ms\n", " return out\n", "end\n", "\n", "# overwrite `out` by `(I - P) * v`\n", "function LinearAlgebra.mul!(\n", " out :: Vector{T}, \n", " Mt :: Transpose{T, PageRankImPt{TA, IA, T}}, \n", " v :: Vector{T}\n", " ) where {TA<:Number, IA<:Integer, T <: AbstractFloat}\n", " M = Mt.parent\n", " # TODO: implement mul!(out, transpose(M), v)\n", " sleep(1e-2) # wait 10 ms as if your code takes 1ms\n", " out\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check correctness. Note \n", "$$\n", "\\mathbf{M}^T \\mathbf{1} = \\mathbf{0}\n", "$$\n", "and\n", "$$\n", "\\mathbf{M} \\mathbf{x} = \\mathbf{0}\n", "$$\n", "for stationary distribution $\\mathbf{x}$.\n", "\n", "Download the solution file `pgrksol.csv.gz`. **Do not put this file in your Git**. You will lose points if you do. You can add a line `pgrksol.csv.gz` to your `.gitignore` file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "using CodecZlib, DelimitedFiles\n", "\n", "isfile(\"pgrksol.csv.gz\") || download(\"https://raw.githubusercontent.com/ucla-biostat-257/2022spring/master/hw/hw4/pgrksol.csv.gz\")\n", "xsol = open(\"pgrksol.csv.gz\", \"r\") do io\n", " vec(readdlm(GzipDecompressorStream(io)))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**You will lose all 35 points (Steps 1 and 2)** if the following statements throw AssertError." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "M = PageRankImPt(A, 0.85)\n", "n = size(M, 1)\n", "\n", "#@assert transpose(M) * ones(n) ≈ zeros(n)\n", "@assert norm(transpose(M) * ones(n)) < 1e-12" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#@assert M * xsol ≈ zeros(n)\n", "@assert norm(M * xsol) < 1e-12" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 2 (20 pts)\n", "\n", "We want to benchmark the hot functions `mul!` to make sure they are efficient and allocate little memory." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "M = PageRankImPt(A, 0.85)\n", "n = size(M, 1)\n", "v, out = ones(n), zeros(n)\n", "bm_mv = @benchmark mul!($out, $M, $v) setup=(fill!(out, 0); fill!(v, 1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bm_mtv = @benchmark mul!($out, $(transpose(M)), $v) setup=(fill!(out, 0); fill!(v, 1))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You will lose 1 point for each 100 bytes memory allocation. So the points you will get is" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clamp(10 - median(bm_mv).memory / 100, 0, 10) + \n", "clamp(10 - median(bm_mtv).memory / 100, 0, 10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My median run times are 30-40 ms and memory allocations are 0 bytes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 3 (20 pts)\n", "\n", "Let's first try to solve the PageRank problem by the GMRES method for solving linear equations. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "using KrylovKit\n", "\n", "# normalize in-degrees to be the start point\n", "x0 = vec(sum(A, dims = 1)) .+ 1.0\n", "x0 ./= sum(x0)\n", "\n", "# right hand side\n", "b = zeros(n)\n", "\n", "# warm up (compilation)\n", "linsolve(M, b, x0, issymmetric = false, isposdef = false, maxiter = 1) \n", "# output is complex eigenvalue/eigenvector\n", "(x_gmres, info), time_gmres, = @timed linsolve(M, b, x0, issymmetric = false, isposdef = false)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check correctness. **You will lose all 20 points if the following statement throws `AssertError`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@assert norm(x_gmres - xsol) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "GMRES should be reasonably fast. The points you'll get is" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clamp(20 / time_gmres * 20, 0, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My runtime is about 7-8 seconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step 4 (20 pts)\n", "\n", "Let's first try to solve the PageRank problem by the Arnoldi method for solving eigen problems. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# warm up (compilation)\n", "eigsolve(M, x0, 1, :SR, issymmetric = false, maxiter = 1)\n", "# output is complex eigenvalue/eigenvector\n", "(vals, vecs, info), time_arnoldi, = @timed eigsolve(M, x0, 1, :SR, issymmetric = false)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check correctness. **You will lose all 20 points if the following statement throws `AssertError`.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "@assert abs(Real(vals[1])) < 1e-8" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "x_arnoldi = abs.(Real.(vecs[1]))\n", "x_arnoldi ./= sum(x_arnoldi)\n", "@assert norm(x_arnoldi - xsol) < 1e-8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Arnoldi should be reasonably fast. The points you'll get is" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "clamp(20 / time_arnoldi * 20, 0, 20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Hint**: My runtime is about 11-12 seconds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q6 (5 pts) Results\n", "\n", "List the top 20 pages you found and their corresponding PageRank score. Do they match the top 20 pages ranked according to in-degrees? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Q7 Be proud of yourself\n", "\n", "Go to your resume/cv and claim you have experience performing analysis on a network of one million nodes." ] } ], "metadata": { "@webio": { "lastCommId": null, "lastKernelId": null }, "kernelspec": { "display_name": "Julia 1.7.1", "language": "julia", "name": "julia-1.7" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.7.1" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "87px", "width": "252px" }, "navigate_menu": true, "number_sections": false, "sideBar": true, "skip_h1_title": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": true, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 4 }