{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# The DAG of Julia packages\n", "\n", "## Problem statement\n", "\n", "In this tutorial, we show how LG in conjunction with other utility packages can be used for extracting the most recent directed acyclic graph (DAG) of the Julia package system. This information can be used for interactive data visualization with [D3](https://d3js.org/) like in the following links:\n", "\n", "- **The DAG of Julia packages:** https://juliohm.github.io/dataviz/DAG-of-Julia-packages\n", "- **Where are the Julians?** https://juliohm.github.io/dataviz/where-are-the-julians\n", "\n", "All the packages used in this notebook can be installed with:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mPackage HTTP is already installed\n", "\u001b[39m\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mPackage JSON is already installed\n", "\u001b[39m\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mPackage GitHub is already installed\n", "\u001b[39m\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mPackage LightGraphs is already installed\n", "\u001b[39m\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mPackage ProgressMeter is already installed\n", "\u001b[39m" ] } ], "source": [ "for dep in [\"HTTP\",\"JSON\",\"GitHub\",\"LightGraphs\",\"ProgressMeter\"]\n", " Pkg.add(dep)\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to be able to query information from GitHub without be misinterpreted as a malicious robot, you need to [create a personal token](https://github.com/settings/tokens) in your GitHub settings. Since this token is private, we ask you to save it as an environment variable in your operating system (e.g. set `GITHUB_AUTH` in your `.bashrc` file). This variable will be read in Julia and used for authentication as follows:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "WARNING: deprecated syntax \"abstract GitHubType\" at /home/juliohm/.julia/v0.6/GitHub/src/utils/GitHubType.jl:20.\n", "Use \"abstract type GitHubType end\" instead.\n", "\n", "WARNING: deprecated syntax \"typealias GitHubString Compat.UTF8String\" at /home/juliohm/.julia/v0.6/GitHub/src/utils/GitHubType.jl:22.\n", "Use \"const GitHubString = Compat.UTF8String\" instead.\n", "\n", "WARNING: deprecated syntax \"abstract Authorization\" at /home/juliohm/.julia/v0.6/GitHub/src/utils/auth.jl:6.\n", "Use \"abstract type Authorization end\" instead.\n" ] }, { "data": { "text/plain": [ "GitHub.OAuth2(8cda0d**********************************)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using HTTP\n", "using JSON\n", "using GitHub\n", "using LightGraphs\n", "using ProgressMeter\n", "\n", "# authenticate with GitHub to increase query limits\n", "mytoken = ENV[\"GITHUB_AUTH\"]\n", "myauth = GitHub.authenticate(mytoken)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After successful authentication, we are now ready to start coding. First, we extract the names of all registered packages in METADATA and assign to each of them a unique integer id:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dict{String,Int64} with 1500 entries:\n", " \"Levenshtein\" => 724\n", " \"ReadStat\" => 1141\n", " \"Discretizers\" => 326\n", " \"SchumakerSpline\" => 1209\n", " \"FredData\" => 455\n", " \"GaussQuadrature\" => 475\n", " \"RecurrenceAnalysis\" => 1147\n", " \"MKLSparse\" => 843\n", " \"AnsiColor\" => 20\n", " \"ProximalOperators\" => 1075\n", " \"Luxor\" => 776\n", " \"RobustLeastSquares\" => 1186\n", " \"Temporal\" => 1353\n", " \"Robotlib\" => 1184\n", " \"PiecewiseLinearOpt\" => 1026\n", " \"JLDArchives\" => 665\n", " \"MatrixDepot\" => 803\n", " \"CodeTools\" => 168\n", " \"NumericSuffixes\" => 935\n", " \"COBRA\" => 162\n", " \"Crypto\" => 234\n", " \"Mongo\" => 857\n", " \"ROOT\" => 1194\n", " \"MNIST\" => 849\n", " \"RandomMatrices\" => 1123\n", " ⋮ => ⋮" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find all packages in METADATA\n", "pkgs = readdir(Pkg.dir(\"METADATA\"))\n", "filterfunc = p -> isdir(joinpath(Pkg.dir(\"METADATA\"), p)) && p ∉ [\".git\",\".test\"]\n", "pkgs = filter(filterfunc, pkgs)\n", "\n", "# assign each package an id\n", "pkgdict = Dict{String,Int}()\n", "for (i,pkg) in enumerate(pkgs)\n", " push!(pkgdict, pkg => i)\n", "end\n", "pkgdict" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the ids, we can easily build the DAG of packages with LG:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Building graph...100% Time: 0:04:02\n" ] } ], "source": [ "# build DAG\n", "DAG = DiGraph(length(pkgs))\n", "@showprogress 1 \"Building graph...\" for pkg in pkgs\n", " children = Pkg.dependents(pkg)\n", " for c in children\n", " add_edge!(DAG, pkgdict[pkg], pkgdict[c])\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We are interested in finding all the descendents of a package. In other words, we are interested in finding all packages that are influenced by a given package. In this context, we further want to save the level of dependency (or geodesic distance) from descendents to the package being queried. This is a straightforward operation in LG:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# find (indirect) descendents\n", "descendents = []\n", "for pkg in pkgs\n", " gdists = gdistances(DAG, pkgdict[pkg])\n", " desc = [Dict(\"id\"=>pkgs[v], \"level\"=>gdists[v]) for v in find(gdists .> 0)]\n", " push!(descendents, desc)\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each package, we also want to save information about who has contributed to the project. This task is easy to implement with the awesome [GitHub.jl](https://github.com/JuliaWeb/GitHub.jl) API. However, some of the packages registered in METADATA are hosted on different websites such as gitlab, for which an API is missing. We simply skip them and ask authors to migrate their code to GitHub if possible:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Finding contributors...100% Time: 0:12:27\n" ] } ], "source": [ "# find contributors\n", "pkgcontributors = []\n", "hostnames = []\n", "@showprogress 1 \"Finding contributors...\" for pkg in pkgs\n", " url = Pkg.Read.url(pkg)\n", " m = match(r\".*://([a-z.]*)/(.*)\\.git.*\", url)\n", " hostname = m[1]; reponame = m[2]\n", " if hostname == \"github.com\"\n", " users, _ = contributors(reponame, auth=myauth)\n", " usersdata = map(u -> (u[\"contributor\"].login, u[\"contributions\"]), users)\n", " pkgcontrib = [Dict(\"id\"=>u, \"contributions\"=>c) for (u,c) in usersdata]\n", " push!(pkgcontributors, pkgcontrib)\n", " push!(hostnames, hostname)\n", " else\n", " push!(pkgcontributors, [])\n", " push!(hostnames, hostname)\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We also extract the Julia version required in the last tag of a package. Both the lower and upper bounds are saved as well as a \"cleaned\" `major.minor` string for the lower bound, which is useful for data visualization:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# find required Julia version\n", "juliaversion = []\n", "for pkg in pkgs\n", " versiondir = joinpath(Pkg.dir(\"METADATA\"), pkg, \"versions\")\n", " if isdir(versiondir)\n", " latestversion = readdir(versiondir)[end]\n", " reqfile = joinpath(versiondir, latestversion, \"requires\")\n", " reqs = Pkg.Reqs.parse(reqfile)\n", " if \"julia\" ∈ keys(reqs)\n", " vinterval = reqs[\"julia\"].intervals[1]\n", " vmin = vinterval.lower\n", " vmax = vinterval.upper\n", " majorminor = \"v$(vmin.major).$(vmin.minor)\"\n", " push!(juliaversion, Dict(\"min\"=>string(vinterval.lower),\n", " \"max\"=>string(vinterval.upper),\n", " \"majorminor\"=>majorminor))\n", " else\n", " push!(juliaversion, Dict(\"min\"=>\"NA\", \"max\"=>\"NA\", \"majorminor\"=>\"NA\"))\n", " end\n", " else\n", " push!(juliaversion, Dict(\"min\"=>\"BOGUS\", \"max\"=>\"BOGUS\", \"majorminor\"=>\"BOGUS\"))\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we save the data in a JSON file:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# construct JSON\n", "nodes = [Dict(\"id\"=>pkgs[v],\n", " \"indegree\"=>indegree(DAG,v),\n", " \"outdegree\"=>outdegree(DAG,v),\n", " \"juliaversion\"=>juliaversion[v],\n", " \"descendents\"=>descendents[v],\n", " \"contributors\"=>pkgcontributors[v]) for v in vertices(DAG)]\n", "\n", "links = [Dict(\"source\"=>pkgs[src(e)], \"target\"=>pkgs[dst(e)]) for e in edges(DAG)]\n", "\n", "data = Dict(\"nodes\"=>nodes, \"links\"=>links)\n", "\n", "# write to file\n", "open(\"DAG-Julia-Pkgs.json\", \"w\") do f\n", " JSON.print(f, data, 2)\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Where are the Julians?\n", "\n", "Having extracted and saved the DAG of Julia packages, we take this opportunity to find out the Julians responsible for this amazing package system.\n", "\n", "We use LG again to build this social network:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Dict{String,Int64} with 1558 entries:\n", " \"credentiality\" => 496\n", " \"ZacCranko\" => 238\n", " \"Snnappie\" => 206\n", " \"benhamner\" => 387\n", " \"lynyus\" => 991\n", " \"iraikov\" => 781\n", " \"nstiurca\" => 1143\n", " \"pearlzli\" => 1180\n", " \"GunnarFarneback\" => 90\n", " \"njwilson23\" => 1135\n", " \"gustafsson\" => 719\n", " \"cgoldammer\" => 449\n", " \"garrison\" => 680\n", " \"lobingera\" => 977\n", " \"randyzwitch\" => 1232\n", " \"JonathanAnderson\" => 123\n", " \"madanim\" => 999\n", " \"Armavica\" => 20\n", " \"Matt5sean3\" => 148\n", " \"slangangular\" => 1356\n", " \"raphapr\" => 1236\n", " \"kuldeepdhaka\" => 946\n", " \"jdrugo\" => 818\n", " \"J-Revell\" => 103\n", " \"fserra\" => 668\n", " ⋮ => ⋮" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# find Julians on Github\n", "julians = []\n", "for pkgcontrib in pkgcontributors\n", " append!(julians, [julian[\"id\"].value for julian in pkgcontrib])\n", "end\n", "julians = sort(unique(julians))\n", "\n", "# assign each Julian an id\n", "juliandict = Dict{String,Int}()\n", "for (i,julian) in enumerate(julians)\n", " push!(juliandict, julian => i)\n", "end\n", "juliandict" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36m1558 Julians and 43978 connections\n", "\u001b[39m" ] } ], "source": [ "# build the social network\n", "socialnet = Graph(length(julians))\n", "contribdict = Dict{String,Int}()\n", "for pkgcontrib in pkgcontributors\n", " ids = [julian[\"id\"].value for julian in pkgcontrib]\n", " contribs = [julian[\"contributions\"] for julian in pkgcontrib]\n", " for i=1:length(ids)\n", " contribdict[ids[i]] = get(contribdict, ids[i], 0) + contribs[i]\n", " end\n", " for i=1:length(ids), j=1:i-1\n", " add_edge!(socialnet, juliandict[ids[i]], juliandict[ids[j]])\n", " end\n", "end\n", "\n", "njulians = nv(socialnet)\n", "nconnections = ne(socialnet)\n", "\n", "info(\"$njulians Julians and $nconnections connections\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each node of the social network, we use GitHub API to retrieve user information:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Retrieving Julian info...100% Time: 0:04:44\n" ] } ], "source": [ "# HTTP requests on https://api.github.com\n", "juliansinfo = []\n", "@showprogress 1 \"Retrieving Julian info...\" for julian in julians\n", " resp = HTTP.get(\"https://api.github.com/users/$julian?access_token=$mytoken\")\n", " htmlbody = identity(String(resp.body))\n", " push!(juliansinfo, JSON.Parser.parse(htmlbody))\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the user has typed an address on his profile, we find an approximate latitude/longitude with Google Maps geocoding API:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Geocoding Julian address...100% Time: 0:04:37\n" ] } ], "source": [ "locnames = []\n", "latitudes = []\n", "longitudes = []\n", "countries = []\n", "@showprogress 1 \"Geocoding Julian address...\" for julian in juliansinfo\n", " address = julian[\"location\"]\n", " if address ≠ nothing\n", " address = replace(address, \"–\", \"\")\n", " address = replace(address, \" \", \"+\")\n", " resp = HTTP.get(\"http://maps.google.com/maps/api/geocode/json?address=$address\")\n", " htmlbody = identity(String(resp.body))\n", " results = JSON.Parser.parse(htmlbody)[\"results\"]\n", " if length(results) > 0\n", " geoinfo = results[1]\n", " locname = geoinfo[\"formatted_address\"]\n", " loccoords = geoinfo[\"geometry\"][\"location\"]\n", " push!(locnames, locname)\n", " push!(latitudes, loccoords[\"lat\"])\n", " push!(longitudes, loccoords[\"lng\"])\n", " for comp in geoinfo[\"address_components\"]\n", " if \"country\" ∈ comp[\"types\"]\n", " push!(countries, comp[\"long_name\"])\n", " end\n", " end\n", " else\n", " push!(locnames, nothing)\n", " push!(latitudes, nothing)\n", " push!(longitudes, nothing)\n", " push!(countries, nothing)\n", " end\n", " else\n", " push!(locnames, nothing)\n", " push!(latitudes, nothing)\n", " push!(longitudes, nothing)\n", " push!(countries, nothing)\n", " end\n", "end" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we use JSON again to save the data:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true, "scrolled": false }, "outputs": [], "source": [ "# construct JSON\n", "usernodes = [Dict(\"id\"=>julian[\"login\"],\n", " \"name\"=>julian[\"name\"],\n", " \"avatar_url\"=>julian[\"avatar_url\"],\n", " \"contributions\"=>contribdict[julian[\"login\"]],\n", " \"location\"=>locnames[i],\n", " \"latitude\"=>latitudes[i],\n", " \"longitude\"=>longitudes[i],\n", " \"country\"=>countries[i]) for (i,julian) in enumerate(juliansinfo)]\n", "\n", "userlinks = [Dict(\"source\"=>julians[src(e)], \"target\"=>julians[dst(e)]) for e in edges(socialnet)]\n", "\n", "userdata = Dict(\"nodes\"=>usernodes, \"links\"=>userlinks)\n", "\n", "# write to file\n", "open(\"Julians.json\", \"w\") do f\n", " JSON.print(f, userdata, 2)\n", "end" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 0.6.2", "language": "julia", "name": "julia-0.6" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "0.6.2" } }, "nbformat": 4, "nbformat_minor": 2 }