{ "metadata": { "name": "", "signature": "sha256:f99ffdd41a3fa537c4bd4d91e511496fca421822d919877127ba2142d77f20d4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "The UK Web Archive offers a number of data download for analysis. One of them is a graph of links across the UK domain, and our goal is to outline how to process and visualise this data for a particular domain.\n", "\n", "The Dataset\n", "-----------\n", "\n", "The ~2.5 billion 200 OK responses in the [JISC UK Web Domain Dataset (1996-2010)]({{ site.baseurl }}/ukwa.ds.2/) dataset have been scanned for hyperlinks. For each link, we extract the host that the link targets, and use this to build up a picture of which hosts have linked to which other hosts, over time.\n", "\n", "This host-level link graph summarises the number of links between hosts, in each year. The data format is a slightly unusual, as you can see from this snippet:\n", "\n", " 1996|appserver.ed.ac.uk|portico.bl.uk 1\n", " 1996|art-www.acorn.co.uk|portico.bl.uk 1\n", " 1996|astra.ich.ucl.ac.uk|portico.bl.uk 1\n", " 1996|back.niss.ac.uk|portico.bl.uk 1\n", " 1996|beta.bids.ac.uk|portico.bl.uk 2\n", " 1996|blaiseweb.bl.uk|blaiseweb.bl.uk 4\n", " 1996|bonsai.iielr.dmu.ac.uk|portico.bl.uk 4\n", "\n", "There are two tab-separated columns. The first contains three bar-separated fields: the crawl year, the source host, and the target host. The second contains the number of linking URLs. Therefore, the first line:\n", "\n", "
\n",
      "1996|appserver.ed.ac.uk|portico.bl.uk   1\n",
      "
\n", "\n", "represents an assertion that, from the data crawled in 1996, we found one URL on the 'appserver.ed.ac.uk' host that contained a hyperlink to a resource held on 'portico.bl.uk'.\n", "\n", "Scale\n", "-----\n", "Visualising such a large data set is very difficult. A cutting edge algorithm was able to visualised the 1996 part ... see http://britishlibrary.typepad.co.uk/webarchive/2013/07/using-open-data-to-visualise-the-early-web.html\n", "\n", "Filtering By Domain\n", "-------------------\n", "One approach to making the dataset more manageable is to focus on a particular domain of interest. For example, using the compression-aware version of the grep tool, one can quickly pick out the links between the hosts on a particular domain. For example, this command extracts all the links relating to the British Libraries domain.\n", "\n", " % zgrep \"bl.uk\" host-linkage.tsv.gz | sort > bl-uk-linkage.tsv\n", " \n", "Note that the final stage sorts the data so that lines from each year appear together.\n", "\n", "Visualising Total Links Over Time\n", "---------------------------------\n", "\n", "Talk about chart-based visualisation, linking to Peter's work. see http://peterwebster.me/2014/01/28/distant-reading-the-webarchive/\n", "\n", "\n", "Dynamic Visualisation\n", "---------------------\n", "\n", "One very common strand of work I've seen over the years is that of looking out to the open source community and finding tools that can be reused. \n", "\n", "There are very sophisticated graph analysis tools, notably gephi, although time-dependent graphs are less well supported.\n", "\n", "However, another completely different tool may be of use here.\n", "\n", "### Visualisation using Gource ###\n", "[Gource](https://code.google.com/p/gource/) was designed as a tool for visualising how software has been changed over time.\n", "\n", "However, it has a [generic input log format](https://code.google.com/p/gource/wiki/CustomLogFormat) which we can re-use here:\n", "\n", " 1275543595|andrew|A|src/main.cpp|FF0000\n", "\n", "If we can get from\n", "\n", " 1996|appserver.ed.ac.uk|portico.bl.uk 1\n", "\n", "to something like\n", "\n", " 820454400|appserver.ed.ac.uk|A|uk/bl/portico/uk/ac/ed/appserver/1|FF0000\n", "\n", "we should find some interesting patterns emerge." ] }, { "cell_type": "code", "collapsed": false, "input": [ "with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as f:\n", " for line in f:\n", " print(line)\n", " row = line.rstrip().replace('|','\\t').split(\"\\t\")\n", " print(row)\n", " break" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "1996|appserver.ed.ac.uk|portico.bl.uk\t1\n", "\n", "['1996', 'appserver.ed.ac.uk', 'portico.bl.uk', '1']\n" ] } ], "prompt_number": 6 }, { "cell_type": "code", "collapsed": false, "input": [ "import time\n", "\n", "# Open input and output files\n", "with open('./host-linkage/bl-uk-host-linkage.dat', 'r') as fin:\n", " with open('./host-linkage/linkage.log', 'w') as fout:\n", " counter = 0\n", " for line in fin:\n", " # Reformat:\n", " new_line = to_gource(line)\n", " fout.write(new_line)\n", " fout.write('\\n')\n", " \n", " # Also count:\n", " counter = counter + 1\n", " \n", " # Report progress:\n", " if( counter%10000 == 0 ):\n", " print(counter, line, new_line, '\\n')\n", " \n", " # Report outcome:\n", " print(\"Wrote {} lines.\".format(counter))\n", " fout.close()\n", " " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10000 2001|www.orchardhotel.demon.co.uk|portico.bl.uk\t2\n", " 978307200|www.orchardhotel.demon.co.uk|A|uk/co/demon/orchardhotel/www|FF0000 \n", "\n", "20000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2003|portico.bl.uk|www.london-victorian-ring.com\t1\n", " 1041379200|www.london-victorian-ring.com|A|com/london-victorian-ring/www|0000FF \n", "\n", "30000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2004|portico.bl.uk|www.parlophone.co.uk\t4\n", " 1072915200|www.parlophone.co.uk|A|uk/co/parlophone/www|0000FF \n", "\n", "40000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2005|portico.bl.uk|www.biomedcentral.com\t3\n", " 1104537600|www.biomedcentral.com|A|com/biomedcentral/www|0000FF \n", "\n", "50000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2006|minos.bl.uk|www.theglobalsite.ac.uk\t1\n", " 1136073600|www.theglobalsite.ac.uk|A|uk/ac/theglobalsite/www|0000FF \n", "\n", "60000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2007|gopher.bl.uk|www.google.com\t5\n", " 1167609600|www.google.com|A|com/google/www|0000FF \n", "\n", "70000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2007|www.bl.uk|www.regione.emilia-romagna.it\t13\n", " 1167609600|www.regione.emilia-romagna.it|A|it/emilia-romagna/regione/www|0000FF \n", "\n", "80000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2009|www.bl.uk|www.bank.lv\t2\n", " 1230768000|www.bank.lv|A|lv/bank/www|0000FF \n", "\n", "Wrote 86281 lines." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 37 }, { "cell_type": "code", "collapsed": false, "input": [ "import time\n", "\n", "def to_gource(line):\n", " row = line.rstrip().replace('|','\\t').split(\"\\t\")\n", " timestamp = int(time.mktime(time.strptime(row[0], \"%Y\")))\n", " hostname = row[1]\n", " blhost = row[2]\n", " action = \"A\"\n", " colour = \"FF0000\"\n", " if( blhost.find(\"bl.uk\") == -1 ):\n", " hostname = row[2]\n", " blhost = row[1]\n", " colour = \"0000FF\"\n", " #path = '/'.join(reversed(blhost.split('.')))\n", " #path = path +'/' + '/'.join(reversed(hostname.split('.')))\n", " path = '/'.join(reversed(hostname.split('.')))\n", " return \"{}|{}|{}|{}|{}\".format(timestamp,hostname,action,path,colour)\n", " \n", "print(to_gource(\"1996|appserver.ed.ac.uk|portico.bl.uk\t1\"))" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "820454400|appserver.ed.ac.uk|A|uk/ac/ed/appserver|FF0000\n" ] } ], "prompt_number": 2 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Passing this to gource works ok\n", "\n", " gource host-linkage/linkage.log\n", " \n", "...BUT it's rather unclear what's going on, as each old link turns up afresh. We really need to keep the state from year to year so we can add/delete links and can tune the colour. RED for inlink, BLUE for outlink, GREEN for both.\n", "\n", "So, we change tack and store the whole graph first, so we can pick through it later.\n" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import time\n", "\n", "# Takes a line from the linkage dataset and converts it into the form\n", "# (year, link_path, link_source, link_num)\n", "# Where 'actor' is the host that created the link\n", "def transform_link(line):\n", " row = line.rstrip().replace('|','\\t').split(\"\\t\")\n", " year = row[0]\n", " link_source = row[1]\n", " link_target = row[2]\n", " host = row[1]\n", " blhost = row[2]\n", " link_num = row[3]\n", " if( blhost.find(\"bl.uk\") == -1 ):\n", " host = row[2]\n", " blhost = row[1]\n", " path = '/'.join(reversed(host.split('.')))\n", " return (year, path, link_source, link_num)\n", "\n", "# Open input and output files\n", "known = {}\n", "years = set()\n", "paths = set()\n", "counter = 0\n", "with open('./host-linkage/york-ac-uk-linkage.tsv', 'r') as fin:\n", " for line in fin:\n", " try:\n", " # Reformat:\n", " (year, path, link_source, link_num) = transform_link(line)\n", " if( link_source.find(\"york.ac.uk\") == -1 ):\n", " link_type = \"in\"\n", " else:\n", " link_type = \"out\"\n", " key = \"{}|{}|{}\".format(year, path, link_type)\n", " known[key] = link_num\n", " years.add(year)\n", " paths.add(path)\n", " \n", " except Exception as e:\n", " print(e)\n", " print(line)\n", " stop\n", " \n", " # Also count:\n", " counter = counter + 1\n", " \n", " # Report progress:\n", " if( counter%10000 == 0 ):\n", " print(counter, line, key, '\\n')\n", " \n", "# Report outcome:\n", "print(\"Processed {} lines.\".format(counter))\n" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "10000 1997|www.york.ac.uk|www.nsls.bnl.gov\t1\n", " 1997|gov/bnl/nsls/www|out \n", "\n", "20000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 1999|www.geo.ed.ac.uk|www-users.york.ac.uk\t2\n", " 1999|uk/ac/york/www-users|in \n", "\n", "30000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2000|www.together.creations.co.uk|neural13.cs.york.ac.uk\t2\n", " 2000|uk/ac/york/cs/neural13|in \n", "\n", "40000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2001|www.melcom.free-online.co.uk|www-users.york.ac.uk\t3\n", " 2001|uk/ac/york/www-users|in \n", "\n", "50000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2001|www-users.york.ac.uk|www.trafford.gov.uk\t1\n", " 2001|uk/gov/trafford/www|out \n", "\n", "60000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2002|missendenchurch.org.uk|www-users.york.ac.uk\t1\n", " 2002|uk/ac/york/www-users|in \n", "\n", "70000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2002|www-users.cs.york.ac.uk|sloan.stanford.edu\t4\n", " 2002|edu/stanford/sloan|out \n", "\n", "80000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2002|www-users.york.ac.uk|www.umr.edu\t5\n", " 2002|edu/umr/www|out \n", "\n", "90000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2003|ctiwebct.york.ac.uk|www.wfc.ac.uk\t4\n", " 2003|uk/ac/wfc/www|out \n", "\n", "100000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2003|www.physrev.york.ac.uk|www-sci.lib.uci.edu\t2\n", " 2003|edu/uci/lib/www-sci|out \n", "\n", "110000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2003|www-users.york.ac.uk|www.bmjpg.com\t7\n", " 2003|com/bmjpg/www|out \n", "\n", "120000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2003|www.york.ac.uk|www.etc.com.au\t2\n", " 2003|au/com/etc/www|out \n", "\n", "130000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2004|npg.york.ac.uk|nnsa.dl.ac.uk\t7\n", " 2004|uk/ac/dl/nnsa|out \n", "\n", "140000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2004|www-rr.york.ac.uk|www.thesaurus.com\t1\n", " 2004|com/thesaurus/www|out \n", "\n", "150000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2004|www-users.york.ac.uk|www.eee.bham.ac.uk\t5\n", " 2004|uk/ac/bham/eee/www|out \n", "\n", "160000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2004|www.york.ac.uk|www.htdig.org\t25\n", " 2004|org/htdig/www|out \n", "\n", "170000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2005|www.cpag.org.uk|www.york.ac.uk\t3\n", " 2005|uk/ac/york/www|in \n", "\n", "180000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2005|www-users.york.ac.uk|websitegarage.netscape.com\t4\n", " 2005|com/netscape/websitegarage|out \n", "\n", "190000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2005|www.york.ac.uk|www.mariestopes.org.uk\t2\n", " 2005|uk/org/mariestopes/www|out \n", "\n", "200000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2006|www.gwc.org.uk|www-users.york.ac.uk\t8\n", " 2006|uk/ac/york/www-users|in \n", "\n", "210000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2006|www.york.ac.uk|www.brentwood-council.gov.uk\t1\n", " 2006|uk/gov/brentwood-council/www|out \n", "\n", "220000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2007|www.aspies.co.uk|www.york.ac.uk\t18\n", " 2007|uk/ac/york/www|in \n", "\n", "230000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2007|www-users.york.ac.uk|www.pinegroup.com\t15\n", " 2007|com/pinegroup/www|out \n", "\n", "240000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2008|motility.york.ac.uk|www3.ncbi.nlm.nih.gov\t3\n", " 2008|gov/nih/nlm/ncbi/www3|out \n", "\n", "250000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2008|www-users.york.ac.uk|www.ntgateway.com\t1\n", " 2008|com/ntgateway/www|out \n", "\n", "260000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2009|www.bromhammillers.co.uk|www.york.ac.uk\t1\n", " 2009|uk/ac/york/www|in \n", "\n", "270000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2009|www.york.ac.uk|www.prowess.org.uk\t1\n", " 2009|uk/org/prowess/www|out \n", "\n", "280000" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 2010|www.york.ac.uk|www.colartz.com\t1\n", " 2010|com/colartz/www|out \n", "\n", "Processed 284247 lines." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 3 }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, now we can re-use the processed form and output as changes:" ] }, { "cell_type": "code", "collapsed": false, "input": [ "def get_state(year,path):\n", " key_in = \"{}|{}|{}\".format(year, path, \"in\")\n", " key_out = \"{}|{}|{}\".format(year, path, \"out\")\n", " state = None;\n", " if( key_in in known ):\n", " state = \"in\"\n", " if( key_out in known ):\n", " state = \"out\"\n", " if( key_in in known and key_out in known ):\n", " state = \"both\"\n", " return state;\n", "\n", "# Now process and output in the form \"1230768000|www.bank.lv|A|lv/bank/www|0000FF\" : \n", "changes = 0\n", "deletions = 0\n", "with open('./host-linkage/york-linkage.log', 'w') as fout:\n", "\n", " # Loop over all known years and paths:\n", " for year in sorted(years):\n", " timestamp = int(time.mktime(time.strptime(year, \"%Y\")))\n", " for path in paths:\n", " # Determine state for current year:\n", " current_state = get_state(year,path)\n", " previous_state = get_state(int(year)-1,path)\n", " host = '.'.join(reversed(path.split('/')))\n", " if( current_state != None and current_state != previous_state ):\n", " changes += 1\n", " if( previous_state == None ):\n", " action = \"A\"\n", " else:\n", " action = \"M\"\n", " # And now the in/out state:\n", " if( current_state == \"in\" ):\n", " colour = \"FFFF00\"\n", " elif( current_state == \"out\" ):\n", " colour = \"0000FF\"\n", " elif( current_state == \"both\" ):\n", " colour = \"00FF00\"\n", " agent = path\n", " fout.write(\"{}|{}|{}|{}/{}|{}\".format(timestamp,agent,action,path,host,colour) )\n", " fout.write('\\n')\n", " if( current_state == None and previous_state != None ):\n", " # This is the case of deleted links:\n", " deletions += 1\n", " action = \"D\"\n", " colour = \"000000\"\n", " agent = path\n", " fout.write(\"{}|{}|{}|{}/{}|{}\".format(timestamp,agent,action,path,host,colour) )\n", " fout.write('\\n')\n", " \n", " print(\"Output {} changes.\".format(changes))\n", " print(\"Output {} deletions.\".format(deletions))\n", " fout.close()\n", " " ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Output 61007 changes.\n", "Output 51613 deletions.\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, running gource with these options seems to work well (having the labels on obscures things too much).\n", "\n", " gource --max-file-lag 0.1 --title \"Links to/from bl.uk, 1996-2010.\" --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log\n", " \n", "Links from bl.uk are shown in blue. Links to bl.uk are shown in yellow. Both-ways in green.\n", "\n", "e.g. 1996\n", "\n", "\n", "\n", "and 2010 \n", "\n", "\n", "\n", "Similarly, I was able to create a [video](https://code.google.com/p/gource/wiki/Videos) suitable for YouTube:\n", "\n", " gource -o gource.ppm --max-file-lag 0.1 --title \"Links to/from bl.uk, 1996-2010.\" --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage.log \n", " ffmpeg -y -r 60 -f image2pipe -vcodec ppm -i gource.ppm -vcodec libx264 -preset ultrafast -pix_fmt yuv420p -crf 1 -threads 0 -bf 0 gource.mp4\n", "\n", "now here: https://www.youtube.com/watch?v=rX6Hix19_No" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Post-Generation Filtering ###\n", "\n", "e.g. to look at just the ac.uk bit:\n", "\n", " grep \"|uk/ac/\" linkage.log > linkage-ac.uk.log\n", " \n", "and then\n", "\n", " gource --hide bloom,users,usernames,dirnames,filenames host-linkage/linkage-ac.uk.log\n" ] } ], "metadata": {} } ] }