{ "metadata": { "name": "", "signature": "sha256:3e14059bb6b4eed1df615401561843cff4863b56c858377fd52ae19ef05b8267" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "The Listiness of Wikipedia" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Although it was only an aside, an answer of \"What is a Reference work?\" caught my attention at UC Berkeley iSchool's [March 21st Friday Afternoon Seminar by Michael Buckland](http://www.ischool.berkeley.edu/newsandevents/events/ias/20140321). One possible answer suggested was: works that are over 80% list.\n", "\n", "That definition, although seeming a bit short, was actually serious suggestion published by Marcia Bates in 1984. [Bates, Marcia J. \"What Is a Reference Book: A Theoretical and Empirical Analysis.\" RQ 26 (Fall 1986): 37-57] This is an elegant solution in my opinion as a way to define reference works because although heuristic, it's entirely quantitative. Still necessary though, is a definition of list. According to Bates every book is a certain percentage list. Consider a classical monograph, it probably has a table of contents, or index - which is a list structure.\n", "\n", "At this point in reading, I realised that it would be simple identify what parts of Wikipedia articles are list. And so, we could determine the percentage list - or the \"listiness\" of - each Wikipedia article. \n", "\n" ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Method" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Analysing a May 2014 copy of English Wikipedia, we look at the listiness of all articles in the main namespace. To do this I used the `xml_dump` library from the excellent [mediawiki-utilities](https://pypi.python.org/pypi/mediawiki-utilities/0.2.1) by [@halfak](https://twitter.com/halfak).\n", "\n", "In [wikitext](https://www.wikidata.org/wiki/Q826308) lists are identified, by the prepending of a line with the characters __*__ (unordered list) and __#__ (numbered list). Additionally there are Infoboxes, which use | (pipe character) and tables whose rows begin with |- (pipe dash). What percentage of lines begin with any of these characters therefore determine the share of list of an article, or the \"listiness\" as I am now coining it.\n", "\n", "We do not allow redirect pages - pages with a starting '#' character and are one line long. Those pages which are not redirects, we term 'canonical pages'. We do not allow \"talk\" pages either.\n", "\n", "So for instnance, we can look at the statistics of each of these different line-starting characters. Below are the mean number of these line-startings per page." ] }, { "cell_type": "heading", "level": 3, "metadata": {}, "source": [ "Results" ] }, { "cell_type": "code", "collapsed": false, "input": [ "canon_description = canonical.describe()\n", "canon_description.loc[['mean','std']]" ], "language": "python", "metadata": {}, "outputs": [ { "html": [ "
\n", " | * | \n", "# | \n", "|- | \n", "| | \n", "total | \n", "
---|---|---|---|---|---|
mean | \n", "6.660130 | \n", "0.457347 | \n", "3.392566 | \n", "29.755385 | \n", "2284.055053 | \n", "
std | \n", "33.667805 | \n", "7.173435 | \n", "24.983288 | \n", "124.254805 | \n", "4120.072758 | \n", "
2 rows \u00d7 5 columns
\n", "\n", " | * | \n", "# | \n", "|- | \n", "| | \n", "total | \n", "
---|---|---|---|---|---|
* | \n", "1.000000 | \n", "0.016075 | \n", "0.049516 | \n", "0.045971 | \n", "-0.093864 | \n", "
# | \n", "0.016075 | \n", "1.000000 | \n", "0.019311 | \n", "0.023440 | \n", "-0.031146 | \n", "
|- | \n", "0.049516 | \n", "0.019311 | \n", "1.000000 | \n", "0.764544 | \n", "-0.045862 | \n", "
| | \n", "0.045971 | \n", "0.023440 | \n", "0.764544 | \n", "1.000000 | \n", "-0.093092 | \n", "
total | \n", "-0.093864 | \n", "-0.031146 | \n", "-0.045862 | \n", "-0.093092 | \n", "1.000000 | \n", "
5 rows \u00d7 5 columns
\n", "\n", " | all | \n", "listiest | \n", "ratio (listiest/all) | \n", "
---|---|---|---|
of | \n", "640540 | \n", "60163 | \n", "0.093925 | \n", "
list | \n", "92978 | \n", "34307 | \n", "0.368980 | \n", "
in | \n", "488559 | \n", "22937 | \n", "0.046948 | \n", "
the | \n", "355866 | \n", "17071 | \n", "0.047970 | \n", "
\u2013 | \n", "31977 | \n", "10836 | \n", "0.338869 | \n", "
for | \n", "415190 | \n", "9120 | \n", "0.021966 | \n", "
mens | \n", "23931 | \n", "5837 | \n", "0.243910 | \n", "
singles | \n", "7986 | \n", "5391 | \n", "0.675056 | \n", "
wikipediaarticles | \n", "295972 | \n", "5358 | \n", "0.018103 | \n", "
season | \n", "41053 | \n", "5323 | \n", "0.129662 | \n", "
and | \n", "154046 | \n", "5133 | \n", "0.033321 | \n", "
championships | \n", "17045 | \n", "5121 | \n", "0.300440 | \n", "
at | \n", "47756 | \n", "5066 | \n", "0.106081 | \n", "
world | \n", "34675 | \n", "4981 | \n", "0.143648 | \n", "
district | \n", "51854 | \n", "4780 | \n", "0.092182 | \n", "
by | \n", "105340 | \n", "4673 | \n", "0.044361 | \n", "
team | \n", "36328 | \n", "4256 | \n", "0.117155 | \n", "
county | \n", "104867 | \n", "4255 | \n", "0.040575 | \n", "
football | \n", "53007 | \n", "3853 | \n", "0.072689 | \n", "
wikipediawikiproject | \n", "183711 | \n", "3665 | \n", "0.019950 | \n", "
20 rows \u00d7 3 columns
\n", "\n", " | all | \n", "listiest | \n", "ratio (listiest/all) | \n", "
---|---|---|---|
pri | \n", "728 | \n", "664 | \n", "0.912088 | \n", "
vrh | \n", "235 | \n", "210 | \n", "0.893617 | \n", "
divisional | \n", "410 | \n", "325 | \n", "0.792683 | \n", "
vas | \n", "365 | \n", "279 | \n", "0.764384 | \n", "
gornji | \n", "229 | \n", "167 | \n", "0.729258 | \n", "
filmography | \n", "526 | \n", "371 | \n", "0.705323 | \n", "
secretariat | \n", "463 | \n", "325 | \n", "0.701944 | \n", "
numberone | \n", "2394 | \n", "1675 | \n", "0.699666 | \n", "
singles | \n", "7986 | \n", "5391 | \n", "0.675056 | \n", "
stakes | \n", "1149 | \n", "727 | \n", "0.632724 | \n", "
handicap | \n", "342 | \n", "213 | \n", "0.622807 | \n", "
billboard | \n", "740 | \n", "454 | \n", "0.613514 | \n", "
fia | \n", "369 | \n", "207 | \n", "0.560976 | \n", "
iaaf | \n", "718 | \n", "354 | \n", "0.493036 | \n", "
grade | \n", "895 | \n", "421 | \n", "0.470391 | \n", "
listings | \n", "2959 | \n", "1376 | \n", "0.465022 | \n", "
16 rows \u00d7 3 columns
\n", "