{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Extract coordinates and keywords from Wikipedia\n", "\n", "Wikipedia allows you to download the entirety of its contents. For English only (and without talk pages) that comes out to about 16 GB of data in the zipped format. We're only interested in articles with coordinates specified for the article itself (rather than something the article refers to). So we need to 1) be able to read in a page, given either a page ID or the page's title, and 2) query all pages with coordinates with a certain display type.\n", "\n", "Wikipedia provides an index file to go with the database dump, that lets you read articles in batches of 100 using byte locations. But I'd rather have the exact index to all the pages; that will save time when we're iterating over a subset of the pages later. And compared to other things I'm doing, building my own index isn't terribly computationally expensive.\n", "\n", "First, let's tell it where the dump file is. I made a 10 MB sample file that we can use." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from pathlib import Path\n", "from wikiparse import config\n", "\n", "xml_filename = config.xml\n", "scratch_folder = Path(config.folder)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I could save lots of hard drive space and some code if I used the bzipped XML file provided by Wikipedia along with their index. But the computationally intensive part isn't indexing the XML dump, it's the page tokenization and clustering which take forever. I'll leave it as an exercise for the reader whether switching this part over to using the bzip dump is better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What we want to create\n", "\n", "For each page, we're going to have the start and end indices, the page number, title, and the coord tag. We'll ignore any pages that don't have the string `Coord` or `coord`.\n", "\n", "Here are the databases we're creating:\n", " - `index`: maps page number to the title, start and end indices, and raw coord tag\n", " - `coords`: maps page number to lots of attributes in a coord tag\n", " - `title`: maps page title to page number\n", " \n", "Here's an example of a coord tag from a Wikipedia article:\n", "```\n", "coord|42.440556|-98.148083|type:landmark|name=Ashfall Fossil Beds\n", "```\n", "\n", "Let's get started." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from wikiparse import geo_indexer\n", "import time\n", "pipeline_start = time.time()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "opening C:\\Users\\rowan\\Documents\\geowiki\\scratch\\index.db\n", "Ready. Metadata: []\n", "iterating 100.0% of pages took 73.28 minutes53k k \n", "7.67 % contained coordinates tag\n", "Creating title dictionary\n", "UNIQUE constraint failed: titles.title KBZZ (AM)\n", " 99.888%\t4.39ms / page \r" ] } ], "source": [ "indexer = geo_indexer.Indexer(xml_filename,\n", " scratch_folder=scratch_folder)\n", "indexer.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "How many pages do we have? This is the number that have coord tags." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "647725" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(indexer.get_page_numbers())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check out a page." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\"Anatolia\" \n", "\n", "\n", "\n", "\n", "\n", " its eastern and southeastern borders are widely taken to be Turkeys eastern border. By some definitions, the Armenian Highlands lies beyond the boundary of the Anatolian plateau. The official name of this inland region is the Eastern Anatolia Region.\n", "\n", "The Ancient Anatolians ancient inhabitants of Anatolia spoke the now-extinct Anatolian languages, which were largely replaced by the Greek language starting from classical antiquity and during the Hellenistic period Hellenistic, Roman Empire Roman and Byzantine Empire Byzantine periods. Major Anatolian languages included Hittite language Hittite, Luwian language Luwian, and Lydian language Lydian, among other more poorly attested relatives. The Turkification of Anatolia began under the Seljuk Empire in the late 11th century and continued under the Ottoman Empire between the late 13th and early 20th centuries. However, various non-Turkic languages continue to be spoken by minorities in Anatolia today, including Kurdish languages Kurdi\n" ] } ], "source": [ "page = indexer.get_page_by_num(indexer.get_page_numbers()[5])\n", "print(f'\"{page.title}\" \\n\\n{page.text[:1000]}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We're losing _some_ of the actual text of the Wikipedia article, but generally the filter's doing a good job of keeping out the metadata and keeping in the text." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "pipeline took 2.01 hours\n" ] } ], "source": [ "took = time.time() - pipeline_start\n", "if took < 60:\n", " print(\"pipeline took\", round(took, 2), \"seconds\")\n", "elif took < 3600:\n", " print(\"pipeline took\", round(took/60, 2), \"minutes\")\n", "else:\n", " print(\"pipeline took\", round(took/60/60, 2), \"hours\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" } }, "nbformat": 4, "nbformat_minor": 4 }