{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a SQLite database with indices, page titles, and coordinate strings, let's make a database where we extract all the metadata out of those coordinate strings so it's queryable.\n", "\n", "This should be run after the other notebook that extracts the coordinate strings." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "import csv\n", "import json\n", "from wikiparse import indexer, syntax_parser as sp\n", "import time\n", "import os\n", "import sqlite3\n", "import random" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "opening E:/enwiki-20190101-pages-articles-multistream.xml/scratch\\py3\\index.db\n", "current mapping 19.1 m pages\n", "\n", "__init__ complete\n" ] } ], "source": [ "dumps = indexer.load_dumps(build_index=False, scratch_folder='py3')\n", "english = dumps['en']" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [], "source": [ "# english.db.close()" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "c = english.cursor" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we create the database let's get a complete list of the entries we're going to want. That is, let's look at all the coordinate strings we've extracted from each page and extract the list of keywords from there. " ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(66,\n", " 'Coord|32.7|-86.7|type:adm1st_region:US_dim:1000000_source:USGS|display=title'),\n", " (86, 'Coord|36|42|N|3|13|E|type:city'),\n", " (105, 'coord|42|30|N|1|31|E|display=inline,title'),\n", " (114,\n", " 'Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes={{Cite gnis|1785533|State of Alaska'),\n", " (139,\n", " 'Coord|13|19|N|169|9|W|type:event|name=Apollo 11 splashdown||Coord|10|36|N|172|24|E|display=inline||Coord|13|19|N|169|9|W|display=inline'),\n", " (140,\n", " 'Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing||Coord|30|12|N|74|7|W|name=Apollo 8 S-IC impact||Coord|31|50|N|37|17|W|name=Apollo 8 S-II impact||Coord|8|8|N|165|1|W|name=Apollo 8 estimated splashdown'),\n", " (161,\n", " 'Coord|12|30|40|N|69|58|27|W|type:isle|display=title||Coord|12|31|07|N|70|02|09|W||Coord|12|31|01|N|70|02|04|W|'),\n", " (166, 'coord|0|N|25|W|region:ZZ_type:waterbody|display=inline,title'),\n", " (168, 'Coord|12|30|S|18|30|E|display=title||Coord|8|50|S|13|20|E|type:city'),\n", " (177,\n", " 'Coord|55|N|115|W|type:adm1st_scale:10000000_region:CA-AB|display=title')]" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "result = c.execute('''SELECT page_num,coords,title FROM indices WHERE coords != \"\"\n", "''').fetchall()\n", "coordStrings = {item[0]:item[1] for item in result}\n", "idx_to_title = {item[0]:item[2] for item in result}\n", "list(coordStrings.items())[:10]" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1152376, dict)" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(coordStrings), type(coordStrings)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If a page has more than one coordinate string, choose the one that's displayed at the top (`display=title`) or the first." ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [], "source": [ "for page_num in coordStrings:\n", " if '||' in coordStrings[page_num]:\n", " pageCoordStrings = coordStrings[page_num].split('||')\n", " coordStrings[page_num] = pageCoordStrings[0]\n", " for s in pageCoordStrings:\n", " if \"display=title\" in s:\n", " coordStrings[page_num] = s" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[(66,\n", " 'Coord|32.7|-86.7|type:adm1st_region:US_dim:1000000_source:USGS|display=title'),\n", " (86, 'Coord|36|42|N|3|13|E|type:city'),\n", " (105, 'coord|42|30|N|1|31|E|display=inline,title'),\n", " (114,\n", " 'Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes={{Cite gnis|1785533|State of Alaska'),\n", " (139, 'Coord|13|19|N|169|9|W|type:event|name=Apollo 11 splashdown'),\n", " (140, 'Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing'),\n", " (161, 'Coord|12|30|40|N|69|58|27|W|type:isle|display=title'),\n", " (166, 'coord|0|N|25|W|region:ZZ_type:waterbody|display=inline,title'),\n", " (168, 'Coord|12|30|S|18|30|E|display=title'),\n", " (177,\n", " 'Coord|55|N|115|W|type:adm1st_scale:10000000_region:CA-AB|display=title')]" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(coordStrings.items())[:10]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For some Coord templates, there's a `note` (https://en.wikipedia.org/wiki/Template:Coord#Examples) which contains more pipes that will cut off the rest of the template. Since that tag seems to come after the other important tags, let's\n", "ignore this problem." ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'type': 'event', 'name': 'Apollo 8 landing'}\n", "{'region': 'US-AK_type', 'display': 'title', 'notes': '{{Cite gnis'}\n", "{'type': 'city'}\n" ] } ], "source": [ "def getKeywords(coordString, verbose=False):\n", " if verbose:\n", " print(coordString)\n", " keywords = {}\n", " rest = []\n", " items = coordString.split('|')\n", " for item in items:\n", " if '=' in item:\n", " keywords[item.split('=')[0]] = item.split('=')[1]\n", " elif ':' in item:\n", " keywords[item.split(':')[0]] = item.split(':')[1]\n", " else:\n", " rest.append(item)\n", "# return keywords, '|'.join(rest)\n", " return keywords\n", "print(getKeywords('Coord|8|8|N|165|1|W|type:event|name=Apollo 8 landing'))\n", "print(getKeywords('Coord|64|N|150|W|region:US-AK_type:adm1st_scale:10000000|display=title|notes={{Cite gnis|1785533|State of Alaska'))\n", "print(getKeywords('Coord|36|42|N|3|13|E|type:city'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We only care about a subset of keywords, so let's make a whitelist." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "defaultdict(int,\n", " {'type': 287434,\n", " 'display': 1099820,\n", " 'region': 432293,\n", " 'notes': 1748,\n", " 'name': 20998,\n", " 'dim': 4491,\n", " 'format': 133330,\n", " 'source': 67662,\n", " 'globe': 2834,\n", " 'scale': 8045,\n", " 'id': 217,\n", " 'accessdate': 305,\n", " 'entrydate': 42,\n", " 'title': 270,\n", " 'journal': 9,\n", " 'volume': 9,\n", " 'number': 1,\n", " 'pages': 12,\n", " 'author1': 1,\n", " 'date': 62,\n", " 'doi': 8,\n", " 'bibcode': 2,\n", " 'display-authors': 1,\n", " 'label': 1,\n", " '3': 143,\n", " 'url': 151,\n", " 'DK_type': 1,\n", " 'trans-title': 2,\n", " 'language': 7,\n", " 'publisher': 99,\n", " 'type_landmark_region': 2,\n", " 'region:US': 1,\n", " '2': 5,\n", " \"The ''globe'' [[File\": 25,\n", " 'upright': 25,\n", " '': 5,\n", " 'first': 13,\n", " 'last': 19,\n", " 'work': 70,\n", " 'deadurl': 6,\n", " 'archiveurl': 5,\n", " 'archivedate': 5,\n", " 'df': 4,\n", " '234503_622630_region': 1,\n", " 'הערה': 1,\n", " 'landmark_region': 4,\n", " 'USGS': 3,\n", " '4': 8,\n", " '6': 3,\n", " 'Register of Historic Parks and Gardens]].Crow Lane Roundabout': 1,\n", " 'nosave': 8,\n", " 'website': 12,\n", " 'source:https://archnet.org/print/preview/sites': 1,\n", " 'For example': 1,\n", " '5': 7,\n", " 'elevation': 4,\n", " 'GNS]] coordinates adjusted using [[Google Maps]] and [http': 42,\n", " 'type;landmark_region': 2,\n", " 'author': 16,\n", " 'location': 9,\n", " 'GNS]] coordinates adjusted using [[Google Maps]], and [http': 23,\n", " '600700_212618_region': 1,\n", " '{{#expr': 457,\n", " 'display': 1,\n", " '{{#property': 5,\n", " 'Wtype': 1,\n", " 'ype': 3,\n", " 'region_US-WA_type': 1,\n", " 'island': 1,\n", " 'regio': 2,\n", " 'Genicoord]], [[User': 1,\n", " 'YetanotherGenisock]], [[User': 1,\n", " 'Genidealingwithfairuse]], [[User': 1,\n", " 'Liveware problem]], [[User': 1,\n", " \"It's Character Forming]], [[User\": 1,\n", " 'UK voteing account]], [[User': 1,\n", " 'Geniice]]
several punctuation accounts to push abusive names off the first page of [[Special': 1,\n", " 'December 2004 (A)]]
\\n[[Wikipedia': 1,\n", " 'September 2006 (B)]]
\\n[[Wikipedia': 1,\n", " 'March 2008 (A)]]
\\n[[Wikipedia': 1,\n", " 'August 2008 (A)]]
\\n[[Wikipedia': 1,\n", " 'region:US-WV_scale:10000_source:placeopedia:display': 1,\n", " 'region-iso': 1,\n", " 'conference': 1,\n", " 'conference-url': 1,\n", " 'reg': 4,\n", " 'tye': 1,\n", " 'range_coordinates': 1,\n", " 'nopp': 1,\n", " 'last5': 3,\n", " 'first5': 3,\n", " 'last6': 3,\n", " 'first6': 3,\n", " 'last7': 3,\n", " 'first7': 3,\n", " 'scales': 1,\n", " 'soutype': 1,\n", " 'Avadi': 1,\n", " 'E