{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "CWPK \\#50: Querying External Sources\n", "=======================================\n", "\n", "The Nearly Infinite Usefulness of SPARQL\n", "--------------------------\n", "\n", "
altLabels
and skos.definition
, existing crosswalks or mappings, longer descriptions, subsumption relations, related links, and interesting joins and intersections across external knowledge base content. Often, one is able to specify the format (serialization) of the desired results.\n",
"\n",
"The outputs from these external queries can be manipulated as strings, and then written to flat files useful for ingest into the various build routines. Of course, it is important that the format and CSV-nature of the results be maintained in a form that the build routines expect. One may alter the build formats or the extract formats, but to work they need to match on both ends.\n",
"\n",
"So, what we provide in today's installment are some guidelines and recipes for using SPARQL to obtain information you need and to write them to flat files. Because of their importance, we emphasize Wikidata and DBpedia (also a stand-in for Wikipedia) in our examples. Once populated, you may need to do some intermediate [wrangling](https://en.wikipedia.org/wiki/Data_wrangling) of these files to get them into shape for direct import. We covered that topic in brief in [**CWPK #36**](https://www.mkbergman.com/2374/cwpk-36-bulk-modification-techniques/), but really do not address file wrangling further here. There are way too many varieties to cover the topic in a meaningful way, though we certainly have examples in today's installment and across the entire **CWPK** series that should provide a useful foundation to your own efforts."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Choosing Access Method\n",
"There are not that many public SPARQL endpoints available, and some are not always up and available. But the endpoints that do exist, with their identification in the **Query Sources** section at the conclusion of today's installment, are often comprehensive and with high value. The two we will be emphasizing today, Wikidata and DBpedia (and, by extension, the [linked open data](https://en.wikipedia.org/wiki/Linked_data#Linked_open_data) (LOD) cloud beyond that), are among the most valuable. (Of course, many endpoints, like ones specific to a particular organization, are private, and can be parts of valuable, distributed information ecosystems.) Another notable endpoint worthy of your attention is the [LOD endpoint maintained](http://lod.openlinksw.com/sparql) by OpenLink Software.\n",
"\n",
"It is possible to query many of these sources directly online with an HTML interface, often also providing a choice of the output format desired. In some of the examples below, I provide a **Try it!** link that takes you directly to the source site and uses their native SPARQL interface. (Also, inspect the URI links for these **Try it!** options, since it shows how SPARQL gets communicated over the Web.) You may often find this is the fastest and cleanest way to get useful results, and sometimes better formatted than what our home-brewed options below produce. Your mileage may vary. In any case, it is useful to learn how to conduct direct SPARQL capabilities from within *cowpoke*. For that reason, I emphasize our home-brewed examples below."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setting Up This Installment\n",
"Like we have been emphasing of late, we begin today's installment with our standard start-up instructions:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from cowpoke.__main__ import *\n",
"from cowpoke.config import *\n",
"from owlready2 import *"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from SPARQLWrapper import SPARQLWrapper, JSON\n",
"from rdflib import Graph\n",
"\n",
"#sparql = SPARQLWrapper('http://dbpedia.org/sparql')\n",
"sparql = SPARQLWrapper('https://query.wikidata.org/sparql')\n",
"graph = world.as_rdflib_graph()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Of course, we actually have a very capable query method to our own internal stores:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"form_1 = list(graph.query_owlready(\"\"\"\n",
" PREFIX rc: VALUES
statement. This construct allows a listing of IDs to be passed to the query source. Depending on various endpoint limits, you may be able to list 1000 or more IDs in such a listing; experience with a given endpoint will dictate. If you use the VALUES
construct, just make sure you are using the proper format and prefix (wd:
in this instance for a Q item within Wikidata) in front of each value.\n",
"\n",
"##### Parent Class from Q IDs\n",
"The first query is to obtain the parent class from submitted listing of Q items. You may also **[Try it!](https://query.wikidata.org/#PREFIX%20schema%3A%20%3Chttp%3A%2F%2Fschema.org%2F%3E%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fwikilink%20%3FitemDescription%20%3FsubClass%20%3FsubClassLabel%20WHERE%20%7B%0A%20%20VALUES%20%3Fitem%20%7B%20wd%3AQ25297630%0A%20%20wd%3AQ537127%0A%20%20wd%3AQ16831714%0A%20%20wd%3AQ24398318%0A%20%20wd%3AQ11755880%0A%20%20wd%3AQ681337%0A%7D%0A%20%3Fitem%20wdt%3AP910%20%3FsubClass.%0A%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D)** directly from [Wikidata](https://query.wikidata.org/):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sparql.setQuery(\"\"\"\n",
"PREFIX schema: print
statements above to see how we can start varying outputs. Chances are you will need to do some string manipulation before your flat files are ready for ingest, but we can vary these specifications to get the initial output closer to our requirements.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### subClass and Instance listings for Q ID\n",
"**[Try it!](https://query.wikidata.org/#%23subClass%20and%20Instance%20of%20Q%20ID%0A%0ASELECT%20%3Fsubclass%20%3FsubclassLabel%20%3Finstance%20%3FinstanceLabel%0AWHERE%0A%7B%0A%20%20%3Fsubclass%20wdt%3AP279%20wd%3AQ183366.%0A%20%20%3Finstance%20wdt%3AP31%20wd%3AQ183366.%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22en%22.%20%7D%0A%7D%0AORDER%20BY%20xsd%3Ainteger%28SUBSTR%28STR%28%3Fsubclass%29%2CSTRLEN%28%22http%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ%22%29%2B1%29%29)** as well."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sparql.setQuery(\"\"\"\n",
"SELECT ?subclass ?subclassLabel ?instance ?instanceLabel\n",
"WHERE\n",
"{\n",
" ?subclass wdt:P279 wd:Q183366.\n",
" ?instance wdt:P31 wd:Q183366.\n",
" SERVICE wikibase:label { bd:serviceParam wikibase:language \"en\". }\n",
"}\n",
"ORDER BY xsd:integer(SUBSTR(STR(?subclass),STRLEN(\"http://www.wikidata.org/entity/Q\")+1))\n",
"\"\"\")\n",
"sparql.setReturnFormat(JSON)\n",
"results = sparql.query().convert()\n",
"\n",
"#for result in results[\"results\"][\"bindings\"]:\n",
"# print(result[\"item\"][\"value\"])\n",
"\n",
"print(results)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Useful Q Item Attributes\n",
"\n",
"**[Try it!](https://query.wikidata.org/#PREFIX%20schema%3A%20%3Chttp%3A%2F%2Fschema.org%2F%3E%0A%0ASELECT%20%3Fitem%20%3FitemLabel%20%3Fclass%20%3FclassLabel%20%3Fdescription%20%3Farticle%20%3FitemAltLabel%20WHERE%20%7B%0A%20%20VALUES%20%3Fitem%20%7B%20wd%3AQ1%20wd%3AQ2%20wd%3AQ3%20wd%3AQ4%20wd%3AQ5%20%7D%0A%20%20%3Fitem%20wdt%3AP31%20%3Fclass%3B%0A%20%20%20%20%20%20%20%20wdt%3AP5008%20%3Fproject.%0A%23%20%20%3Farticle%20rdfs%3Acomment%20%3Fdescription.%0A%20%20%0A%20%20%20OPTIONAL%20%7B%0A%20%20%20%20%3Farticle%20schema%3Aabout%20%3Fitem.%0A%20%20%20%20%3Farticle%20schema%3AisPartOf%20%3Chttps%3A%2F%2Fen.wikipedia.org%2F%3E.%0A%20%20%7D%0A%20%20%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%0A%7D)**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sparql.setQuery(\"\"\"\n",
"PREFIX schema: *.ipynb
file. It may take a bit of time for the interactive option to load.