{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "CWPK \\#47: Summary of the Extract-Build Roundtrip\n", "=======================================\n", "\n", "Here is the Master Listing of Extraction and Build Steps\n", "--------------------------\n", "\n", "
kko.superClasses
and rdfs.isDefinedyBy
properties. Some issues in CSV extraction and build settings were also discovered that led to excess quoting of strings. The \"official\" code, then, is what is contained in the *cowpoke* modules, and not necessarily exactly what is in the notebook pages.\n",
"\n",
"Therefore, of the many installments in this **CWPK** series, this present one is perhaps one of the most important for you to keep and reference. We will have occasion to summarize other steps in our series, but this installment is the most comprehensive view of the extract-and-build 'roundtrip' cycle."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary of Extraction and Build Steps\n",
"Here are the basic steps in a complete roundtrip from extracting to building the knowledge graph anew:\n",
"\n",
"1. Startup\n",
"\n",
"\n",
"2. Extraction\n",
" - Structure Extraction of Classes\n",
" - Structure Extraction of Properties\n",
" - Annotation Extraction of Classes\n",
" - Annotation Extraction of Properties\n",
" - Extraction of Mappings\n",
"\n",
" \n",
"3. Offline Development and Manipulation\n",
"\n",
"\n",
"4. Clean and Test Build Input Files\n",
"\n",
"\n",
"5. Build\n",
" - Build Class Structure\n",
" - Build Property Structure\n",
" - Build Class Annotations\n",
" - Build Property Annotations\n",
" - Ingest of Mappings\n",
" \n",
" \n",
"6. Test Build "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The order of extraction and building of classes and properties must begin each phase because we need to have these resources adequately registered to the knowledge graph. Once done, however, there is no ordering requirement for whether mapping or annotation proceeds next. Since annotation changes are always likely in every new version or build, I have listed them before mapping, but that is only a matter of preference.\n",
"\n",
"Each of these steps is described below, plus some key configuration settings as appropriate. We begin with our first step, startup:\n",
"\n",
"### 1. Startup"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from cowpoke.__main__ import *\n",
"from cowpoke.config import *"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will re-cap the entire breakdown and build process here. We first begin with structure extraction, first classes and then properties:\n",
"\n",
"### 2. Extraction\n",
"The purpose of a full extraction is to retrieve all assertions in KBpedia aside from those in the [upper](https://en.wikipedia.org/wiki/Upper_ontology) (also called [top-level](https://en.wikipedia.org/wiki/Upper_ontology)) KBpedia Knowledge Ontology, or [KKO](https://kbpedia.org/docs/kko-upper-structure/).\n",
"\n",
"#### A. Structure Extraction of Classes\n",
"We begin with the (mostly) hierarchical typologies and their linkage into KKO and with one another. Since all of the reference concepts in KBpedia are subsumed by the top-level category of Generals
, we can specify it alone as a means to retrieve all of the RCs in KBpedia:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see extract_deck in config.py) ###\n",
"# 'krb_src' : 'extract' # Set in master_deck\n",
"# 'descent_type' : 'descent',\n",
"# 'loop' : 'class_loop',\n",
"# 'loop_list' : custom_dict.values(), # Single 'Generals' specified \n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_struct_out.csv',\n",
"# 'render' : 'r_iri',\n",
"\n",
"def struct2_extractor(**extract_deck):\n",
" print('Beginning structure extraction . . .')\n",
"# 1 - render method goes here \n",
" r_default = ''\n",
" r_label = ''\n",
" r_iri = ''\n",
" render = extract_deck.get('render')\n",
" if render == 'r_default':\n",
" set_render_func(default_render_func)\n",
" elif render == 'r_label':\n",
" set_render_func(render_using_label)\n",
" elif render == 'r_iri':\n",
" set_render_func(render_using_iri)\n",
" else:\n",
" print('You have assigned an incorrect render method--execution stopping.')\n",
" return\n",
"# 2 - note about custom extractions\n",
" loop_list = extract_deck.get('loop_list')\n",
" loop = extract_deck.get('loop')\n",
" out_file = extract_deck.get('out_file')\n",
" class_loop = extract_deck.get('class_loop')\n",
" property_loop = extract_deck.get('property_loop')\n",
" descent_type = extract_deck.get('descent_type')\n",
" x = 1\n",
" cur_list = []\n",
" a_set = []\n",
" s_set = []\n",
" new_class = 'owl:Thing'\n",
"# 5 - what gets passed to 'output'\n",
" with open(out_file, mode='w', encoding='utf8', newline='') as output:\n",
" csv_out = csv.writer(output)\n",
" if loop == 'class_loop': \n",
" header = ['id', 'subClassOf', 'parent']\n",
" p_item = 'rdfs:subClassOf'\n",
" else:\n",
" header = ['id', 'subPropertyOf', 'parent']\n",
" p_item = 'rdfs:subPropertyOf'\n",
" csv_out.writerow(header) \n",
"# 3 - what gets passed to 'loop_list' \n",
" for value in loop_list:\n",
" print(' . . . processing', value) \n",
" root = eval(value)\n",
"# 4 - descendant or single here\n",
" if descent_type == 'descent':\n",
" a_set = root.descendants()\n",
" a_set = set(a_set)\n",
" s_set = a_set.union(s_set)\n",
" elif descent_type == 'single':\n",
" a_set = root\n",
" s_set.append(a_set)\n",
" else:\n",
" print('You have assigned an incorrect descent method--execution stopping.')\n",
" return \n",
" print(' . . . processing consolidated set.')\n",
" for s_item in s_set:\n",
" o_set = s_item.is_a\n",
" for o_item in o_set:\n",
" row_out = (s_item,p_item,o_item)\n",
" csv_out.writerow(row_out)\n",
" if loop == 'class_loop':\n",
" if s_item not in cur_list: \n",
" row_out = (s_item,p_item,new_class)\n",
" csv_out.writerow(row_out)\n",
" cur_list.append(s_item)\n",
" x = x + 1\n",
" print('Total unique IDs written to file:', x)\n",
" print('The structure extraction for the ', loop, 'is completed.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"struct2_extractor(**extract_deck)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### B. Structure Extraction of Properties"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"See above with the following changes/notes:\n",
"\n",
"\n", "### KEY CONFIG SETTINGS (see extract_deck in config.py) ###\n", "# 'krb_src' : 'extract' # Set in master_deck\n", "# 'descent_type' : 'descent',\n", "# 'loop' : 'property_loop',\n", "# 'loop_list' : prop_dict.values(),\n", "# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_struct_out.csv',\n", "# 'render' : 'r_default',\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### C. Annotation Extraction of Classes\n", "Annotations require a different method, though with a similar composition to the prior ones. It was during testing of the full extract-build roundtrip that I realized our initial class annotation extraction routine was missing for the
rdfs.isDefinedBy
and kko.superClassOf
properties. The code in extract.py
has been updated to reflect these changes. \n",
"\n",
"Again, we first begin with classes. **Note**: by convention, I have shifted a couple structural:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see extract_deck in config.py) ### \n",
"# 'krb_src' : 'extract' # Set in master_deck\n",
"# 'descent_type' : 'descent',\n",
"# 'loop' : 'class_loop',\n",
"# 'loop_list' : custom_dict.values(), # Single 'Generals' specified \n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/classes/Generals_annot_out.csv',\n",
"# 'render' : 'r_label',\n",
"\n",
"def annot2_extractor(**extract_deck):\n",
" print('Beginning annotation extraction . . .') \n",
" r_default = ''\n",
" r_label = ''\n",
" r_iri = ''\n",
" render = extract_deck.get('render')\n",
" if render == 'r_default':\n",
" set_render_func(default_render_func)\n",
" elif render == 'r_label':\n",
" set_render_func(render_using_label)\n",
" elif render == 'r_iri':\n",
" set_render_func(render_using_iri)\n",
" else:\n",
" print('You have assigned an incorrect render method--execution stopping.')\n",
" return \n",
" loop_list = extract_deck.get('loop_list')\n",
" loop = extract_deck.get('loop')\n",
" out_file = extract_deck.get('out_file')\n",
" class_loop = extract_deck.get('class_loop')\n",
" property_loop = extract_deck.get('property_loop')\n",
" descent_type = extract_deck.get('descent_type')\n",
" \"\"\" These are internal counters used in this module's methods \"\"\"\n",
" p_set = []\n",
" a_ser = []\n",
" x = 1\n",
" cur_list = []\n",
" with open(out_file, mode='w', encoding='utf8', newline='') as output:\n",
" csv_out = csv.writer(output) \n",
" if loop == 'class_loop': \n",
" header = ['id', 'prefLabel', 'subClassOf', 'altLabel', \n",
" 'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']\n",
" else:\n",
" header = ['id', 'prefLabel', 'subPropertyOf', 'domain', 'range', \n",
" 'functional', 'altLabel', 'definition', 'editorialNote']\n",
" csv_out.writerow(header) \n",
" for value in loop_list: \n",
" print(' . . . processing', value) \n",
" root = eval(value) \n",
" if descent_type == 'descent':\n",
" p_set = root.descendants()\n",
" elif descent_type == 'single':\n",
" a_set = root\n",
" p_set.append(a_set)\n",
" else:\n",
" print('You have assigned an incorrect descent method--execution stopping.')\n",
" return \n",
" for p_item in p_set:\n",
" if p_item not in cur_list: \n",
" a_pref = p_item.prefLabel\n",
" a_pref = str(a_pref)[1:-1].strip('\"\\'') \n",
" a_sub = p_item.is_a\n",
" for a_id, a in enumerate(a_sub): \n",
" a_item = str(a)\n",
" if a_id > 0:\n",
" a_item = a_sub + '||' + str(a)\n",
" a_sub = a_item\n",
" if loop == 'property_loop': \n",
" a_item = ''\n",
" a_dom = p_item.domain\n",
" for a_id, a in enumerate(a_dom):\n",
" a_item = str(a)\n",
" if a_id > 0:\n",
" a_item = a_dom + '||' + str(a)\n",
" a_dom = a_item \n",
" a_dom = a_item\n",
" a_rng = p_item.range\n",
" a_rng = str(a_rng)[1:-1]\n",
" a_func = ''\n",
" a_item = ''\n",
" a_alt = p_item.altLabel\n",
" for a_id, a in enumerate(a_alt):\n",
" a_item = str(a)\n",
" if a_id > 0:\n",
" a_item = a_alt + '||' + str(a)\n",
" a_alt = a_item \n",
" a_alt = a_item\n",
" a_def = p_item.definition\n",
" a_def = str(a_def)[2:-2]\n",
" a_note = p_item.editorialNote\n",
" a_note = str(a_note)[1:-1]\n",
" if loop == 'class_loop': \n",
" a_isby = p_item.isDefinedBy\n",
" a_isby = str(a_isby)[2:-2]\n",
" a_isby = a_isby + '/'\n",
" a_item = ''\n",
" a_super = p_item.superClassOf\n",
" for a_id, a in enumerate(a_super):\n",
" a_item = str(a)\n",
" if a_id > 0:\n",
" a_item = a_super + '||' + str(a)\n",
" a_super = a_item \n",
" a_super = a_item\n",
" if loop == 'class_loop': \n",
" row_out = (p_item,a_pref,a_sub,a_alt,a_def,a_note,a_isby,a_super)\n",
" else:\n",
" row_out = (p_item,a_pref,a_sub,a_dom,a_rng,a_func,\n",
" a_alt,a_def,a_note)\n",
" csv_out.writerow(row_out) \n",
" cur_list.append(p_item)\n",
" x = x + 1\n",
" print('Total unique IDs written to file:', x) \n",
" print('The annotation extraction for the', loop, 'is completed.')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"annot2_extractor(**extract_deck)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"d=csv.get_dialect('excel')\n",
"print(\"Delimiter: \", d.delimiter)\n",
"print(\"Doublequote: \", d.doublequote)\n",
"print(\"Escapechar: \", d.escapechar)\n",
"print(\"lineterminator: \", repr(d.lineterminator))\n",
"print(\"quotechar: \", d.quotechar)\n",
"print(\"Quoting: \", d.quoting)\n",
"print(\"skipinitialspace: \", d.skipinitialspace)\n",
"print(\"strict: \", d.strict)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### D. Annotation Extraction of Properties\n",
"\n",
"See above with the following changes/notes:\n",
"\n",
"\n", "### KEY CONFIG SETTINGS (see extract_deck in config.py) ### \n", "# 'krb_src' : 'extract' # Set in master_deck\n", "# 'descent_type' : 'descent',\n", "# 'loop' : 'property_loop',\n", "# 'loop_list' : prop_dict.values(), \n", "# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/extractions/properties/prop_annot_out.csv',\n", "# 'render' : 'r_default',\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### E. Extraction of Mappings\n", "Mappings to external sources is an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, extractions of existing mappings is also a logical step in the overall extraction process.\n", "\n", "Though we will not address mappings until **CWPK #49**, those steps belong here in the overall set of procedures for the extract-build roundtrip process." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Offline Development and Manipulation\n", "The above extraction steps can capture changes over time that have been made with an ontology editing tool such as [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)). Once that knowledge graph is at a state of readiness after using Protégé, and more major changes are desired to your knowledge graph, it is sometimes easier to work with flat files in bulk. I discussed some of my own steps using spreadsheets in [**CWPK #36**](https://www.mkbergman.com/2374/cwpk-36-bulk-modification-techniques/), and I will also walk through some refactorings using bulk files in our next installment, **CWPK #48**. That case study will help us see at least a few of the circumstances that warrant bulk refactoring. Major additions or changes to the typologies is also an occasion for such bulk activities.\n", "\n", "At any rate, this step in the overall roundtripping process is where such modifications are made before rebuilding the knowledge graph anew." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Clean and Test Build Input Files\n", "We covered these topics in [**CWPK #45**](https://www.mkbergman.com/2387/cwpk-45-cleaning-and-file-pre-checks/). If you recall, cleaning and testing of input files occurs at this logical point, but we delayed discussing it in detail until we had covered the overall build process steps. This is why this sequence number for this installment appears a bit out of order." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Build\n", "The start of the build cycle is to have all structure, annotation, and mapping files in proper shape and vetted for encoding and quality. \n", "\n", "(**Note**: where 'Generals' is specified, keep the initial capitalization, since it is also generated as such from the extraction routines and is consistent with typology naming.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### A. Build Class Structure\n", "We start with the knowledge graph classes and their subsumption relationships, as specified in one or more class structure CSV input files. In this case, we are doing a full build, so we begin with the KKO and RC stubs, plus run our
Generals
typology since it is inclusive:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see build_deck in config.py) ### # Option 1: from Generals\n",
"# 'kb_src' : 'start' # Set in master_deck; only step with 'start'\n",
"# 'loop_list' : custom_dict.values(), # Single 'Generals' specified \n",
"# 'loop' : 'class_loop',\n",
"# 'base' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/', \n",
"# 'ext' : '_struct_out.csv', # Note change \n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',\n",
"\n",
"### KEY CONFIG SETTINGS (see build_deck in config.py) ### # Option 2: from all typologies\n",
"# 'kb_src' : 'start' # Set in master_deck; only step with 'start'\n",
"# 'loop_list' : typol_dict.values(), \n",
"# 'loop' : 'class_loop',\n",
"# 'base' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/', \n",
"# 'ext' : '.csv', # Note change \n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',\n",
"\n",
"from cowpoke.build import *\n",
"\n",
"def class2_struct_builder(**build_deck): \n",
" print('Beginning KBpedia class structure build . . .') \n",
" kko_list = typol_dict.values() \n",
" loop_list = build_deck.get('loop_list')\n",
" loop = build_deck.get('loop')\n",
" base = build_deck.get('base')\n",
" ext = build_deck.get('ext')\n",
" out_file = build_deck.get('out_file')\n",
" if loop is not 'class_loop':\n",
" print(\"Needs to be a 'class_loop'; returning program.\")\n",
" return\n",
" for loopval in loop_list:\n",
" print(' . . . processing', loopval) \n",
" frag = loopval.replace('kko.','')\n",
" in_file = (base + frag + ext)\n",
" with open(in_file, 'r', encoding='utf8') as input:\n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) \n",
" for row in reader:\n",
" r_id = row['id'] \n",
" r_parent = row['parent']\n",
" id = row_clean(r_id, iss='i_id') \n",
" id_frag = row_clean(r_id, iss='i_id_frag')\n",
" parent = row_clean(r_parent, iss='i_parent')\n",
" parent_frag = row_clean(r_parent, iss='i_parent_frag')\n",
" if is_first_row: \n",
" is_first_row = False\n",
" continue \n",
" with rc: \n",
" kko_id = None\n",
" kko_frag = None\n",
" if parent_frag == 'Thing': \n",
" if id in kko_list: \n",
" kko_id = id\n",
" kko_frag = id_frag\n",
" else: \n",
" id = types.new_class(id_frag, (Thing,)) \n",
" if kko_id != None: \n",
" with kko: \n",
" kko_id = types.new_class(kko_frag, (Thing,)) \n",
" with open(in_file, 'r', encoding='utf8') as input:\n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])\n",
" for row in reader: \n",
" r_id = row['id'] \n",
" r_parent = row['parent']\n",
" id = row_clean(r_id, iss='i_id')\n",
" id_frag = row_clean(r_id, iss='i_id_frag')\n",
" parent = row_clean(r_parent, iss='i_parent')\n",
" parent_frag = row_clean(r_parent, iss='i_parent_frag')\n",
" if is_first_row:\n",
" is_first_row = False\n",
" continue \n",
" with rc:\n",
" kko_id = None \n",
" kko_frag = None\n",
" kko_parent = None\n",
" kko_parent_frag = None\n",
" if parent_frag is not 'Thing':\n",
" if id in kko_list:\n",
" continue\n",
" elif parent in kko_list:\n",
" kko_id = id\n",
" kko_frag = id_frag\n",
" kko_parent = parent\n",
" kko_parent_frag = parent_frag\n",
" else: \n",
" var1 = getattr(rc, id_frag) \n",
" var2 = getattr(rc, parent_frag)\n",
" if var2 == None: \n",
" continue\n",
" else:\n",
" print(var1, var2)\n",
" var1.is_a.append(var2)\n",
" if kko_parent != None: \n",
" with kko: \n",
" if kko_id in kko_list: \n",
" continue\n",
" else:\n",
" var1 = getattr(rc, kko_frag)\n",
" var2 = getattr(kko, kko_parent_frag) \n",
" var1.is_a.append(var2)\n",
" with open(in_file, 'r', encoding='utf8') as input: \n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])\n",
" for row in reader: \n",
" r_id = row['id'] \n",
" r_parent = row['parent']\n",
" id = row_clean(r_id, iss='i_id')\n",
" id_frag = row_clean(r_id, iss='i_id_frag')\n",
" parent = row_clean(r_parent, iss='i_parent')\n",
" parent_frag = row_clean(r_parent, iss='i_parent_frag')\n",
" if is_first_row:\n",
" is_first_row = False\n",
" continue\n",
" if parent_frag == 'Thing': \n",
" var1 = getattr(rc, id_frag)\n",
" var2 = getattr(owl, parent_frag)\n",
" try:\n",
" var1.is_a.remove(var2)\n",
" except Exception:\n",
" continue\n",
" kb.save(out_file, format=\"rdfxml\") \n",
" print('KBpedia class structure build is complete.')\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class2_struct_builder(**build_deck)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### B. Build Property Structure\n",
"After classes, when then add property structure to the system. Note, however, that we now switch to our normal 'standard' kb source:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see build_deck in config.py) ### \n",
"# 'kb_src' : 'standard' # Set in master_deck\n",
"# 'loop_list' : prop_dict.values(), \n",
"# 'loop' : 'property_loop',\n",
"# 'base' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/', \n",
"# 'ext' : '_struct_out.csv', \n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',\n",
"# 'frag' : set in code block; see below\n",
"\n",
"def prop2_struct_builder(**build_deck):\n",
" print('Beginning KBpedia property structure build . . .')\n",
" loop_list = build_deck.get('loop_list')\n",
" loop = build_deck.get('loop')\n",
" base = build_deck.get('base')\n",
" ext = build_deck.get('ext')\n",
" out_file = build_deck.get('out_file')\n",
" if loop is not 'property_loop':\n",
" print(\"Needs to be a 'property_loop'; returning program.\")\n",
" return\n",
" for loopval in loop_list:\n",
" print(' . . . processing', loopval)\n",
" frag = 'prop' \n",
" in_file = (base + frag + ext)\n",
" print(in_file)\n",
" with open(in_file, 'r', encoding='utf8') as input:\n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'subPropertyOf', 'parent'])\n",
" for row in reader:\n",
" if is_first_row:\n",
" is_first_row = False \n",
" continue\n",
" r_id = row['id']\n",
" r_parent = row['parent']\n",
" value = r_parent.find('owl.')\n",
" if value == 0: \n",
" continue\n",
" value = r_id.find('rc.')\n",
" if value == 0:\n",
" id_frag = r_id.replace('rc.', '')\n",
" parent_frag = r_parent.replace('kko.', '')\n",
" var2 = getattr(kko, parent_frag) \n",
" with rc: \n",
" r_id = types.new_class(id_frag, (var2,))\n",
" kb.save(out_file, format=\"rdfxml\")\n",
" print(kbpedia)\n",
" print(out_file)\n",
" print('KBpedia property structure build is complete.') "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"prop2_struct_builder(**build_deck)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### C. Build Class Annotations\n",
"With the subsumption structure built, we next load our annotations, beginning with the class ones:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see build_deck in config.py) ### \n",
"# 'kb_src' : 'standard' \n",
"# 'loop_list' : file_dict.values(), # see 'in_file'\n",
"# 'loop' : 'class_loop',\n",
"# 'in_file' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/classes/Generals_annot_out.csv',\n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',\n",
"\n",
"def class2_annot_build(**build_deck):\n",
" print('Beginning KBpedia class annotation build . . .')\n",
" loop_list = build_deck.get('loop_list')\n",
" loop = build_deck.get('loop')\n",
" class_loop = build_deck.get('class_loop')\n",
" out_file = build_deck.get('out_file')\n",
" if loop is not 'class_loop':\n",
" print(\"Needs to be a 'class_loop'; returning program.\")\n",
" return\n",
" for loopval in loop_list:\n",
" print(' . . . processing', loopval) \n",
" in_file = loopval\n",
" with open(in_file, 'r', encoding='utf8') as input:\n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subClassOf', \n",
" 'altLabel', 'definition', 'editorialNote', 'isDefinedBy', 'superClassOf']) \n",
" for row in reader:\n",
" r_id = row['id']\n",
" id = getattr(rc, r_id)\n",
" if id == None:\n",
" print(r_id)\n",
" continue\n",
" r_pref = row['prefLabel']\n",
" r_alt = row['altLabel']\n",
" r_def = row['definition']\n",
" r_note = row['editorialNote']\n",
" r_isby = row['isDefinedBy']\n",
" r_super = row['superClassOf']\n",
" if is_first_row: \n",
" is_first_row = False\n",
" continue \n",
" id.prefLabel.append(r_pref)\n",
" i_alt = r_alt.split('||')\n",
" if i_alt != ['']: \n",
" for item in i_alt:\n",
" id.altLabel.append(item)\n",
" id.definition.append(r_def) \n",
" i_note = r_note.split('||')\n",
" if i_note != ['']: \n",
" for item in i_note:\n",
" id.editorialNote.append(item)\n",
" id.isDefinedBy.append(r_isby)\n",
" i_super = r_super.split('||')\n",
" if i_super != ['']: \n",
" for item in i_super:\n",
" item = 'http://kbpedia.org/kko/rc/' + item\n",
"# Code block to be used if objectProperty; 5.5 hr load\n",
"# item = getattr(rc, item)\n",
"# if item == None:\n",
"# print('Failed assignment:', r_id, item)\n",
"# continue\n",
"# else: \n",
" id.superClassOf.append(item)\n",
" kb.save(out_file, format=\"rdfxml\") \n",
" print('KBpedia class annotation build is complete.') "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class2_annot_build(**build_deck)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### D. Build Property Annotations\n",
"And then the property annotations:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"### KEY CONFIG SETTINGS (see build_deck in config.py) ### \n",
"# 'kb_src' : 'standard' \n",
"# 'loop_list' : file_dict.values(), # see 'in_file'\n",
"# 'loop' : 'property_loop',\n",
"# 'in_file' : 'C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv',\n",
"# 'out_file' : 'C:/1-PythonProjects/kbpedia/v300/target/ontologies/kbpedia_reference_concepts.csv',\n",
"\n",
"def prop2_annot_build(**build_deck):\n",
" print('Beginning KBpedia property annotation build . . .')\n",
" xsd = kb.get_namespace('http://w3.org/2001/XMLSchema#')\n",
" wgs84 = kb.get_namespace('http://www.opengis.net/def/crs/OGC/1.3/CRS84') \n",
" loop_list = build_deck.get('loop_list')\n",
" loop = build_deck.get('loop')\n",
" out_file = build_deck.get('out_file')\n",
" x = 1\n",
" if loop is not 'property_loop':\n",
" print(\"Needs to be a 'property_loop'; returning program.\")\n",
" return\n",
" for loopval in loop_list:\n",
" print(' . . . processing', loopval) \n",
" in_file = loopval\n",
" with open(in_file, 'r', encoding='utf8') as input:\n",
" is_first_row = True\n",
" reader = csv.DictReader(input, delimiter=',', fieldnames=['id', 'prefLabel', 'subPropertyOf', 'domain', \n",
" 'range', 'functional', 'altLabel', 'definition', 'editorialNote']) \n",
" for row in reader:\n",
" r_id = row['id'] \n",
" r_pref = row['prefLabel']\n",
" r_dom = row['domain']\n",
" r_rng = row['range']\n",
" r_alt = row['altLabel']\n",
" r_def = row['definition']\n",
" r_note = row['editorialNote']\n",
" r_id = r_id.replace('rc.', '')\n",
" id = getattr(rc, r_id)\n",
" if id == None:\n",
" continue\n",
" if is_first_row: \n",
" is_first_row = False\n",
" continue\n",
" id.prefLabel.append(r_pref)\n",
" i_dom = r_dom.split('||')\n",
" if i_dom != ['']: \n",
" for item in i_dom:\n",
" if 'kko.' in item:\n",
" item = item.replace('kko.', '')\n",
" item = getattr(kko, item)\n",
" id.domain.append(item) \n",
" elif 'owl.' in item:\n",
" item = item.replace('owl.', '')\n",
" item = getattr(owl, item)\n",
" id.domain.append(item)\n",
" elif item == ['']:\n",
" continue \n",
" elif item != '':\n",
" item = getattr(rc, item)\n",
" if item == None:\n",
" continue\n",
" else:\n",
" id.domain.append(item) \n",
" else:\n",
" print('No domain assignment:', 'Item no:', x, item)\n",
" continue \n",
" if 'owl.' in r_rng:\n",
" r_rng = r_rng.replace('owl.', '')\n",
" r_rng = getattr(owl, r_rng)\n",
" id.range.append(r_rng)\n",
" elif 'string' in r_rng: \n",
" id.range = [str]\n",
" elif 'decimal' in r_rng:\n",
" id.range = [float]\n",
" elif 'anyuri' in r_rng:\n",
" id.range = [normstr]\n",
" elif 'boolean' in r_rng: \n",
" id.range = [bool]\n",
" elif 'datetime' in r_rng: \n",
" id.range = [datetime.datetime] \n",
" elif 'date' in r_rng: \n",
" id.range = [datetime.date] \n",
" elif 'time' in r_rng: \n",
" id.range = [datetime.time] \n",
" elif 'wgs84.' in r_rng:\n",
" r_rng = r_rng.replace('wgs84.', '')\n",
" r_rng = getattr(wgs84, r_rng)\n",
" id.range.append(r_rng) \n",
" elif r_rng == ['']:\n",
" print('r_rng = empty:', r_rng)\n",
" else:\n",
" print('r_rng = else:', r_rng, id)\n",
"# id.range.append(r_rng)\n",
" i_alt = r_alt.split('||') \n",
" if i_alt != ['']: \n",
" for item in i_alt:\n",
" id.altLabel.append(item)\n",
" id.definition.append(r_def) \n",
" i_note = r_note.split('||')\n",
" if i_note != ['']: \n",
" for item in i_note:\n",
" id.editorialNote.append(item)\n",
" x = x + 1 \n",
" kb.save(out_file, format=\"rdfxml\") \n",
" print('KBpedia property annotation build is complete.')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Beginning KBpedia property annotation build . . .\n",
" . . . processing C:/1-PythonProjects/kbpedia/v300/build_ins/properties/prop_annot_out.csv\n",
"r_rng = else: xsd.anyURI rc.release_notes\n",
"r_rng = else: xsd.anyURI rc.schema_version\n",
"r_rng = else: xsd.anyURI rc.unit_code\n",
"r_rng = else: xsd.anyURI rc.property_id\n",
"r_rng = else: xsd.anyURI rc.ticket_token\n",
"r_rng = else: xsd.anyURI rc.role_name\n",
"r_rng = else: xsd.anyURI rc.feature_list\n",
"r_rng = else: xsd.hexBinary rc.associated_media\n",
"r_rng = else: xsd.hexBinary rc.encoding\n",
"r_rng = else: xsd.hexBinary rc.encodings\n",
"r_rng = else: xsd.hexBinary rc.photo\n",
"r_rng = else: xsd.hexBinary rc.photos\n",
"r_rng = else: xsd.hexBinary rc.primary_image_of_page\n",
"r_rng = else: xsd.hexBinary rc.thumbnail\n",
"r_rng = else: xsd.anyURI rc.code_repository\n",
"r_rng = else: xsd.anyURI rc.content_url\n",
"r_rng = else: xsd.anyURI rc.discussion_url\n",
"r_rng = else: xsd.anyURI rc.download_url\n",
"r_rng = else: xsd.anyURI rc.embed_url\n",
"r_rng = else: xsd.anyURI rc.install_url\n",
"r_rng = else: xsd.anyURI rc.map\n",
"r_rng = else: xsd.anyURI rc.maps\n",
"r_rng = else: xsd.anyURI rc.payment_url\n",
"r_rng = else: xsd.anyURI rc.reply_to_url\n",
"r_rng = else: xsd.anyURI rc.service_url\n",
"r_rng = else: xsd.anyURI rc.significant_link\n",
"r_rng = else: xsd.anyURI rc.significant_links\n",
"r_rng = else: xsd.anyURI rc.target_url\n",
"r_rng = else: xsd.anyURI rc.thumbnail_url\n",
"r_rng = else: xsd.anyURI rc.tracking_url\n",
"r_rng = else: xsd.anyURI rc.url\n",
"r_rng = else: xsd.anyURI rc.related_link\n",
"r_rng = else: xsd.anyURI rc.genre_schema\n",
"r_rng = else: xsd.anyURI rc.same_as\n",
"r_rng = else: xsd.anyURI rc.action_platform\n",
"r_rng = else: xsd.anyURI rc.fees_and_commissions_specification\n",
"r_rng = else: xsd.anyURI rc.requirements\n",
"r_rng = else: xsd.anyURI rc.software_requirements\n",
"r_rng = else: xsd.anyURI rc.storage_requirements\n",
"r_rng = else: xsd.anyURI rc.artform\n",
"r_rng = else: xsd.anyURI rc.artwork_surface\n",
"r_rng = else: xsd.anyURI rc.course_mode\n",
"r_rng = else: xsd.anyURI rc.encoding_format\n",
"r_rng = else: xsd.anyURI rc.file_format_schema\n",
"r_rng = else: xsd.anyURI rc.named_position\n",
"r_rng = else: xsd.anyURI rc.surface\n",
"r_rng = else: wgs84 rc.geo_midpoint\n",
"r_rng = else: xsd.anyURI rc.memory_requirements\n",
"r_rng = else: wgs84 rc.aerodrome_reference_point\n",
"r_rng = else: wgs84 rc.coordinate_location\n",
"r_rng = else: wgs84 rc.coordinates_of_easternmost_point\n",
"r_rng = else: wgs84 rc.coordinates_of_northernmost_point\n",
"r_rng = else: wgs84 rc.coordinates_of_southernmost_point\n",
"r_rng = else: wgs84 rc.coordinates_of_the_point_of_view\n",
"r_rng = else: wgs84 rc.coordinates_of_westernmost_point\n",
"r_rng = else: wgs84 rc.geo\n",
"r_rng = else: xsd.anyURI rc.additional_type\n",
"r_rng = else: xsd.anyURI rc.application_category\n",
"r_rng = else: xsd.anyURI rc.application_sub_category\n",
"r_rng = else: xsd.anyURI rc.art_medium\n",
"r_rng = else: xsd.anyURI rc.sport_schema\n",
"KBpedia property annotation build is complete.\n"
]
}
],
"source": [
"prop2_annot_build(**build_deck)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### E. Ingest of Mappings\n",
"Mappings to external sources are an integral part of KBpedia, as is likely the case for any similar, large-scale knowledge graph. As such, ingest of new or revised mappings is also a logical step in the overall build process, and occurs at this point in the sequence.\n",
"\n",
"Though we will not address mappings until **CWPK #49**, those steps belong here in the overall set of procedures for the extract-build roundtrip process."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Test Build\n",
"We then conduct our series of logic tests ([**CWPK #43**](https://www.mkbergman.com/2384/cwpk-43-logic-testing-of-the-knowledge-graph-structure/)). This portion of the process may actually be the longest of all, given that it may take multiple iterations to pass all of these tests. However, in other circumstances, the build tests may also go quite quickly if relatively few changes were made between versions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Wrap Up\n",
"Of course, these steps could be embedded in an overall 'complete' extract and build routine, but I have not done so.\n",
"\n",
"Before we conclude this major part in our **CWPK** series, we next proceed to show how all of the steps may be combined to achieve a rather large re-factoring of all of KBpedia."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" *.ipynb
file. It may take a bit of time for the interactive option to load.