{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "CWPK \\#39: I/O and Structural Ingest\n", "=======================================\n", "\n", "Builds Are a More Complicated Workflow than Extractions\n", "--------------------------\n", "\n", "
build_ins
. This directory is the location where we first put the files extracted from a prior version to be used as the starting basis for the new version (see *Figure 2* in [**CWPK #37**](https://www.mkbergman.com/2376/cwpk-37-organizing-the-code-base/)). It is also the directory where we place our starting ontology files, the stubs
, that bootstrap the locations for new properties and classes to be added. We also place our fixes
inputs into this directory.\n",
"\n",
"Second, the result of our various build steps will generally be placed into a single sub-directory, the targets
directory. This directory is the source for all completed builds used for analysis and extractions for external uses and new builds. It is also the source of the knowledge graph input when we are in an incremental update or 'fix' mode, since we desire to modify the current build in-progress, not always start from scratch. The targets
directory is also the appropriate location for logging, statistics, and working 'scratchpad' subdirectories while we are working on a given build. \n",
"\n",
"To this structure I also add a sandbox
directory for experiments, etc., that do not fall within a conventional build paradigm. The sandbox
material can either be total scratch or copied manually to other locations if there is some other value.\n",
"\n",
"Please see *Figure 2* in [**CWPK #37**](https://www.mkbergman.com/2376/cwpk-37-organizing-the-code-base/) to see the complete enumeration of these directory structures."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basic I/O Routines\n",
"Similar to what we did with the extraction side of the roundtrip, we will begin our structural builds (and the annotation ones two installments hence) in the interactive format of Jupyter Notebook. We will be able to progress cell-by-cell Running these or invoking them with the shift+enter
convention. After our cleaning routines in **CWPK #45**, we will then be able to embed these interactive routines into build
and clean
modules in **CWPK #47** as part of the *cowpoke* package.\n",
"\n",
"From the get-go with the build
module we need to have a more flexible load routine for *cowpoke* that enables us to specify different sources and targets for the specific build, the inputs-outputs, or I/O. We had already discovered in the extraction routines that we needed to bring three ontologies into our project namespace, [KKO](https://kbpedia.org/docs/kko-upper-structure/), the reference concepts of [KBpedia](https://kbpedia.org/), and [SKOS](https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System). We may also need to differentiate 'start' v 'fix' wrinkles in our builds. That leads to three different combinations of source and target: 'standard' (same as 'fixes'), 'start', and our optional 'sandbox') for our basic \"build\" I/O:\n",
"\n",
"skos
and kko
in our current effort. \n",
"\n",
"Third, it is essential to declare the namespaces for these imports under the current working ontology. Then, from that point forward, it is also essential to be cognizant that these separate namespaces need to be addressed explicitly. In the case of *cowpoke* and KBpedia, for example, we have classes from our governing upper ontology, KKO (also with namespace 'kko
') and the reference concepts of the full KBpedia (namespace 'rc
'). More than one namespace in the working ontology does complicate matters quite a bit, but that is also the more realistic architecture and design approach. Part of the nature of semantic technologies is to promote interoperability among multiple knowledge graphs or ontologies, each of which will have at least one of its own namespaces. To do meaningful work across ontologies, it is important to understand these ontology ← → namespace distinctions.\n",
"\n",
"This is how these assignments needed to work out for our build routines based on these considerations:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"kb = world.get_ontology(kbpedia).load()\n",
"rc = kb.get_namespace('http://kbpedia.org/kko/rc/') # need to make sure we set the namespace\n",
"\n",
"skos = world.get_ontology(skos_file).load()\n",
"kb.imported_ontologies.append(skos)\n",
"core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')\n",
"\n",
"kko = world.get_ontology(kko_file).load()\n",
"kb.imported_ontologies.append(kko)\n",
"kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#') # need to assign namespace to main onto ('kb')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have set up our initial build switches and defined our ontologies and related namespaces, we are ready to construct the code for our first build attempt. In this instance, we will be working with only a single class structure input file to the build, typol_AudioInfo.csv
, which according to our 'start' build switch (see above) is found in the kbpedia/v300/build_ins/typologies/
directory under our project location.\n",
"\n",
"The routine below needs to go through three different passes (at least as I have naively specified it!), and is fairly complicated. There are quite a few notes below the code listing explaining some of these steps. Also note we will be definining this code block as a function and the import types
statement will be moved to the header in our eventual build module:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import types\n",
"\n",
"src_file = 'C:/1-PythonProjects/kbpedia/v300/build_ins/typologies/typol_AudioInfo.csv'\n",
"kko_list = typol_dict.values()\n",
"with open(src_file, 'r', encoding='utf8') as csv_file: # Note 1\n",
" is_first_row = True\n",
" reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent']) \n",
" for row in reader: ## Note 2: Pass 1: register class\n",
" id = row['id'] # Note 3\n",
" parent = row['parent'] # Note 3\n",
" id = id.replace('http://kbpedia.org/kko/rc/', 'rc.') # Note 4\n",
" id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" id_frag = id.replace('rc.', '')\n",
" id_frag = id_frag.replace('kko.', '')\n",
" parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') \n",
" parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" parent = parent.replace('owl:', 'owl.')\n",
" parent_frag = parent.replace('rc.', '')\n",
" parent_frag = parent_frag.replace('kko.', '')\n",
" parent_frag = parent_frag.replace('owl.', '')\n",
" if is_first_row: # Note 5\n",
" is_first_row = False\n",
" continue \n",
" with rc: # Note 6\n",
" kko_id = None\n",
" kko_frag = None\n",
" if parent_frag == 'Thing': # Note 7 \n",
" if id in kko_list: # Note 8\n",
" kko_id = id\n",
" kko_frag = id_frag\n",
" else: \n",
" id = types.new_class(id_frag, (Thing,)) # Note 6\n",
" if kko_id != None: # Note 8\n",
" with kko: # same form as Note 6\n",
" kko_id = types.new_class(kko_frag, (Thing,)) \n",
"with open(src_file, 'r', encoding='utf8') as csv_file:\n",
" is_first_row = True\n",
" reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])\n",
" for row in reader: ## Note 2: Pass 2: assign parent\n",
" id = row['id']\n",
" parent = row['parent']\n",
" id = id.replace('http://kbpedia.org/kko/rc/', 'rc.') # Note 4\n",
" id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" id_frag = id.replace('rc.', '')\n",
" id_frag = id_frag.replace('kko.', '')\n",
" parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') \n",
" parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" parent = parent.replace('owl:', 'owl.')\n",
" parent_frag = parent.replace('rc.', '')\n",
" parent_frag = parent_frag.replace('kko.', '')\n",
" parent_frag = parent_frag.replace('owl.', '')\n",
" if is_first_row:\n",
" is_first_row = False\n",
" continue \n",
" with rc:\n",
" kko_id = None # Note 9\n",
" kko_frag = None\n",
" kko_parent = None\n",
" kko_parent_frag = None\n",
" if parent_frag is not 'Thing': # Note 10\n",
" if parent in kko_list:\n",
" kko_id = id\n",
" kko_frag = id_frag\n",
" kko_parent = parent\n",
" kko_parent_frag = parent_frag\n",
" else: \n",
" var1 = getattr(rc, id_frag) # Note 11\n",
" var2 = getattr(rc, parent_frag)\n",
" if var2 == None: # Note 12\n",
" continue\n",
" else:\n",
" var1.is_a.append(var2) # Note 13\n",
" if kko_parent != None: # Note 14 \n",
" with kko: \n",
" if kko_id in kko_list: # Note 15\n",
" continue\n",
" else:\n",
" var1 = getattr(rc, kko_frag) # Note 16\n",
" var2 = getattr(kko, kko_parent_frag)\n",
" var1.is_a.append(var2)\n",
"thing_list = [] # Note 17\n",
"with open(src_file, 'r', encoding='utf8') as csv_file:\n",
" is_first_row = True\n",
" reader = csv.DictReader(csv_file, delimiter=',', fieldnames=['id', 'subClassOf', 'parent'])\n",
" for row in reader: ## Note 2: Pass 3: remove owl.Thing\n",
" id = row['id']\n",
" parent = row['parent']\n",
" id = id.replace('http://kbpedia.org/kko/rc/', 'rc.') # Note 4\n",
" id = id.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" id_frag = id.replace('rc.', '')\n",
" id_frag = id_frag.replace('kko.', '')\n",
" parent = parent.replace('http://kbpedia.org/kko/rc/', 'rc.') \n",
" parent = parent.replace('http://kbpedia.org/ontologies/kko#', 'kko.')\n",
" parent = parent.replace('owl:', 'owl.')\n",
" parent_frag = parent.replace('rc.', '')\n",
" parent_frag = parent_frag.replace('kko.', '')\n",
" parent_frag = parent_frag.replace('owl.', '')\n",
" if is_first_row:\n",
" is_first_row = False\n",
" continue\n",
" if parent_frag == 'Thing': # Note 18\n",
" if id in thing_list: # Note 17\n",
" continue\n",
" else:\n",
" if id in kko_list: # Note 19\n",
" var1 = getattr(kko, id_frag)\n",
" thing_list.append(id)\n",
" else: # Note 19\n",
" var1 = getattr(rc, id_frag)\n",
" var1.is_a.remove(owl.Thing)\n",
" thing_list.append(id)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The code block above was the most challenging to date in this **CWPK** series. Some of the lessons from working this out are offered in [**CWPK #21**](https://www.mkbergman.com/2353/cwpk-21-some-accumulated-tips/). Here are the notes that correspond to some of the statements made in the code above:\n",
"\n",
"1. This is a fairly standard CSV processing routine. However, note the 'fieldnames' that are assigned, which give us a basis as the routine proceeds to pick out individual column values by row\n",
"\n",
"\n",
"2. Each file processed requires three passes: *Pass #1* - registers each new item in the source file as a bona fide owl:Class
; *Pass #2* - each new item, now properly registered to the system, is assigned its parent class; and *Pass #3* - each of the new items has its direct assignment to owl:Class
removed to provide a cleaner hierarchy layout\n",
"\n",
"\n",
"3. We are assigning each row value to a local variable for processing during the loop\n",
"\n",
"\n",
"4. In this, and in the lines to follow, we are reducing the class string and its parent string from potentially its full IRI string to prefix + Name. This gives us the flexibility to have different format input files. We will eventually pull this repeated code each loop out into its own function\n",
"\n",
"\n",
"5. This is a standard approach in CSV file processing to skip the first header row in the file\n",
"\n",
"\n",
"6. There are a few methods apparently possible in owlready2 for assigning a class, but this form of looping over the ontology using the 'rc
' namespace is the only version I was able to get to work successfully, with the assignment statement as shown in the second part of this method. Note the assignment to 'Thing' is in the form of a tuple, which is why there is a trailing comma\n",
"\n",
"\n",
"7. Via this check, we only pick up the initial class declarations in our input file, and skip over all of the others that set actual direct parents (which we deal with in *Pass #2*)\n",
"\n",
"\n",
"8. We check all of our input roles to see if the row class is already in our kko dictionary (kko_list, set above the routine) or not. If it is a kko.Class
, we assign the row information to a new variable, which we then process outside of the 'rc' loop so as to not get the namespaces confused\n",
"\n",
"\n",
"9. Initializing all of this loops variables to 'None'\n",
"\n",
"\n",
"10. Same processing checks as for *Pass #1*, except now we are checking on the parent values\n",
"\n",
"\n",
"11. This is an owlready2 tip, and a critical one, for getting a class type value from a string input; without this, the class assignment method (Note 13) fails\n",
"\n",
"\n",
"12. If var2 is not in the 'rc
' namespace (in other words, it is in 'kko
', we skip the parent assignment in the 'rc
' loop\n",
"\n",
"\n",
"13. This is another owlready2 method for assigning a class to a parent class. In this loop given the checks performed, both parent and id are in the 'rc
' namespace\n",
"\n",
"\n",
"14. As for *Pass #1*, we are now processing the 'kko
' namespace items outside of the 'rc
' namespace and in its own 'kko
' namespace\n",
"\n",
"\n",
"15. We earlier picked up rows with parents in the 'kko
' namespace; via this call, we also exclude rows with a 'kko
' id as well, since our imported KKO ontology already has all kko class assignments set\n",
"\n",
"\n",
"16. We use the same parent class assignment method as in Note #11, but now for ids in the 'rc
' namespace and parents in the 'kko
' namespace. However, the routine so far also results in a long listing of classes directly under owl:Thing
root **(1)** in an ontology editor such as [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)):\n",
"\n",
"owl:Thing
. There may be multiple declarations in our build file, but we only may delete the assignment once from the knowledge base. The lookup to 'thing_list' prevents us from erroring when trying to delete for a second or more times \n",
"\n",
"\n",
"18. We are selecting on 'Thing' because we want to *unassign* all of the temporary owl:Thing
class assignments needed to provide placeholders in *Pass #1* (**Note**: recall in our structure extractor routines in [**CWPK #28**](https://www.mkbergman.com/2363/cwpk-28-extracting-structure-for-typologies/) we added an extra assignment to add an owl:Thing
class definition so that all classes in the extracted files could be recognized and loaded by external ontology editors)\n",
"\n",
"19. We differentiate between 'rc
' and 'kko
' concepts because the kko are defined separated in the KKO ontology, used as one of our build stubs.\n",
"\n",
"\n",
"As you run this routine in real time from Jupyter Notebook, you can inspect what have been removed by inspecting:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"list(thing_list)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can now inspect this loading of an individual typology into our stub. We need to preface our 'save' statement with the 'kb' ontology identifier. I also have chosen to use the 'working' directory for saving these temporary results:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"kb.save(file=r'C:/1-PythonProjects/kbpedia/v300/build_ins/working/kbpedia_reference_concepts.owl', format=\"rdfxml\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, phew! After much time and trial, I was able to get this code running successfully! Here is the output of the full routine:\n",
"\n",
"*.ipynb
file. It may take a bit of time for the interactive option to load.