CWPK \#49: Mapping External Sources
=======================================

Sometimes KR is Kind of Tricky
--------------------------

<div style="float: left; width: 305px; margin-right: 10px;">

<img src="http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png" title="Cooking with KBpedia" width="305" />

</div>

I think I pretty much lied to you about '[roundtripping](https://en.wikipedia.org/wiki/Round-trip_engineering)' and the importance of the last major section that we just completed in this [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series. We had indeed accomplished the extraction-and-build cycle for classes, properties, and their annotations. These are the same roundtrip objectives we had been pursuing and doing with our [Clojure](https://en.wikipedia.org/wiki/Clojure) procedures for years. But we were incomplete. 

There is another purpose to KBpedia as a 'scaffolding' to external sources that is not captured in this initial 'roundtripping' sense. If you download KBpedia, right now as version 2.50, and load it up into an editor like [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)), you will see the KBpedia structure and its properties and annotations, but you will ***not*** see the links to the external resources that KBpedia integrates. These mappings (correspondences) are provided in separate mapping files, and have been since KBpedia was first released. This has struck me for some time as an unneeded indirection. I think the better approach is to embed these links directly into KBpedia (or whatever your master [knowledge graph](https://en.wikipedia.org/wiki/Ontology_(information_science)) may be). 

OK, so if we do that, does that not become part of the baseline that needs to be roundtripped? And, if to be used, what is the best practice for representing these external relationships?

Like much that has an [open world approach](https://en.wikipedia.org/wiki/Open-world_assumption), these kinds of questions and open-ended project requirements can stretch into the unachievable future. I don't want to keep moving the goalposts. But I do think the role of top-level knowledge graphs is integration and coverage. The addition of key mappings to our basic expectation for our central knowledge graph seems appropriate and reasonable. Thus, for a 'truer' conceptualization of 'roundtripping', I now include mappings with our basic remit.

This correction (which, of course, is an addition), and some other utilities and tools explorations, are the focus of this next major **Part V** in our **CWPK** series. Think of this next major part as wrapping our basic knowledge graph into the tools of the trade.

### Representing External Resources
The standard way to handle links with an external source in most ontologies is to 'import' the external source. In OWL terms, this means to actually import an ontology and to incorporate the import's full aspects. This is a bit of a more blunderbuss approach when using semantic technologies like OWL. At least in Python, as we have learned, we can actually limit our imports to specific methods or procedures.

With the standard OWL import command one brings the entire knowledge graph into the active space. Perhaps this is OK, though it does seem excessive if one only wants a small portion of the external resource, perhaps only for a specific class or predicate or three. But we have a more fundamental problem in that some of our external resources are not defined in a formal ontology subject to an import, but are just persistent URIs. Actually, with external knowledge bases like Wikipedia or Wikidata, this design is more common than one where a single ontology call can bring in all of the external structure.

If we are not bringing in a full ontology import -- or may not even be able to do so because the external resources are only accessible via a URI scheme and not a logical ontology -- then we will need to formulate a different linkage protocol. The one we have chosen to follow is: Give each external source its own mapping predicate, and then annotate any KBpedia resource with a mapping to show the link to the external source and its specific mapping predicate. This design decision means we can not apply a reasoner to the external sources, but that we may identify and collect them using SPARQL queries. This demarks an important realization point where we see that we can now supplement our information access and aggregation for our knowledge graph not only using logical inferencing and subsumption chains, but through directed queries that make all aspects of our information characterization, including annotations, available for selection and retrieval.

### External Mapping Approach
OK, given this design, we proceed to code up our build (load) routines to bring the external mappings into KBpedia. We begin with our standard start-up instructions:

In [1]:
from cowpoke.__main__ import *
from cowpoke.config import *
from owlready2 import *

And then proceed to develop up our routine in the standard manner, including a header showing key configuration inputs. I explain some of the important implementation notes below the routine:

In [None]:
### KEY CONFIG SETTINGS (see build_deck in config.py) ###             
# 'kb_src'        : 'standard'                                        # Set in master_deck
# 'loop_list'     : mapping_dict.values(),                             
# 'base'          : 'C:/1-PythonProjects/kbpedia/v300/build_ins/mappings/',              
# 'ext'           : '.csv',                                         
# 'out_file'      : 'C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts.csv',


def mapping_builder(**build_deck):
    print('Beginning KBpedia mappings build . . .')
    loop_list = build_deck.get('loop_list')
    base = build_deck.get('base')
    ext = build_deck.get('ext')
    out_file = build_deck.get('out_file')
    for loopval in loop_list:                                               # Note 1
        print('   . . . processing', loopval)                             
        in_file = (base + loopval + ext)
        print(in_file)
        with open(in_file, 'r', encoding='utf8') as input:
            is_first_row = True
            reader = csv.DictReader(input, delimiter=',', fieldnames=['s', 'p', 'o'])
            for row in reader:                                              # Note 2
                if is_first_row:
                    is_first_row = False                
                    continue
                r_s = row['s']                                              # Note 3
                if 'kko/rc' in r_s:                                         # Note 4
                    r_s = r_s.replace('http://kbpedia.org/kko/rc/', '')
                    r_s = getattr(rc, r_s)
                else:
                    r_s = r_s.replace('http://kbpedia.org/ontologies/kko#', '')
                    r_s = getattr(kko, r_s)                                 # Note 5        
                r_p = row['p']                                              # Note 6
                r_o = row['o']
                if loopval == 'dbpedia':                                    # Note 7                    
                    kb_frag = 'dbpedia_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.dbpedia_id.append(r_o)                              # Note 8
                elif loopval == 'dbpedia-ontology':
                    kb_frag = 'dbpedia_ontology_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.dbpedia_ontology_id.append(r_o)
                elif loopval == 'geonames':
                    kb_frag = 'geo_names_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.geo_names_id.append(r_o)
                elif loopval == 'schema.org':
                    kb_frag = 'schema_org_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.schema_org_id.append(r_o)
                elif loopval == 'wikidata':
                    kb_frag = 'wikidata_q_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.wikidata_q_id.append(r_o)
                elif loopval == 'wikipedia':
                    kb_frag = 'wikipedia_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.wikipedia_id.append(r_o)
                elif loopval == 'wikipedia-categories':
                    kb_frag = 'wikipedia_category_id'
                    kb_prop = getattr(rc, kb_frag)
                    r_s.wikipedia_category_id.append(r_o)
                else:
                    print(loopval, 'is not on list.')
                kko.mappingAsObject[r_s, kb_prop, r_o] = [r_p]              # Note 9         
                print(r_s)
    print('External mapping uploads are complete.')                

We set up our standard looping routine **(1)**, progressing through a new 'loop_list' that is contained in a new dictionary (<code>mapping_dict</code>) in our <code>config.py</code> file. Like our other build methods, we use the <code>csv</code> module and progress through the input files row-by-row **(2)**. We pick up our column values **(3)** and **(6)** for each row, and assign them to a local variable, corresponding to the s-p-o '[triples](https://en.wikipedia.org/wiki/Semantic_triple)' for each row. We need to account for the fact that our KBpedia resources are under either the 'rc' or 'kko' namespaces **(4)**, and pare them down to their fragments (since the resources are stored on file using their complete IRIs). Because these are string values we are extracting from the input files, we need to look them up and convert them to their internal attribute form **(5)**, which are KBpedia classes in these instances.

We have a long list of switch statements **(7)** that detect the input file type and then pick the proper mapping predicate (there is one for each external mapping source). This long list of <code>else-if</code> statements would seem to lend themselves to a helper function of some sort, but I decide just to repeat the code blocks because they are simple and straightforward. (Rather than a 'dict' format I want something more akin to a 'record' where a single key may point to many different values, but that is an expansion of the 'dict' approach that actually seems a little complicated in Python). All of material up to this point is really set-up and processing for the two key [owlready2](http://www.lesfleursdunormal.fr/static/informatique/owlready/index_en.html) algorithms that we need to use.

The first algorithm **(8)** we have seen before and associates the external mapped IRI to the current KBpedia class in the loop routine, and has the form of:

<pre>
  subject.predicate.append(object)
</pre>

The only tricky part here is making sure that namespace prefixes are used as indicated and that we have converted the external mapping predicate to its internal form (<code>kb_prop</code>). This code is now associating the external IRI to the subject KBpedia class. However, we are still lacking the nature of the mapping predicate.

For this purpose, we need to "annotate the annotation", or basically reify the annotation with its specific mapping predicate. There is a format in owlready2 where we basically treat the entire tuple of the s-p-o object as the target for the added annotation, shown in the form as **(9)**. The predicate for this purpose is the <code>kko.mappingAsObject</code>. The basic format is:

<pre>
  kko.mappingAsObject[s, p, o] = 'specific mapping predicate'
</pre>

The 'specific mapping predicate' is either <code>owl:equivalentClass</code>, <code>rdfs:subClassOf</code>, <code>kko:superClassOf</code>, <code>kko:isAbout</code>, or <code>kko:isRelatedTo</code>. The <code>kko:mappingAsObject</code> is named as it is because this specifically assigned mapping predicate applies to the external IRI when it appears in the object slot of the s-p-o triple. In other words, the triple so formed has the s-p-o construction of:

<pre>
   Kbpedia class - 'specific mapping predicate' - external IRI
</pre>

Just as it appears in our external mapping files. It is important to keep this ordering straight, since reversing the order of the subject and object causes the inverse properties to be required (<code>rdfs:subClassOf</code> is the inverse of <code>kko:superClassOf</code>, <code>kko:isAbout</code> is the inverse of <code>kko:isRelatedTo</code>).

So, with this code block now set, we can run our routine:

In [None]:
mapping_builder(**build_deck)

Once processed, here is the form the mapping takes within the [Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_(software)) editor:

<div style="margin: 10px auto; display: table;">

<img src="files/mapping-annotation-protege.png" title="Mapping Annotations" height="800" alt="Mapping Annotations" />

</div>

<div style="margin: 10px auto; display: table; font-style: italic;">

Figure 1: Mapping Annotations

</div>

The first callout **(1)** indicates we successfully changed our hyphens to underscores, as discussed in the last major part. The second callout **(2)** shows how our mapping annotations now appear, with the annotation itself being annotated with the specific mapping predicate used.

Since we like how these mappings turned out, we decide to save the file:

In [None]:
kb.save('C:/1-PythonProjects/kbpedia/v300/targets/ontologies/kbpedia_reference_concepts_test.owl', format="rdfxml") 

Since these mappings add considerably to the file size, I actually save both mapped and unmapped versions for later use, depending.

### Completing the Roundtrip
Now that our builds contain external mappings, it is appropriate we also create extraction routines to get these assignments back out. First, we can list all of the KBpedia RCs that have a specific external mapping (<code>rc.dbpedia_ontology_id</code> in this case), which also returns a string value for the external IRI:

In [5]:
list(rc.dbpedia_ontology_id.get_relations())

[(rc.Person, 'http://dbpedia.org/ontology/Person'),
 (rc.SoccerPlayer, 'http://dbpedia.org/ontology/SoccerPlayer'),
 (rc.ConceptualWork, 'http://dbpedia.org/ontology/Work'),
 (rc.Bird, 'http://dbpedia.org/ontology/Bird'),
 (rc.TextualPCW, 'http://dbpedia.org/ontology/WrittenWork'),
 (rc.Scientist, 'http://dbpedia.org/ontology/Scientist'),
 (rc.BasketballPlayer, 'http://dbpedia.org/ontology/BasketballPlayer'),
 (rc.Artist, 'http://dbpedia.org/ontology/Artist'),
 (rc.GeopoliticalEntity,
  'http://dbpedia.org/ontology/GeopoliticalOrganisation'),
 (rc.GeopoliticalEntity, 'http://dbpedia.org/ontology/Region'),
 (rc.Lighthouse, 'http://dbpedia.org/ontology/Lighthouse'),
 (rc.FootballPlayer_American,
  'http://dbpedia.org/ontology/GridironFootballPlayer'),
 (rc.FootballPlayer_American,
  'http://dbpedia.org/ontology/AmericanFootballPlayer'),
 (rc.Organization, 'http://dbpedia.org/ontology/Organisation'),
 (rc.Organization, 'http://dbpedia.org/ontology/EmployersOrganisation'),
 (rc.Organizatio

Then, we can decompose those results to construct up our format for annotating a s-p-o tuple to get the exact mapping predicate used:

In [11]:
list(kko.mappingAsObject[rc.Currency, rc.dbpedia_ontology_id, 'http://dbpedia.org/ontology/Currency']) 

['owl:equivalentClass']

With some string replacements and shifting the orders, we then can see how we can get a reconstruction of the input rows we used to load the external mappings in the first place, with our example becoming:

<pre>
  rc.Currency, owl:equivalentClass, http://dbpedia.org/ontology/Currency
</pre>

(Using 'rc.' rather than base IRI in this example.) For the moment, we will defer writing the full routine, but will eventually add it to our <code>extract.py</code> module. We will also embed this code block into a loop that writes our individual mapping files to disk, thereby completing the external mapping portion of our roundtrip.

 <div style="background-color:#ffecec; border:1px dotted #f5aca6; vertical-align:middle; margin:15px 60px; padding:8px;"> 
  <span style="font-weight: bold;">NOTE:</span> This article is part of the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/" style="font-style: italic;">Cooking with Python and KBpedia</a> series. See the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/"><strong>CWPK</strong> listing</a> for other articles in the series. <a href="http://kbpedia.org/">KBpedia</a> has its own Web site. The <em>cowpoke</em> Python <a href="https://github.com/Cognonto/cowpoke">code listing covering the series</a> is also available from GitHub.
  </div>

<div style="background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;"> 

<span style="font-weight: bold;">NOTE:</span> This <strong>CWPK 
installment</strong> is available both as an online interactive
file <a href="https://mybinder.org/v2/gh/Cognonto/CWPK/master" ><img src="https://mybinder.org/badge_logo.svg" style="display:inline-block; vertical-align: middle;" /></a> or as a <a href="https://github.com/Cognonto/CWPK" title="CWPK notebook" alt="CWPK notebook">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>

<div style="background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;"> 
<div style="float: left; margin-right: 5px;"><img src="http://kbpedia.org/cwpk-files/warning.png" title="Caution!" width="32" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href="mailto:mike@mkbergman.com">notify me</a> should you make improvements.    

</div>