CWPK \#30: Extracting Annotations
=======================================

Everything Can Be Annotated in a Knowledge Graph
--------------------------

<div style="float: left; width: 305px; margin-right: 10px;">

<img src="http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png" title="Cooking with KBpedia" width="305" />

</div>

We've seen in the previous two installments of this [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series various ways to specify a subset population for driving an iterative process for extracting structure from [KBpedia](https://kbpedia.org/). We're going to retain that iterative approach, only change it now to extract annotations. Classes, properties, and instances (individuals) may all be annotated in [OWL](https://en.wikipedia.org/wiki/Web_Ontology_Language). We thus need to derive generalized approaches that can apply to any entity in a knowledge graph.

Annotations are information applied to a given entity in order to point to it, describe it, or identify it. As a best practices matter, there are certain fields we recommend be universally applied to annotate any given entity:

- A preferred label (<code>prefLabel</code>) that is the standard name or title for a thing
- A multiple of alternative labels (<code>altLabel</code>) that capture any of the ways a given thing may be referred to, including synonyms, acronyms, jargon, etc. 
- A definition of the thing (<code>definition</code>)
- All labels should be tagged with a language tag in order to more readily support translation and use in multiple languages.

We may also find comments or notes associated with particular items. Further, in the case of object or data properties, we may have additional characterizations such as <code>domain</code> or <code>range</code> or functionality assigned to the item. We could have retrieved these characterizations as part of our structural extractions, but decided to include them rather in an annotation extraction pass (even though those characterizations are not annotative).

### Items to be Extracted During Annotation Pass
We can thus assemble up a list of items that may be extracted during an annotation extraction pass. We could do these extractions in parts, since that is often the better approach during the inverse process of building our knowledge graph. However, given the number of annotation and related items that may be extracted, and the number of combinations of same, we decide as a matter of simplicity to extract all such information as a single record for each subject entity. We can later manipulate the large flat files so generated if we need to focus on subsets of them. We may revisit this question once we tackle the build side of this [roundtripping](https://en.wikipedia.org/wiki/Round-trip_format_conversion) process.

Some of the items that we will extract have multiple entries per subject. Parental class is one such item, as are alternative labels, which may number into the tens for a rather complete characterization. From our experience in the last installments we know we will need to set up some inner loops to accommodate such multiple entries. So, with these understandings, we can now compile up a list of items that may be extracted on an annotation extraction pass, including whether the item is limited to a single entry, or may have many:

- IRI fragment name: single
- <code>prefLabel</code>: single
- <code>altLabel</code>: many
- superclass: many
- <code>definition</code>: single
- <code>editorialNote</code>: many
- mapping properties: many (a characterization that will grow over time)
- <code>comment</code>: many
- <code>domain</code>: single (object and data properties, only)
- <code>range</code>: single (object and data properties, only)
- functional type: single (object and data properties, only)

So, we decide to develop two variants of our code block. A standard one, and an expanded one that includes the object and data property additions. The IRI fragment name is the alias used internally in our Python programs and what gets concatenated with the base IRI to form the full IRI for the entity. 

Also, to maintain the idea of a single line per subject entity, we decide that: 1) we will separate multiple entries for a given item with the '||' ("double pipe") separator, which we use because it is never used in the wild and it is easy to spot when scanning code; and 2) we will not use full IRIs in order to aid record readability.

(BTW, if we decide over time to add other standard characterizations to our items we will adjust our routines accordingly.)

### Starting and Load
We again begin with our standard opening routine, except we have now substituted 'kbpedia' for 'main' in the first line, to make our reference going forward more specific:

<div style="background-color:#eee; border:1px dotted #aaa; vertical-align:middle; margin:15px 60px; padding:8px;"><strong>Which environment?</strong> The specific load routine you should choose below depends on whether you are using the online MyBinder service (the 'raw' version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, replace with the lines that are commented (<code>#</code>) out.</div>

In [1]:
kbpedia = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# kbpedia = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'


from owlready2 import *
world = World()
kb = world.get_ontology(kbpedia).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')               

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)
core = world.get_namespace('http://www.w3.org/2004/02/skos/core#')

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)
kko = kb.get_namespace('http://kbpedia.org/ontologies/kko#')

Like always, we execute each cell as we progress down this notebook page by pressing <code>shift+enter</code> for the highlighted cell or by choosing Run from the notebook menu.

### Basic Extraction Set-up
We tackle the smaller (non-property) variant of our code block first, treating the extracted items listed above as the members of a Python set specification. We also choose to prefix our variables with <code>annot_</code>. We will first start with a single item, foregoing the loop for the moment, to test if we have gotten our correspondences right. For the class set-up we'll use the relatively small <code>rc.Luggage</code> class. (You may substitute any KBpedia RC as this item.) 

In [2]:
s_item = rc.Luggage
annot_pref = s_item.prefLabel
annot_sup  = ''
# annot_sup  = s_item.superclass  # maybe it should be is_a
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = s_item.editorialNote
annot = [annot_pref, annot_sup, annot_alt, annot_def, annot_note]
print(annot)

[['baggage'], '', ['bag', 'bags', 'luggage'], ['This ‘class’ product category is drawn from UNSPSC, readily converted into other widely used systems. This product ‘class’ is an aggregate of luggage. This product category corresponds to the UNSPSC code: 53121500.'], []]


We need to add a few items to deal with specific property characteristics including <code>domain</code>, <code>range</code>, and functional type (which is blank in all of our cases):

In [12]:
s_item = kko.representations
annot_pref = s_item.prefLabel
annot_sup  = s_item.is_a
annot_dom  = s_item.domain
annot_rng  = s_item.range
annot_func = ''
annot_alt  = s_item.altLabel
annot_def  = s_item.definition
annot_note = s_item.editorialNote
annot = [annot_pref, annot_sup, annot_dom, annot_rng, annot_func, annot_alt, annot_def, annot_note]
print(annot)

[['representations'], [owl.AnnotationProperty], [], [], '', ['annotations', 'indexicals', 'metadata'], ['Pointers or indicators, including symbolic ones such as text or URLs, that draw attention to the actual or dynamic object.'], []]


KBpedia does not use functional properties at present. I leave a placeholder above, but have not worked out the [owlready2](https://owlready2.readthedocs.io/en/latest/intro.html) access methods.

### Working Out the Code Block
A quick inspection of these outputs flags a few areas of concern. We see that items are often enclosed in square brackets (a set notation in Python), we have many quoted items, and we have (as we knew) mutiple entries for some fields, especially <code>altLabel</code> and parents. In order to test our code block out, we will need to have a test set loaded. We decide to keep on with <code>rc.Luggage</code>, but I throw in a length count. You can substitute any non-leaf RC into the code if you want a larger or smaller or different domain test set.

In [3]:
root = rc.Luggage
s_set=root.descendants()

len(s_set)

25

For the iteration part for the multiple entries, we begin with the code blocks used for the inner loops dealing with the structural backbone issues in [**CWPK #28**](https://www.mkbergman.com/2363/cwpk-28-extracting-structure-for-typologies/) and [**CWPK #29**](https://www.mkbergman.com/2364/cwpk-29-extracting-object-and-data-properties/). But the purpose of tracing inheritance is different than retrieving values for multiple attributes of a single entity. Maybe we should tackle what seems to be an easier concern to remove the enclosing brackets ('[ ]').

I also decide as we test out these code blocks that I would shorten the variable names to reduce the amount of typing and to reflect a more general procedure. So, all of the <code>annot_</code> prefixes from above become <code>a_</code>.

Poking around I first find a string replacement example, followed by the <code>.join</code> method for strings:

In [None]:
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [a_pref,a_sup,a_alt,a_def,a_note]
  def listToStringWithoutBrackets(a_):
    return str(a_).replace('[','').replace(']','')
  listToStringWithoutBrackets(a_)
  print( ','.join( repr(e) for e in a_ ) )
len(s_set)

I try another string substitution example with similarly disappointing results:

In [None]:
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_note = s_item.editorialNote
  a_     = [s_item,a_pref,a_sup,a_alt,a_def,a_note]
  a_     = [str(i) for i in a_]
  a_out  = ','.join(a_).strip('[]') 
  print(a_out)


The reason these, and multiple other string approaches failed, was that we are dealing with results sets with multiple entries. It seemed like the safest way to ensure the fields were treated as strings was to explicitly declare them as such, and then manipulate the string directly. So, in the code below, we grab the property from the entity, convert it to a string, and then remove the first and last characters of the entire string, which in our case are the brackets. Note in this test code that I also (temporarily) comment out the two fields where we have possibly multiple items that we want to loop over and concatenate into a single string entry:

In [None]:
a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]                   # this is one way to remove opening and closing characters ([ ])
#  a_sup  = s_item.is_a
#  a_alt  = s_item.altLabel
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

We still see brackets in the listing, but those are for the two properties we commented out. All other brackets are now gone. While I really do not like repeating the same bracket removal code multiple times, it works, and after spending perhaps more time than I care to admit trying to find a more elegant solution, I decide to accept the workable over the perfect. I am hoping when we loop over the elements for the two fields commented out that we will be extracting the element from each pass of the loop, and thus via processiing will see the removal of their brackets. (That apparently is what happens in the loop steps below.)

Now, it is time to tackle the harder question of collapsing (also called 'flattening') a field with multiple entries. The basic idea of this inner loop is to treat all elements as strings, loop over the multiple elements, grab the first one and end if there is only one, but to continue looping if there is more than one until the number of elements is met, and to add a 'double pipe' ('||') character string to the previously built elements before concatenating the current element. This order of putting the delimiter at the beginning of each loop result is to make sure our final string with all concatenated results does not end with a delimiter. The skipping of the first pass means no delimiter is added at the beginning of the first element, also good if there is only one element for a given entity, which is often the case.

There are very robust <code>for</code> and <code>while</code> operators in Python. The one I settled on for this example uses an <code>id,enumerate</code> tuple where we get both the current element item and its numeric index:

In [None]:
a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
#  a_sup  = s_item.is_a
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt):            # here is the added inner loop as explained in text
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

Now that the inner loop example is working we can duplicate the approach for the other inner loop and move on to putting a full working code block together.

### Class Annotations
OK, so we appear ready to start finalizing the code block. We will start with class annotations because they have fewer fields to capture. The first step we want to do is to remove the pesky <code>rc.</code> namespace prefix in our output. Remember, this came from a tip in our last installment:

In [4]:
def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)

(How to set it back to the default is described in the prior installment.)

We also pick a class and its descendants to use in our prototype example. I also add a <code>len</code> statement in the code to indicate how many classes we will be processing in this example:

In [None]:
root = rc.Luggage
s_set=root.descendants()

len(s_set)

We now expand our code block to set our initial iterator to an empty string, fix (remove) the brackets, and process the two inner loops of the <code>altLabels</code> and parent classes putting the "double pipe" ('||') between entries:

In [7]:
a_item = ''
for s_item in s_set:
  a_pref = s_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = s_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_sup + '||' + str(a)
    a_sup  = a_item
  a_alt  = s_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_alt + '||' + str(a)
    a_alt  = a_item
  a_def  = s_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = s_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(s_item,a_pref,a_sup,a_alt,a_def,a_note, sep=',')

EveningBag,'evening bag',Purse,evening bags,'The collection of all evening bags. A type of Purse. The collection EveningBag is an ArtifactTypeByGenericCategory and a SpatiallyDisjointObjectType.',
Gucci,'Gucci',Luggage,GUCCI,'Gucci (/ɡuːtʃi/; Italian pronunciation: [ˈɡuttʃi]) is an Italian luxury brand of fashion and leather goods, part of the Gucci Group, which is owned by the French holding company Kering. Gucci was founded by Guccio Gucci in Florence in 1921.Gucci generated about €4.2 billion in revenue worldwide in 2008 according to BusinessWeek and climbed to 41st position in the magazine\'s annual 2009 \\"Top Global 100 Brands\\" chart created by Interbrand; it ranked retained that rank in Interbrand\'s 2014 index. Gucci is also the biggest-selling Italian brand. Gucci operates about 278 directly operated stores worldwide as of September 2009, and it wholesales its products through franchisees and upscale department stores. In the year 2013, the brand was valued at US$12.1 billio

The routine now seems to be working how we want it, so we move on to accommodate the properties as well.

### Property Annotations
Again, we set the renderer to the 'clean' setting and now pick a property and its sub-properties to populate our working set:

In [2]:
def render_using_label(entity):
    return entity.label.first() or entity.name

set_render_func(render_using_label)

In [3]:
root = kko.representations
p_set=root.descendants()

len(p_set)

2901

This example has nearly 3000 sub-properties! That should make for an interesting example. We add our three new properties to the prior code block. We also make another change, which is to substitute the <code>p_</code> prefix (for properties) over the prior <code>s_</code> prefix for subject (classes or individual):

In [None]:
p_item = ''
for p_item in p_set:
  a_pref = p_item.prefLabel
  a_pref  = str(a_pref)[1:-1]
  a_sup  = p_item.is_a
  for a_id, a in enumerate(a_sup): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_sup  = a_item
  a_dom  = p_item.domain
  a_dom  = str(a_dom)[1:-1]
  a_rng  = p_item.range
  a_rng  = str(a_rng)[1:-1]
  a_func = ''
  a_alt  = p_item.altLabel
  for a_id, a in enumerate(a_alt): 
    a_item = str(a)
    if a_id > 1:
        a_item = a_item + '||' + str(a)
  a_alt  = a_item
  a_def  = p_item.definition
  a_def  = str(a_def)[1:-1]
  a_note = p_item.editorialNote
  a_note  = str(a_note)[1:-1]
  print(p_item,a_pref,a_sup,a_dom,a_rng,a_func,a_alt,a_def,a_note, sep=',')

Fantastic! It seems that our basic annotation retrieval mechanisms are working properly. 

You may have noted the <code>sep=','</code> argument in the <code>print</code> statement. It means to add a comma separator between the output variables in the listing, a useful addition in Python 3 especially given our reliance on comma-separated value (CSV) files.

We are now largely done with the logic of our extractors. But, before we get to how to assemble the pieces in a working module, it is time for us to take a brief detour to learn about naming and writing output and saving to and reading from files. Since we will be using CSV files heavily, we also work that into next installment's discussion.

### Additional Documentation

The routines in this installment required much background reading and examples having to do with Python loops and string processing. Here are a few I found informative for today's **CWPK** installment:

- Nice [DataCamp discussion](https://www.datacamp.com/community/tutorials/18-most-common-python-list-questions-learn-python) of lists
- [Building CSV strings in Python](https://levelup.gitconnected.com/building-csv-strings-in-python-32934aed5a9e)
- [Strings, lists, tuples](http://www.openbookproject.net/books/bpp4awd/ch03.html)
- [Python lists and list manipulation](https://towardsdatascience.com/python-basics-6-lists-and-list-manipulation-a56be62b1f95).



 <div style="background-color:#efefff; border:1px dotted #ceceff; vertical-align:middle; margin:15px 60px; padding:8px;"> 
  <span style="font-weight: bold;">NOTE:</span> This article is part of the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/" style="font-style: italic;">Cooking with Python and KBpedia</a> series. See the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/"><strong>CWPK</strong> listing</a> for other articles in the series. <a href="http://kbpedia.org/">KBpedia</a> has its own Web site.
  </div>

<div style="background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;"> 

<span style="font-weight: bold;">NOTE:</span> This <strong>CWPK 
installment</strong> is available both as an online interactive
file <a href="https://mybinder.org/v2/gh/Cognonto/CWPK/master" ><img src="https://mybinder.org/badge_logo.svg" style="display:inline-block; vertical-align: middle;" /></a> or as a <a href="https://github.com/Cognonto/CWPK" title="CWPK notebook" alt="CWPK notebook">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>

<div style="background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;"> 
<div style="float: left; margin-right: 5px;"><img src="http://kbpedia.org/cwpk-files/warning.png" title="Caution!" width="32" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href="mailto:mike@mkbergman.com">notify me</a> should you make improvements.    

</div>