CWPK \#23: Text Searching KBpedia
==============================


Using the Direct Approach with Owlready2
--------------------------

<div style="float: left; width: 305px; margin-right: 10px;">

<img src="http://kbpedia.org/cwpk-files/cooking-with-kbpedia-305.png" title="Cooking with KBpedia" width="305" />

</div>

In this installment of the [*Cooking with Python and KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series, we explore ways to directly search knowledge graph text from within the [owlready2](http://www.lesfleursdunormal.fr/static/informatique/owlready/index_en.html) [API](https://en.wikipedia.org/wiki/Application_programming_interface). We first introduced this topic in <strong>[CWPK #19](https://www.mkbergman.com/2349/cwpk-19-exploring-the-api-to-owl/)</strong>; we explain further some of the nuances here.

Recall that owlready2 uses its own local datastore, [SQLite](https://en.wikipedia.org/wiki/SQLite), for storing its knowledge graphs. Besides the search functionality added in Owlready2, we will also be taking advantage of the full-text search (FTS) functionality within SQLite.

### Load Full Knowledge Graph
To get started, we again load our working knowledge graph. In this instance we will use the full KBpedia knowledge graph, <code>kbpedia_reference_concepts.owl</code>, because it has a richer set of contents.

<div style="background-color:#eee; border:1px dotted #aaa; vertical-align:middle; margin:15px 60px; padding:8px;"><strong>Which environment?</strong> The specific load routine you should choose below depends on whether you are using the online MyBinder service (the 'raw' version) or local files. The example below is based on using local files (though replace with your own local directory specification). If loading from MyBinder, use this <a href="https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl">address for <code>kbpedia_reference_concepts.owl</code></a></div>

In [2]:
main = 'C:/1-PythonProjects/kbpedia/sandbox/kbpedia_reference_concepts.owl'
# main = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kbpedia_reference_concepts.owl'
skos_file = 'http://www.w3.org/2004/02/skos/core' 
kko_file = 'C:/1-PythonProjects/owlready2/kg/kko.owl'
# kko_file = 'https://raw.githubusercontent.com/Cognonto/CWPK/master/sandbox/builds/ontologies/kko.owl'

from owlready2 import *
world = World()
kb = world.get_ontology(main).load()
rc = kb.get_namespace('http://kbpedia.org/kko/rc/')

skos = world.get_ontology(skos_file).load()
kb.imported_ontologies.append(skos)

kko = world.get_ontology(kko_file).load()
kb.imported_ontologies.append(kko)

To execute the load, pick <code>shift+enter</code> to execute the cell contents, or pick Run from the main menu.

Besides changing our absolute file input, note we have added another scoping assignment <code>world</code> to our load. <code>world</code> is a reserved keyword in Owlready2 that encompasses the SQLite storage space used by Owlready2. Note we assign all of our ontologies (knowledge graphs) to this namespace so that we may invoke some of the FTS functionality later in this installment.

### Basic Search Functions

As the [owlready2 documentation](xxx) explains, it contains some pre-loaded search capabilities that can be performed with the <code>.search()</code> query method. This method can accept one or several keyword arguments:

- <code>iri</code> - for searching entities by their full IRIs
- <code>type</code> - for searching instances for a given class
- <code>subclass_of</code> - for searching subclasses of a given class
- <code>is_a</code> - for searching both instances and subclasses of a given class, or object, data or annotation property name.

Special arguments that may be added to these arguments are:

- <code>_use_str_as_loc_str</code> - whether to treats plain Python strings as strings in any language (default is True)
- <code>_case_sensitive</code> - whether to take lower/upper case into consideration (default is True).

Our search queries may accept quoted phrases and prefix or suffix wildcards (*). Let's look at some examples combining these arguments and methods. Our first one is similar to what we presented in <strong>CWPK #19</strong>:

In [3]:
world.search(iri = "*luggage*")

[]

Notice our result here is an empty set, in other words, no matches. Yet we know there are IRIs in KBpedia that include the term 'luggage'. We suspect the reason for not seeing a match is that the term might start with upper case in our IRIs. We will set the case sensitivity argument to false and try again:

In [None]:
world.search(iri = "*luggage*", _case_sensitive = False)

Great! We are now seeing the results we expected.

Note in the query above that we used the wildcard (*) to allow for either prefix or suffix matches. As you can see from the results above, most of the search references match the interior part of the IRI string.

The <code>iri</code> argument takes a search string as its assignment. The other three keyword assignments noted above take an object name, as this next example shows:

In [None]:
world.search(subclass_of=rc.Mammal)

We get a tremendous number of matches on this query, so much so that I cleared away the current cell output (via Cell &rarr; Current Outputs &rarr; Clear, when highlighting this cell). To winnow this results set further, we can combine search terms as the next example shows. We will add to our initial search a string search in the IRIs for which prior results might represent 'Bats':

In [4]:
world.search(subclass_of=rc.Mammal, iri = "*Bat*")

[rc.Bat-Mammal, rc.SacWingedBat, rc.BulldogBat, rc.FreeTailedBat, rc.HorseshoeBat, rc.SchreibersBat, rc.WesternSuckerFootedBat, rc.AfricanLongFingeredBat, rc.AfricanYellowBat, rc.AllensYellowBat, rc.AsianPartiColoredBat, rc.DaubentonsBat, rc.EasternRedBat, rc.GreatEveningBat, rc.GreaterTubeNosedBat, rc.GreyLongEaredBat, rc.HawaiianHoaryBat, rc.KobayashisBat, rc.LesserYellowBat, rc.LittleTubeNosedBat, rc.NewGuineaBigEaredBat, rc.NewZealandLongTailedBat, rc.NorthernLongEaredBat, rc.PallidBat, rc.SilverHairedBat, rc.SpottedBat, rc.AfricanSheathTailedBat, rc.AmazonianSacWingedBat, rc.BeccarisSheathTailedBat, rc.ChestnutSacWingedBat, rc.DarkSheathTailedBat, rc.EcuadorianSacWingedBat, rc.EgyptianTombBat, rc.FrostedSacWingedBat, rc.GraySacWingedBat, rc.GreaterSacWingedBat, rc.GreaterSheathTailedBat, rc.GreenhallsDogFacedBat, rc.HamiltonsTombBat, rc.HildegardesTombBat, rc.LargeEaredSheath-TailedBat, rc.LesserSacWingedBat, rc.LesserSheathTailedBat, rc.MauritianTombBat, rc.NorthernGhostBat, rc.P

Again, we get a large number of results. There are clearly many mammals and bats within the KBpedia reference graph!

Per the listing above, there are a number of these pre-configured search arguments directly available through Owlready2.

### Full Text Search
We can also instruct the FTS system in SQLite that we want to index still additional fields. Since we are interested in  a term we know occurs in KBpedia's annotations relating some reference concepts to the UN standard products and services codes ([UNSPSC](https://en.wikipedia.org/wiki/UNSPSC)) we try that search directly:

In [5]:
world.search(entered = "*UNSPSC*")

[]

Hmm, this tells us there are no results. We must be missing an indexed field. So, let's instruct the system to add indexing to the <code>definition</code> property where we suspect the reference may occur. We do so using the <code>.append</code> method to add a new field for our RC definitions (<code>skos.definition</code>) to the available FTS index structure:

In [7]:
world.full_text_search_properties.append(skos.definition)

Since this is just a simple assignment, when we Run the cell we get no results output. 

However, that assignment now allows us to invoke the internal FTS (full-text search) argument:

In [None]:
world.search(definition = FTS("UNSPSC*"))

If you get an 'operational error' that means you did not Run the <code>.append</code> instruction above.

Like some of the other listings, this command results in a very large number of results, a couple of which are warnings we can ignore, so we again Clear the Cell. We can get a smaller listing with another keyword search, this time for the wildcarded 'gear*' search:

In [11]:
world.search(definition = FTS("gear*"))

[rc.undercarriage, rc.number-of-forward-gears, rc.vehicle-transmission, rc.AutomaticTransmission, rc.BearingBushingWheelGear, rc.BevelGear, rc.Bicycle-MultiGear, rc.BoeingAH-64Apache, rc.BugattiVeyron, rc.ChildrensWebSite, rc.CombatSportsEvent, rc.Commercialism, rc.CyclingClothing, rc.Device-FunctionallyDefective, rc.FirstNorthAmericansNovels, rc.Fishery, rc.FreeDiving, rc.Game-EquipmentSet, rc.Gear, rc.GearManufacturingMachine, rc.Gearing-Mechanical, rc.GearlessElectricDrive, rc.Goggles, rc.Harness-Animal, rc.Helmet, rc.IlyushinIl-30, rc.LandingGearAssembly, rc.MachineProtocol, rc.Mechanism-Technology, rc.Overdrive-Mechanics, rc.PinionGear, rc.ProtectiveEquipment-Human, rc.ProtectiveGear, rc.ScubaGear, rc.ScubaSnorkelingGear, rc.ShockAndAwe-MilitaryTactic, rc.Supercharger, rc.TeacherTrainingProgram, rc.Trek, rc.Wheel, rc.WildernessBackpacking, rc.WormGear]

Notice in this search that we are able to use the suffix wildcard (*) character. However, unlike the standard OWLready2 search, we are not able to use a wildcard (*) prefix search.

Since we have added a new indexed search table to our system, we may want to retain this capability. So, we decide to save the entire graph to the database, as the last example shows:

In [None]:
world.set_backend(filename = 'cwpk-23-text-searching-kbpedia.db', exclusive = False)

This now means our database has been saved persistently to disk.

If you run this multiple times you may get an operational error since you have already set the backend filename.

We can then <code>.save()</code> our work and exit the notebook.

In [13]:
world.save()

### Additional Documentation

Here is additional information on the system's text searching capabilities:

- Standard [owlready2 search options](https://owlready2.readthedocs.io/en/latest/onto.html#simple-queries)
- [FTS search options](https://owlready2.readthedocs.io/en/latest/annotations.html#full-text-search-fts).


 <div style="background-color:#efefff; border:1px dotted #ceceff; vertical-align:middle; margin:15px 60px; padding:8px;"> 
  <span style="font-weight: bold;">NOTE:</span> This article is part of the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/" style="font-style: italic;">Cooking with Python and KBpedia</a> series. See the <a href="https://www.mkbergman.com/cooking-with-python-and-kbpedia/"><strong>CWPK</strong> listing</a> for other articles in the series. <a href="http://kbpedia.org/">KBpedia</a> has its own Web site.
  </div>

<div style="background-color:#ebf8e2; border:1px dotted #71c837; vertical-align:middle; margin:15px 60px; padding:8px;"> 

<span style="font-weight: bold;">NOTE:</span> This <strong>CWPK 
installment</strong> is available both as an online interactive
file <a href="https://mybinder.org/v2/gh/Cognonto/CWPK/master" ><img src="https://mybinder.org/badge_logo.svg" style="display:inline-block; vertical-align: middle;" /></a> or as a <a href="https://github.com/Cognonto/CWPK" title="CWPK notebook" alt="CWPK notebook">direct download</a> to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the <code>*.ipynb</code> file. It may take a bit of time for the interactive option to load.</div>

<div style="background-color:#feeedc; border:1px dotted #f7941d; vertical-align:middle; margin:15px 60px; padding:8px;"> 
<div style="float: left; margin-right: 5px;"><img src="http://kbpedia.org/cwpk-files/warning.png" title="Caution!" width="32" /></div>I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to <a href="mailto:mike@mkbergman.com">notify me</a> should you make improvements.    

</div>