{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "CWPK \\#16: Planning the Project \n", "===============================\n", "\n", "Most of the Effort in Coding is in the Planning\n", "-----------------------------------------------\n", "\n", "
\n", "\n", "\n", "\n", "
\n", "\n", "With the environment in place, it is now time to plan the project\n", "underlying this [*Cooking with Python and\n", "KBpedia*](https://www.mkbergman.com/cooking-with-python-and-kbpedia/) series. This installment formally begins **Part II** in our **CWPK** installments.\n", "\n", "Recall from the\n", "outset that our major objectives of this initiative, besides learning\n", "[Python](https://en.wikipedia.org/wiki/Python_%28programming_language%29)\n", "and gaining scripts, were to manage and exploit the\n", "[KBpedia](https://kbpedia.org/) knowledge graph, to expose its build and\n", "test procedures so that extensions or modifications to the baseline\n", "KBpedia may be possible by others, and to apply KBpedia to contemporary\n", "challenges in [machine\n", "learning](https://en.wikipedia.org/wiki/Machine_learning), [artificial\n", "intelligence](https://en.wikipedia.org/wiki/Artificial_intelligence),\n", "and [data\n", "interoperability](https://en.wikipedia.org/wiki/Interoperability). These\n", "broad objectives help to provide the organizational backbone to our\n", "plan.\n", "\n", "We can thus see three main parts to our project. The first part deals\n", "with managing, querying, and using KBpedia as distributed. The second\n", "part emphasizes the logical build and testing regimes for the graph and\n", "how those may be applied to extensions or modifications. The last part\n", "covers a variety of advanced applications of KBpedia or its progeny. As\n", "we define the tasks in these parts of the plan, we will also identify\n", "possible gaps in our current environment that we will need to rectify\n", "for progress to continue. Some of these gaps we can identify now and so\n", "filling them will be some of our most immediate tasks. Other gaps may\n", "only arise as we work through subsequent steps. In those instances we\n", "will need to fill the gaps as encountered. Lastly, in terms of scope,\n", "while our last part deals with advanced applications that we can term\n", "'complete' at some arbitrary number of applications, the truth is that\n", "applications are open-ended. We may continue to add to the roster of\n", "advanced applications as time and need allows.\n", "\n", "
\n", "\n", "Important Series Note: As first noted in CWPK #14, this current installment marks the first that every new CWPK article is now available as an interactive Jupyter Notebook page. The first interactive installment was actually CWPK #14, and we have reached back and made those earlier pages available as well.\n", "

\n", "Each of these new CWPK \n", "installments is available both as an online interactive\n", "file or as a direct download to use locally. For the online interactive option, pick one of the *.ipynb files. The MyBinder service we are using for the online interactive version maintains a Docker image for each project. Depending on how long it has been since someone last requested a CWPK interactive page, sometimes access may be rapid since the image is in cache, or it may take a bit of time to generate another image anew. We discuss this service more in CWPK #57.
\n", "\n", "\n", "### Part I: Using and Managing KBpedia\n", "\n", "Two immediate implications of the project plan arise as we begin to\n", "think it through. First, because of our learning and tech transfer\n", "objectives for the series, we have the opportunity to rely on the\n", "[electronic notebook](https://en.wikipedia.org/wiki/Notebook_interface)\n", "aspects of [Jupyter](https://en.wikipedia.org/wiki/Project_Jupyter) to\n", "deliver on these objectives. We thus need to better understand how to\n", "mix narrative, working code, and interactivity in our Jupyter Notebook\n", "pages. Second, since we need to bridge between Python programs and a\n", "knowledge graph written in\n", "[OWL](https://en.wikipedia.org/wiki/Web_Ontology_Language), we will need\n", "some form of application programming interface\n", "([API](https://en.wikipedia.org/wiki/Application_programming_interface))\n", "or bridge between these programmatic and semantic worlds. It, too, is a\n", "piece that needs to be put in place at the outset.\n", "\n", "This additional foundation then enables us to tackle key use and\n", "management aspects for the KBpedia knowledge graph. First among these\n", "tasks are the so-called\n", "[CRUD](https://en.wikipedia.org/wiki/Create,_read,_update_and_delete)\n", "(create-read-update-delete)\n", "activities for the structural components of a knowledge graph:\n", "\n", "- Add/delete/modify classes (concepts)\n", "- Add/delete/modify individuals (instances)\n", "- Add/delete/modify object properties\n", "- Add/delete/modify data properties and values\n", "- Add/delete/modify annotations.\n", "\n", "We also need to expand upon these basic management functions in areas\n", "such as:\n", "\n", "- Advanced class specifications\n", "- Advanced property specifications\n", "- Multi-lingual annotations\n", "- Load/save of ontologies (knowledge graphs)\n", "- Copy/rename ontologies.\n", "\n", "We also need to put in place means for querying KBpedia and using the\n", "[SPARQL](https://en.wikipedia.org/wiki/SPARQL) query language. We can\n", "enhance these basics with a rules language,\n", "[SWRL](https://en.wikipedia.org/wiki/Semantic_Web_Rule_Language).\n", "Because our use of the knowledge graph involves feeding inputs to\n", "third-party machine learners and natural language processors, we need to\n", "add scripts for writing outputs to file in various formats. We want to\n", "add to this listing some best practices and how we can package our\n", "scripts into reusable files and libraries.\n", "\n", "### Part II: Building, Testing, and Extending the Knowledge Graph\n", "\n", "Though KBpedia is certainly usable 'as is' for many tasks, importantly\n", "including as a common reference nexus for interoperating disparate data,\n", "maximum advantage arises when the knowledge graph encompasses the domain\n", "problem at hand. KBpedia is an excellent starting point for building\n", "such domain ontologies. By definition, the scope, breadth, and depth of\n", "a domain knowledge graph will differ from what is already in KBpedia.\n", "Some existing areas of KBpedia are likely not needed, others are\n", "missing, and connections and entity coverage will differ as well. This\n", "part of the project deals with building and logically testing the domain\n", "knowledge graph that morphs from the KBpedia starting point.\n", "\n", "For years now we have built KBpedia from scratch based on a suite of\n", "canonically formatted\n", "[CSV](https://en.wikipedia.org/wiki/Comma-separated_values) input files.\n", "These input files are written in a common\n", "[UTF-8](https://en.wikipedia.org/wiki/UTF-8) encoding and duplicate the\n", "kind of tuples found in an N3\n", "([Notation3](https://en.wikipedia.org/wiki/Notation3)) RDF/OWL file. As\n", "a build progresses through its steps, various consistency and logical\n", "tests are applied to ensure the coherence of the built graph. Builds\n", "that fail these tests are error flagged, which requires fixes to the\n", "input files, before the build can resume and progress to completion. The\n", "knowledge graph that passes these logical tests might be used or altered\n", "by third-party tools, prominently including\n", "[Protégé](https://en.wikipedia.org/wiki/Prot%C3%A9g%C3%A9_%28software%29),\n", "during the use of and interaction with the graph. We thus also need\n", "methods for extracting out the build files from an existing knowledge\n", "graph in order to feed the build process anew. These various workflows\n", "between graph and build scripts and tools is shown by Figure 1:\n", "\n", "
\n", "\n", "\"General\n", "\n", "
\n", "\n", "
\n", "\n", "Figure 1: General Workflow of the KBpedia Project\n", "\n", "
\n", "\n", "This part of the plan will address all steps in this workflow. The use\n", "of CSV flat files as the canonical transfer form between the\n", "applications also means we need to have syntax and encoding checks in\n", "the process. Many of the instructions in this part deal with good\n", "practices for debugging and fixing inconsistent or unsatisfied graphs.\n", "At least as we have managed KBpedia to date, every new coherent release\n", "requires multiple build iterations until the errors are found and\n", "corrected. (This area has potential for more automation.)\n", "\n", "We will also spend time on the modular design of the KBpedia knowledge\n", "graph and the role of (potentially disjoint) typologies to organize and\n", "manage the entities represented by the graph. Here, too, we may want to\n", "modify individual typologies or add or delete entire ones in\n", "transitioning the baseline KBpedia to a responsive domain graph. We thus\n", "provide additional installments focused solely on typology construction,\n", "modification, and extension. Use and mapping of external sources is\n", "essential in this process, but is never cookie-cutter in nature. Having\n", "some general scripts available plus knowledge of creating new relevant\n", "Python scripts is most helpful to accommodate the diversity found in the\n", "wild.Fortunately, we have existing\n", "[Clojure](https://en.wikipedia.org/wiki/Clojure) code for most of these\n", "components so that our planning efforts amount more to a refactoring of\n", "an existing code base into another language. Hopefully, we will also be\n", "able to improve a bit on these existing scripts.\n", "\n", "### Part III: Advanced Applications\n", "\n", "Having full control of the knowledge graph, plus a working toolchest of\n", "applications and scripts, is a firm basis to use the now-tailored\n", "knowledge graph for machine learning and other advanced applications.\n", "The plan here is less clear than the prior two parts, though we have\n", "[documented existing use cases with\n", "code](https://kbpedia.org/use-cases/) to draw upon. Major installments\n", "in this part are likely in creating machine learning training sets, in\n", "creating corpora for unsupervised training, generating various types\n", "(word, statement, graph) of embedding models, selecting and generating\n", "sub-graphs, mapping external vocabularies, categorization, and natural\n", "language processing.\n", "\n", "Lastly, we reserve a task in this plan for setting up the knowledge\n", "graph on a remote server and creating access endpoints. This task is\n", "likely to occur at the transition between Parts II and III, though it\n", "may prove opportune to do it at other steps along the way.\n", "\n", "
\n", " NOTE: This article is part of the Cooking with Python and KBpedia series. See the CWPK listing for other articles in the series. KBpedia has its own Web site.\n", "
\n", "\n", "
\n", "\n", "NOTE: This CWPK \n", "installment is available both as an online interactive\n", "file or as a direct download to use locally. Make sure and pick the correct installment number. For the online interactive option, pick the *.ipynb file. It may take a bit of time for the interactive option to load.
\n", "\n", "
\n", "
I am at best an amateur with Python. There are likely more efficient methods for coding these steps than what I provide. I encourage you to experiment -- which is part of the fun of Python -- and to notify me should you make improvements. \n", "\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }