{"nbformat":4,"nbformat_minor":0,"metadata":{"colab":{"name":"Morph-KGC.ipynb","provenance":[],"collapsed_sections":[],"authorship_tag":"ABX9TyPDi7PZzZ7SP3K3EH++nCHd"},"kernelspec":{"name":"python3","display_name":"Python 3"},"language_info":{"name":"python"}},"cells":[{"cell_type":"markdown","source":["# **[Morph-KGC](https://github.com/oeg-upm/morph-kgc) Tutorial**\n","\n","**[Morph-KGC](https://github.com/oeg-upm/morph-kgc)** is an engine that constructs **[RDF](https://www.w3.org/TR/rdf11-concepts/)** and **[RDF-star](https://w3c.github.io/rdf-star/cg-spec/2021-12-17.html)** knowledge graphs from heterogeneous data sources with the **[R2RML](https://www.w3.org/TR/r2rml/)**, **[RML](https://rml.io/specs/rml/)** and **[RML-star](https://kg-construct.github.io/rml-star-spec/)** mapping languages. The full documentation of Morph-KGC is available in **[Read the Docs](https://morph-kgc.readthedocs.io/en/latest/)**.\n","\n","There are two different options to run Morph-KGC:\n","\n","- As a **library**, integrating with **[RDFLib](https://rdflib.readthedocs.io)** and **[Oxigraph](https://oxigraph.org/pyoxigraph)**.\n","- Via the **command line**.\n","\n","Morph-KGC currently supports the following input data formats:\n","- **Relational databases**: **[MySQL](https://www.mysql.com/)**, **[PostgreSQL](https://www.postgresql.org/)**, **[Oracle](https://www.oracle.com/database/)**, **[Microsoft SQL Server](https://www.microsoft.com/sql-server)**, **[MariaDB](https://mariadb.org/)**, **[SQLite](https://www.sqlite.org)**.\n","- **Tabular files**: **[CSV](https://en.wikipedia.org/wiki/Comma-separated_values)**, **[TSV](https://en.wikipedia.org/wiki/Tab-separated_values)**, **[Excel](https://www.microsoft.com/en-us/microsoft-365/excel)**, **[Parquet](https://parquet.apache.org/documentation/latest/)**, **[Feather](https://arrow.apache.org/docs/python/feather.html)**, **[ORC](https://orc.apache.org/)**, **[Stata](https://www.stata.com/)**, **[SAS](https://www.sas.com)**, **[SPSS](https://www.ibm.com/analytics/spss-statistics-software)**, **[ODS](https://en.wikipedia.org/wiki/OpenDocument)**.\n","- **Hierarchical files**: **[JSON](https://www.json.org)**, **[XML](https://www.w3.org/TR/xml/)**.\n","\n","This tutorial shows the different alternatives to run Morph-KGC.\n"],"metadata":{"id":"jQIDSTs1PKhg"}},{"cell_type":"markdown","source":["## **Load Knowledge Graph to [RDFLib](https://rdflib.readthedocs.io)**\n","\n","**[RDFLib](https://rdflib.readthedocs.io)** is the reference library to work with RDF in Python. Morph-KGC can be used as a **library** to create a knowledge graph and load it to RDFLib. In this example we will use the **[GTFS-Madrid-Bench](https://github.com/oeg-upm/gtfs-bench)** with **CSV** data. Morph-KGC allows to access mappings and data **remotely**, so we will use this functionality to avoid downloading the data and the mappings ourselves. The RML mappings are available [here](https://github.com/oeg-upm/morph-kgc/blob/main/examples/tutorial/mapping.gtfs.ttl) and the data is available [here](https://github.com/oeg-upm/morph-kgc/tree/main/examples/csv/data)."],"metadata":{"id":"bXYRQ_W2PdjU"}},{"cell_type":"markdown","source":["First of all, we need to **install** [Morph-KGC](https://pypi.org/project/morph-kgc) (this will also install [RDFLib](https://pypi.org/project/rdflib/) and [Oxigraph](https://pypi.org/project/pyoxigraph/))."],"metadata":{"id":"HlzLovomSjH9"}},{"cell_type":"code","source":["!pip install morph-kgc"],"metadata":{"id":"xM4SFlPkPmDZ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Now we just need to **import** Morph-KGC and we are ready to go!"],"metadata":{"id":"eEaaoTnkTJBu"}},{"cell_type":"code","execution_count":null,"metadata":{"id":"u4UZHlZUOxBL"},"outputs":[],"source":["import morph_kgc"]},{"cell_type":"markdown","source":["To run Morph-KGC it is neccesary to provide some information. This is done via a config **INI** file. When running Morph-KGC as a **library**, this configuration can be provided as an **string** or as a **file path**. Below there is a basic config file for our example provided as a string. The _config_ indicates the path to a mapping file."],"metadata":{"id":"90laT8Tvz4Gj"}},{"cell_type":"code","source":["config = \"\"\"\n"," [GTFS-Madrid-Bench]\n"," mappings: https://raw.githubusercontent.com/oeg-upm/morph-kgc/main/examples/tutorial/mapping.gtfs.ttl\n"," \"\"\""],"metadata":{"id":"fL5hqV6IWa0b"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We just need to call `materialize` passing our _config_ and Morph-KGC will create the knowledge graph and load it to RDFLib."],"metadata":{"id":"loiJCxBF01cp"}},{"cell_type":"code","source":["g = morph_kgc.materialize(config)"],"metadata":{"id":"RSFNbylgTVKy"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["**That is it!** Now we can work with our RDFLib graph: query, navigate or save the graph and more. For instance, below we query the knowledge graph with [query 3](https://github.com/oeg-upm/gtfs-bench/blob/master/queries/q3.rq) of the GTFS-Madrid-Bench."],"metadata":{"id":"Q7SDl-w3eZ9T"}},{"cell_type":"code","source":["q3 = \"\"\"\n"," PREFIX gtfs: \n"," PREFIX geo: \n"," PREFIX dct: \n","\n"," SELECT * WHERE {\n"," ?stop a gtfs:Stop . \n"," ?stop gtfs:locationType ?location .\n"," OPTIONAL { ?stop dct:description ?stopDescription . }\n"," OPTIONAL { \n"," ?stop geo:lat ?stopLat . \n"," ?stop geo:long ?stopLong .\n"," }\n"," OPTIONAL {?stop gtfs:wheelchairAccessible ?wheelchairAccessible . }\n"," FILTER (?location=)\n"," }\n"," \"\"\"\n","\n","q3_res = g.query(q3)\n","\n","for row in q3_res:\n"," print(row['stop'], row['stopLat'], row['stopLong'])"],"metadata":{"id":"1T-cECqEcBbS"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We could also have run Morph-KGC with the config from a file. Below we create the _config_ file writing it to disk. "],"metadata":{"id":"LrmEdLWC1q48"}},{"cell_type":"code","source":["# create the config file\n","!echo \"[GTFS-Madrid-Bench]\" > config.ini\n","!echo \"mappings: https://raw.githubusercontent.com/oeg-upm/morph-kgc/main/examples/tutorial/mapping.gtfs.ttl\" >> config.ini\n","\n","# show the config file\n","!cat config.ini"],"metadata":{"id":"t2L960lb127p"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We create our knowledge graph again, this time passing the file path to `materialize`."],"metadata":{"id":"m9X45Efu1JDr"}},{"cell_type":"code","source":["g = morph_kgc.materialize('config.ini')"],"metadata":{"id":"HPl_fvGD2BCJ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Usually the default configuration is enough for most use cases. However, in some cases we may need to tune Morph-KGC. For this we can use a `CONFIGURATION` section in the _config_ file. For instance, you can specify which values should be interpreted as NULL (e.g., _#N/A_). You can find the full list of configuration options in the **[documentation](https://morph-kgc.readthedocs.io/en/latest/documentation/#engine-configuration)**. Below you can see an example of a more detailed _config_ file."],"metadata":{"id":"vf4o9ih-2W49"}},{"cell_type":"code","source":["config = \"\"\"\n"," [CONFIGURATION]\n"," na_values: #N/A,,N/A\n"," logging_level: DEBUG\n","\n"," [GTFS-Madrid-Bench]\n"," mappings: https://raw.githubusercontent.com/oeg-upm/morph-kgc/main/examples/tutorial/mapping.gtfs.ttl\n"," \"\"\""],"metadata":{"id":"bkiaLXzN3SyV"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## **Load Knowledge Graph to [Oxigraph](https://oxigraph.org/pyoxigraph/)**\n","\n","While RDFLib provides much functionality, it does not support **[RDF-star](https://w3c.github.io/rdf-star/cg-spec/2021-12-17.html)** yet. Morph-KGC can create RDF-star knowledge graphs using **[RML-star](https://kg-construct.github.io/rml-star-spec/)** mappings and load them to **[Oxigraph](https://oxigraph.org/pyoxigraph/)**.\n","\n","The following example creates an RDF-star knowledge graph of scientific software metadata (the Morph-KGC software in this example), extracted with [SoMEF](https://github.com/KnowledgeCaptureAndDiscovery/somef). SoMEF extract some characteristics of the software which are annotated with the technique that was used to extract them and also with a confidence value. The **JSON** data is available [here](https://github.com/oeg-upm/morph-kgc/blob/main/examples/tutorial/oeg-upm_morph-kgc.json) and the RML-star mappings are available [here](https://github.com/oeg-upm/morph-kgc/blob/main/examples/tutorial/mapping.somef.ttl)."],"metadata":{"id":"5PTWCzIBjqhu"}},{"cell_type":"markdown","source":["As with RDFLib, we just need to create the _config_ and call `materialize_oxigraph`."],"metadata":{"id":"aO4hGPKK28U5"}},{"cell_type":"code","source":["import morph_kgc\n","\n","config = \"\"\"\n"," [SoMEF]\n"," mappings: https://raw.githubusercontent.com/oeg-upm/morph-kgc/main/examples/tutorial/mapping.somef.ttl\n"," \"\"\"\n","\n","g = morph_kgc.materialize_oxigraph(config)"],"metadata":{"id":"2dJNM_OwpXel"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["We loaded our knowledge graph to an Oxigraph store, we can now query it with **[SPARQL-star](https://w3c.github.io/rdf-star/cg-spec/editors_draft.html#sparql-star)**. The query below retrieves the license, the technique used to obtain the information and the confidence value."],"metadata":{"id":"D5j8W0vs3NZS"}},{"cell_type":"code","source":["q = \"\"\"\n"," PREFIX sd: \n"," PREFIX em: \n","\n"," SELECT * WHERE {\n"," ?sowtware a sd:Software .\n"," << ?software sd:license ?license >> em:confidence ?confidence .\n"," << ?software sd:license ?license >> em:technique ?technique .\n"," }\n"," \"\"\"\n","\n","q_res = g.query(q)\n","\n","for solution in q_res:\n"," print(solution['software'], solution ['license'], solution ['technique'], solution['confidence'])"],"metadata":{"id":"2TdYTy6kp2wN"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["## **Create Knowledge Graph via Command Line**"],"metadata":{"id":"vZJpXRcFuDkG"}},{"cell_type":"markdown","source":["Morph-KGC can also be executed from the **command line**. This is the most recommended option if you work with **large volumes of data**. As before, we need to create a config file. In this example we use again the data from the GTFS-Madrid-Bench."],"metadata":{"id":"Gpw6nbNC4msF"}},{"cell_type":"code","source":["# create the config file\n","!echo \"[GTFS-Madrid-Bench]\" > config.ini\n","!echo \"mappings: https://raw.githubusercontent.com/oeg-upm/morph-kgc/main/examples/tutorial/mapping.gtfs.ttl\" >> config.ini\n","\n","# show the config file\n","!cat config.ini"],"metadata":{"id":"SxksqKc3uKAN"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["The following command will create the knowledge graph and write it to a _knowledge-graph.nt_ file. You just need to provide the path to the _config_ file."],"metadata":{"id":"B7RoKRRF45dQ"}},{"cell_type":"code","source":["!python3 -m morph_kgc config.ini"],"metadata":{"id":"NGfkMb-VucxJ"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["Let's take a look to a subset of the generated RDF!"],"metadata":{"id":"7UHqM8W98VLp"}},{"cell_type":"code","source":["!head knowledge-graph.nt"],"metadata":{"id":"HJMLqWFp5apX"},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":["With the generated RDF we could for instance load it to RDFLib (or any triplestore) and pose queries."],"metadata":{"id":"UYChi6168hws"}},{"cell_type":"code","source":["import rdflib\n","\n","g = rdflib.Graph()\n","g.parse('knowledge-graph.nt')\n","\n","q3 = \"\"\"\n"," PREFIX gtfs: \n"," PREFIX geo: \n"," PREFIX dct: \n","\n"," SELECT * WHERE {\n"," ?stop a gtfs:Stop . \n"," ?stop gtfs:locationType ?location .\n"," OPTIONAL { ?stop dct:description ?stopDescription . }\n"," OPTIONAL { \n"," ?stop geo:lat ?stopLat . \n"," ?stop geo:long ?stopLong .\n"," }\n"," OPTIONAL {?stop gtfs:wheelchairAccessible ?wheelchairAccessible . }\n"," FILTER (?location=)\n"," }\n"," \"\"\"\n","\n","q3_res = g.query(q3)\n","\n","for row in q3_res:\n"," print(row['stop'], row['stopLat'], row['stopLong'])"],"metadata":{"id":"9tH1KM1Quv0_"},"execution_count":null,"outputs":[]}]}