{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "# Schema Matching with Valentine (pyJedAI intregrated)\n", "\n", "In this notebook we present the pyJedAI schema matching functionality. In general __Schema Matching__ looks for semantic correspondences between structures or models, such as database schemas, XML message formats, and ontologies, identifying different attribute names that describe the same feature (e.g., “profession” and “job” are semantically equivalent)\n", "\n", "## TU Delft tool for Schema Matching \n", "\n", "Website: https://delftdata.github.io/valentine\n", "\n", "> Valentine is an extensible open-source product to execute and organize large-scale automated matching processes on tabular data either for experimentation or deployment in real world data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. To enable proper evaluation, Valentine offers a fabricator for creating evaluation dataset pairs that respect specific semantics.\n", "\n", "## pyJedAI functionalities with Valentine\n", "- __Similarity Flooding__ constructs a graph representation of the schemas or ontologies to be matched, casting schema matching as a graph matching task. The nodes of the graph represent schema elements, such as attributes, or types, while the edges capture the relationships between these elements. The algorithm starts by initializing similarity scores based on initial mappings or heuristics. It then iteratively propagates similarities through the graph by 'flooding' the network with these scores: each node's similarity score is updated based on the scores of its neighbors, thus capturing the notion that similar nodes are often connected to other similar nodes. This process continues until the similarity scores converge or a specified number of iterations is reached. \n", "- Cupid is a comprehensive approach that involves the four steps: __[Coma matcher it is required to have java (jre) installed]__\n", " - preprocessing, which normalizes schema elements by expanding abbreviations and removing special characters to ensure that linguistic comparisons are effective, \n", " - linguistic matching, which uses a thesaurus or lexical database, like WordNet, to find semantic relationships (synonyms, hypernyms, etc.) between names and descriptions of schema elements, assigning linguistic similarity scores. \n", " - structural matching, which infers the context and relatedness of elements from the hierarchical structure of the schema (e.g., parent-child relationships), \n", " - similarity aggregation, which combines linguistic and structural similarities using a weighted average that can be adjusted to prioritize one aspect over the other depending on the context. \n", "- __COMA__ constitutes a flexible approach for combining several complementary techniques, such as linguistic matching based on names and descriptions, structural matching based on schema hierarchies, and constraint-based matching considering keys and cardinalities. These matchers can be executed in parallel or sequentially, and their results can be aggregated using different strategies, such as weighted averages or maximum selection. The end result is a consolidated similarity matrix that reflects the combined evidence from all applied matchers. \n", "- __EmbDI__ uses a compact graph-based representation that captures the relationships between the elements of the given schema, facilitating the extraction of sentences that effectively describe the similarity between elements. The resulting sentences are used to learn embeddings that are crafted for the data at hand, using optimizations to improve their quality. \n", "- __SemProp__ is composite approach that combines syntactic with semantic similarity. First, it applies the Semantic Matcher, which leverages the semantic similarity between schema elements, as it is determined by pre-trained word embeddings. This similarity is used to yield coherent groups, i.e., clusters of words for which the average cosine similarity of their embedding vectors is higher than a specific threshold. Pairs of attributes that cannot be matched in this way, are associated with syntactic similarities based on the textual similarity of their instances (i.e., attribute values) and attribute names. \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "pycharm": { "is_executing": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python 3.9.16\n" ] } ], "source": [ "!python --version" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Libraries injection" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "pycharm": { "is_executing": true } }, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "\n", "from pyjedai.utils import (\n", " text_cleaning_method,\n", " print_clusters,\n", " print_blocks,\n", " print_candidate_pairs\n", ")\n", "from pyjedai.evaluation import Evaluation\n", "from pyjedai.datamodel import Data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data reading and pyJedAI-formating" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(r\"C:\\Users\\nikol\\Desktop\\GitHub\\pyJedAI-Dev\\data\\ccer\\schema_matching\\authors\\authors1.csv\")\n", "d2 = pd.read_csv(r\"C:\\Users\\nikol\\Desktop\\GitHub\\pyJedAI-Dev\\data\\ccer\\schema_matching\\authors\\authors2.csv\")\n", "gt = pd.read_csv(r\"C:\\Users\\nikol\\Desktop\\GitHub\\pyJedAI-Dev\\data\\ccer\\schema_matching\\authors\\pairs.csv\")\n", "\n", "\n", "data = Data(\n", " dataset_1=d1,\n", " attributes_1=['EID','Authors','Cited by','Title','Year','Source tittle','DOI'],\n", " id_column_name_1='EID',\n", " dataset_2=d2,\n", " attributes_2=['EID','Authors','Cited by','Country','Document Type','City','Access Type','aggregationType'],\n", " id_column_name_2='EID',\n", " ground_truth=gt,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Call pyJedAI methods using Valentine" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from pyjedai.schema.matching import ValentineMethodBuilder, ValentineSchemaMatching\n", "\n", "sm = ValentineSchemaMatching(ValentineMethodBuilder.cupid_matcher())\n", "sm.process(data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Get matches" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matches = sm.get_matches()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Print matches" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm.print_matches()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sm.evaluate()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 4 }