{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Clean-Clean ER\n", "\n", "In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.\n" ] }, { "cell_type": "markdown", "id": "9c49d2b7-11b5-40b3-9341-de98608dde13", "metadata": {}, "source": [ "Dataset: __Abt-Buy dataset__ (D1)\n", "\n", "The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price." ] }, { "cell_type": "markdown", "id": "744b3017-9a5c-4d3c-8e0a-fe39b069b647", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": null, "id": "029a5825-799d-4c3f-a6cd-a75e257cadcc", "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 2, "id": "462695ec-3af1-4048-9971-9ed0bce0f07b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.0\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "7b4c62c5-6581-4d2e-9d44-c7c02f43d441", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph" ] }, { "cell_type": "code", "execution_count": 2, "id": "4d4e6a90-9fd8-4f7a-bf4f-a5b994e0adfb", "metadata": {}, "outputs": [], "source": [ "import pyjedai\n", "from pyjedai.utils import (\n", " text_cleaning_method,\n", " print_clusters,\n", " print_blocks,\n", " print_candidate_pairs\n", ")\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "451bf970-4425-487b-8756-776abb9536ea", "metadata": {}, "source": [ "# Workflow Architecture\n", "\n", "![workflow-example.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/img/workflow-example.png?raw=true)" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": {}, "source": [ "# Data Reading\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n" ] }, { "cell_type": "code", "execution_count": 3, "id": "e6aabec4-ef4f-4267-8c1e-377054e669d2", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "code", "execution_count": 4, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"./../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False)\n", "d2 = pd.read_csv(\"./../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False)\n", "gt = pd.read_csv(\"./../data/ccer/D2/gt.csv\", sep='|', engine='python')\n", "\n", "data = Data(dataset_1=d1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "5d8a8a78-858e-4c79-90fe-197a68e95e11", "metadata": {}, "source": [ "pyJedAI offers also dataset analysis methods (more will be developed)" ] }, { "cell_type": "code", "execution_count": 5, "id": "7cb87af2-adda-49e0-82cc-b1a5f7a595ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Data Report\n", "***************************************************************************************************************************\n", "Type of Entity Resolution: Clean-Clean\n", "Dataset 1 (D1):\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tMemory usage [KB]: 563.56\n", "\tAttributes:\n", "\t\t name\n", "\t\t description\n", "\t\t price\n", "Dataset 2 (D2):\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tMemory usage [KB]: 336.63\n", "\tAttributes:\n", "\t\t name\n", "\t\t description\n", "\t\t price\n", "\n", "Total number of entities: 2152\n", "Number of matching pairs in ground-truth: 1076\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "data.print_specs()" ] }, { "cell_type": "code", "execution_count": 6, "id": "b822d7c0-19a2-4050-9554-c35a208bb848", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Sony Turntable - PSLX350HSony Turntable - PSLX350H/ Belt Drive System/ ...
11Bose Acoustimass 5 Series III Speaker System -...Bose Acoustimass 5 Series III Speaker System -...399
22Sony Switcher - SBV40SSony Switcher - SBV40S/ Eliminates Disconnecti...49
33Sony 5 Disc CD Player - CDPCE375Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...
44Bose 27028 161 Bookshelf Pair Speakers In Whit...Bose 161 Bookshelf Speakers In White - 161WH/ ...158
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Sony Turntable - PSLX350H \n", "1 1 Bose Acoustimass 5 Series III Speaker System -... \n", "2 2 Sony Switcher - SBV40S \n", "3 3 Sony 5 Disc CD Player - CDPCE375 \n", "4 4 Bose 27028 161 Bookshelf Pair Speakers In Whit... \n", "\n", " description price \n", "0 Sony Turntable - PSLX350H/ Belt Drive System/ ... \n", "1 Bose Acoustimass 5 Series III Speaker System -... 399 \n", "2 Sony Switcher - SBV40S/ Eliminates Disconnecti... 49 \n", "3 Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change... \n", "4 Bose 161 Bookshelf Speakers In White - 161WH/ ... 158 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_1.head(5)" ] }, { "cell_type": "code", "execution_count": 7, "id": "5c26b595-5e02-4bfc-8e79-e476ab2830ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Linksys EtherFast EZXS88W Ethernet Switch - EZ...Linksys EtherFast 8-Port 10/100 Switch (New/Wo...
11Linksys EtherFast EZXS55W Ethernet Switch5 x 10/100Base-TX LAN
22Netgear ProSafe FS105 Ethernet Switch - FS105NANETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...
33Belkin Pro Series High Integrity VGA/SVGA Moni...1 x HD-15 - 1 x HD-15 - 10ft - Beige
44Netgear ProSafe JFS516 Ethernet SwitchNetgear ProSafe 16 Port 10/100 Rackmount Switc...
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Linksys EtherFast EZXS88W Ethernet Switch - EZ... \n", "1 1 Linksys EtherFast EZXS55W Ethernet Switch \n", "2 2 Netgear ProSafe FS105 Ethernet Switch - FS105NA \n", "3 3 Belkin Pro Series High Integrity VGA/SVGA Moni... \n", "4 4 Netgear ProSafe JFS516 Ethernet Switch \n", "\n", " description price \n", "0 Linksys EtherFast 8-Port 10/100 Switch (New/Wo... \n", "1 5 x 10/100Base-TX LAN \n", "2 NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw... \n", "3 1 x HD-15 - 1 x HD-15 - 10ft - Beige \n", "4 Netgear ProSafe 16 Port 10/100 Rackmount Switc... " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_2.head(5)" ] }, { "cell_type": "code", "execution_count": 8, "id": "b3c9827e-a08a-47b2-a7f2-6f3f72184a17", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D1D2
0206216
16046
2182160
\n", "
" ], "text/plain": [ " D1 D2\n", "0 206 216\n", "1 60 46\n", "2 182 160" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ground_truth.head(3)" ] }, { "cell_type": "markdown", "id": "19891fc5-960e-4df1-a72a-e4533a74a761", "metadata": {}, "source": [ "### Data cleaning step (optional)\n", "\n", "pyJedAI offers 4 types of text cleaning/processing. \n", "\n", "- Stopwords removal\n", "- Punctuation removal\n", "- Numbers removal\n", "- Unicodes removal" ] }, { "cell_type": "code", "execution_count": 9, "id": "e471e48c-c882-4c74-b94f-4c1dff9fa36c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] /home/konstantinos/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "data.clean_dataset(remove_stopwords = False, \n", " remove_punctuation = False, \n", " remove_numbers = False,\n", " remove_unicodes = False)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "## Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 10, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/conda/miniconda3/envs/pyjedai/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " ExtendedQGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", ")" ] }, { "cell_type": "code", "execution_count": 11, "id": "9741f0c4-6250-455f-9c88-b8dc61ab7d4d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Standard Blocking: 100%|██████████| 2152/2152 [00:00<00:00, 18652.37it/s]\n" ] } ], "source": [ "bb = StandardBlocking()\n", "blocks = bb.build_blocks(data, attributes_1=['name'], attributes_2=['name'])" ] }, { "cell_type": "code", "execution_count": 12, "id": "d2d9ae46-28fa-4438-87b7-ba901c75bd99", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Standard Blocking\n", "Method info: Creates one block for every token in the attribute values of at least two entities.\n", "Parameters: Parameter-Free method\n", "Attributes from D1:\n", "\tname\n", "Attributes from D2:\n", "\tname\n", "Runtime: 0.1172 seconds\n" ] } ], "source": [ "bb.report()" ] }, { "cell_type": "code", "execution_count": 13, "id": "b0ac846d-0f13-4b90-b4c8-688054ed7ffe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Standard Blocking\n", "***************************************************************************************************************************\n", "Method name: Standard Blocking\n", "Parameters: \n", "Runtime: 0.1172 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 0.45% \n", "\tRecall: 99.54%\n", "\tF1-score: 0.90%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Classification report:\n", "\tTrue positives: 1071\n", "\tFalse positives: 236447\n", "\tTrue negatives: 1156695\n", "\tFalse negatives: 5\n", "\tTotal comparisons: 237518\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bb.evaluate(blocks, with_classification_report=True)" ] }, { "cell_type": "markdown", "id": "eeb516cf-43cd-4f02-88f8-6dfef7c5e20e", "metadata": {}, "source": [ "## Block Purging\n", "\n", "__Optional step__\n", "\n", "Discards the blocks exceeding a certain number of comparisons. \n" ] }, { "cell_type": "code", "execution_count": 14, "id": "725426e2-0af8-4295-baff-92653c841fdd", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockPurging" ] }, { "cell_type": "code", "execution_count": 15, "id": "7997b2b6-9629-44f0-a66d-5bc4fea28fb6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Purging: 100%|██████████| 2934/2934 [00:00<00:00, 373522.98it/s]\n" ] } ], "source": [ "bp = BlockPurging()\n", "cleaned_blocks = bp.process(blocks, data, tqdm_disable=False)" ] }, { "cell_type": "code", "execution_count": 16, "id": "d8842b00-8765-449f-bdb7-f9b2206e91c7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Block Purging\n", "Method info: Discards the blocks exceeding a certain number of comparisons.\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 3224.0\n", "Runtime: 0.0116 seconds\n" ] } ], "source": [ "bp.report()" ] }, { "cell_type": "code", "execution_count": 17, "id": "bfbef308-2ae0-4a2b-aec4-1bac09e426a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Block Purging\n", "***************************************************************************************************************************\n", "Method name: Block Purging\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 3224.0\n", "Runtime: 0.0116 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 1.12% \n", "\tRecall: 98.61%\n", "\tF1-score: 2.21%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bp.evaluate(cleaned_blocks)" ] }, { "cell_type": "markdown", "id": "9f9e77d5-c906-431a-bdc7-68dc9c00cc31", "metadata": { "tags": [] }, "source": [ "## Block Cleaning\n", "\n", "___Optional step___\n", "\n", "Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities." ] }, { "cell_type": "code", "execution_count": 18, "id": "9c2c0e42-485a-444e-9161-975f30d21a02", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockFiltering" ] }, { "cell_type": "code", "execution_count": 19, "id": "bf5c20ac-b16a-484d-82b0-61ecb9e7f3ea", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Filtering: 100%|██████████| 3/3 [00:00<00:00, 112.49it/s]\n" ] } ], "source": [ "bf = BlockFiltering(ratio=0.8)\n", "filtered_blocks = bf.process(cleaned_blocks, data, tqdm_disable=False)" ] }, { "cell_type": "code", "execution_count": 20, "id": "25fd0be0-91c3-4d0b-b596-c66dccba3c79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Block Filtering\n", "***************************************************************************************************************************\n", "Method name: Block Filtering\n", "Parameters: \n", "\tRatio: 0.8\n", "Runtime: 0.0297 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 2.56% \n", "\tRecall: 96.10%\n", "\tF1-score: 4.99%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 2.562450436161776,\n", " 'Recall %': 96.09665427509294,\n", " 'F1 %': 4.991792990248141,\n", " 'True Positives': 1034,\n", " 'False Positives': 39318,\n", " 'True Negatives': 1156658,\n", " 'False Negatives': 42}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bf.evaluate(filtered_blocks)" ] }, { "cell_type": "markdown", "id": "9cd12048-bd0c-4571-ba70-488d46afcdd6", "metadata": {}, "source": [ "## Comparison Cleaning - Meta Blocking\n", "\n", "___Optional step___\n", "\n", "Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.\n", "\n", "The following methods are currently supported:\n", "\n", "- Comparison Propagation\n", "- Cardinality Edge Pruning (CEP)\n", "- Cardinality Node Pruning (CNP)\n", "- Weighed Edge Pruning (WEP)\n", "- Weighed Node Pruning (WNP)\n", "- Reciprocal Cardinality Node Pruning (ReCNP)\n", "- Reciprocal Weighed Node Pruning (ReWNP)\n", "- BLAST\n", "\n", "Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:\n", "\n", "- Aggregate Reciprocal Comparisons Scheme (ARCS)\n", "- Common Blocks Scheme (CBS)\n", "- Enhanced Common Blocks Scheme (ECBS)\n", "- Jaccard Scheme (JS)\n", "- Enhanced Jaccard Scheme (EJS)" ] }, { "cell_type": "code", "execution_count": 21, "id": "1f7d75f3-6bed-482d-a572-c3b4927236a5", "metadata": {}, "outputs": [], "source": [ "from pyjedai.comparison_cleaning import (\n", " WeightedEdgePruning,\n", " WeightedNodePruning,\n", " CardinalityEdgePruning,\n", " CardinalityNodePruning,\n", " BLAST,\n", " ReciprocalCardinalityNodePruning,\n", " ReciprocalWeightedNodePruning,\n", " ComparisonPropagation\n", ")" ] }, { "cell_type": "code", "execution_count": 22, "id": "c92e0ca3-5591-4620-b3f4-012a23637416", "metadata": {}, "outputs": [], "source": [ "mb = WeightedEdgePruning(weighting_scheme='EJS')\n", "candidate_pairs_blocks = mb.process(filtered_blocks, data, tqdm_disable=True)" ] }, { "cell_type": "code", "execution_count": 23, "id": "f469e387-e135-4945-b97f-da14d391c6b1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Weighted Edge Pruning\n", "***************************************************************************************************************************\n", "Method name: Weighted Edge Pruning\n", "Parameters: \n", "\tNode centric: False\n", "\tWeighting scheme: EJS\n", "Runtime: 0.1928 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 10.86% \n", "\tRecall: 91.45%\n", "\tF1-score: 19.41%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = mb.evaluate(candidate_pairs_blocks)" ] }, { "cell_type": "markdown", "id": "5ea46276-a97c-4767-a3ef-0fe705186cfa", "metadata": {}, "source": [ "### Want to export pairs in this step?\n", "\n", "Every step provides a method named `export_to_df` that exports all pairs in dataframe. If you wish to export them in a file use `.to_csv` from pandas." ] }, { "cell_type": "code", "execution_count": 24, "id": "0e842336-5e82-4221-ad7f-aac80471f04b", "metadata": {}, "outputs": [], "source": [ "pairs_df=mb.export_to_df(candidate_pairs_blocks)" ] }, { "cell_type": "code", "execution_count": 25, "id": "1143b23a-60db-4de7-b0bd-dd86ae3e4f34", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id1id2
00205
10193
2053
3055
40697
\n", "
" ], "text/plain": [ " id1 id2\n", "0 0 205\n", "1 0 193\n", "2 0 53\n", "3 0 55\n", "4 0 697" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs_df.head(5)" ] }, { "cell_type": "markdown", "id": "6aeff39a-b51b-4166-a55b-f8452ec258a7", "metadata": {}, "source": [ "## Entity Matching\n", "\n", "It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities." ] }, { "cell_type": "code", "execution_count": 26, "id": "f479d967-8bac-4870-99bd-68c01e75747b", "metadata": {}, "outputs": [], "source": [ "from pyjedai.matching import EntityMatching" ] }, { "cell_type": "code", "execution_count": 27, "id": "ae7b1e6a-e937-44fe-bfe5-34696ea1156c", "metadata": {}, "outputs": [], "source": [ "em = EntityMatching(\n", " metric='cosine',\n", " tokenizer='char_tokenizer',\n", " vectorizer='tfidf',\n", " qgram=3,\n", " similarity_threshold=0.0\n", ")\n", "\n", "pairs_graph = em.predict(candidate_pairs_blocks, data, tqdm_disable=True)" ] }, { "cell_type": "code", "execution_count": 28, "id": "4d606bfc-3265-4042-93f3-22a1117c4886", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(pairs_graph)" ] }, { "cell_type": "code", "execution_count": 29, "id": "4a2a5f4a-6ffa-4c16-ae49-ff4fec4c467d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: cosine\n", "\tAttributes: None\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: char_tokenizer\n", "\tVectorizer: tfidf\n", "\tQgrams: 3\n", "Runtime: 0.4898 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 10.86% \n", "\tRecall: 91.45%\n", "\tF1-score: 19.41%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = em.evaluate(pairs_graph)" ] }, { "cell_type": "markdown", "id": "07ecd3d3-aa47-447c-af4d-cdd4744ca7c1", "metadata": {}, "source": [ "### How to set a valid similarity threshold?\n", "\n", "Configure similariy threshold with a Grid-Search or with an Optuna search. Also pyJedAI provides some visualizations on the distributions of the scores.\n", "\n", "For example with a classic histogram:\n" ] }, { "cell_type": "code", "execution_count": 30, "id": "c04c0482-a0d2-4ebf-b9c8-f6cff1863e5d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_all_weights()" ] }, { "cell_type": "markdown", "id": "800c9f29-1260-40fb-bdf0-34cc9ada4aa3", "metadata": {}, "source": [ "Or with a range 0.1 from 0.0 to 1.0 grouping:" ] }, { "cell_type": "code", "execution_count": 31, "id": "3e624fb5-cb48-4081-b90f-0e59adf88d26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distribution-% of predicted scores: [13.551092474067536, 28.8126241447804, 25.5131317589936, 17.325093798278527, 9.00463473846833, 3.8402118737585518, 1.4566320900463474, 0.4634738468329287, 0.03310527477378062, 0.0]\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_scores()" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "## Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 32, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering" ] }, { "cell_type": "code", "execution_count": 33, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ccc = UniqueMappingClustering()\n", "clusters = ccc.process(pairs_graph, data, similarity_threshold=0.17)" ] }, { "cell_type": "code", "execution_count": 34, "id": "5b52a534-691a-48be-b5e9-c073dc04b154", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Unique Mapping Clustering\n", "Method info: Prunes all edges with a weight lower than t, sorts the remaining ones indecreasing weight/similarity and iteratively forms a partition forthe top-weighted pair as long as none of its entities has alreadybeen matched to some other.\n", "Parameters: \n", "\tSimilarity Threshold: 0.17\n", "\n", "Runtime: 0.0320 seconds\n" ] } ], "source": [ "ccc.report()" ] }, { "cell_type": "code", "execution_count": 35, "id": "00bc2e82-9bc1-4119-b8cb-4a1c18afee19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Unique Mapping Clustering\n", "***************************************************************************************************************************\n", "Method name: Unique Mapping Clustering\n", "Parameters: \n", "\tSimilarity Threshold: 0.17\n", "Runtime: 0.0320 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 92.69% \n", "\tRecall: 86.06%\n", "\tF1-score: 89.25%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ccc.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "315369d8-6564-44d4-aea0-14034b54cf16", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.18" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }