{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Clean-Clean Entity Resolution Tutorial\n", "\n", "In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.\n" ] }, { "cell_type": "markdown", "id": "9c49d2b7-11b5-40b3-9341-de98608dde13", "metadata": {}, "source": [ "Dataset: __Abt-Buy dataset__ (D1)\n", "\n", "The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price." ] }, { "cell_type": "markdown", "id": "744b3017-9a5c-4d3c-8e0a-fe39b069b647", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": null, "id": "029a5825-799d-4c3f-a6cd-a75e257cadcc", "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 2, "id": "462695ec-3af1-4048-9971-9ed0bce0f07b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.0\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "7b4c62c5-6581-4d2e-9d44-c7c02f43d441", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 3, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph" ] }, { "cell_type": "code", "execution_count": 4, "id": "4d4e6a90-9fd8-4f7a-bf4f-a5b994e0adfb", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /home/jm/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import pyjedai\n", "from pyjedai.utils import (\n", " text_cleaning_method,\n", " print_clusters,\n", " print_blocks,\n", " print_candidate_pairs\n", ")\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "451bf970-4425-487b-8756-776abb9536ea", "metadata": {}, "source": [ "# Workflow Architecture\n", "\n", "![workflow-example.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/img/workflow-example.png?raw=true)" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": {}, "source": [ "# Data Reading\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n" ] }, { "cell_type": "code", "execution_count": 5, "id": "e6aabec4-ef4f-4267-8c1e-377054e669d2", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "code", "execution_count": 6, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"./../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False)\n", "d2 = pd.read_csv(\"./../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False)\n", "gt = pd.read_csv(\"./../data/ccer/D2/gt.csv\", sep='|', engine='python')\n", "\n", "data = Data(dataset_1=d1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "5d8a8a78-858e-4c79-90fe-197a68e95e11", "metadata": {}, "source": [ "pyJedAI offers also dataset analysis methods (more will be developed)" ] }, { "cell_type": "code", "execution_count": 7, "id": "7cb87af2-adda-49e0-82cc-b1a5f7a595ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------- Data -------------------------\n", "Type of Entity Resolution: Clean-Clean\n", "Dataset-1:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "Dataset-2:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "\n", "Total number of entities: 2152\n", "Number of matching pairs in ground-truth: 1076\n", "-------------------------------------------------------- \n", "\n" ] } ], "source": [ "data.print_specs()" ] }, { "cell_type": "code", "execution_count": 8, "id": "b822d7c0-19a2-4050-9554-c35a208bb848", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Sony Turntable - PSLX350HSony Turntable - PSLX350H/ Belt Drive System/ ...
11Bose Acoustimass 5 Series III Speaker System -...Bose Acoustimass 5 Series III Speaker System -...399
22Sony Switcher - SBV40SSony Switcher - SBV40S/ Eliminates Disconnecti...49
33Sony 5 Disc CD Player - CDPCE375Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...
44Bose 27028 161 Bookshelf Pair Speakers In Whit...Bose 161 Bookshelf Speakers In White - 161WH/ ...158
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Sony Turntable - PSLX350H \n", "1 1 Bose Acoustimass 5 Series III Speaker System -... \n", "2 2 Sony Switcher - SBV40S \n", "3 3 Sony 5 Disc CD Player - CDPCE375 \n", "4 4 Bose 27028 161 Bookshelf Pair Speakers In Whit... \n", "\n", " description price \n", "0 Sony Turntable - PSLX350H/ Belt Drive System/ ... \n", "1 Bose Acoustimass 5 Series III Speaker System -... 399 \n", "2 Sony Switcher - SBV40S/ Eliminates Disconnecti... 49 \n", "3 Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change... \n", "4 Bose 161 Bookshelf Speakers In White - 161WH/ ... 158 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_1.head(5)" ] }, { "cell_type": "code", "execution_count": 9, "id": "5c26b595-5e02-4bfc-8e79-e476ab2830ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Linksys EtherFast EZXS88W Ethernet Switch - EZ...Linksys EtherFast 8-Port 10/100 Switch (New/Wo...
11Linksys EtherFast EZXS55W Ethernet Switch5 x 10/100Base-TX LAN
22Netgear ProSafe FS105 Ethernet Switch - FS105NANETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...
33Belkin Pro Series High Integrity VGA/SVGA Moni...1 x HD-15 - 1 x HD-15 - 10ft - Beige
44Netgear ProSafe JFS516 Ethernet SwitchNetgear ProSafe 16 Port 10/100 Rackmount Switc...
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Linksys EtherFast EZXS88W Ethernet Switch - EZ... \n", "1 1 Linksys EtherFast EZXS55W Ethernet Switch \n", "2 2 Netgear ProSafe FS105 Ethernet Switch - FS105NA \n", "3 3 Belkin Pro Series High Integrity VGA/SVGA Moni... \n", "4 4 Netgear ProSafe JFS516 Ethernet Switch \n", "\n", " description price \n", "0 Linksys EtherFast 8-Port 10/100 Switch (New/Wo... \n", "1 5 x 10/100Base-TX LAN \n", "2 NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw... \n", "3 1 x HD-15 - 1 x HD-15 - 10ft - Beige \n", "4 Netgear ProSafe 16 Port 10/100 Rackmount Switc... " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_2.head(5)" ] }, { "cell_type": "code", "execution_count": 10, "id": "b3c9827e-a08a-47b2-a7f2-6f3f72184a17", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D1D2
0206216
16046
2182160
\n", "
" ], "text/plain": [ " D1 D2\n", "0 206 216\n", "1 60 46\n", "2 182 160" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ground_truth.head(3)" ] }, { "cell_type": "markdown", "id": "19891fc5-960e-4df1-a72a-e4533a74a761", "metadata": {}, "source": [ "### Data cleaning step (optional)\n", "\n", "pyJedAI offers 4 types of text cleaning/processing. \n", "\n", "- Stopwords removal\n", "- Punctuation removal\n", "- Numbers removal\n", "- Unicodes removal" ] }, { "cell_type": "code", "execution_count": 11, "id": "e471e48c-c882-4c74-b94f-4c1dff9fa36c", "metadata": {}, "outputs": [], "source": [ "data.clean_dataset(remove_stopwords = False, \n", " remove_punctuation = False, \n", " remove_numbers = False,\n", " remove_unicodes = False)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "## Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 12, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " ExtendedQGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", ")" ] }, { "cell_type": "code", "execution_count": 13, "id": "9741f0c4-6250-455f-9c88-b8dc61ab7d4d", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "bd8dfde5fb8e4dbd95cd43e14a2eb850", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Standard Blocking: 0%| | 0/2152 [00:00\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
id1id2
00205
10193
2053
3055
40697
\n", "" ], "text/plain": [ " id1 id2\n", "0 0 205\n", "1 0 193\n", "2 0 53\n", "3 0 55\n", "4 0 697" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pairs_df.head(5)" ] }, { "cell_type": "markdown", "id": "6aeff39a-b51b-4166-a55b-f8452ec258a7", "metadata": {}, "source": [ "## Entity Matching\n", "\n", "It compares pairs of entity profiles, associating every pair with a similarity in [0,1]. Its output comprises the similarity graph, i.e., an undirected, weighted graph where the nodes correspond to entities and the edges connect pairs of compared entities." ] }, { "cell_type": "code", "execution_count": 28, "id": "f479d967-8bac-4870-99bd-68c01e75747b", "metadata": {}, "outputs": [], "source": [ "from pyjedai.matching import EntityMatching" ] }, { "cell_type": "code", "execution_count": 29, "id": "ae7b1e6a-e937-44fe-bfe5-34696ea1156c", "metadata": {}, "outputs": [], "source": [ "em = EntityMatching(\n", " metric='cosine',\n", " tokenizer='char_tokenizer',\n", " vectorizer='tfidf',\n", " qgram=3,\n", " similarity_threshold=0.0\n", ")\n", "\n", "pairs_graph = em.predict(candidate_pairs_blocks, data, tqdm_disable=True)" ] }, { "cell_type": "code", "execution_count": 30, "id": "4d606bfc-3265-4042-93f3-22a1117c4886", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(pairs_graph)" ] }, { "cell_type": "code", "execution_count": 31, "id": "4a2a5f4a-6ffa-4c16-ae49-ff4fec4c467d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: cosine\n", "\tAttributes: None\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: char_tokenizer\n", "\tVectorizer: tfidf\n", "\tQgrams: 3\n", "Runtime: 2.3469 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 10.86% \n", "\tRecall: 91.45%\n", "\tF1-score: 19.41%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = em.evaluate(pairs_graph)" ] }, { "cell_type": "markdown", "id": "07ecd3d3-aa47-447c-af4d-cdd4744ca7c1", "metadata": {}, "source": [ "### How to set a valid similarity threshold?\n", "\n", "Configure similariy threshold with a Grid-Search or with an Optuna search. Also pyJedAI provides some visualizations on the distributions of the scores.\n", "\n", "For example with a classic histogram:\n" ] }, { "cell_type": "code", "execution_count": 32, "id": "c04c0482-a0d2-4ebf-b9c8-f6cff1863e5d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_all_weights()" ] }, { "cell_type": "markdown", "id": "800c9f29-1260-40fb-bdf0-34cc9ada4aa3", "metadata": {}, "source": [ "Or with a range 0.1 from 0.0 to 1.0 grouping:" ] }, { "cell_type": "code", "execution_count": 33, "id": "3e624fb5-cb48-4081-b90f-0e59adf88d26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distribution-% of predicted scores: [13.551092474067536, 28.8126241447804, 25.5131317589936, 17.325093798278527, 9.00463473846833, 3.8402118737585518, 1.4566320900463474, 0.4634738468329287, 0.03310527477378062, 0.0]\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_scores()" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "## Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 34, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering" ] }, { "cell_type": "code", "execution_count": 35, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ccc = UniqueMappingClustering()\n", "clusters = ccc.process(pairs_graph, data, similarity_threshold=0.17)" ] }, { "cell_type": "code", "execution_count": 36, "id": "5b52a534-691a-48be-b5e9-c073dc04b154", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Unique Mapping Clustering\n", "Method info: Prunes all edges with a weight lower than t, sorts the remaining ones indecreasing weight/similarity and iteratively forms a partition forthe top-weighted pair as long as none of its entities has alreadybeen matched to some other.\n", "Parameters: None\n", "Runtime: 0.1299 seconds\n" ] } ], "source": [ "ccc.report()" ] }, { "cell_type": "code", "execution_count": 37, "id": "00bc2e82-9bc1-4119-b8cb-4a1c18afee19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Unique Mapping Clustering\n", "***************************************************************************************************************************\n", "Method name: Unique Mapping Clustering\n", "Parameters: \n", "Runtime: 0.1299 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 92.69% \n", "\tRecall: 86.06%\n", "\tF1-score: 89.25%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ccc.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "315369d8-6564-44d4-aea0-14034b54cf16", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }