{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "
\n", "

\n", " \"drawing\"\n", "
\n", "
\n", " Clean-Clean Entity Resolution Tutorial\n", "
\n", "
\n", "
\n", "\n", "\n", "In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities." ] }, { "cell_type": "markdown", "id": "9c49d2b7-11b5-40b3-9341-de98608dde13", "metadata": {}, "source": [ "Dataset: __Abt-Buy dataset__\n", "\n", "The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price." ] }, { "cell_type": "markdown", "id": "744b3017-9a5c-4d3c-8e0a-fe39b069b647", "metadata": {}, "source": [ "# Instalation\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": 1, "id": "029a5825-799d-4c3f-a6cd-a75e257cadcc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyjedai in c:\\users\\nikol\\anaconda3\\lib\\site-packages (0.0.5)\n", "Requirement already satisfied: strsim>=0.0.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (0.0.3)\n", "Requirement already satisfied: regex>=2022.6.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (2022.6.2)\n", "Requirement already satisfied: pandocfilters>=1.5 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.5.0)\n", "Requirement already satisfied: strsimpy>=0.2.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (0.2.1)\n", "Requirement already satisfied: seaborn>=0.11 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (0.11.2)\n", "Requirement already satisfied: gensim>=4.2.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (4.2.0)\n", "Requirement already satisfied: matplotlib>=3.1.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (3.5.3)\n", "Requirement already satisfied: matplotlib-inline>=0.1.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (0.1.6)\n", "Requirement already satisfied: numpy>=1.21 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.21.2)\n", "Requirement already satisfied: faiss-cpu>=1.7 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.7.2)\n", "Requirement already satisfied: transformers>=4.21 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (4.21.3)\n", "Requirement already satisfied: nltk>=3.7 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (3.7)\n", "Requirement already satisfied: scipy>=1.7 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.7.1)\n", "Requirement already satisfied: pandas>=0.25.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.3.4)\n", "Requirement already satisfied: tomli in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (2.0.1)\n", "Requirement already satisfied: tqdm>=4.64 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (4.64.0)\n", "Requirement already satisfied: optuna>=3.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (3.0.1)\n", "Requirement already satisfied: rdfpandas>=1.1.5 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (1.1.5)\n", "Requirement already satisfied: networkx>=2.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (2.6.3)\n", "Requirement already satisfied: PyYAML>=6.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (6.0)\n", "Requirement already satisfied: pandas-profiling>=3.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (3.2.0)\n", "Requirement already satisfied: rdflib>=6.1.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (6.1.1)\n", "Requirement already satisfied: sentence-transformers>=2.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pyjedai) (2.2.2)\n", "Requirement already satisfied: Cython==0.29.28 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from gensim>=4.2.0->pyjedai) (0.29.28)\n", "Requirement already satisfied: smart-open>=1.8.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from gensim>=4.2.0->pyjedai) (5.1.0)\n", "Requirement already satisfied: pyparsing>=2.2.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.4)\n", "Requirement already satisfied: python-dateutil>=2.7 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2)\n", "Requirement already satisfied: packaging>=20.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (21.3)\n", "Requirement already satisfied: pillow>=6.2.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (8.4.0)\n", "Requirement already satisfied: cycler>=0.10 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (0.10.0)\n", "Requirement already satisfied: fonttools>=4.22.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (4.25.0)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib>=3.1.3->pyjedai) (1.3.1)\n", "Requirement already satisfied: six in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cycler>=0.10->matplotlib>=3.1.3->pyjedai) (1.16.0)\n", "Requirement already satisfied: traitlets in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.1.1)\n", "Requirement already satisfied: click in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from nltk>=3.7->pyjedai) (8.0.3)\n", "Requirement already satisfied: joblib in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from nltk>=3.7->pyjedai) (1.1.0)\n", "Requirement already satisfied: typing-extensions>=3.10.0.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (4.3.0)\n", "Requirement already satisfied: colorlog in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (6.6.0)\n", "Requirement already satisfied: cmaes>=0.8.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (0.8.2)\n", "Requirement already satisfied: sqlalchemy>=1.1.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (1.4.22)\n", "Requirement already satisfied: cliff in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (3.10.0)\n", "Requirement already satisfied: alembic in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from optuna>=3.0->pyjedai) (1.7.5)\n", "Requirement already satisfied: pytz>=2017.3 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas>=0.25.3->pyjedai) (2021.3)\n", "Requirement already satisfied: pydantic>=1.8.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (1.10.2)\n", "Requirement already satisfied: htmlmin>=0.1.12 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (0.1.12)\n", "Requirement already satisfied: visions[type_image_path]==0.7.4 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (0.7.4)\n", "Requirement already satisfied: tangled-up-in-unicode==0.2.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (0.2.0)\n", "Requirement already satisfied: jinja2>=2.11.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (3.0.2)\n", "Requirement already satisfied: requests>=2.24.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (2.26.0)\n", "Requirement already satisfied: multimethod>=1.4 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (1.9)\n", "Requirement already satisfied: missingno>=0.4.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (0.5.1)\n", "Requirement already satisfied: phik>=0.11.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (0.12.2)\n", "Requirement already satisfied: markupsafe~=2.1.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from pandas-profiling>=3.2->pyjedai) (2.1.1)\n", "Requirement already satisfied: attrs>=19.3.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (21.2.0)\n", "Requirement already satisfied: imagehash in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (4.3.1)\n", "Requirement already satisfied: importlib-metadata in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from rdflib>=6.1.1->pyjedai) (4.8.1)\n", "Requirement already satisfied: setuptools in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from rdflib>=6.1.1->pyjedai) (58.0.4)\n", "Requirement already satisfied: isodate in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1)\n", "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2022.9.14)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (1.26.7)\n", "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (3.3)\n", "Requirement already satisfied: charset-normalizer~=2.0.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from requests>=2.24.0->pandas-profiling>=3.2->pyjedai) (2.0.4)\n", "Requirement already satisfied: torch>=1.6.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sentence-transformers>=2.2->pyjedai) (1.12.1)\n", "Requirement already satisfied: scikit-learn in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sentence-transformers>=2.2->pyjedai) (1.0.1)\n", "Requirement already satisfied: sentencepiece in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.97)\n", "Requirement already satisfied: torchvision in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sentence-transformers>=2.2->pyjedai) (0.13.1)\n", "Requirement already satisfied: huggingface-hub>=0.4.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sentence-transformers>=2.2->pyjedai) (0.10.0)\n", "Requirement already satisfied: filelock in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (3.3.1)\n", "Requirement already satisfied: greenlet!=0.4.17 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from sqlalchemy>=1.1.0->optuna>=3.0->pyjedai) (1.1.1)\n", "Requirement already satisfied: colorama in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from tqdm>=4.64->pyjedai) (0.4.4)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.13,>=0.11.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from transformers>=4.21->pyjedai) (0.12.1)\n", "Requirement already satisfied: Mako in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from alembic->optuna>=3.0->pyjedai) (1.1.4)\n", "Requirement already satisfied: importlib-resources in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from alembic->optuna>=3.0->pyjedai) (5.4.0)\n", "Requirement already satisfied: stevedore>=2.0.1 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cliff->optuna>=3.0->pyjedai) (3.5.0)\n", "Requirement already satisfied: autopage>=0.4.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cliff->optuna>=3.0->pyjedai) (0.4.0)\n", "Requirement already satisfied: cmd2>=1.0.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cliff->optuna>=3.0->pyjedai) (2.3.2)\n", "Requirement already satisfied: PrettyTable>=0.7.2 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cliff->optuna>=3.0->pyjedai) (2.4.0)\n", "Requirement already satisfied: pbr!=2.1.0,>=2.0.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cliff->optuna>=3.0->pyjedai) (5.6.0)\n", "Requirement already satisfied: wcwidth>=0.1.7 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (0.2.5)\n", "Requirement already satisfied: pyreadline in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (2.1)\n", "Requirement already satisfied: pyperclip>=1.6 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from cmd2>=1.0.0->cliff->optuna>=3.0->pyjedai) (1.8.2)\n", "Requirement already satisfied: zipp>=0.5 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from importlib-metadata->rdflib>=6.1.1->pyjedai) (3.6.0)\n", "Requirement already satisfied: PyWavelets in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from imagehash->visions[type_image_path]==0.7.4->pandas-profiling>=3.2->pyjedai) (1.1.1)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\users\\nikol\\anaconda3\\lib\\site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (2.2.0)\n" ] } ], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 2, "id": "462695ec-3af1-4048-9971-9ed0bce0f07b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.0.5\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis \n", "License: Apache Software License 2.0\n", "Location: c:\\users\\nikol\\anaconda3\\lib\\site-packages\n", "Requires: PyYAML, optuna, scipy, gensim, pandocfilters, numpy, rdflib, pandas, transformers, regex, strsim, tqdm, networkx, seaborn, rdfpandas, strsimpy, matplotlib-inline, matplotlib, pandas-profiling, tomli, nltk, faiss-cpu, sentence-transformers\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "7b4c62c5-6581-4d2e-9d44-c7c02f43d441", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 3, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph" ] }, { "cell_type": "code", "execution_count": 4, "id": "4d4e6a90-9fd8-4f7a-bf4f-a5b994e0adfb", "metadata": {}, "outputs": [], "source": [ "import pyjedai\n", "from pyjedai.utils import (\n", " text_cleaning_method,\n", " print_clusters,\n", " print_blocks,\n", " print_candidate_pairs\n", ")\n", "from pyjedai.evaluation import Evaluation, write" ] }, { "cell_type": "markdown", "id": "451bf970-4425-487b-8756-776abb9536ea", "metadata": {}, "source": [ "# Workflow Architecture\n", "\n", "![workflow-example.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/documentation/workflow-example.png?raw=true)" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": {}, "source": [ "# Data Reading\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n" ] }, { "cell_type": "code", "execution_count": 5, "id": "e6aabec4-ef4f-4267-8c1e-377054e669d2", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "code", "execution_count": 8, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"./../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "d2 = pd.read_csv(\"./../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "gt = pd.read_csv(\"./../data/ccer/D2/gt.csv\", sep='|', engine='python').astype(str)\n", "\n", "data = Data(dataset_1=d1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "5d8a8a78-858e-4c79-90fe-197a68e95e11", "metadata": {}, "source": [ "pyJedAI offers also dataset analysis methods (more will be developed)" ] }, { "cell_type": "code", "execution_count": 9, "id": "7cb87af2-adda-49e0-82cc-b1a5f7a595ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------- Data -------------------------\n", "Type of Entity Resolution: Clean-Clean\n", "Dataset-1:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "Dataset-2:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "\n", "Total number of entities: 2152\n", "Number of matching pairs in ground-truth: 1076\n", "-------------------------------------------------------- \n", "\n" ] } ], "source": [ "data.print_specs()" ] }, { "cell_type": "code", "execution_count": 10, "id": "b822d7c0-19a2-4050-9554-c35a208bb848", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Sony Turntable - PSLX350HSony Turntable - PSLX350H/ Belt Drive System/ ...
11Bose Acoustimass 5 Series III Speaker System -...Bose Acoustimass 5 Series III Speaker System -...399
22Sony Switcher - SBV40SSony Switcher - SBV40S/ Eliminates Disconnecti...49
33Sony 5 Disc CD Player - CDPCE375Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...
44Bose 27028 161 Bookshelf Pair Speakers In Whit...Bose 161 Bookshelf Speakers In White - 161WH/ ...158
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Sony Turntable - PSLX350H \n", "1 1 Bose Acoustimass 5 Series III Speaker System -... \n", "2 2 Sony Switcher - SBV40S \n", "3 3 Sony 5 Disc CD Player - CDPCE375 \n", "4 4 Bose 27028 161 Bookshelf Pair Speakers In Whit... \n", "\n", " description price \n", "0 Sony Turntable - PSLX350H/ Belt Drive System/ ... \n", "1 Bose Acoustimass 5 Series III Speaker System -... 399 \n", "2 Sony Switcher - SBV40S/ Eliminates Disconnecti... 49 \n", "3 Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change... \n", "4 Bose 161 Bookshelf Speakers In White - 161WH/ ... 158 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_1.head(5)" ] }, { "cell_type": "code", "execution_count": 11, "id": "5c26b595-5e02-4bfc-8e79-e476ab2830ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Linksys EtherFast EZXS88W Ethernet Switch - EZ...Linksys EtherFast 8-Port 10/100 Switch (New/Wo...
11Linksys EtherFast EZXS55W Ethernet Switch5 x 10/100Base-TX LAN
22Netgear ProSafe FS105 Ethernet Switch - FS105NANETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...
33Belkin Pro Series High Integrity VGA/SVGA Moni...1 x HD-15 - 1 x HD-15 - 10ft - Beige
44Netgear ProSafe JFS516 Ethernet SwitchNetgear ProSafe 16 Port 10/100 Rackmount Switc...
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Linksys EtherFast EZXS88W Ethernet Switch - EZ... \n", "1 1 Linksys EtherFast EZXS55W Ethernet Switch \n", "2 2 Netgear ProSafe FS105 Ethernet Switch - FS105NA \n", "3 3 Belkin Pro Series High Integrity VGA/SVGA Moni... \n", "4 4 Netgear ProSafe JFS516 Ethernet Switch \n", "\n", " description price \n", "0 Linksys EtherFast 8-Port 10/100 Switch (New/Wo... \n", "1 5 x 10/100Base-TX LAN \n", "2 NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw... \n", "3 1 x HD-15 - 1 x HD-15 - 10ft - Beige \n", "4 Netgear ProSafe 16 Port 10/100 Rackmount Switc... " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_2.head(5)" ] }, { "cell_type": "code", "execution_count": 12, "id": "b3c9827e-a08a-47b2-a7f2-6f3f72184a17", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D1D2
0206216
16046
2182160
\n", "
" ], "text/plain": [ " D1 D2\n", "0 206 216\n", "1 60 46\n", "2 182 160" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ground_truth.head(3)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "# Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 13, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Created embeddings directory at: C:\\Users\\nikol\\Desktop\\test\\tutorials\\.embeddings\n" ] } ], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " ExtendedQGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "id": "9741f0c4-6250-455f-9c88-b8dc61ab7d4d", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "e8f45ed2068a488d8987344463a830bd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Suffix Arrays Blocking: 0%| | 0/2152 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(pairs_graph)" ] }, { "cell_type": "code", "execution_count": 38, "id": "4a2a5f4a-6ffa-4c16-ae49-ff4fec4c467d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tTokenizer: white_space_tokenizer\n", "\tMetric: dice\n", "\tSimilarity Threshold: 0.5\n", "Runtime: 2.2469 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 97.14% \n", "\tRecall: 3.16%\n", "\tF1-score: 6.12%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = EM.evaluate(pairs_graph)" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "# Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 39, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 41, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ccc = ConnectedComponentsClustering()\n", "clusters = ccc.process(pairs_graph, data)" ] }, { "cell_type": "code", "execution_count": 42, "id": "5b52a534-691a-48be-b5e9-c073dc04b154", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Connected Components Clustering\n", "Method info: Gets equivalence clusters from the transitive closure of the similarity graph.\n", "Parameters: None\n", "Runtime: 0.0010 seconds\n" ] } ], "source": [ "ccc.report()" ] }, { "cell_type": "code", "execution_count": 45, "id": "00bc2e82-9bc1-4119-b8cb-4a1c18afee19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "Runtime: 0.0010 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 97.14% \n", "\tRecall: 3.16%\n", "\tF1-score: 6.12%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ccc.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "315369d8-6564-44d4-aea0-14034b54cf16", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }