{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Dirty Entity Resolution Tutorial\n", "\n", "In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Dirty ER, is the process of dedeplication of one set." ] }, { "cell_type": "markdown", "id": "5274855c-ba95-49b1-ba68-4f50ca2bbd89", "metadata": {}, "source": [ "# How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": 1, "id": "4697d149-c1a4-4767-9ed1-14444485e409", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python 3.8.17\n" ] } ], "source": [ "!python --version" ] }, { "cell_type": "code", "execution_count": 2, "id": "776843a2-570d-4d87-bb1d-b5b61f07b1da", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Requirement already satisfied: pyjedai in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (0.1.0)\n", "Requirement already satisfied: gensim>=4.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.3.2)\n", "Requirement already satisfied: matplotlib>=3.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.7.2)\n", "Requirement already satisfied: matplotlib-inline>=0.1.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.6)\n", "Requirement already satisfied: networkx>=2.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.1)\n", "Requirement already satisfied: nltk>=3.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.8.1)\n", "Requirement already satisfied: numpy>=1.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.23.5)\n", "Requirement already satisfied: pandas>=0.25.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.3)\n", "Requirement already satisfied: pandas-profiling>=3.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.6.6)\n", "Requirement already satisfied: pandocfilters>=1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.5.0)\n", "Requirement already satisfied: PyYAML>=6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (6.0.1)\n", "Requirement already satisfied: rdflib>=6.1.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (7.0.0)\n", "Requirement already satisfied: rdfpandas>=1.1.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.1.6)\n", "Requirement already satisfied: regex>=2022.6.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2023.8.8)\n", "Requirement already satisfied: scipy>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.10.1)\n", "Requirement already satisfied: seaborn>=0.11 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.12.2)\n", "Requirement already satisfied: strsim>=0.0.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.0.3)\n", "Requirement already satisfied: strsimpy>=0.2.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.2.1)\n", "Requirement already satisfied: tqdm>=4.64 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.66.1)\n", "Requirement already satisfied: transformers>=4.21 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.32.0)\n", "Requirement already satisfied: sentence-transformers>=2.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.2.2)\n", "Requirement already satisfied: faiss-cpu>=1.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (1.7.4)\n", "Requirement already satisfied: optuna>=3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (3.3.0)\n", "Requirement already satisfied: py-stringmatching>=0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.4.3)\n", "Requirement already satisfied: ordered-set>=4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (4.1.0)\n", "Requirement already satisfied: plotly>=5.16.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (5.16.1)\n", "Requirement already satisfied: tomli in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (2.0.1)\n", "Requirement already satisfied: valentine>=0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pyjedai) (0.1.7)\n", "Requirement already satisfied: smart-open>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from gensim>=4.2.0->pyjedai) (6.3.0)\n", "Requirement already satisfied: contourpy>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.1.0)\n", "Requirement already satisfied: cycler>=0.10 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (0.11.0)\n", "Requirement already satisfied: fonttools>=4.22.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (4.42.1)\n", "Requirement already satisfied: kiwisolver>=1.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (1.4.5)\n", "Requirement already satisfied: packaging>=20.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (23.0)\n", "Requirement already satisfied: pillow>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (10.0.0)\n", "Requirement already satisfied: pyparsing<3.1,>=2.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (3.0.9)\n", "Requirement already satisfied: python-dateutil>=2.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (2.8.2)\n", "Requirement already satisfied: importlib-resources>=3.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib>=3.1.3->pyjedai) (5.2.0)\n", "Requirement already satisfied: traitlets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from matplotlib-inline>=0.1.3->pyjedai) (5.7.1)\n", "Requirement already satisfied: click in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (8.1.7)\n", "Requirement already satisfied: joblib in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nltk>=3.7->pyjedai) (1.3.2)\n", "Requirement already satisfied: alembic>=1.5.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (1.11.3)\n", "Requirement already satisfied: cmaes>=0.10.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (0.10.0)\n", "Requirement already satisfied: colorlog in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (6.7.0)\n", "Requirement already satisfied: sqlalchemy>=1.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from optuna>=3.0->pyjedai) (2.0.20)\n", "Requirement already satisfied: pytz>=2020.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2022.7)\n", "Requirement already satisfied: tzdata>=2022.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas>=0.25.3->pyjedai) (2023.3)\n", "Requirement already satisfied: ydata-profiling in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from pandas-profiling>=3.2->pyjedai) (4.5.1)\n", "Requirement already satisfied: tenacity>=6.2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from plotly>=5.16.0->pyjedai) (8.2.3)\n", "Requirement already satisfied: six in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from py-stringmatching>=0.4->pyjedai) (1.16.0)\n", "Requirement already satisfied: isodate<0.7.0,>=0.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from rdflib>=6.1.1->pyjedai) (0.6.1)\n", "Requirement already satisfied: torch>=1.6.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (2.0.1)\n", "Requirement already satisfied: torchvision in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.15.2)\n", "Requirement already satisfied: scikit-learn in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (1.3.0)\n", "Requirement already satisfied: sentencepiece in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.1.99)\n", "Requirement already satisfied: huggingface-hub>=0.4.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sentence-transformers>=2.2->pyjedai) (0.16.4)\n", "Requirement already satisfied: filelock in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (3.12.2)\n", "Requirement already satisfied: requests in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (2.31.0)\n", "Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.13.3)\n", "Requirement already satisfied: safetensors>=0.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from transformers>=4.21->pyjedai) (0.3.3)\n", "Requirement already satisfied: anytree<2.9,>=2.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.8.0)\n", "Requirement already satisfied: chardet<6.0.0,>=5.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (5.2.0)\n", "Requirement already satisfied: levenshtein<1.0,>=0.20.7 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.21.1)\n", "Requirement already satisfied: PuLP<3.0,>=2.5.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (2.7.0)\n", "Requirement already satisfied: pot<1.0,>=0.8.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from valentine>=0.1->pyjedai) (0.9.1)\n", "Requirement already satisfied: Mako in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (1.2.4)\n", "Requirement already satisfied: typing-extensions>=4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (4.7.1)\n", "Requirement already satisfied: importlib-metadata in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from alembic>=1.5.0->optuna>=3.0->pyjedai) (6.0.0)\n", "Requirement already satisfied: fsspec in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from huggingface-hub>=0.4.0->sentence-transformers>=2.2->pyjedai) (2023.6.0)\n", "Requirement already satisfied: zipp>=3.1.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from importlib-resources>=3.2.0->matplotlib>=3.1.3->pyjedai) (3.11.0)\n", "Requirement already satisfied: rapidfuzz<4.0.0,>=2.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from levenshtein<1.0,>=0.20.7->valentine>=0.1->pyjedai) (3.2.0)\n", "Requirement already satisfied: greenlet!=0.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sqlalchemy>=1.3.0->optuna>=3.0->pyjedai) (2.0.2)\n", "Requirement already satisfied: sympy in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.12)\n", "Requirement already satisfied: jinja2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.1.2)\n", "Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99)\n", "Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.99)\n", "Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.101)\n", "Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (8.5.0.96)\n", "Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.10.3.66)\n", "Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.9.0.58)\n", "Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (10.2.10.91)\n", "Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.4.0.1)\n", "Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.4.91)\n", "Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.14.3)\n", "Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (11.7.91)\n", "Requirement already satisfied: triton==2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.0.0)\n", "Requirement already satisfied: setuptools in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (68.0.0)\n", "Requirement already satisfied: wheel in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (0.38.4)\n", "Requirement already satisfied: cmake in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (3.27.2)\n", "Requirement already satisfied: lit in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from triton==2.0.0->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (16.0.6)\n", "Requirement already satisfied: charset-normalizer<4,>=2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2.0.4)\n", "Requirement already satisfied: idna<4,>=2.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (3.4)\n", "Requirement already satisfied: urllib3<3,>=1.21.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (1.26.16)\n", "Requirement already satisfied: certifi>=2017.4.17 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from requests->transformers>=4.21->pyjedai) (2023.7.22)\n", "Requirement already satisfied: threadpoolctl>=2.0.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from scikit-learn->sentence-transformers>=2.2->pyjedai) (3.2.0)\n", "Requirement already satisfied: pydantic<2,>=1.8.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.10.12)\n", "Requirement already satisfied: visions[type_image_path]==0.7.5 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.7.5)\n", "Requirement already satisfied: htmlmin==0.1.12 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.1.12)\n", "Requirement already satisfied: phik<0.13,>=0.11.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.12.3)\n", "Requirement already satisfied: multimethod<2,>=1.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.1)\n", "Requirement already satisfied: statsmodels<1,>=0.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.14.0)\n", "Requirement already satisfied: typeguard<3,>=2.13.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (2.13.3)\n", "Requirement already satisfied: imagehash==4.3.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (4.3.1)\n", "Requirement already satisfied: wordcloud>=1.9.1 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.9.2)\n", "Requirement already satisfied: dacite>=1.8 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.8.1)\n", "Requirement already satisfied: PyWavelets in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from imagehash==4.3.1->ydata-profiling->pandas-profiling>=3.2->pyjedai) (1.4.1)\n", "Requirement already satisfied: attrs>=19.3.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (22.1.0)\n", "Requirement already satisfied: tangled-up-in-unicode>=0.0.4 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from visions[type_image_path]==0.7.5->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.2.0)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from jinja2->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (2.1.1)\n", "Requirement already satisfied: patsy>=0.5.2 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from statsmodels<1,>=0.13.2->ydata-profiling->pandas-profiling>=3.2->pyjedai) (0.5.3)\n", "Requirement already satisfied: mpmath>=0.19 in /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages (from sympy->torch>=1.6.0->sentence-transformers>=2.2->pyjedai) (1.3.0)\n" ] } ], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 3, "id": "6d2e5cf7-ff2e-4271-9242-fe3d638263e9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.0\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "15d28272-269a-4e87-bb03-0a45a5492a06", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 4, "id": "a0890ce6-3a10-4e66-913f-78095bd786a1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /home/jm/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph\n", "\n", "from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": { "tags": [] }, "source": [ "## Reading the dataset\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n", "\n", "
\n", " \n", "
\n", "\n", "\n", "### pyjedai module\n", "\n", "Data module offers a numpber of options\n", "- Selecting the parameters (columns) of the dataframe, in D1 (and in D2)\n", "- Prints a detailed text analysis\n", "- Stores a hidden mapping of the ids, and creates it if not exists." ] }, { "cell_type": "code", "execution_count": 5, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "\n", "d1 = pd.read_csv(\"./../data/der/cora/cora.csv\", sep='|')\n", "gt = pd.read_csv(\"./../data/der/cora/cora_gt.csv\", sep='|', header=None)\n", "attr = ['Entity Id','author', 'title']" ] }, { "cell_type": "markdown", "id": "fda32323-c74d-4374-b322-5c11a175c3ea", "metadata": {}, "source": [ "Data is the connecting module of all steps of the workflow" ] }, { "cell_type": "code", "execution_count": 6, "id": "e257597d-ea77-4090-ba34-e1038d8f9a0d", "metadata": {}, "outputs": [], "source": [ "data = Data(\n", " dataset_1=d1,\n", " id_column_name_1='Entity Id',\n", " ground_truth=gt,\n", " attributes_1=attr\n", ")" ] }, { "cell_type": "markdown", "id": "93464edd-b88a-40d9-aa4d-7fe1523db662", "metadata": {}, "source": [ "## Workflow with Block Cleaning Methods\n", "\n", "In this notebook we created the bellow architecture:\n", "\n", "![workflow1-cora.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/img/workflow1-cora.png?raw=true)\n", "\n" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "## Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 7, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", " ExtendedQGramsBlocking\n", ")" ] }, { "cell_type": "code", "execution_count": 8, "id": "7ee34038-1352-440e-8c34-98c5cf036523", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "8605af0eba7d43c3972e821952576ebd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Suffix Arrays Blocking: 0%| | 0/1295 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(pairs_graph)" ] }, { "cell_type": "code", "execution_count": 23, "id": "00bc2e82-9bc1-4119-b8cb-4a1c18afee19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: jaccard\n", "\tAttributes: None\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: white_space_tokenizer\n", "\tVectorizer: None\n", "\tQgrams: 1\n", "Runtime: 7.9103 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 74.46% \n", "\tRecall: 48.51%\n", "\tF1-score: 58.75%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = em.evaluate(pairs_graph)" ] }, { "cell_type": "markdown", "id": "607ee1fb-bed4-4751-9fa8-429874457f72", "metadata": {}, "source": [ "### How to set a valid similarity threshold?\n", "\n", "Configure similariy threshold with a Grid-Search or with an Optuna search. Also pyJedAI provides some visualizations on the distributions of the scores.\n", "\n", "For example with a classic histogram:\n" ] }, { "cell_type": "code", "execution_count": 24, "id": "c04c0482-a0d2-4ebf-b9c8-f6cff1863e5d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_all_weights()" ] }, { "cell_type": "markdown", "id": "75b6218c-3983-413e-a904-53bba33ea381", "metadata": {}, "source": [ "Or with a range 0.1 from 0.0 to 1.0 grouping:" ] }, { "cell_type": "code", "execution_count": 25, "id": "3e624fb5-cb48-4081-b90f-0e59adf88d26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distribution-% of predicted scores: [3.0010718113612005, 7.725973561986424, 13.844230082172205, 23.97284744551626, 18.854948195784207, 13.71025366202215, 8.145766345123258, 4.948195784208646, 3.858520900321544, 1.9381922115041088]\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_scores()" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "## Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 26, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 27, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ec = ConnectedComponentsClustering()\n", "clusters = ec.process(pairs_graph, data, similarity_threshold=0.3)" ] }, { "cell_type": "code", "execution_count": 28, "id": "3d2aa574", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "Runtime: 0.0894 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 76.51% \n", "\tRecall: 51.54%\n", "\tF1-score: 61.59%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ec.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "6e44642b-b0d9-4f8d-9fe4-4fca2ad716aa", "metadata": {}, "source": [ "# Workflow with Similarity Joins\n", "\n", "In this notebook we created the bellow archtecture:\n", "\n", "![workflow2-cora.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/documentation/workflow2-cora.png?raw=true)\n", "\n" ] }, { "cell_type": "markdown", "id": "2b07ea13-58cb-498a-949a-4702ea9ee4ce", "metadata": { "tags": [] }, "source": [ "## Data Reading" ] }, { "cell_type": "markdown", "id": "a0b7acf5-32b3-45b1-8347-197dfa869fb9", "metadata": {}, "source": [ "Data is the connecting module of all steps of the workflow" ] }, { "cell_type": "code", "execution_count": 29, "id": "3006b051-8348-4922-a627-56441a1db7b7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "d1 = pd.read_csv(\"./../data/der/cora/cora.csv\", sep='|')\n", "gt = pd.read_csv(\"./../data/der/cora/cora_gt.csv\", sep='|', header=None)\n", "attr = ['Entity Id','author', 'title']\n", "data = Data(\n", " dataset_1=d1,\n", " id_column_name_1='Entity Id',\n", " ground_truth=gt,\n", " attributes_1=attr\n", ")" ] }, { "cell_type": "markdown", "id": "b3eedb4c-b86f-4f98-abf2-9d6f7d0271c5", "metadata": {}, "source": [ "## Similarity Joins" ] }, { "cell_type": "code", "execution_count": 37, "id": "afd97b7e-4bf8-4256-b9fc-a301e413e834", "metadata": {}, "outputs": [], "source": [ "from pyjedai.joins import EJoin, TopKJoin" ] }, { "cell_type": "code", "execution_count": 38, "id": "7bc8ba43-b059-4839-8958-0b31fab95e46", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "c518d093ae0047529a0ff47fa6e0696e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "EJoin (jaccard): 0%| | 0/2590 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(g)" ] }, { "cell_type": "code", "execution_count": 42, "id": "a4c57244-0c98-4598-8ae2-06cdd0c48385", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Top-K Join\n", "***************************************************************************************************************************\n", "Method name: Top-K Join\n", "Parameters: \n", "\tsimilarity_threshold: 0.25547445255474455\n", "\tK: 20\n", "\tmetric: jaccard\n", "\ttokenization: qgrams\n", "\tqgrams: 3\n", "Runtime: 33.9919 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 58.34% \n", "\tRecall: 63.75%\n", "\tF1-score: 60.92%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 58.340434597358325,\n", " 'Recall %': 63.74534450651769,\n", " 'F1 %': 60.923248053392655,\n", " 'True Positives': 10954,\n", " 'False Positives': 7822,\n", " 'True Negatives': 814451.0,\n", " 'False Negatives': 6230}" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topk_join.evaluate(g)" ] }, { "cell_type": "markdown", "id": "e686ba55-cbbe-4133-8256-fa5339e70720", "metadata": {}, "source": [ "## Entity Clustering" ] }, { "cell_type": "code", "execution_count": 43, "id": "e48627ef-0be6-45dd-b3ba-863818eadbfb", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 44, "id": "a9d5a28d-79f0-479f-b154-f5f7e3661897", "metadata": {}, "outputs": [], "source": [ "ccc = ConnectedComponentsClustering()\n", "\n", "clusters = ccc.process(g, data)" ] }, { "cell_type": "code", "execution_count": 45, "id": "f95539fe-2569-4569-8929-2f4220f14157", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "Runtime: 0.1218 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 2.05% \n", "\tRecall: 100.00%\n", "\tF1-score: 4.02%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ccc.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "778b9b19-0964-4095-a293-a8336fe0607a", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "eb71b039-9d16-4e13-a484-d9a64e7b96bb", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.17" } }, "nbformat": 4, "nbformat_minor": 5 }