{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Similarity Joins Tutorial\n", "\n", "In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset using a Similarity Join workflow.\n", "\n", "\n", "![workflow2-cora.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/img/workflow2-cora.png?raw=true)\n" ] }, { "cell_type": "markdown", "id": "5274855c-ba95-49b1-ba68-4f50ca2bbd89", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": 1, "id": "4697d149-c1a4-4767-9ed1-14444485e409", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Python 3.8.17\n" ] } ], "source": [ "!python --version" ] }, { "cell_type": "code", "execution_count": null, "id": "776843a2-570d-4d87-bb1d-b5b61f07b1da", "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 3, "id": "6d2e5cf7-ff2e-4271-9242-fe3d638263e9", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.0\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "15d28272-269a-4e87-bb03-0a45a5492a06", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 4, "id": "a0890ce6-3a10-4e66-913f-78095bd786a1", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /home/jm/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph\n", "\n", "from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": { "tags": [] }, "source": [ "## Reading the dataset\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n", "\n", "
\n", " \n", "
\n", "\n", "\n", "### pyjedai module\n", "\n", "Data module offers a numpber of options\n", "- Selecting the parameters (columns) of the dataframe, in D1 (and in D2)\n", "- Prints a detailed text analysis\n", "- Stores a hidden mapping of the ids, and creates it if not exists." ] }, { "cell_type": "code", "execution_count": 5, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "\n", "d1 = pd.read_csv(\"./../data/der/cora/cora.csv\", sep='|')\n", "gt = pd.read_csv(\"./../data/der/cora/cora_gt.csv\", sep='|', header=None)\n", "attr = ['Entity Id','author', 'title']" ] }, { "cell_type": "markdown", "id": "fda32323-c74d-4374-b322-5c11a175c3ea", "metadata": {}, "source": [ "Data is the connecting module of all steps of the workflow" ] }, { "cell_type": "code", "execution_count": 6, "id": "e257597d-ea77-4090-ba34-e1038d8f9a0d", "metadata": {}, "outputs": [], "source": [ "data = Data(\n", " dataset_1=d1,\n", " id_column_name_1='Entity Id',\n", " ground_truth=gt,\n", " attributes_1=attr\n", ")" ] }, { "cell_type": "markdown", "id": "5ee7b50f-8b4c-4668-a44c-1ebf8c7a1242", "metadata": {}, "source": [ "## Similarity Joins\n", "\n", "__Available algorithms:__\n", "\n", "- EJoin\n", "- TopKJoin" ] }, { "cell_type": "code", "execution_count": 7, "id": "afd97b7e-4bf8-4256-b9fc-a301e413e834", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/jm/public-pyJedAI/pyJedAI/src/pyjedai/joins.py:13: TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)\n", " from tqdm.autonotebook import tqdm\n" ] } ], "source": [ "from pyjedai.joins import EJoin, TopKJoin" ] }, { "cell_type": "code", "execution_count": 8, "id": "7bc8ba43-b059-4839-8958-0b31fab95e46", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6251232d68a342e48e2250e40085d73a", "version_major": 2, "version_minor": 0 }, "text/plain": [ "EJoin (jaccard): 0%| | 0/2590 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(g)" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "## Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 11, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 12, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ec = ConnectedComponentsClustering()\n", "clusters = ec.process(g, data, similarity_threshold=0.3)" ] }, { "cell_type": "code", "execution_count": 13, "id": "3d2aa574", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "Runtime: 0.3853 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 48.42% \n", "\tRecall: 93.19%\n", "\tF1-score: 63.73%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ec.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "778b9b19-0964-4095-a293-a8336fe0607a", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "dc77e8ab-087c-48c9-8ee3-56bddba98ea2", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }