{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "\n", "# pyTorch and FAISS workflow\n", "\n", "pyJedAI with pyTorch pre-trained embeddings and FAISS.\n", "\n", "
\n", " \n", "
\n" ] }, { "cell_type": "markdown", "id": "34082d2f-2446-46d8-acff-842c16959120", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4598710b-6c72-462a-a261-3bd1c5694fe4", "metadata": {}, "outputs": [], "source": [ "%pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": null, "id": "f6145391-26bd-4e96-ba93-57e986e61047", "metadata": {}, "outputs": [], "source": [ "%pip show pyjedai" ] }, { "cell_type": "markdown", "id": "80dfcdfa-c357-4856-bca4-1ed17b0238d1", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph\n", "\n", "from pyjedai.evaluation import Evaluation\n", "from pyjedai.datamodel import Data" ] }, { "cell_type": "code", "execution_count": 2, "id": "b4398804-8bf6-44a6-b324-e628011c17a8", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "d2 = pd.read_csv(\"../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "gt = pd.read_csv(\"../data/ccer/D2/gt.csv\", sep='|', engine='python')\n", "\n", "attr1 = d1.columns[1:].to_list()\n", "attr2 = d2.columns[1:].to_list()\n", "\n", "data = Data(dataset_1=d1,\n", " attributes_1=attr1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " attributes_2=attr2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "# Block Building\n", "\n", "## Pre-trained pyTorch & GENSIM embeddings\n", "\n", "Available embeddings:\n", "\n", "- Gensim: `{ 'fasttext', 'glove', 'word2vec'}`\n", "- pyTorch Sentence transformers : `{'smpnet','st5','sdistilroberta','sminilm','sent_glove'}`\n", "- pyTorch Word transformers :`{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}`\n", "\n", "## FAISS\n", "\n", "faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:\n" ] }, { "cell_type": "code", "execution_count": 3, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/conda/miniconda3/envs/pypi_dependencies/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding" ] }, { "cell_type": "code", "execution_count": 4, "id": "b25e2601-a871-41dd-a48c-bf72382d0f13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building blocks via Embeddings-NN Block Building [sminilm, faiss]\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Embeddings-NN Block Building [sminilm, faiss, cuda]: 100%|██████████| 2152/2152 [00:21<00:00, 101.10it/s]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "disable True\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',\n", " similarity_search='faiss')\n", "\n", "blocks, g = emb.build_blocks(data,\n", " top_k=5,\n", " similarity_distance='euclidean',\n", " load_embeddings_if_exist=False,\n", " save_embeddings=False,\n", " with_entity_matching=True)" ] }, { "cell_type": "code", "execution_count": 5, "id": "661ddf8a-537c-4935-a159-4cfb7bb2325a", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Embeddings-NN Block Building\n", "***************************************************************************************************************************\n", "Method name: Embeddings-NN Block Building\n", "Parameters: \n", "\tVectorizer: sminilm\n", "\tSimilarity-Search: faiss\n", "\tTop-K: 5\n", "\tVector size: 384\n", "Runtime: 21.9882 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 18.75% \n", "\tRecall: 93.77%\n", "\tF1-score: 31.26%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Classification report:\n", "\tTrue positives: 1009\n", "\tFalse positives: 4371\n", "\tTrue negatives: 1156633\n", "\tFalse negatives: 67\n", "\tTotal comparisons: 5380\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Statistics:\n", " FAISS:\n", "\tIndices shape returned after search: (1076, 5)\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 18.7546468401487,\n", " 'Recall %': 93.77323420074349,\n", " 'F1 %': 31.257744733581166,\n", " 'True Positives': 1009,\n", " 'False Positives': 4371,\n", " 'True Negatives': 1156633,\n", " 'False Negatives': 67}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "emb.evaluate(blocks, with_classification_report=True, with_stats=True)" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "# Entity Clustering" ] }, { "cell_type": "code", "execution_count": 6, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering" ] }, { "cell_type": "code", "execution_count": 7, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ccc = UniqueMappingClustering()\n", "clusters = ccc.process(g, data, similarity_threshold=0.63)" ] }, { "cell_type": "code", "execution_count": 8, "id": "22ced5c0-0e3c-4a09-a32a-19abbbe2e8d7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Unique Mapping Clustering\n", "***************************************************************************************************************************\n", "Method name: Unique Mapping Clustering\n", "Parameters: \n", "\tSimilarity Threshold: 0.63\n", "Runtime: 0.0330 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 83.41% \n", "\tRecall: 67.29%\n", "\tF1-score: 74.49%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Classification report:\n", "\tTrue positives: 724\n", "\tFalse positives: 144\n", "\tTrue negatives: 1156348\n", "\tFalse negatives: 352\n", "\tTotal comparisons: 868\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 83.41013824884793,\n", " 'Recall %': 67.28624535315984,\n", " 'F1 %': 74.48559670781893,\n", " 'True Positives': 724,\n", " 'False Positives': 144,\n", " 'True Negatives': 1156348,\n", " 'False Negatives': 352}" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ccc.evaluate(clusters, with_classification_report=True)" ] }, { "cell_type": "markdown", "id": "f811abf6-e6f0-45ec-b907-85d09b229fda", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }