{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "96ec678e-b20c-4213-8616-542010f46342",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "# pyJedAI with pyTorch pre-trained embeddings and FAISS\n",
    "\n",
    "<div align=\"center\"> \n",
    "    <img src=\"https://cdn.icon-icons.com/icons2/2699/PNG/512/pytorch_logo_icon_169823.png\" style=\"width: 250px;\"/>\n",
    "</div>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "34082d2f-2446-46d8-acff-842c16959120",
   "metadata": {},
   "source": [
    "## How to install?\n",
    "\n",
    "pyJedAI is an open-source library that can be installed from PyPI.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4598710b-6c72-462a-a261-3bd1c5694fe4",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install pyjedai -U"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f6145391-26bd-4e96-ba93-57e986e61047",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Name: pyjedai\n",
      "Version: 0.1.0\n",
      "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n",
      "Home-page: \n",
      "Author: \n",
      "Author-email: Konstantinos Nikoletos <nikoletos.kon@gmail.com>, George Papadakis <gpapadis84@gmail.com>, Jakub Maciejewski <jacobb.maciejewski@gmail.com>, Manolis Koubarakis <koubarak@di.uoa.gr>\n",
      "License: Apache Software License 2.0\n",
      "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n",
      "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n",
      "Required-by: \n"
     ]
    }
   ],
   "source": [
    "!pip show pyjedai"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "80dfcdfa-c357-4856-bca4-1ed17b0238d1",
   "metadata": {},
   "source": [
    "Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "6db50d83-51d8-4c95-9f27-30ef867338f2",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package stopwords to /home/jm/nltk_data...\n",
      "[nltk_data]   Package stopwords is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "import sys\n",
    "import pandas as pd\n",
    "import networkx\n",
    "from networkx import draw, Graph\n",
    "\n",
    "from pyjedai.evaluation import Evaluation\n",
    "from pyjedai.datamodel import Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "b4398804-8bf6-44a6-b324-e628011c17a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "d1 = pd.read_csv(\"../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False).astype(str)\n",
    "d2 = pd.read_csv(\"../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False).astype(str)\n",
    "gt = pd.read_csv(\"../data/ccer/D2/gt.csv\", sep='|', engine='python')\n",
    "\n",
    "attr1 = d1.columns[1:].to_list()\n",
    "attr2 = d2.columns[1:].to_list()\n",
    "\n",
    "data = Data(dataset_1=d1,\n",
    "            attributes_1=attr1,\n",
    "            id_column_name_1='id',\n",
    "            dataset_2=d2,\n",
    "            attributes_2=attr2,\n",
    "            id_column_name_2='id',\n",
    "            ground_truth=gt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c068252-4a69-405a-a320-c2875ec08ea5",
   "metadata": {},
   "source": [
    "# Block Building\n",
    "\n",
    "## Pre-trained pyTorch & GENSIM embeddings\n",
    "\n",
    "Available embeddings:\n",
    "\n",
    "- Gensim: `{ 'fasttext', 'glove', 'word2vec'}`\n",
    "- pyTorch Sentence transformers : `{'smpnet','st5','sdistilroberta','sminilm','sent_glove'}`\n",
    "- pyTorch Word transformers :`{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}`\n",
    "\n",
    "## FAISS\n",
    "\n",
    "faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "9c1b6213-a218-40cf-bc72-801b77d28da9",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b25e2601-a871-41dd-a48c-bf72382d0f13",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Building blocks via Embeddings-NN Block Building [sminilm, faiss]\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "2ef190a70d734a2991a614f5ffb8b3dd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Embeddings-NN Block Building [sminilm, faiss]:   0%|          | 0/2152 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Device selected:  cpu\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "ff5bf46b24354720a680a3f8136f4c80",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)5dded/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "4ab5ad6038ad4603a6d313ddb5f5dcbb",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e5b1b695c1dd4c85866d780830141a86",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)4d81d5dded/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "6b4f4d389ec143288371a6ab4f8db9bf",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)81d5dded/config.json:   0%|          | 0.00/573 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f0b3cc8627174dc492977a12e607b3e6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "3658fe1b1f4b4a158a906de980aaec1e",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)ded/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "593df076d3fc4b59a46f8faa6a79b3c6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "24742cdfd8e841acbd5835b9a1fb0ee4",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "171c9b926022495f903aebf2003c0b3f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "f06686557dcf486985b562243c6755db",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)5dded/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "971d0562bc484d56ae58f9d75d39701f",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)okenizer_config.json:   0%|          | 0.00/352 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "954473e8c9e145a39b49c3a1ece9fca6",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)dded/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "07b7c69ab6214083b323acd988b677bd",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)4d81d5dded/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "e8ed944a207043229f2ff40a8563d1b3",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)1d5dded/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',\n",
    "                                similarity_search='faiss')\n",
    "\n",
    "blocks, g = emb.build_blocks(data,\n",
    "                             top_k=5,\n",
    "                             similarity_distance='euclidean',\n",
    "                             load_embeddings_if_exist=False,\n",
    "                             save_embeddings=False,\n",
    "                             with_entity_matching=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "661ddf8a-537c-4935-a159-4cfb7bb2325a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***************************************************************************************************************************\n",
      "                                         Μethod:  Embeddings-NN Block Building\n",
      "***************************************************************************************************************************\n",
      "Method name: Embeddings-NN Block Building\n",
      "Parameters: \n",
      "\tVectorizer: sminilm\n",
      "\tSimilarity-Search: faiss\n",
      "\tTop-K: 5\n",
      "\tVector size: 384\n",
      "Runtime: 157.0182 seconds\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n",
      "Performance:\n",
      "\tPrecision:      9.38% \n",
      "\tRecall:        93.77%\n",
      "\tF1-score:      17.05%\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n",
      "Classification report:\n",
      "\tTrue positives: 1009\n",
      "\tFalse positives: 9751\n",
      "\tTrue negatives: 1156633\n",
      "\tFalse negatives: 67\n",
      "\tTotal comparisons: 10760\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n",
      "Statistics:\n",
      " FAISS:\n",
      "\tIndices shape returned after search: (1076, 5)\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'Precision %': 9.37732342007435,\n",
       " 'Recall %': 93.77323420074349,\n",
       " 'F1 %': 17.049678945589726,\n",
       " 'True Positives': 1009,\n",
       " 'False Positives': 9751,\n",
       " 'True Negatives': 1156633,\n",
       " 'False Negatives': 67}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "emb.evaluate(blocks, with_classification_report=True, with_stats=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb",
   "metadata": {},
   "source": [
    "# Entity Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7",
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9",
   "metadata": {},
   "outputs": [],
   "source": [
    "ccc = UniqueMappingClustering()\n",
    "clusters = ccc.process(g, data, similarity_threshold=0.63)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "22ced5c0-0e3c-4a09-a32a-19abbbe2e8d7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "***************************************************************************************************************************\n",
      "                                         Μethod:  Unique Mapping Clustering\n",
      "***************************************************************************************************************************\n",
      "Method name: Unique Mapping Clustering\n",
      "Parameters: \n",
      "Runtime: 0.1209 seconds\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n",
      "Performance:\n",
      "\tPrecision:     83.41% \n",
      "\tRecall:        67.29%\n",
      "\tF1-score:      74.49%\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n",
      "Classification report:\n",
      "\tTrue positives: 724\n",
      "\tFalse positives: 144\n",
      "\tTrue negatives: 1156348\n",
      "\tFalse negatives: 352\n",
      "\tTotal comparisons: 868\n",
      "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'Precision %': 83.41013824884793,\n",
       " 'Recall %': 67.28624535315984,\n",
       " 'F1 %': 74.48559670781893,\n",
       " 'True Positives': 724,\n",
       " 'False Positives': 144,\n",
       " 'True Negatives': 1156348,\n",
       " 'False Negatives': 352}"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ccc.evaluate(clusters, with_classification_report=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f811abf6-e6f0-45ec-b907-85d09b229fda",
   "metadata": {},
   "source": [
    "<hr>\n",
    "<div align=\"right\">\n",
    "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n",
    "</div>\n",
    "<div align=\"right\">\n",
    "<a href=\"https://github.com/Nikoletos-K/pyJedAI/blob/main/LICENSE\">Apache License 2.0</a>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f74144c-b934-4f63-9020-a66e628be7de",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "vscode": {
   "interpreter": {
    "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}