{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "\n", "# pyJedAI with pyTorch pre-trained embeddings and FAISS\n", "\n", "
\n", " \n", "
\n" ] }, { "cell_type": "markdown", "id": "34082d2f-2446-46d8-acff-842c16959120", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "4598710b-6c72-462a-a261-3bd1c5694fe4", "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 2, "id": "f6145391-26bd-4e96-ba93-57e986e61047", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.0\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: /home/jm/anaconda3/envs/pyjedai-new/lib/python3.8/site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "80dfcdfa-c357-4856-bca4-1ed17b0238d1", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 3, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to /home/jm/nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph\n", "\n", "from pyjedai.evaluation import Evaluation\n", "from pyjedai.datamodel import Data" ] }, { "cell_type": "code", "execution_count": 4, "id": "b4398804-8bf6-44a6-b324-e628011c17a8", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "d2 = pd.read_csv(\"../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False).astype(str)\n", "gt = pd.read_csv(\"../data/ccer/D2/gt.csv\", sep='|', engine='python')\n", "\n", "attr1 = d1.columns[1:].to_list()\n", "attr2 = d2.columns[1:].to_list()\n", "\n", "data = Data(dataset_1=d1,\n", " attributes_1=attr1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " attributes_2=attr2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "# Block Building\n", "\n", "## Pre-trained pyTorch & GENSIM embeddings\n", "\n", "Available embeddings:\n", "\n", "- Gensim: `{ 'fasttext', 'glove', 'word2vec'}`\n", "- pyTorch Sentence transformers : `{'smpnet','st5','sdistilroberta','sminilm','sent_glove'}`\n", "- pyTorch Word transformers :`{'bert', 'distilbert', 'roberta', 'xlnet', 'albert'}`\n", "\n", "## FAISS\n", "\n", "faiss.IndexIVFFlat is an implementation of an inverted file index with coarse quantization. This index is used to efficiently search for nearest neighbors of a query vector in a large dataset of vectors. Here's a brief explanation of the parameters used in this index:\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [], "source": [ "from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding" ] }, { "cell_type": "code", "execution_count": 6, "id": "b25e2601-a871-41dd-a48c-bf72382d0f13", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building blocks via Embeddings-NN Block Building [sminilm, faiss]\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2ef190a70d734a2991a614f5ffb8b3dd", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Embeddings-NN Block Building [sminilm, faiss]: 0%| | 0/2152 [00:00\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] }, { "cell_type": "code", "execution_count": null, "id": "3f74144c-b934-4f63-9020-a66e628be7de", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }