{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Progressive ER\n", "\n", "__Progressive Clean-Clean Entity Resolution Tutorial.__ In this notebook we present the pyJedAI approach in the well-known ABT-BUY dataset. Clean-Clean ER in the link discovery/deduplication between two sets of entities.\n" ] }, { "cell_type": "markdown", "id": "9c49d2b7-11b5-40b3-9341-de98608dde13", "metadata": {}, "source": [ "Dataset: __Abt-Buy dataset__ (D1)\n", "\n", "The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1076 entities from abt.com and 1076 entities from buy.com as well as a gold standard (perfect mapping) with 1076 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price." ] }, { "cell_type": "markdown", "id": "744b3017-9a5c-4d3c-8e0a-fe39b069b647", "metadata": {}, "source": [ "## How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": 1, "id": "029a5825-799d-4c3f-a6cd-a75e257cadcc", "metadata": {}, "outputs": [], "source": [ "!pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": 2, "id": "462695ec-3af1-4048-9971-9ed0bce0f07b", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Name: pyjedai\n", "Version: 0.1.3\n", "Summary: An open-source library that builds powerful end-to-end Entity Resolution workflows.\n", "Home-page: \n", "Author: \n", "Author-email: Konstantinos Nikoletos , George Papadakis , Jakub Maciejewski , Manolis Koubarakis \n", "License: Apache Software License 2.0\n", "Location: c:\\users\\nikol\\anaconda3\\envs\\d31\\lib\\site-packages\n", "Requires: faiss-cpu, gensim, matplotlib, matplotlib-inline, networkx, nltk, numpy, optuna, ordered-set, pandas, pandas-profiling, pandocfilters, plotly, py-stringmatching, PyYAML, rdflib, rdfpandas, regex, scipy, seaborn, sentence-transformers, strsim, strsimpy, tomli, tqdm, transformers, valentine\n", "Required-by: \n" ] } ], "source": [ "!pip show pyjedai" ] }, { "cell_type": "markdown", "id": "7b4c62c5-6581-4d2e-9d44-c7c02f43d441", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 3, "id": "6db50d83-51d8-4c95-9f27-30ef867338f2", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph" ] }, { "cell_type": "code", "execution_count": 4, "id": "4d4e6a90-9fd8-4f7a-bf4f-a5b994e0adfb", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "[nltk_data] Downloading package stopwords to\n", "[nltk_data] C:\\Users\\nikol\\AppData\\Roaming\\nltk_data...\n", "[nltk_data] Package stopwords is already up-to-date!\n" ] } ], "source": [ "import pyjedai\n", "from pyjedai.utils import (\n", " text_cleaning_method,\n", " print_clusters,\n", " print_blocks,\n", " print_candidate_pairs\n", ")\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": {}, "source": [ "# Data Reading\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n" ] }, { "cell_type": "code", "execution_count": 5, "id": "e6aabec4-ef4f-4267-8c1e-377054e669d2", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "code", "execution_count": 7, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "d1 = pd.read_csv(\"./../data/ccer/D2/abt.csv\", sep='|', engine='python', na_filter=False)\n", "d2 = pd.read_csv(\"./../data/ccer/D2/buy.csv\", sep='|', engine='python', na_filter=False)\n", "gt = pd.read_csv(\"./../data/ccer/D2/gt.csv\", sep='|', engine='python')\n", "\n", "data = Data(dataset_1=d1,\n", " id_column_name_1='id',\n", " dataset_2=d2,\n", " id_column_name_2='id',\n", " ground_truth=gt)" ] }, { "cell_type": "markdown", "id": "5d8a8a78-858e-4c79-90fe-197a68e95e11", "metadata": {}, "source": [ "pyJedAI offers also dataset analysis methods (more will be developed)" ] }, { "cell_type": "code", "execution_count": 8, "id": "7cb87af2-adda-49e0-82cc-b1a5f7a595ef", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "------------------------- Data -------------------------\n", "Type of Entity Resolution: Clean-Clean\n", "Dataset-1:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "Dataset-2:\n", "\tNumber of entities: 1076\n", "\tNumber of NaN values: 0\n", "\tAttributes: \n", "\t\t ['name', 'description', 'price']\n", "\n", "Total number of entities: 2152\n", "Number of matching pairs in ground-truth: 1076\n", "-------------------------------------------------------- \n", "\n" ] } ], "source": [ "data.print_specs()" ] }, { "cell_type": "code", "execution_count": 9, "id": "b822d7c0-19a2-4050-9554-c35a208bb848", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Sony Turntable - PSLX350HSony Turntable - PSLX350H/ Belt Drive System/ ...
11Bose Acoustimass 5 Series III Speaker System -...Bose Acoustimass 5 Series III Speaker System -...399
22Sony Switcher - SBV40SSony Switcher - SBV40S/ Eliminates Disconnecti...49
33Sony 5 Disc CD Player - CDPCE375Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change...
44Bose 27028 161 Bookshelf Pair Speakers In Whit...Bose 161 Bookshelf Speakers In White - 161WH/ ...158
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Sony Turntable - PSLX350H \n", "1 1 Bose Acoustimass 5 Series III Speaker System -... \n", "2 2 Sony Switcher - SBV40S \n", "3 3 Sony 5 Disc CD Player - CDPCE375 \n", "4 4 Bose 27028 161 Bookshelf Pair Speakers In Whit... \n", "\n", " description price \n", "0 Sony Turntable - PSLX350H/ Belt Drive System/ ... \n", "1 Bose Acoustimass 5 Series III Speaker System -... 399 \n", "2 Sony Switcher - SBV40S/ Eliminates Disconnecti... 49 \n", "3 Sony 5 Disc CD Player- CDPCE375/ 5 Disc Change... \n", "4 Bose 161 Bookshelf Speakers In White - 161WH/ ... 158 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_1.head(5)" ] }, { "cell_type": "code", "execution_count": 10, "id": "5c26b595-5e02-4bfc-8e79-e476ab2830ef", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamedescriptionprice
00Linksys EtherFast EZXS88W Ethernet Switch - EZ...Linksys EtherFast 8-Port 10/100 Switch (New/Wo...
11Linksys EtherFast EZXS55W Ethernet Switch5 x 10/100Base-TX LAN
22Netgear ProSafe FS105 Ethernet Switch - FS105NANETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw...
33Belkin Pro Series High Integrity VGA/SVGA Moni...1 x HD-15 - 1 x HD-15 - 10ft - Beige
44Netgear ProSafe JFS516 Ethernet SwitchNetgear ProSafe 16 Port 10/100 Rackmount Switc...
\n", "
" ], "text/plain": [ " id name \\\n", "0 0 Linksys EtherFast EZXS88W Ethernet Switch - EZ... \n", "1 1 Linksys EtherFast EZXS55W Ethernet Switch \n", "2 2 Netgear ProSafe FS105 Ethernet Switch - FS105NA \n", "3 3 Belkin Pro Series High Integrity VGA/SVGA Moni... \n", "4 4 Netgear ProSafe JFS516 Ethernet Switch \n", "\n", " description price \n", "0 Linksys EtherFast 8-Port 10/100 Switch (New/Wo... \n", "1 5 x 10/100Base-TX LAN \n", "2 NETGEAR FS105 Prosafe 5 Port 10/100 Desktop Sw... \n", "3 1 x HD-15 - 1 x HD-15 - 10ft - Beige \n", "4 Netgear ProSafe 16 Port 10/100 Rackmount Switc... " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dataset_2.head(5)" ] }, { "cell_type": "code", "execution_count": 11, "id": "b3c9827e-a08a-47b2-a7f2-6f3f72184a17", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
D1D2
0206216
16046
2182160
\n", "
" ], "text/plain": [ " D1 D2\n", "0 206 216\n", "1 60 46\n", "2 182 160" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.ground_truth.head(3)" ] }, { "cell_type": "markdown", "id": "19891fc5-960e-4df1-a72a-e4533a74a761", "metadata": {}, "source": [ "### Data cleaning step (optional)\n", "\n", "pyJedAI offers 4 types of text cleaning/processing. \n", "\n", "- Stopwords removal\n", "- Punctuation removal\n", "- Numbers removal\n", "- Unicodes removal" ] }, { "cell_type": "code", "execution_count": 12, "id": "e471e48c-c882-4c74-b94f-4c1dff9fa36c", "metadata": {}, "outputs": [], "source": [ "data.clean_dataset(remove_stopwords = False, \n", " remove_punctuation = False, \n", " remove_numbers = False,\n", " remove_unicodes = False)" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "# Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 13, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/conda/miniconda3/envs/pyjedai-progressive/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " ExtendedQGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", ")" ] }, { "cell_type": "code", "execution_count": 14, "id": "9741f0c4-6250-455f-9c88-b8dc61ab7d4d", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Standard Blocking: 100%|██████████| 2152/2152 [00:00<00:00, 21279.99it/s]\n" ] } ], "source": [ "bb = StandardBlocking()\n", "blocks = bb.build_blocks(data, attributes_1=['name', 'description'], attributes_2=['name', 'description'])" ] }, { "cell_type": "code", "execution_count": 15, "id": "36fcbb3f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Standard Blocking\n", "Method info: Creates one block for every token in the attribute values of at least two entities.\n", "Parameters: Parameter-Free method\n", "Attributes from D1:\n", "\tname, description\n", "Attributes from D2:\n", "\tname, description\n", "Runtime: 0.1018 seconds\n" ] } ], "source": [ "bb.report()" ] }, { "cell_type": "code", "execution_count": 16, "id": "ce797c53", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Standard Blocking\n", "***************************************************************************************************************************\n", "Method name: Standard Blocking\n", "Parameters: \n", "Runtime: 0.1018 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 0.12% \n", "\tRecall: 99.81%\n", "\tF1-score: 0.25%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Classification report:\n", "\tTrue positives: 1074\n", "\tFalse positives: 874536\n", "\tTrue negatives: 1156698\n", "\tFalse negatives: 2\n", "\tTotal comparisons: 875610\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bb.evaluate(blocks, with_classification_report=True)" ] }, { "cell_type": "markdown", "id": "eeb516cf-43cd-4f02-88f8-6dfef7c5e20e", "metadata": {}, "source": [ "## Block Purging\n", "\n", "__Optional step__\n", "\n", "Discards the blocks exceeding a certain number of comparisons. \n" ] }, { "cell_type": "code", "execution_count": 17, "id": "725426e2-0af8-4295-baff-92653c841fdd", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockPurging" ] }, { "cell_type": "code", "execution_count": 18, "id": "7997b2b6-9629-44f0-a66d-5bc4fea28fb6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Purging: 100%|██████████| 4096/4096 [00:00<00:00, 838778.89it/s]\n" ] } ], "source": [ "bp = BlockPurging()\n", "cleaned_blocks = bp.process(blocks, data, tqdm_disable=False)" ] }, { "cell_type": "code", "execution_count": 19, "id": "88888fdc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Block Purging\n", "Method info: Discards the blocks exceeding a certain number of comparisons.\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 11845.0\n", "Runtime: 0.0060 seconds\n" ] } ], "source": [ "bp.report()" ] }, { "cell_type": "code", "execution_count": 20, "id": "4ff69547", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Block Purging\n", "***************************************************************************************************************************\n", "Method name: Block Purging\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 11845.0\n", "Runtime: 0.0060 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 0.26% \n", "\tRecall: 99.81%\n", "\tF1-score: 0.52%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bp.evaluate(cleaned_blocks)" ] }, { "cell_type": "markdown", "id": "9f9e77d5-c906-431a-bdc7-68dc9c00cc31", "metadata": { "tags": [] }, "source": [ "## Block Cleaning\n", "\n", "___Optional step___\n", "\n", "Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities." ] }, { "cell_type": "code", "execution_count": 21, "id": "9c2c0e42-485a-444e-9161-975f30d21a02", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockFiltering" ] }, { "cell_type": "code", "execution_count": 22, "id": "bf5c20ac-b16a-484d-82b0-61ecb9e7f3ea", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Filtering: 100%|██████████| 3/3 [00:00<00:00, 94.76it/s]\n" ] } ], "source": [ "bf = BlockFiltering(ratio=0.8)\n", "filtered_blocks = bf.process(cleaned_blocks, data, tqdm_disable=False)" ] }, { "cell_type": "code", "execution_count": 23, "id": "cdde6eff", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Μethod: Block Filtering\n", "***************************************************************************************************************************\n", "Method name: Block Filtering\n", "Parameters: \n", "\tRatio: 0.8\n", "Runtime: 0.0326 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 0.68% \n", "\tRecall: 99.26%\n", "\tF1-score: 1.35%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 0.6797958066528331,\n", " 'Recall %': 99.25650557620817,\n", " 'F1 %': 1.3503432754674995,\n", " 'True Positives': 1068,\n", " 'False Positives': 156038,\n", " 'True Negatives': 1156692,\n", " 'False Negatives': 8}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "bf.evaluate(filtered_blocks)" ] }, { "cell_type": "markdown", "id": "9cd12048-bd0c-4571-ba70-488d46afcdd6", "metadata": {}, "source": [ "## Progressive Entity Resolution\n", "\n", "___Scheduling + Emission + Matching___\n", "\n", "Progressive Entity Resolution (PER) consists of the above three stages. Specifically:\n", "\n", "**1. Scheduling -** This step is similar to Comparison Cleaning. We extract a subset of the original fully connected dataset in which each entity could be a duplicate candidate for any other entity. This is done by deriving neighborhoods for each entity, which contain its duplicate candidates.\n", "\n", "**2. Emission -** We iterate over the previously derived neighborhoods following a wide variety of algorithms (BFS, DFS, Hybrid etc.) and we extract the final candidate pairs. The number of emissions is limited by our *budget*.\n", "\n", "**3. Matching -** The candidate pairs are evaluated on the premise of being true duplicates. PER methods allow for the calculation of cumulative recall and as a result give us the possibility of deriving AUCs and plotting ROCs for different budget limitations.\n", "\n", "The following workflows are currently supported:\n", "\n", "\n", "* **NN workflows -**\n", "Progressive Vector Based BB (EmbeddingsNNBPM)\n", "\n", "* **Join workflows -**\n", "Base/Vector Based Progressive TopKJoin (TopKJoinPM)\n", "\n", "* **MB (Hash Based) workflows -**\n", "Progressive CEP (GlobalTopPM), \n", "Progressive CNP (LocalTopPM)\n", "\n", "* **Sorted Neighborhood workflows -**\n", "Global Progressive Sorted Neighborhood (GlobalPSNM), \n", "Local Progressive Sorted Neighborhood (LocalPSNM)\n", "\n", "* **Scheduling workflows -**\n", "Progressive Entity Scheduling (PESM)" ] }, { "cell_type": "code", "execution_count": 24, "id": "d301c271-eee3-46cd-8793-d1672f8cdc61", "metadata": {}, "outputs": [], "source": [ "from pyjedai.prioritization import (\n", " GlobalTopPM,\n", " LocalTopPM,\n", " EmbeddingsNNBPM,\n", " GlobalPSNM,\n", " LocalPSNM,\n", " RandomPM,\n", " PESM,\n", " TopKJoinPM\n", ")" ] }, { "cell_type": "code", "execution_count": 25, "id": "0947f01f", "metadata": {}, "outputs": [], "source": [ "# Maximum number of candidate pair emissions that can be parsed to matching\n", "BUDGET=10000\n", "# Emission Algorithm (DFS/BFS/HB/TOP)\n", "ALGORITHM=\"BFS\"\n", "# Identification Context - defines which dataset is the source and target one (inorder/reverse/bilateral)\n", "# Non-inorder indexing makes sense only in the context of NN and Join PER workflows\n", "# The other ones conduct entity identification in both dataset directions\n", "INDEXING=\"inorder\"" ] }, { "cell_type": "markdown", "id": "411abad8", "metadata": {}, "source": [ "### NN PER (Vector Based)" ] }, { "cell_type": "code", "execution_count": null, "id": "4cf2b9b5", "metadata": {}, "outputs": [], "source": [ "ennbpm = EmbeddingsNNBPM(language_model=\"sminilm\",\n", " number_of_nearest_neighbors=10,\n", " similarity_search=\"faiss\",\n", " similarity_function=\"euclidean\",\n", " similarity_threshold=0.0\n", " )\n", "\n", "# NN PER workflows don't require blocks in order to define neighborhoods\n", "# Entities are vectorized and similarity function is applied (e.x. faiss)\n", "# In an attempt to cluster similar entities into neighborhoods\n", "ennbpm_candidates = ennbpm.predict(data=data,\n", " blocks=None,\n", " budget=BUDGET,\n", " algorithm=ALGORITHM,\n", " indexing=INDEXING\n", " )\n", "\n" ] }, { "cell_type": "markdown", "id": "327b689d", "metadata": {}, "source": [ "### JN PER (Join)" ] }, { "cell_type": "code", "execution_count": 27, "id": "803b809a", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Top-K Join Progressive Matching: 0%| | 0/2048 [00:00" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "CPU times: user 183 ms, sys: 112 ms, total: 294 ms\n", "Wall time: 176 ms\n" ] } ], "source": [ "%%time\n", "progressive_matchers_evaluator = Evaluation(data)\n", "progressive_matchers_evaluator.evaluate_auc_roc(matchers = matchers, proportional = False)" ] }, { "cell_type": "code", "execution_count": 32, "id": "608e466d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NN WORKFLOW:\n", "Total Emissions: 10000\n", "Cumulative Recall: 0.9684014869888475\n", "Normalized AUC: 0.8500531917068019\n" ] } ], "source": [ "print(\"NN WORKFLOW:\")\n", "\n", "print(f'Total Emissions: {ennbpm.get_total_emissions()}')\n", "print(f'Cumulative Recall: {ennbpm.get_cumulative_recall()}')\n", "print(f'Normalized AUC: {ennbpm.get_normalized_auc()}')" ] }, { "cell_type": "code", "execution_count": 33, "id": "c6800626", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "JOIN WORKFLOW:\n", "Total Emissions: 10000\n", "Cumulative Recall: 0.9693308550185874\n", "Normalized AUC: 0.8705726081666761\n" ] } ], "source": [ "print(\"JOIN WORKFLOW:\")\n", "\n", "print(f'Total Emissions: {tkjpm.get_total_emissions()}')\n", "print(f'Cumulative Recall: {tkjpm.get_cumulative_recall()}')\n", "print(f'Normalized AUC: {tkjpm.get_normalized_auc()}')" ] }, { "cell_type": "code", "execution_count": 34, "id": "352a3cb1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MB WORKFLOW:\n", "Total Emissions: 10000\n", "Cumulative Recall: 0.9163568773234201\n", "Normalized AUC: 0.8116456941666359\n" ] } ], "source": [ "print(\"MB WORKFLOW:\")\n", "\n", "print(f'Total Emissions: {ltpm.get_total_emissions()}')\n", "print(f'Cumulative Recall: {ltpm.get_cumulative_recall()}')\n", "print(f'Normalized AUC: {ltpm.get_normalized_auc()}')" ] }, { "cell_type": "code", "execution_count": 35, "id": "edd29dbc", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SN WORKFLOW:\n", "Total Emissions: 10000\n", "Cumulative Recall: 0.7620817843866171\n", "Normalized AUC: 0.6066398007039242\n" ] } ], "source": [ "print(\"SN WORKFLOW:\")\n", "\n", "print(f'Total Emissions: {gpsnm.get_total_emissions()}')\n", "print(f'Cumulative Recall: {gpsnm.get_cumulative_recall()}')\n", "print(f'Normalized AUC: {gpsnm.get_normalized_auc()}')" ] }, { "cell_type": "markdown", "id": "315369d8-6564-44d4-aea0-14034b54cf16", "metadata": {}, "source": [ "
\n", "
\n", "J. Maciejewski, K. Nikoletos, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" }, "vscode": { "interpreter": { "hash": "824e5f4123a1a5b690f910010b2896a5dc6379151ca1c56e0c0465c15ebbd094" } } }, "nbformat": 4, "nbformat_minor": 5 }