{ "cells": [ { "cell_type": "markdown", "id": "96ec678e-b20c-4213-8616-542010f46342", "metadata": {}, "source": [ "\n", "# Dirty ER\n", "\n", "In this notebook we present the pyJedAI approach in the well-known Cora dataset. Dirty ER, is the process of dedeplication of one set." ] }, { "cell_type": "markdown", "id": "5274855c-ba95-49b1-ba68-4f50ca2bbd89", "metadata": {}, "source": [ "# How to install?\n", "\n", "pyJedAI is an open-source library that can be installed from PyPI.\n", "\n", "For more: [pypi.org/project/pyjedai/](https://pypi.org/project/pyjedai/)" ] }, { "cell_type": "code", "execution_count": null, "id": "4697d149-c1a4-4767-9ed1-14444485e409", "metadata": {}, "outputs": [], "source": [ "%python --version" ] }, { "cell_type": "code", "execution_count": null, "id": "776843a2-570d-4d87-bb1d-b5b61f07b1da", "metadata": {}, "outputs": [], "source": [ "%pip install pyjedai -U" ] }, { "cell_type": "code", "execution_count": null, "id": "6d2e5cf7-ff2e-4271-9242-fe3d638263e9", "metadata": {}, "outputs": [], "source": [ "%pip show pyjedai" ] }, { "cell_type": "markdown", "id": "15d28272-269a-4e87-bb03-0a45a5492a06", "metadata": {}, "source": [ "Imports" ] }, { "cell_type": "code", "execution_count": 1, "id": "a0890ce6-3a10-4e66-913f-78095bd786a1", "metadata": {}, "outputs": [], "source": [ "import os\n", "import sys\n", "import pandas as pd\n", "import networkx\n", "from networkx import draw, Graph\n", "\n", "from pyjedai.utils import print_clusters, print_blocks, print_candidate_pairs\n", "from pyjedai.evaluation import Evaluation" ] }, { "cell_type": "markdown", "id": "af77914f-5e76-4da8-a0ad-1c53e0111a0f", "metadata": { "tags": [] }, "source": [ "## Reading the dataset\n", "\n", "pyJedAI in order to perfrom needs only the tranformation of the initial data into a pandas DataFrame. Hence, pyJedAI can function in every structured or semi-structured data. In this case Abt-Buy dataset is provided as .csv files. \n", "\n", "
\n", " \n", "
\n", "\n", "\n", "### pyjedai module\n", "\n", "Data module offers a numpber of options\n", "- Selecting the parameters (columns) of the dataframe, in D1 (and in D2)\n", "- Prints a detailed text analysis\n", "- Stores a hidden mapping of the ids, and creates it if not exists." ] }, { "cell_type": "code", "execution_count": 2, "id": "3d3feb89-1406-4c90-a1aa-dc2cf4707739", "metadata": {}, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "\n", "d1 = pd.read_csv(\"./../data/der/cora/cora.csv\", sep='|')\n", "gt = pd.read_csv(\"./../data/der/cora/cora_gt.csv\", sep='|', header=None)\n", "attr = ['author', 'title']" ] }, { "cell_type": "markdown", "id": "fda32323-c74d-4374-b322-5c11a175c3ea", "metadata": {}, "source": [ "Data is the connecting module of all steps of the workflow" ] }, { "cell_type": "code", "execution_count": 3, "id": "e257597d-ea77-4090-ba34-e1038d8f9a0d", "metadata": {}, "outputs": [], "source": [ "data = Data(\n", " dataset_1=d1,\n", " id_column_name_1='Entity Id',\n", " ground_truth=gt,\n", " attributes_1=attr,\n", " dataset_name_1=\"CORA\"\n", ")" ] }, { "cell_type": "markdown", "id": "93464edd-b88a-40d9-aa4d-7fe1523db662", "metadata": {}, "source": [ "## Workflow with Block Cleaning Methods\n", "\n", "In this notebook we created the bellow architecture:\n", "\n", "![workflow1-cora.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/docs/img/workflow1-cora.png?raw=true)\n", "\n" ] }, { "cell_type": "markdown", "id": "9c068252-4a69-405a-a320-c2875ec08ea5", "metadata": {}, "source": [ "## Block Building\n", "\n", "It clusters entities into overlapping blocks in a lazy manner that relies on unsupervised blocking keys: every token in an attribute value forms a key. Blocks are then extracted, possibly using a transformation, based on its equality or on its similarity with other keys.\n", "\n", "The following methods are currently supported:\n", "\n", "- Standard/Token Blocking\n", "- Sorted Neighborhood\n", "- Extended Sorted Neighborhood\n", "- Q-Grams Blocking\n", "- Extended Q-Grams Blocking\n", "- Suffix Arrays Blocking\n", "- Extended Suffix Arrays Blocking" ] }, { "cell_type": "code", "execution_count": 4, "id": "9c1b6213-a218-40cf-bc72-801b77d28da9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/conda/miniconda3/envs/pypi_dependencies/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "from pyjedai.block_building import (\n", " StandardBlocking,\n", " QGramsBlocking,\n", " SuffixArraysBlocking,\n", " ExtendedSuffixArraysBlocking,\n", " ExtendedQGramsBlocking\n", ")" ] }, { "cell_type": "code", "execution_count": 5, "id": "7ee34038-1352-440e-8c34-98c5cf036523", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Suffix Arrays Blocking: 100%|██████████| 1295/1295 [00:00<00:00, 7419.94it/s]\n" ] } ], "source": [ "bb = SuffixArraysBlocking(suffix_length=2)\n", "blocks = bb.build_blocks(data)" ] }, { "cell_type": "code", "execution_count": 6, "id": "b0ac846d-0f13-4b90-b4c8-688054ed7ffe", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Suffix Arrays Blocking\n", "***************************************************************************************************************************\n", "Method name: Suffix Arrays Blocking\n", "Parameters: \n", "\tSuffix length: 2\n", "\tMaximum Block Size: 53\n", "Runtime: 0.1759 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 4.40% \n", "\tRecall: 75.75%\n", "\tF1-score: 8.31%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bb.evaluate(blocks)" ] }, { "cell_type": "markdown", "id": "9fd79ed0-4073-4789-9f9c-09e61da1f4ef", "metadata": {}, "source": [ "## Block Purging\n", "\n", "__Optional step__\n", "\n", "Discards the blocks exceeding a certain number of comparisons. \n" ] }, { "cell_type": "code", "execution_count": 7, "id": "ca78a044-589d-48c1-b508-72d5d3205a1c", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockPurging" ] }, { "cell_type": "code", "execution_count": 8, "id": "56f77f40-1c76-4f03-b0dd-5bb592f586e0", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Purging: 100%|██████████| 2842/2842 [00:00<00:00, 127747.12it/s]\n" ] } ], "source": [ "bp = BlockPurging()\n", "cleaned_blocks = bp.process(blocks, data, tqdm_disable=False)" ] }, { "cell_type": "code", "execution_count": 9, "id": "9ad1950f-9c89-484e-926e-54892dcd93d6", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Method name: Block Purging\n", "Method info: Discards the blocks exceeding a certain number of comparisons.\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 1378.0\n", "Runtime: 0.0283 seconds\n" ] } ], "source": [ "bp.report()" ] }, { "cell_type": "code", "execution_count": 10, "id": "e3a7cbdf-7ed5-499d-b8da-aa7da3167660", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Block Purging\n", "***************************************************************************************************************************\n", "Method name: Block Purging\n", "Parameters: \n", "\tSmoothing factor: 1.025\n", "\tMax Comparisons per Block: 1378.0\n", "Runtime: 0.0283 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 4.40% \n", "\tRecall: 75.75%\n", "\tF1-score: 8.31%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bp.evaluate(cleaned_blocks)" ] }, { "cell_type": "markdown", "id": "62b067d8-fbc4-4689-ba72-7ab5baf88a70", "metadata": { "tags": [] }, "source": [ "## Block Cleaning\n", "\n", "___Optional step___\n", "\n", "Its goal is to clean a set of overlapping blocks from unnecessary comparisons, which can be either redundant (i.e., repeated comparisons that have already been executed in a previously examined block) or superfluous (i.e., comparisons that involve non-matching entities). Its methods operate on the coarse level of individual blocks or entities." ] }, { "cell_type": "code", "execution_count": 11, "id": "9c2c0e42-485a-444e-9161-975f30d21a02", "metadata": {}, "outputs": [], "source": [ "from pyjedai.block_cleaning import BlockFiltering" ] }, { "cell_type": "code", "execution_count": 12, "id": "bf5c20ac-b16a-484d-82b0-61ecb9e7f3ea", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Block Filtering: 100%|██████████| 3/3 [00:00<00:00, 51.87it/s]\n" ] } ], "source": [ "bc = BlockFiltering(ratio=0.9)\n", "blocks = bc.process(blocks, data)" ] }, { "cell_type": "code", "execution_count": 13, "id": "25fd0be0-91c3-4d0b-b596-c66dccba3c79", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Block Filtering\n", "***************************************************************************************************************************\n", "Method name: Block Filtering\n", "Parameters: \n", "\tRatio: 0.9\n", "Runtime: 0.0615 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 5.21% \n", "\tRecall: 74.08%\n", "\tF1-score: 9.73%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = bc.evaluate(blocks)" ] }, { "cell_type": "markdown", "id": "9cd12048-bd0c-4571-ba70-488d46afcdd6", "metadata": {}, "source": [ "## Comparison Cleaning - Meta Blocking\n", "\n", "___Optional step___\n", "\n", "Similar to Block Cleaning, this step aims to clean a set of blocks from both redundant and superfluous comparisons. Unlike Block Cleaning, its methods operate on the finer granularity of individual comparisons.\n", "\n", "The following methods are currently supported:\n", "\n", "- Comparison Propagation\n", "- Cardinality Edge Pruning (CEP)\n", "- Cardinality Node Pruning (CNP)\n", "- Weighed Edge Pruning (WEP)\n", "- Weighed Node Pruning (WNP)\n", "- Reciprocal Cardinality Node Pruning (ReCNP)\n", "- Reciprocal Weighed Node Pruning (ReWNP)\n", "- BLAST\n", "\n", "Most of these methods are Meta-blocking techniques. All methods are optional, but competive, in the sense that only one of them can part of an ER workflow. For more details on the functionality of these methods, see here. They can be combined with one of the following weighting schemes:\n", "\n", "- Aggregate Reciprocal Comparisons Scheme (ARCS)\n", "- Common Blocks Scheme (CBS)\n", "- Enhanced Common Blocks Scheme (ECBS)\n", "- Jaccard Scheme (JS)\n", "- Enhanced Jaccard Scheme (EJS)" ] }, { "cell_type": "code", "execution_count": 14, "id": "1f7d75f3-6bed-482d-a572-c3b4927236a5", "metadata": {}, "outputs": [], "source": [ "from pyjedai.comparison_cleaning import (\n", " WeightedEdgePruning,\n", " WeightedNodePruning,\n", " CardinalityEdgePruning,\n", " CardinalityNodePruning,\n", " BLAST,\n", " ReciprocalCardinalityNodePruning,\n", " ComparisonPropagation\n", ")" ] }, { "cell_type": "code", "execution_count": 15, "id": "c92e0ca3-5591-4620-b3f4-012a23637416", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Weighted Edge Pruning: 0%| | 0/1295 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(pairs_graph)" ] }, { "cell_type": "code", "execution_count": 20, "id": "00bc2e82-9bc1-4119-b8cb-4a1c18afee19", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: jaccard\n", "\tAttributes: None\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: white_space_tokenizer\n", "\tVectorizer: None\n", "\tQgrams: 1\n", "Runtime: 1.8509 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 77.91% \n", "\tRecall: 43.85%\n", "\tF1-score: 56.12%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = em.evaluate(pairs_graph)" ] }, { "cell_type": "markdown", "id": "30a4c55c", "metadata": {}, "source": [ "### Experimenting with the attributes selected in Matching step" ] }, { "cell_type": "markdown", "id": "316ec67f", "metadata": {}, "source": [ "Giving a `list` of attributes (subset of initial), the user can experiment with the attributes that are selected in the matching step. The user can select the attributes that are used in the matching step, and the attributes that are used in the blocking step." ] }, { "cell_type": "code", "execution_count": 21, "id": "3c49107c", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Entity Matching (jaccard, white_space_tokenizer): 100%|██████████| 727/727 [00:00<00:00, 864.19it/s] \n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: jaccard\n", "\tAttributes: ['author']\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: white_space_tokenizer\n", "\tVectorizer: None\n", "\tQgrams: 1\n", "Runtime: 0.8427 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 77.53% \n", "\tRecall: 41.53%\n", "\tF1-score: 54.09%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "em = EntityMatching(\n", " metric='jaccard',\n", " similarity_threshold=0.0,\n", " attributes=['author']\n", ")\n", "\n", "authors_pairs_graph = em.predict(blocks, data)\n", "_ = em.evaluate(authors_pairs_graph)" ] }, { "cell_type": "markdown", "id": "4f016ac7", "metadata": {}, "source": [ "Giving weights as `dict`. Adding a weight factor to each attribute. " ] }, { "cell_type": "code", "execution_count": 22, "id": "da9845bb", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Entity Matching (jaccard, white_space_tokenizer): 11%|█ | 80/727 [00:00<00:00, 786.33it/s]" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Entity Matching (jaccard, white_space_tokenizer): 100%|██████████| 727/727 [00:01<00:00, 477.20it/s]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Entity Matching\n", "***************************************************************************************************************************\n", "Method name: Entity Matching\n", "Parameters: \n", "\tMetric: jaccard\n", "\tAttributes: {'author': 0.2, 'title': 0.8}\n", "\tSimilarity threshold: 0.0\n", "\tTokenizer: white_space_tokenizer\n", "\tVectorizer: None\n", "\tQgrams: 1\n", "Runtime: 1.5248 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 77.91% \n", "\tRecall: 43.85%\n", "\tF1-score: 56.12%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "weights = {\n", " 'author': 0.2,\n", " 'title': 0.8\n", "}\n", "\n", "em = EntityMatching(\n", " metric='jaccard',\n", " similarity_threshold=0.0,\n", " attributes=weights\n", ")\n", "\n", "weights_pairs_graph = em.predict(blocks, data)\n", "_ = em.evaluate(weights_pairs_graph)" ] }, { "cell_type": "markdown", "id": "607ee1fb-bed4-4751-9fa8-429874457f72", "metadata": {}, "source": [ "### How to set a valid similarity threshold?\n", "\n", "Configure similariy threshold with a Grid-Search or with an Optuna search. Also pyJedAI provides some visualizations on the distributions of the scores.\n", "\n", "For example with a classic histogram:\n" ] }, { "cell_type": "code", "execution_count": 23, "id": "c04c0482-a0d2-4ebf-b9c8-f6cff1863e5d", "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_all_weights()" ] }, { "cell_type": "markdown", "id": "75b6218c-3983-413e-a904-53bba33ea381", "metadata": {}, "source": [ "Or with a range 0.1 from 0.0 to 1.0 grouping:" ] }, { "cell_type": "code", "execution_count": 24, "id": "3e624fb5-cb48-4081-b90f-0e59adf88d26", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Distribution-% of predicted scores: [0.175783269568814, 1.199462309998966, 1.5303484644814394, 3.732809430255403, 3.898252507496639, 6.545341743356427, 8.168751938786063, 8.334195016027298, 34.8361079516079, 17.79547099576052]\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "em.plot_distribution_of_scores()" ] }, { "cell_type": "markdown", "id": "93b72120-4578-4d5c-a408-a24ee78bf6cb", "metadata": {}, "source": [ "## Entity Clustering\n", "\n", "It takes as input the similarity graph produced by Entity Matching and partitions it into a set of equivalence clusters, with every cluster corresponding to a distinct real-world object." ] }, { "cell_type": "code", "execution_count": 25, "id": "500d2ef7-7017-4dba-bbea-acdba8abf5b7", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 26, "id": "aebd9329-3a4b-48c9-bd05-c7bd4aed3ca9", "metadata": {}, "outputs": [], "source": [ "ec = ConnectedComponentsClustering()\n", "clusters = ec.process(pairs_graph, data, similarity_threshold=0.3)" ] }, { "cell_type": "code", "execution_count": 27, "id": "3d2aa574", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "\tSimilarity Threshold: 0.3\n", "Runtime: 0.0396 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 76.45% \n", "\tRecall: 43.99%\n", "\tF1-score: 55.85%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ec.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "6e44642b-b0d9-4f8d-9fe4-4fca2ad716aa", "metadata": {}, "source": [ "# Workflow with Similarity Joins\n", "\n", "In this notebook we created the bellow archtecture:\n", "\n", "![workflow2-cora.png](https://github.com/AI-team-UoA/pyJedAI/blob/main/documentation/workflow2-cora.png?raw=true)\n", "\n" ] }, { "cell_type": "markdown", "id": "2b07ea13-58cb-498a-949a-4702ea9ee4ce", "metadata": { "tags": [] }, "source": [ "## Data Reading" ] }, { "cell_type": "markdown", "id": "a0b7acf5-32b3-45b1-8347-197dfa869fb9", "metadata": {}, "source": [ "Data is the connecting module of all steps of the workflow" ] }, { "cell_type": "code", "execution_count": 28, "id": "3006b051-8348-4922-a627-56441a1db7b7", "metadata": { "tags": [] }, "outputs": [], "source": [ "from pyjedai.datamodel import Data\n", "d1 = pd.read_csv(\"./../data/der/cora/cora.csv\", sep='|')\n", "gt = pd.read_csv(\"./../data/der/cora/cora_gt.csv\", sep='|', header=None)\n", "attr = ['Entity Id','author', 'title']\n", "data = Data(\n", " dataset_1=d1,\n", " id_column_name_1='Entity Id',\n", " ground_truth=gt,\n", " attributes_1=attr\n", ")" ] }, { "cell_type": "markdown", "id": "b3eedb4c-b86f-4f98-abf2-9d6f7d0271c5", "metadata": {}, "source": [ "## Similarity Joins" ] }, { "cell_type": "code", "execution_count": 29, "id": "afd97b7e-4bf8-4256-b9fc-a301e413e834", "metadata": {}, "outputs": [], "source": [ "from pyjedai.joins import EJoin, TopKJoin" ] }, { "cell_type": "code", "execution_count": 30, "id": "7bc8ba43-b059-4839-8958-0b31fab95e46", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "EJoin (jaccard): 0%| | 0/1295 [00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "draw(g)" ] }, { "cell_type": "code", "execution_count": 34, "id": "a4c57244-0c98-4598-8ae2-06cdd0c48385", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Top-K Join\n", "***************************************************************************************************************************\n", "Method name: Top-K Join\n", "Parameters: \n", "\tsimilarity_threshold: 0.25547445255474455\n", "\tK: 20\n", "\tmetric: jaccard\n", "\ttokenization: qgrams\n", "\tqgrams: 3\n", "Runtime: 15.0497 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 58.34% \n", "\tRecall: 63.75%\n", "\tF1-score: 60.92%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] }, { "data": { "text/plain": [ "{'Precision %': 58.340434597358325,\n", " 'Recall %': 63.74534450651769,\n", " 'F1 %': 60.923248053392655,\n", " 'True Positives': 10954,\n", " 'False Positives': 7822,\n", " 'True Negatives': 814451.0,\n", " 'False Negatives': 6230}" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "topk_join.evaluate(g)" ] }, { "cell_type": "markdown", "id": "e686ba55-cbbe-4133-8256-fa5339e70720", "metadata": {}, "source": [ "## Entity Clustering" ] }, { "cell_type": "code", "execution_count": 35, "id": "e48627ef-0be6-45dd-b3ba-863818eadbfb", "metadata": {}, "outputs": [], "source": [ "from pyjedai.clustering import ConnectedComponentsClustering" ] }, { "cell_type": "code", "execution_count": 36, "id": "a9d5a28d-79f0-479f-b154-f5f7e3661897", "metadata": {}, "outputs": [], "source": [ "ccc = ConnectedComponentsClustering()\n", "\n", "clusters = ccc.process(g, data)" ] }, { "cell_type": "code", "execution_count": 37, "id": "f95539fe-2569-4569-8929-2f4220f14157", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "***************************************************************************************************************************\n", " Method: Connected Components Clustering\n", "***************************************************************************************************************************\n", "Method name: Connected Components Clustering\n", "Parameters: \n", "\tSimilarity Threshold: None\n", "Runtime: 0.1237 seconds\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n", "Performance:\n", "\tPrecision: 2.05% \n", "\tRecall: 100.00%\n", "\tF1-score: 4.02%\n", "───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────\n" ] } ], "source": [ "_ = ccc.evaluate(clusters)" ] }, { "cell_type": "markdown", "id": "778b9b19-0964-4095-a293-a8336fe0607a", "metadata": {}, "source": [ "
\n", "
\n", "K. Nikoletos, J. Maciejewski, G. Papadakis & M. Koubarakis\n", "
\n", "
\n", "Apache License 2.0\n", "
" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.0" } }, "nbformat": 4, "nbformat_minor": 5 }