{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a part of the ***Magellan*** ecosystem, in this tutorial we will show how to use `deepmatcher` together with `Magellan`, to perform an end-to-end Entity Matching (EM) task. Specifically, `Magellan` performs EM in a **two-stage** fashion where given two tables,\n", "1. Perform **blocking** on the two tables by removing obvious non-matching tuple pairs to get a candidate set K.\n", "2. Perform **matching** to predict each pair in K as match or non-match. This stage consists of the following substeps:\n", " 1. *Magellan* first helps the user take a sample S from K.\n", " 2. The user labels all pairs in S as the training data.\n", " 3. A classifer L will be learned on S.\n", " 4. The user applies L to K to predict each pair as match or non-match.\n", "\n", "Given the above workflow, `deepmatcher` fits in the step C, which is to take advantage of deep learning to learn a classifer. For the rest of the tutorial, we will use a real example, which is to match songs across two tables from iTunes and Amazon Music, to go through the workflow.\n", "\n", "For more information on `Magellan`, please go to the project website: https://sites.google.com/site/anhaidgroup/projects/magellan." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 0. Preparation\n", "In order to use ***Magellan***, we need to first install it. The easiest way is to use \"pip\" as follow (please consult the package website for other installation options)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "try:\n", " import py_entitymatching as em\n", "except:\n", " !pip install py_entitymatching" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import sys" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "sys.path.append('/Users/pradap/Documents/Research/Python-Package/anhaid/deepmatcher')" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import deepmatcher as dm\n", "import py_entitymatching as em\n", "\n", "import os" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 1. Load data\n", "We first load the two input tables in the csv format using `Magellan`, that contain songs from iTunes and Amazon Music. These two tables are in the \"example\" directory included in our package." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# The path to the two input tables.\n", "path_A = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableA.csv')\n", "path_B = os.path.join('.', 'sample_data', 'itunes-amazon', 'tableB.csv')" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n", "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n" ] } ], "source": [ "# Load the two tables.\n", "A = em.read_csv_metadata(path_A, key='id')\n", "B = em.read_csv_metadata(path_B, key='id')" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of tuples in A: 6907\n", "Number of tuples in B: 55923\n", "Number of tuples in A X B (i.e the cartesian product): 386260161\n" ] } ], "source": [ "# Basic information about the tables.\n", "print('Number of tuples in A: ' + str(len(A)))\n", "print('Number of tuples in B: ' + str(len(B)))\n", "print('Number of tuples in A X B (i.e the cartesian product): ' + str(len(A)*len(B)))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idSong_NameArtist_NameAlbum_NameGenrePriceCopyRightTimeReleased
00Runaway TrainCamWelcome to Cam Country - EPCountry , Music , Contemporary Country , Honky Tonk$ 0.992015 Sony Music Entertainment3:0131-Mar-15
11Track 14OmiMe 4 UPop/Rock , Music , Pop , Dance , R&B / SoulAlbum Only2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ...3:41NaN
22Lips Are MovinMeghan TrainorTitlePop , Music , Pop/Rock , Dance , Rock , Teen Pop$ 1.292014 , 2015 Epic Records , a division of Sony Music Entertainment3:019-Jan-15
33I Want a Hippopotamus for ChristmasA Great Big WorldI 'll Be Home For ChristmasHoliday , Music$ 1.29Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment2:2024-Nov-14
44CreditMeghan TrainorTitle ( Deluxe )Pop , Music , Rock , Pop/Rock , Dance , Teen Pop$ 1.292014 , 2015 Epic Records , a division of Sony Music Entertainment2:519-Jan-15
\n", "
" ], "text/plain": [ " id Song_Name Artist_Name \\\n", "0 0 Runaway Train Cam \n", "1 1 Track 14 Omi \n", "2 2 Lips Are Movin Meghan Trainor \n", "3 3 I Want a Hippopotamus for Christmas A Great Big World \n", "4 4 Credit Meghan Trainor \n", "\n", " Album_Name \\\n", "0 Welcome to Cam Country - EP \n", "1 Me 4 U \n", "2 Title \n", "3 I 'll Be Home For Christmas \n", "4 Title ( Deluxe ) \n", "\n", " Genre Price \\\n", "0 Country , Music , Contemporary Country , Honky Tonk $ 0.99 \n", "1 Pop/Rock , Music , Pop , Dance , R&B / Soul Album Only \n", "2 Pop , Music , Pop/Rock , Dance , Rock , Teen Pop $ 1.29 \n", "3 Holiday , Music $ 1.29 \n", "4 Pop , Music , Rock , Pop/Rock , Dance , Teen Pop $ 1.29 \n", "\n", " CopyRight \\\n", "0 2015 Sony Music Entertainment \n", "1 2015 Ultra Records , LLC under exclusive license to Columbia Records , a Division of Sony Music ... \n", "2 2014 , 2015 Epic Records , a division of Sony Music Entertainment \n", "3 Compilation ( P ) 2014 Epic Records , a division of Sony Music Entertainment \n", "4 2014 , 2015 Epic Records , a division of Sony Music Entertainment \n", "\n", " Time Released \n", "0 3:01 31-Mar-15 \n", "1 3:41 NaN \n", "2 3:01 9-Jan-15 \n", "3 2:20 24-Nov-14 \n", "4 2:51 9-Jan-15 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first few tuples in table A.\n", "A.head()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idSong_NameArtist_NameAlbum_NameGenrePriceCopyRightTimeReleased
00Saxophone Stomp [ Explicit ]Rusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:20September 16 , 2014
11I Wan na Mingle [ feat . Pusher ]Rusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.2:36September 16 , 2014
22LytahRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:48September 16 , 2014
33Slappy PappyRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:55September 16 , 2014
44Fushion JamRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.1:48September 16 , 2014
\n", "
" ], "text/plain": [ " id Song_Name Artist_Name \\\n", "0 0 Saxophone Stomp [ Explicit ] Rusko \n", "1 1 I Wan na Mingle [ feat . Pusher ] Rusko \n", "2 2 Lytah Rusko \n", "3 3 Slappy Pappy Rusko \n", "4 4 Fushion Jam Rusko \n", "\n", " Album_Name Genre Price \\\n", "0 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "1 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "2 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "3 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "4 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "\n", " CopyRight \\\n", "0 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "1 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "2 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "3 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "4 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "\n", " Time Released \n", "0 3:20 September 16 , 2014 \n", "1 2:36 September 16 , 2014 \n", "2 3:48 September 16 , 2014 \n", "3 3:55 September 16 , 2014 \n", "4 1:48 September 16 , 2014 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first few tuples in table B.\n", "B.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 2. Block tables to get the candidate set\n", "We first perform blocking on A and B to get a candidate set K, by removing obvious non-matching tuple pairs. `Magellan` supports four different types of blocker: (1) attribute equivalence, (2) overlap, (3) rule-based, and (4) black-box. Typically, users need to mix and match these blockers with the debugging functionality provided in `Magellan` to get a good candidate set. (Developing a good blocker is not the focus of this tutorial. For more information on developing and debugging a blocker, please consult the user manual of `Magellan`.)\n", "\n", "Here we show an example of blocking. Observe that matching tuple pairs should have some common words in the album name, so we first create an overlap blocker on the attribute name \"Album_Name\" with threshold 2, to remove all pairs with word overlap less 2 in that attribute." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% [##############################] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:18\n" ] } ], "source": [ "# Create an overlap blocker in Magellan and apply it to A and B to get the candidate set K1 which is in the format of \n", "# a dataframe. The \"l_out_attrs\" and \"r_out_attrs\" parameters indicate the columns that will be included in K1 from A\n", "# and B respectively.\n", "ob = em.OverlapBlocker()\n", "K1 = ob.block_tables(A, B, 'Album_Name', 'Album_Name',\n", " l_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'], \n", " r_output_attrs=['Song_Name', 'Artist_Name', 'Album_Name', 'Genre', 'Price', 'CopyRight', 'Time', 'Released'],\n", " overlap_size=2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "3152021" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The number of tuple pairs in K1.\n", "len(K1)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
_idltable_idrtable_idltable_Song_Nameltable_Artist_Nameltable_Album_Nameltable_Genreltable_Priceltable_CopyRightltable_Timeltable_Releasedrtable_Song_Namertable_Artist_Namertable_Album_Namertable_Genrertable_Pricertable_CopyRightrtable_Timertable_Released
0039950Rudolph BlueBrian FechinoA Rock By The Sea Christmas : : Volume 2Holiday , Music , Rock$ 0.992011 Hubbub !1:578-Nov-11Saxophone Stomp [ Explicit ]Rusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:20September 16 , 2014
1139951Rudolph BlueBrian FechinoA Rock By The Sea Christmas : : Volume 2Holiday , Music , Rock$ 0.992011 Hubbub !1:578-Nov-11I Wan na Mingle [ feat . Pusher ]Rusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.2:36September 16 , 2014
2239952Rudolph BlueBrian FechinoA Rock By The Sea Christmas : : Volume 2Holiday , Music , Rock$ 0.992011 Hubbub !1:578-Nov-11LytahRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:48September 16 , 2014
3339953Rudolph BlueBrian FechinoA Rock By The Sea Christmas : : Volume 2Holiday , Music , Rock$ 0.992011 Hubbub !1:578-Nov-11Slappy PappyRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.3:55September 16 , 2014
4439954Rudolph BlueBrian FechinoA Rock By The Sea Christmas : : Volume 2Holiday , Music , Rock$ 0.992011 Hubbub !1:578-Nov-11Fushion JamRusko! ( Volume 2 ) [ Explicit ]Dance & Electronic , Dubstep$ 1.29( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc.1:48September 16 , 2014
\n", "
" ], "text/plain": [ " _id ltable_id rtable_id ltable_Song_Name ltable_Artist_Name \\\n", "0 0 3995 0 Rudolph Blue Brian Fechino \n", "1 1 3995 1 Rudolph Blue Brian Fechino \n", "2 2 3995 2 Rudolph Blue Brian Fechino \n", "3 3 3995 3 Rudolph Blue Brian Fechino \n", "4 4 3995 4 Rudolph Blue Brian Fechino \n", "\n", " ltable_Album_Name ltable_Genre \\\n", "0 A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock \n", "1 A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock \n", "2 A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock \n", "3 A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock \n", "4 A Rock By The Sea Christmas : : Volume 2 Holiday , Music , Rock \n", "\n", " ltable_Price ltable_CopyRight ltable_Time ltable_Released \\\n", "0 $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 \n", "1 $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 \n", "2 $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 \n", "3 $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 \n", "4 $ 0.99 2011 Hubbub ! 1:57 8-Nov-11 \n", "\n", " rtable_Song_Name rtable_Artist_Name \\\n", "0 Saxophone Stomp [ Explicit ] Rusko \n", "1 I Wan na Mingle [ feat . Pusher ] Rusko \n", "2 Lytah Rusko \n", "3 Slappy Pappy Rusko \n", "4 Fushion Jam Rusko \n", "\n", " rtable_Album_Name rtable_Genre rtable_Price \\\n", "0 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "1 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "2 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "3 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "4 ! ( Volume 2 ) [ Explicit ] Dance & Electronic , Dubstep $ 1.29 \n", "\n", " rtable_CopyRight \\\n", "0 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "1 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "2 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "3 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "4 ( C ) 2014 FMLY Under Exclusive License To Universal Music Canada Inc. \n", "\n", " rtable_Time rtable_Released \n", "0 3:20 September 16 , 2014 \n", "1 2:36 September 16 , 2014 \n", "2 3:48 September 16 , 2014 \n", "3 3:55 September 16 , 2014 \n", "4 1:48 September 16 , 2014 " ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The first few tuple pairs in K1\n", "K1.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Notice that K1 has more than 3 million pair in it, which is too large to considered for matching. So we create an overlap blocking with threshold 1 on the attribute \"Artist_Name\" for K1, to filter all pairs in K1 that don't share any word in \"Artist_Name\". And we get the candidate set K2." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% [##############################] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:29\n" ] } ], "source": [ "# Create a new overlap blocker to remove pairs from K1 that have no common word in \"Artist_Name\".\n", "K2 = ob.block_candset(K1, 'Artist_Name', 'Artist_Name', overlap_size=1)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "167180" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The number of tuple pairs in K2.\n", "len(K2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After the second blocking step, now we have a candidate set K2 with about 170K pairs. But we think it is still a bit larger to consider for matching. So we apply a third blocker, which is an overlap blocker on the attribute \"Song_Name\", to further reduce the size of the candidate set." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "0% [##############################] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:01\n" ] } ], "source": [ "# Apply the third overlap blocker.\n", "K3 = ob.block_candset(K2, 'Song_Name', 'Song_Name', overlap_size=1)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "38015" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The number of tuples pairs in K3.\n", "len(K3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we have a candidate set with 38K pairs which is reasonable, so we take K3 as the final candidate set. We save the candidate to the disk in the csv format for future reuse." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "path_K = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv')\n", "K3.to_csv(path_K, index=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Step 3. Match tuple pairs in the candidate set\n", "In this stage we will match tuple pairs in the candidate set to predict each of them as match or non-match. This is the part that `deepmatcher` will be involved. Specifically, it consists of the following steps:\n", "1. Take a sample S from the candidate set K, and label all pairs in S.\n", "2. Train a classifier L using S. Specifically, we will train a classifier using `deepmatcher`.\n", "3. Apply L to the candidiate set K." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Sample and label the candidate set\n", "We first take a random sample S from the candidate set K using `Magellan`. Here for example, we sample 500 pairs for K. Then we label each pair as match (enter 1) or non-match (enter 0) and use S as the training data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Take a sample of 500 pairs from the candidate set.\n", "S = em.sample_table(K3, 500)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Label the sample S in a GUI. Enter 1 for match and 0 for non-match.\n", "G = em.label_table(S, 'gold')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the purposes of this tutorial, we will load in a pre-labeled dataset (of 539 tuple pairs) included in this package." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# The path to the labeled data file.\n", "path_G = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'gold.csv')" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:py_entitymatching.io.parsers:Metadata file is not present in the given path; proceeding to read the csv file.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of labeled pairs: 539\n" ] } ], "source": [ "# Load the labeled data into a dataframe.\n", "G = em.read_csv_metadata(path_G, \n", " key='_id',\n", " ltable=A, rtable=B, \n", " fk_ltable='ltable_id', fk_rtable='rtable_id')\n", "print('Number of labeled pairs:', len(G))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Train a classifier using labeled data\n", "Once we have the labeled data, we use `deepmatcher` to train a classifier. The first thing we need to do is to split the data for training purpose. In this example, we split the labeled data into three parts: training, validation and test data, with the ratio of 3:1:1. (For now we only support spliting the labeled data into three parts train/valid/test, where the validation set is used for selecting the best model during the training epochs.) For the purpose of caching data and progressive training, we will first save the split parts to disk in the format of csv files, then load them back in. The cache file will be saved during the loading procedure. For subsequent training runs, the cache file will be used to save preprocessing time on the raw csv files, unless the csv files are modified (in this case, new cache file will be generated)." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# The directory where the data splits will be saved.\n", "split_path = os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial')" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# Split labeled data into train, valid, and test csv files to disk, with the split ratio of 3:1:1.\n", "dm.data.split(G, split_path, 'train.csv', 'valid.csv', 'test.csv',\n", " [3, 1, 1])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:deepmatcher.data.dataset:Rebuilding data cache because: ['One or more data files have been modified.']\n", "\n", "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/train.csv\"\n", "0% [############################# ] 100% | ETA: 00:00:00\n", "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/valid.csv\"\n", "0% [############################# ] 100% | ETA: 00:00:00\n", "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/test.csv\"\n", "0% [############################# ] 100% | ETA: 00:00:00\n", "Building vocabulary\n", "0% [#] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n", "\n", "Computing principal components\n", "0% [#] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] } ], "source": [ "# Load the training data files from the disk. Ignore the \"left_id\" and \"right_id\" \n", "# columns for data preprocessing.\n", "# The 'use_magellan_convention' parameter asks deepmatcher to use Magellan's \n", "# naming convention for the left and right table column name prefixes \n", "# (\"ltable_\", and \"rtable_\"), and also to consider \"_id\" as the ID column.\n", "train, validation, test = dm.data.process(\n", " path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial'),\n", " cache='train_cache.pth',\n", " train='train.csv',\n", " validation='valid.csv',\n", " test='test.csv',\n", " use_magellan_convention=True,\n", " ignore_columns=('ltable_id', 'rtable_id'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we get the training data, we can use `deepmatcher` to train a classifier. Here we train a hybrid model." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Create a hybrid model.\n", "model = dm.MatchingModel(attr_summarizer='hybrid')" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "* Number of trainable parameters: 17757810\n", "===> TRAIN Epoch 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:04\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 1 || Run Time: 4.2 | Load Time: 0.4 || F1: 39.60 | Prec: 31.25 | Rec: 54.05 || Ex/s: 69.89\n", "\n", "===> EVAL Epoch 1\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 1 || Run Time: 0.6 | Load Time: 0.2 || F1: 60.61 | Prec: 50.00 | Rec: 76.92 || Ex/s: 146.30\n", "\n", "* Best F1: 60.60606060606061\n", "Saving best model...\n", "Done.\n", "---------------------\n", "\n", "===> TRAIN Epoch 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:04\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 2 || Run Time: 3.9 | Load Time: 0.4 || F1: 65.62 | Prec: 53.39 | Rec: 85.14 || Ex/s: 75.12\n", "\n", "===> EVAL Epoch 2\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 2 || Run Time: 0.6 | Load Time: 0.2 || F1: 76.00 | Prec: 79.17 | Rec: 73.08 || Ex/s: 146.34\n", "\n", "* Best F1: 76.0\n", "Saving best model...\n", "Done.\n", "---------------------\n", "\n", "===> TRAIN Epoch 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:03\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 3 || Run Time: 3.8 | Load Time: 0.4 || F1: 87.27 | Prec: 79.12 | Rec: 97.30 || Ex/s: 76.32\n", "\n", "===> EVAL Epoch 3\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 3 || Run Time: 0.6 | Load Time: 0.2 || F1: 84.21 | Prec: 77.42 | Rec: 92.31 || Ex/s: 143.25\n", "\n", "* Best F1: 84.21052631578948\n", "Saving best model...\n", "Done.\n", "---------------------\n", "\n", "===> TRAIN Epoch 4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:03\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 4 || Run Time: 3.8 | Load Time: 0.4 || F1: 90.57 | Prec: 84.71 | Rec: 97.30 || Ex/s: 76.22\n", "\n", "===> EVAL Epoch 4\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 4 || Run Time: 0.6 | Load Time: 0.2 || F1: 82.76 | Prec: 75.00 | Rec: 92.31 || Ex/s: 146.11\n", "\n", "---------------------\n", "\n", "===> TRAIN Epoch 5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:04\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 5 || Run Time: 3.9 | Load Time: 0.4 || F1: 97.37 | Prec: 94.87 | Rec: 100.00 || Ex/s: 74.96\n", "\n", "===> EVAL Epoch 5\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 5 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.57 | Prec: 88.89 | Rec: 92.31 || Ex/s: 146.11\n", "\n", "* Best F1: 90.56603773584906\n", "Saving best model...\n", "Done.\n", "---------------------\n", "\n", "===> TRAIN Epoch 6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:04\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 6 || Run Time: 3.9 | Load Time: 0.4 || F1: 98.67 | Prec: 97.37 | Rec: 100.00 || Ex/s: 74.94\n", "\n", "===> EVAL Epoch 6\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 6 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.20 | Prec: 92.00 | Rec: 88.46 || Ex/s: 145.73\n", "\n", "---------------------\n", "\n", "===> TRAIN Epoch 7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:03\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 7 || Run Time: 3.8 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.88\n", "\n", "===> EVAL Epoch 7\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 7 || Run Time: 0.6 | Load Time: 0.2 || F1: 90.20 | Prec: 92.00 | Rec: 88.46 || Ex/s: 145.56\n", "\n", "---------------------\n", "\n", "===> TRAIN Epoch 8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:03\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 8 || Run Time: 3.9 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.16\n", "\n", "===> EVAL Epoch 8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 8 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.76\n", "\n", "* Best F1: 92.00000000000001\n", "Saving best model...\n", "Done.\n", "---------------------\n", "\n", "===> TRAIN Epoch 9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:03\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 9 || Run Time: 3.7 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 77.50\n", "\n", "===> EVAL Epoch 9\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 9 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.27\n", "\n", "---------------------\n", "\n", "===> TRAIN Epoch 10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:04\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 10 || Run Time: 3.9 | Load Time: 0.4 || F1: 99.33 | Prec: 98.67 | Rec: 100.00 || Ex/s: 75.28\n", "\n", "===> EVAL Epoch 10\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [█] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:00:00\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 10 || Run Time: 0.6 | Load Time: 0.2 || F1: 92.00 | Prec: 95.83 | Rec: 88.46 || Ex/s: 145.95\n", "\n", "---------------------\n", "\n", "Loading best model...\n", "Training done.\n" ] }, { "data": { "text/plain": [ "92.00000000000001" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Train the hybrid model with 10 training epochs, batch size of 16, positive-to-negative \n", "# ratio to be 3. We save the best model (with the \n", "# highest F1 score on the validation set) to 'hybrid_model.pth'.\n", "model.run_train(\n", " train,\n", " validation,\n", " epochs=10,\n", " batch_size=16,\n", " best_save_path='hybrid_model.pth',\n", " pos_neg_ratio=3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we have the trained classifier, we can evaluate the accuracy using the test data." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===> EVAL Epoch 8\n", "Finished Epoch 8 || Run Time: 0.3 | Load Time: 0.2 || F1: 92.31 | Prec: 90.91 | Rec: 93.75 || Ex/s: 208.04\n", "\n" ] }, { "data": { "text/plain": [ "92.3076923076923" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Evaluate the accuracy on the test data.\n", "model.run_eval(test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Apply the trained classifier to the candidate set\n", "Now we have a trained classifer, we can apply it to the candidate set to predict each pair as match or non-match. To achieve this, we first load the trained model and the candidate set." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "# Load the model.\n", "model = dm.MatchingModel(attr_summarizer='hybrid')\n", "model.load_state('hybrid_model.pth')" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "Reading and processing data from \"./sample_data/itunes-amazon/e2e-tutorial/candidate.csv\"\n", "0% [##############################] 100% | ETA: 00:00:00" ] } ], "source": [ "# Load the candidate set. Note that the trained model is an input parameter as we need to trained \n", "# model for candidate set preprocessing.\n", "candidate = dm.data.process_unlabeled(\n", " path=os.path.join('.', 'sample_data', 'itunes-amazon', 'e2e-tutorial', 'candidate.csv'),\n", " trained_model=model,\n", " ignore_columns=('ltable_id', 'rtable_id'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After we load the candidate set, now we make the predictions using the `run_prediction` function. The *output_attribute* argument indicates the columns that will be included in the prediction table. Here \n", "we use all of the columns in the candidate set." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "===> PREDICT Epoch 8\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "0% [██████████████████████████████] 100% | ETA: 00:00:00\n", "Total time elapsed: 00:02:18\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Finished Epoch 8 || Run Time: 97.2 | Load Time: 41.5 || F1: 0.00 | Prec: 0.00 | Rec: 0.00 || Ex/s: 0.00\n", "\n" ] } ], "source": [ "# Predict the pairs in the candidate set and return a dataframe containing the pair id with \n", "# the score of being a match.\n", "predictions = model.run_prediction(candidate, output_attributes=list(candidate.get_raw_table().columns))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below shows a few pairs in the prediction table. Note that this table is not sorted in the descending order of the \n", "match scores." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_scoreltable_idrtable_idltable_Song_Nameltable_Artist_Nameltable_Album_Nameltable_Genreltable_Priceltable_CopyRightltable_Timeltable_Releasedrtable_Song_Namertable_Artist_Namertable_Album_Namertable_Genrertable_Pricertable_CopyRightrtable_Timertable_Released
_id
32370.9604612583228#Selfie ( Instrumental Mix )The Chainsmokers#Selfie ( Instrumental Mix ) - SingleDance , Music , Electronic$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.3:0320-Mar-14#SELFIE ( Instrumental Mix )The Chainsmokers#SELFIE ( Instrumental Mix )Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:03March 20 , 2014
32740.2944732003229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
32750.2944732574229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
32760.2944732575229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
33590.1613232003230#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Botnek Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:36June 17 , 2014
\n", "
" ], "text/plain": [ " match_score ltable_id rtable_id ltable_Song_Name \\\n", "_id \n", "3237 0.960461 2583 228 #Selfie ( Instrumental Mix ) \n", "3274 0.294473 2003 229 #SELFIE ( Caked Up Remix ) \n", "3275 0.294473 2574 229 #SELFIE ( Caked Up Remix ) \n", "3276 0.294473 2575 229 #SELFIE ( Caked Up Remix ) \n", "3359 0.161323 2003 230 #SELFIE ( Caked Up Remix ) \n", "\n", " ltable_Artist_Name ltable_Album_Name \\\n", "_id \n", "3237 The Chainsmokers #Selfie ( Instrumental Mix ) - Single \n", "3274 The Chainsmokers #SELFIE ( The Remixes ) - Single \n", "3275 The Chainsmokers #SELFIE ( The Remixes ) - Single \n", "3276 The Chainsmokers #SELFIE ( The Remixes ) - Single \n", "3359 The Chainsmokers #SELFIE ( The Remixes ) - Single \n", "\n", " ltable_Genre ltable_Price \\\n", "_id \n", "3237 Dance , Music , Electronic $ 1.29 \n", "3274 Electronic , Music , Dance $ 1.29 \n", "3275 Electronic , Music , Dance $ 1.29 \n", "3276 Electronic , Music , Dance $ 1.29 \n", "3359 Electronic , Music , Dance $ 1.29 \n", "\n", " ltable_CopyRight \\\n", "_id \n", "3237 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. \n", "3274 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3275 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3276 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3359 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "\n", " ltable_Time ltable_Released rtable_Song_Name \\\n", "_id \n", "3237 3:03 20-Mar-14 #SELFIE ( Instrumental Mix ) \n", "3274 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3275 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3276 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3359 3:15 17-Jun-14 #Selfie ( Botnek Remix ) \n", "\n", " rtable_Artist_Name rtable_Album_Name \\\n", "_id \n", "3237 The Chainsmokers #SELFIE ( Instrumental Mix ) \n", "3274 The Chainsmokers #Selfie ( The Remixes ) \n", "3275 The Chainsmokers #Selfie ( The Remixes ) \n", "3276 The Chainsmokers #Selfie ( The Remixes ) \n", "3359 The Chainsmokers #Selfie ( The Remixes ) \n", "\n", " rtable_Genre rtable_Price \\\n", "_id \n", "3237 Dance & Electronic $ 1.29 \n", "3274 Electronica , Dance & Electronic $ 1.29 \n", "3275 Electronica , Dance & Electronic $ 1.29 \n", "3276 Electronica , Dance & Electronic $ 1.29 \n", "3359 Electronica , Dance & Electronic $ 1.29 \n", "\n", " rtable_CopyRight \\\n", "_id \n", "3237 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3274 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3275 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3276 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3359 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "\n", " rtable_Time rtable_Released \n", "_id \n", "3237 3:03 March 20 , 2014 \n", "3274 3:52 June 17 , 2014 \n", "3275 3:52 June 17 , 2014 \n", "3276 3:52 June 17 , 2014 \n", "3359 3:36 June 17 , 2014 " ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can simply manipulate the prediction table above to get the pairs that we want. Suppose we are only interested in pairs with the match score more over 0.9, and we want to display them sorting in the descending order on match score. This can be achieved as follow." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_scoreltable_idrtable_idltable_Song_Nameltable_Artist_Nameltable_Album_Nameltable_Genreltable_Priceltable_CopyRightltable_Timeltable_Releasedrtable_Song_Namertable_Artist_Namertable_Album_Namertable_Genrertable_Pricertable_CopyRightrtable_Timertable_Released
_id
14724660.995653246230798It 's Never Easy to Say GoodbyeKenny ChesneyMe and YouCountry , Music , Urban Cowboy , Contemporary Country$ 0.991996 BMG Entertainment4:441-Jan-90It 's Never Easy To Say GoodbyeKenny ChesneyMe And YouCountry$ 0.99( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC4:44January 1 , 1990
14724650.995653245330798It 's Never Easy to Say GoodbyeKenny ChesneyMe and YouCountry , Music , Urban Cowboy , Contemporary Country$ 0.991996 BMG Entertainment4:441-Jan-90It 's Never Easy To Say GoodbyeKenny ChesneyMe And YouCountry$ 0.99( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC4:44January 1 , 1990
14724640.995653243830798It 's Never Easy to Say GoodbyeKenny ChesneyMe and YouCountry , Music , Urban Cowboy , Contemporary Country$ 0.991996 BMG Entertainment4:441-Jan-90It 's Never Easy To Say GoodbyeKenny ChesneyMe And YouCountry$ 0.99( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC4:44January 1 , 1990
14724630.995653243530798It 's Never Easy to Say GoodbyeKenny ChesneyMe and YouCountry , Music , Urban Cowboy , Contemporary Country$ 0.991996 BMG Entertainment4:441-Jan-90It 's Never Easy To Say GoodbyeKenny ChesneyMe And YouCountry$ 0.99( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC4:44January 1 , 1990
14724620.995653243030798It 's Never Easy to Say GoodbyeKenny ChesneyMe and YouCountry , Music , Urban Cowboy , Contemporary Country$ 0.991996 BMG Entertainment4:441-Jan-90It 's Never Easy To Say GoodbyeKenny ChesneyMe And YouCountry$ 0.99( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC4:44January 1 , 1990
\n", "
" ], "text/plain": [ " match_score ltable_id rtable_id ltable_Song_Name \\\n", "_id \n", "1472466 0.995653 2462 30798 It 's Never Easy to Say Goodbye \n", "1472465 0.995653 2453 30798 It 's Never Easy to Say Goodbye \n", "1472464 0.995653 2438 30798 It 's Never Easy to Say Goodbye \n", "1472463 0.995653 2435 30798 It 's Never Easy to Say Goodbye \n", "1472462 0.995653 2430 30798 It 's Never Easy to Say Goodbye \n", "\n", " ltable_Artist_Name ltable_Album_Name \\\n", "_id \n", "1472466 Kenny Chesney Me and You \n", "1472465 Kenny Chesney Me and You \n", "1472464 Kenny Chesney Me and You \n", "1472463 Kenny Chesney Me and You \n", "1472462 Kenny Chesney Me and You \n", "\n", " ltable_Genre ltable_Price \\\n", "_id \n", "1472466 Country , Music , Urban Cowboy , Contemporary Country $ 0.99 \n", "1472465 Country , Music , Urban Cowboy , Contemporary Country $ 0.99 \n", "1472464 Country , Music , Urban Cowboy , Contemporary Country $ 0.99 \n", "1472463 Country , Music , Urban Cowboy , Contemporary Country $ 0.99 \n", "1472462 Country , Music , Urban Cowboy , Contemporary Country $ 0.99 \n", "\n", " ltable_CopyRight ltable_Time ltable_Released \\\n", "_id \n", "1472466 1996 BMG Entertainment 4:44 1-Jan-90 \n", "1472465 1996 BMG Entertainment 4:44 1-Jan-90 \n", "1472464 1996 BMG Entertainment 4:44 1-Jan-90 \n", "1472463 1996 BMG Entertainment 4:44 1-Jan-90 \n", "1472462 1996 BMG Entertainment 4:44 1-Jan-90 \n", "\n", " rtable_Song_Name rtable_Artist_Name rtable_Album_Name \\\n", "_id \n", "1472466 It 's Never Easy To Say Goodbye Kenny Chesney Me And You \n", "1472465 It 's Never Easy To Say Goodbye Kenny Chesney Me And You \n", "1472464 It 's Never Easy To Say Goodbye Kenny Chesney Me And You \n", "1472463 It 's Never Easy To Say Goodbye Kenny Chesney Me And You \n", "1472462 It 's Never Easy To Say Goodbye Kenny Chesney Me And You \n", "\n", " rtable_Genre rtable_Price \\\n", "_id \n", "1472466 Country $ 0.99 \n", "1472465 Country $ 0.99 \n", "1472464 Country $ 0.99 \n", "1472463 Country $ 0.99 \n", "1472462 Country $ 0.99 \n", "\n", " rtable_CopyRight \\\n", "_id \n", "1472466 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC \n", "1472465 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC \n", "1472464 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC \n", "1472463 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC \n", "1472462 ( C ) 2011 MPL Communications Ltd/Inc under exclusive license to StarCon LLC \n", "\n", " rtable_Time rtable_Released \n", "_id \n", "1472466 4:44 January 1 , 1990 \n", "1472465 4:44 January 1 , 1990 \n", "1472464 4:44 January 1 , 1990 \n", "1472463 4:44 January 1 , 1990 \n", "1472462 4:44 January 1 , 1990 " ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "high_score_pairs = predictions[predictions['match_score'] >= 0.9].sort_values(by=['match_score'], ascending=False)\n", "high_score_pairs.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you have seen, `deepmatcher` does not include a column indicating match / non-match predictions in the output table. You can easily create such a column by thresholding the `match_score` at 0.5 (which is what `deepmatcher` uses as the threshold to compute F1), or using a threshold value or your choice." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
match_scorematch_predictionltable_idrtable_idltable_Song_Nameltable_Artist_Nameltable_Album_Nameltable_Genreltable_Priceltable_CopyRightltable_Timeltable_Releasedrtable_Song_Namertable_Artist_Namertable_Album_Namertable_Genrertable_Pricertable_CopyRightrtable_Timertable_Released
_id
32370.96046112583228#Selfie ( Instrumental Mix )The Chainsmokers#Selfie ( Instrumental Mix ) - SingleDance , Music , Electronic$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc.3:0320-Mar-14#SELFIE ( Instrumental Mix )The Chainsmokers#SELFIE ( Instrumental Mix )Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:03March 20 , 2014
32740.29447302003229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
32750.29447302574229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
32760.29447302575229#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Will Sparks Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:52June 17 , 2014
33590.16132302003230#SELFIE ( Caked Up Remix )The Chainsmokers#SELFIE ( The Remixes ) - SingleElectronic , Music , Dance$ 1.292014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I...3:1517-Jun-14#Selfie ( Botnek Remix )The Chainsmokers#Selfie ( The Remixes )Electronica , Dance & Electronic$ 1.29( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin...3:36June 17 , 2014
\n", "
" ], "text/plain": [ " match_score match_prediction ltable_id rtable_id \\\n", "_id \n", "3237 0.960461 1 2583 228 \n", "3274 0.294473 0 2003 229 \n", "3275 0.294473 0 2574 229 \n", "3276 0.294473 0 2575 229 \n", "3359 0.161323 0 2003 230 \n", "\n", " ltable_Song_Name ltable_Artist_Name \\\n", "_id \n", "3237 #Selfie ( Instrumental Mix ) The Chainsmokers \n", "3274 #SELFIE ( Caked Up Remix ) The Chainsmokers \n", "3275 #SELFIE ( Caked Up Remix ) The Chainsmokers \n", "3276 #SELFIE ( Caked Up Remix ) The Chainsmokers \n", "3359 #SELFIE ( Caked Up Remix ) The Chainsmokers \n", "\n", " ltable_Album_Name ltable_Genre \\\n", "_id \n", "3237 #Selfie ( Instrumental Mix ) - Single Dance , Music , Electronic \n", "3274 #SELFIE ( The Remixes ) - Single Electronic , Music , Dance \n", "3275 #SELFIE ( The Remixes ) - Single Electronic , Music , Dance \n", "3276 #SELFIE ( The Remixes ) - Single Electronic , Music , Dance \n", "3359 #SELFIE ( The Remixes ) - Single Electronic , Music , Dance \n", "\n", " ltable_Price \\\n", "_id \n", "3237 $ 1.29 \n", "3274 $ 1.29 \n", "3275 $ 1.29 \n", "3276 $ 1.29 \n", "3359 $ 1.29 \n", "\n", " ltable_CopyRight \\\n", "_id \n", "3237 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , Inc. \n", "3274 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3275 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3276 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "3359 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordings , I... \n", "\n", " ltable_Time ltable_Released rtable_Song_Name \\\n", "_id \n", "3237 3:03 20-Mar-14 #SELFIE ( Instrumental Mix ) \n", "3274 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3275 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3276 3:15 17-Jun-14 #Selfie ( Will Sparks Remix ) \n", "3359 3:15 17-Jun-14 #Selfie ( Botnek Remix ) \n", "\n", " rtable_Artist_Name rtable_Album_Name \\\n", "_id \n", "3237 The Chainsmokers #SELFIE ( Instrumental Mix ) \n", "3274 The Chainsmokers #Selfie ( The Remixes ) \n", "3275 The Chainsmokers #Selfie ( The Remixes ) \n", "3276 The Chainsmokers #Selfie ( The Remixes ) \n", "3359 The Chainsmokers #Selfie ( The Remixes ) \n", "\n", " rtable_Genre rtable_Price \\\n", "_id \n", "3237 Dance & Electronic $ 1.29 \n", "3274 Electronica , Dance & Electronic $ 1.29 \n", "3275 Electronica , Dance & Electronic $ 1.29 \n", "3276 Electronica , Dance & Electronic $ 1.29 \n", "3359 Electronica , Dance & Electronic $ 1.29 \n", "\n", " rtable_CopyRight \\\n", "_id \n", "3237 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3274 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3275 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3276 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "3359 ( C ) 2014 Dim Mak Inc. under exclusive license to Republic Records , a Division of UMG Recordin... \n", "\n", " rtable_Time rtable_Released \n", "_id \n", "3237 3:03 March 20 , 2014 \n", "3274 3:52 June 17 , 2014 \n", "3275 3:52 June 17 , 2014 \n", "3276 3:52 June 17 , 2014 \n", "3359 3:36 June 17 , 2014 " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "predictions['match_prediction'] = predictions['match_score'].apply(lambda score: 1 if score >= 0.5 else 0)\n", "\n", "# Reorder columns to avoid scrolling...\n", "predictions = predictions[['match_score', 'match_prediction'] + predictions.columns.values[1:-1].tolist()]\n", "predictions.head()" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Reset index as Magellan requires the key to be a column in the table\n", "predictions.reset_index(inplace=True)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Update metadata in the catalog. This information can later be used by triggers to modify the labels from \n", "# the learning-based matcher \n", "em.set_key(predictions, '_id')\n", "em.set_fk_ltable(predictions, 'ltable_id')\n", "em.set_fk_rtable(predictions, 'rtable_id')\n", "em.set_ltable(predictions, A)\n", "em.set_rtable(predictions, B)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.4" } }, "nbformat": 4, "nbformat_minor": 2 }