{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"# Step-by-step example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this tutorial, you will go through an example end to end. Here are the main steps you will go through:\n",
"\n",
"- Dataset analysis\n",
"- Construct RLTK datasets\n",
"- Blocking\n",
"- Pairwise comparison\n",
"- Evaluation\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Dataset analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The dataset used here is an artificial dataset which contructed from DBLP and Scholar data. Let's take a look."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# initialization\n",
"import os\n",
"from datetime import datetime\n",
"import pandas as pd\n",
"from IPython.display import display\n",
"import rltk"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" id | \n",
" names | \n",
" date | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" journals/sigmod/HummerLW02 | \n",
" W Hümmer, W Lehner, H Wedekind | \n",
" 2018-12-24 | \n",
"
\n",
" \n",
" 1 | \n",
" conf/vldb/AgrawalS94 | \n",
" R Agrawal, R Srikant | \n",
" 2018-12-22 | \n",
"
\n",
" \n",
" 2 | \n",
" conf/vldb/Brin95 | \n",
" S Brin | \n",
" 2018-12-26 | \n",
"
\n",
" \n",
" 3 | \n",
" conf/vldb/ChakravarthyKAK94 | \n",
" S Chakravarthy, V Krishnaprasad, E Anwar, S Kim | \n",
" 2018-12-29 | \n",
"
\n",
" \n",
" 4 | \n",
" conf/vldb/MedianoCD94 | \n",
" M Mediano, M Casanova, M Dreux | \n",
" 2018-12-26 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" id \\\n",
"0 journals/sigmod/HummerLW02 \n",
"1 conf/vldb/AgrawalS94 \n",
"2 conf/vldb/Brin95 \n",
"3 conf/vldb/ChakravarthyKAK94 \n",
"4 conf/vldb/MedianoCD94 \n",
"\n",
" names date \n",
"0 W Hümmer, W Lehner, H Wedekind 2018-12-24 \n",
"1 R Agrawal, R Srikant 2018-12-22 \n",
"2 S Brin 2018-12-26 \n",
"3 S Chakravarthy, V Krishnaprasad, E Anwar, S Kim 2018-12-29 \n",
"4 M Mediano, M Casanova, M Dreux 2018-12-26 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_dblp = pd.read_csv('resources/dblp.csv', parse_dates=False)\n",
"df_dblp.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" date | \n",
" id | \n",
" names | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 26, Dec 2018 | \n",
" ek26aiEheesJ | \n",
" M Fernandez, J Kang, A Levy, D Suciu | \n",
"
\n",
" \n",
" 1 | \n",
" 29, Dec 2018 | \n",
" rmtEGXAXHKIJ | \n",
" S Adali, KS Candan, Y Papakonstantinou, VS | \n",
"
\n",
" \n",
" 2 | \n",
" 27, Dec 2018 | \n",
" D0z0BDnbnFcJ | \n",
" S Christodoulakis | \n",
"
\n",
" \n",
" 3 | \n",
" 28, Dec 2018 | \n",
" noTo81QxmHQJ | \n",
" ACMS Anthology, P Edition | \n",
"
\n",
" \n",
" 4 | \n",
" 28, Dec 2018 | \n",
" l0W27c1C3NwJ | \n",
" W Litwin, MA Neimat, DA Schneider | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" date id names\n",
"0 26, Dec 2018 ek26aiEheesJ M Fernandez, J Kang, A Levy, D Suciu\n",
"1 29, Dec 2018 rmtEGXAXHKIJ S Adali, KS Candan, Y Papakonstantinou, VS \n",
"2 27, Dec 2018 D0z0BDnbnFcJ S Christodoulakis\n",
"3 28, Dec 2018 noTo81QxmHQJ ACMS Anthology, P Edition\n",
"4 28, Dec 2018 l0W27c1C3NwJ W Litwin, MA Neimat, DA Schneider"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_scholar = pd.read_json('resources/scholar.jl', lines=True, convert_dates=False)\n",
"df_scholar.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By a glance, it's easy to find out that both datasets have id, date and names.\n",
"\n",
"- Dates have different formats\n",
"- Names columns contains many names separated by comma."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Construct RLTK datasets\n",
"\n",
"In RLTK, the data collection is named `Dataset` and each \"row\" is a `Record` instance. In order to construct a `Dataset`, you need to read data from source by a specific `Reader`, then the data is presented in a Python Dict `raw_object` which can be use to construct `Record` instance by the schema (concrete class of `Record`) you definded.\n",
"\n",
"For DBLP:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"class DBLP(rltk.Record):\n",
" @property\n",
" def id(self):\n",
" return self.raw_object['id']\n",
" \n",
" @property\n",
" def date(self):\n",
" return self.raw_object['date']\n",
" \n",
" @property\n",
" def names(self):\n",
" return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"journals/sigmod/HummerLW02 2018-12-24 ['W Hümmer', 'W Lehner', 'H Wedekind']\n",
"conf/vldb/AgrawalS94 2018-12-22 ['R Agrawal', 'R Srikant']\n",
"conf/vldb/Brin95 2018-12-26 ['S Brin']\n",
"conf/vldb/ChakravarthyKAK94 2018-12-29 ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim']\n",
"conf/vldb/MedianoCD94 2018-12-26 ['M Mediano', 'M Casanova', 'M Dreux']\n",
"conf/vldb/SistlaYH94 2018-12-25 ['A Sistla', 'C Yu', 'R Haddad']\n",
"journals/sigmod/PourabbasR00 2018-12-20 ['E Pourabbas', 'M Rafanelli']\n",
"conf/sigmod/MelnikRB03 2018-12-21 ['S Melnik', 'E Rahm', 'P Bernstein']\n",
"conf/sigmod/ZhangDWEMPMDR03 2018-12-26 ['X Zhang', 'K Dimitrova', 'L Wang', 'M El-Sayed', 'B Murphy', 'B Pielech', 'M Mulchandani', 'L Ding', 'E Rundensteiner']\n",
"conf/sigmod/ZhouWGGZWXYF03 2018-12-27 ['A Zhou', 'Q Wang', 'Z Guo', 'X Gong', 'S Zheng', 'H Wu', 'J Xiao', 'K Yue', 'W Fan']\n"
]
}
],
"source": [
"ds_dblp = rltk.Dataset(rltk.CSVReader('resources/dblp.csv'), record_class=DBLP)\n",
"\n",
"for r_dblp in ds_dblp.head():\n",
" print(r_dblp.id, r_dblp.date, r_dblp.names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For scholar:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"@rltk.remove_raw_object\n",
"class Scholar(rltk.Record):\n",
" @rltk.cached_property\n",
" def id(self):\n",
" return self.raw_object['id']\n",
" \n",
" @rltk.cached_property\n",
" def date(self):\n",
" return datetime.strptime(self.raw_object['date'], '%d, %b %Y').date().strftime('%Y-%m-%d')\n",
" \n",
" @rltk.cached_property\n",
" def names(self):\n",
" return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ek26aiEheesJ 2018-12-26 ['M Fernandez', 'J Kang', 'A Levy', 'D Suciu']\n",
"rmtEGXAXHKIJ 2018-12-29 ['S Adali', 'KS Candan', 'Y Papakonstantinou', 'VS']\n",
"D0z0BDnbnFcJ 2018-12-27 ['S Christodoulakis']\n",
"noTo81QxmHQJ 2018-12-28 ['ACMS Anthology', 'P Edition']\n",
"l0W27c1C3NwJ 2018-12-28 ['W Litwin', 'MA Neimat', 'DA Schneider']\n",
"IkNOhDqEY18J 2018-12-26 ['S Acharya', 'PB Gibbons']\n",
"6QZGeKna5lgJ 2018-12-23 ['T Gri']\n",
"XFCkL9QhTjIJ 2018-12-25 ['K Koperski', 'J Han']\n",
"9Wo54Wyh_X8J 2018-12-23 ['H Garcia-Molina', 'S Raghavan']\n",
"9uxj2XzGt9UJ 2018-12-28 ['M Flickner', 'H Sawhney', 'W Niblack', 'J Ashley', 'Q']\n"
]
}
],
"source": [
"ds_scholar = rltk.Dataset(rltk.JsonLinesReader('resources/scholar.jl'), record_class=Scholar)\n",
"\n",
"for r_scholar in ds_scholar.head():\n",
" print(r_scholar.id, r_scholar.date, r_scholar.names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Decorator `cached_property` means the property value will be pre-computed while generating the `Dataset`, it's especially useful to cache the value while the transformation of property is time consuming (e.g., tokenization, vectorization). `remove_raw_object` is used to release the space of `raw_object` after all properties are being cached.\n",
"\n",
"If you prefer to do data cleaning and manipulation in `pandas.Dataframe`, you can build `Dataset` from it easily."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"ek26aiEheesJ 26, Dec 2018 M Fernandez, J Kang, A Levy, D Suciu\n",
"rmtEGXAXHKIJ 29, Dec 2018 S Adali, KS Candan, Y Papakonstantinou, VS \n",
"D0z0BDnbnFcJ 27, Dec 2018 S Christodoulakis\n",
"noTo81QxmHQJ 28, Dec 2018 ACMS Anthology, P Edition\n",
"l0W27c1C3NwJ 28, Dec 2018 W Litwin, MA Neimat, DA Schneider\n",
"IkNOhDqEY18J 26, Dec 2018 S Acharya, PB Gibbons\n",
"6QZGeKna5lgJ 23, Dec 2018 T Gri\n",
"XFCkL9QhTjIJ 25, Dec 2018 K Koperski, J Han\n",
"9Wo54Wyh_X8J 23, Dec 2018 H Garcia-Molina, S Raghavan\n",
"9uxj2XzGt9UJ 28, Dec 2018 M Flickner, H Sawhney, W Niblack, J Ashley, Q \n"
]
}
],
"source": [
"# do data tranformation in df_scholar first, then:\n",
"\n",
"class Scholar2(rltk.AutoGeneratedRecord):\n",
" pass\n",
"\n",
"ds_scholar2 = rltk.Dataset(rltk.DataFrameReader(df_scholar), record_class=Scholar2)\n",
"\n",
"for r_scholar2 in ds_scholar2.head():\n",
" print(r_scholar2.id, r_scholar2.date, r_scholar2.names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Blocking\n",
"\n",
"Blocking can be used to eliminate obvious impossible pairs then greatly reduce unnecessary comparisons.\n",
"\n",
"In this example, date is an ideal key for blocking."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"bg = rltk.HashBlockGenerator()\n",
"block = bg.generate(\n",
" bg.block(ds_dblp, property_='date'),\n",
" bg.block(ds_scholar, property_='date')\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to know what's in a block aggregated by key, you can iterate on the `key_set_adapter` in block object. Block is stored in a concrete `KeySetAdapter` (default is `MemoryKeySetAdapter`)."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"('2018-12-24', {('Scholar', 'BTalXWt3faUJ'), ('Scholar', 'bTYTn8VG5hIJ'), ('Scholar', 'sHJ914nPZtUJ'), ('DBLP', 'conf/sigmod/CherniackZ96'), ('Scholar', 'c9Humx2-EMgJ'), ('Scholar', 'YMcmy4FOXi8J'), ('Scholar', 'W1IcM8IUwAEJ'), ('DBLP', 'journals/sigmod/Yang94'), ('Scholar', 'wLNJcNvsulkJ'), ('DBLP', 'journals/sigmod/BohmR94'), ('DBLP', 'conf/sigmod/SimmenSM96'), ('DBLP', 'conf/sigmod/TatarinovIHW01'), ('DBLP', 'journals/vldb/BarbaraI95'), ('DBLP', 'conf/vldb/RohmBSS02'), ('Scholar', 'XVP8s4K0Bg4J'), ('Scholar', 'jfkafZcMjgIJ'), ('DBLP', 'conf/vldb/CosleyLP02'), ('DBLP', 'journals/sigmod/HummerLW02')})\n",
"('2018-12-22', {('DBLP', 'journals/sigmod/SilberschatzSU96'), ('Scholar', 'ckrgSn0vBOMJ'), ('Scholar', 'cIJQ0qxrkMIJ'), ('Scholar', 'ZnWLup8HMkUJ'), ('DBLP', 'journals/tods/StolboushkinT98'), ('Scholar', '-iaSLKFHwUkJ'), ('DBLP', 'journals/tods/FernandezKSMT02'), ('Scholar', 'soiN2U4tXykJ'), ('Scholar', 'x4HkJDEYFmYJ'), ('DBLP', 'journals/tods/FranklinCL97'), ('DBLP', 'conf/vldb/AgrawalS94'), ('DBLP', 'conf/sigmod/GibbonsM98')})\n",
"('2018-12-26', {('DBLP', 'conf/vldb/RothS97'), ('DBLP', 'journals/sigmod/DogacDKOONEHAKKM95'), ('DBLP', 'conf/sigmod/ZhangDWEMPMDR03'), ('Scholar', 'ek26aiEheesJ'), ('Scholar', '1hkVjoUg8hUJ'), ('Scholar', 'F2ecYx97F2sJ'), ('DBLP', 'journals/vldb/LiR99'), ('Scholar', 'rDObsYKVroMJ'), ('DBLP', 'conf/sigmod/AcharyaGPR99a'), ('Scholar', 'IkNOhDqEY18J'), ('Scholar', 'fXziEl_Htv8J'), ('Scholar', 'LxyVmHubIfUJ'), ('DBLP', 'conf/sigmod/FernandezFKLS97'), ('Scholar', 'qwjRkZuiMHsJ'), ('DBLP', 'conf/sigmod/NybergBCGL94'), ('DBLP', 'conf/sigmod/BreunigKKS01'), ('DBLP', 'conf/sigmod/LometW98'), ('DBLP', 'conf/vldb/MedianoCD94'), ('Scholar', 'Ko9e8CH2Si4J'), ('Scholar', 'DwwSuaisX5QJ'), ('Scholar', 'oAO74aolStoJ'), ('Scholar', 'jXvsW6VxbMYJ'), ('DBLP', 'conf/vldb/Brin95')})\n",
"('2018-12-29', {('Scholar', 'Ph7ZpmdNOPEJ'), ('Scholar', 'OmYc0wE1j4kJ'), ('DBLP', 'journals/sigmod/SouzaS99'), ('Scholar', 'rmtEGXAXHKIJ'), ('Scholar', '3M_0Kd8NNjgJ'), ('DBLP', 'conf/vldb/ChakravarthyKAK94'), ('DBLP', 'conf/sigmod/AdaliCPS96'), ('Scholar', 'tbZ0J3HLI18J'), ('DBLP', 'journals/sigmod/KappelR98'), ('DBLP', 'conf/vldb/MeccaCM01')})\n",
"('2018-12-25', {('Scholar', 'f1wgD54UUKwJ'), ('Scholar', 'RusJdYPDgQ4J'), ('Scholar', 'zkbTv93Zp1UJ'), ('Scholar', 'S8x6zjXc9oAJ'), ('Scholar', '0aJOXauNqYIJ'), ('Scholar', 'XFCkL9QhTjIJ'), ('DBLP', 'conf/vldb/SistlaYH94'), ('DBLP', 'conf/sigmod/HanKS97'), ('Scholar', 'xF8s5N7oUIMJ'), ('DBLP', 'journals/sigmod/FlorescuLM98'), ('Scholar', '_jl3bN2QlE4J'), ('Scholar', '0HlMHEPJRH4J'), ('DBLP', 'conf/sigmod/MamoulisP99'), ('DBLP', 'conf/sigmod/TatarinovVBSSZ02'), ('DBLP', 'conf/vldb/DeutschPT99'), ('DBLP', 'journals/tods/CliffordDIJS97'), ('DBLP', 'journals/vldb/HarrisR96'), ('DBLP', 'conf/sigmod/HuangSW94')})\n"
]
}
],
"source": [
"for idx, b in enumerate(block.key_set_adapter):\n",
" if idx == 5: break\n",
" print(b)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Pairwise comparison\n",
"\n",
"Now let's find out real pairs in all candidate pairs.\n",
"\n",
"First of all, you need to figure out how to measure two records."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"def is_pair(r1, r2):\n",
" for n1, n2 in zip(sorted(r1.names), sorted(r2.names)):\n",
" if rltk.levenshtein_distance(n1, n2) > min(len(n1), len(n2)) / 3:\n",
" return False\n",
" return True"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, make comparison on all candidate pairs (generated within blocks)."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['W Hümmer', 'W Lehner', 'H Wedekind'] ['W Huemmer', 'W Lehner', 'H Wedekind']\n",
"['R Agrawal', 'R Srikant'] ['R Sfikant', 'R Agrawal']\n",
"['S Brin'] ['S Brin']\n",
"['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim'] ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'SK Kim']\n",
"['A Sistla', 'C Yu', 'R Haddad'] ['AP Sistla', 'CT Yu', 'R Haddad']\n",
"['E Pourabbas', 'M Rafanelli'] ['E Pourabbas', 'M Rafanelli']\n",
"['S Melnik', 'E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']\n",
"['S Melnik', 'E Rahm', 'P Bernstein'] ['E Rahm']\n",
"['L Libkin'] ['L Libkin']\n",
"['I Tatarinov', 'S Viglas', 'K Beyer', 'J Shanmugasundaram', 'E Shekita', 'C Zhang'] ['X Zhang']\n",
"['J Gray', 'G Graefe'] ['J Gray', 'G Graefe']\n",
"['D Florescu', 'A Levy', 'A Mendelzon'] ['F Levy']\n",
"['G Kappel', 'W Retschitzegger'] ['G Kappel', 'W Retschitzegger']\n",
"['I Tatarinov', 'Z Ives', 'A Halevy', 'D Weld'] ['I Tatarinov', 'ZG Ives', 'AY Halevy', 'DS Weld']\n",
"['A Silberschatz', 'M Stonebraker', 'J Ullman'] ['A Silberschatz', 'M Stonebraker', 'J Ullman']\n",
"['R Baeza-Yates', 'G Navarro'] ['R Baeza-Yates', 'G Navarro']\n",
"['P Buneman', 'L Raschid', 'J Ullman'] ['P Buneman', 'L Raschid', 'JD Ullman']\n",
"['K Böhm', 'T Rakow'] ['K Bohme', 'TC Rakow']\n",
"['H Darwen', 'C Date'] ['H Darwen', 'CJ Date']\n",
"['M Lee', 'M Kitsuregawa', 'B Ooi', 'K Tan', 'A Mondal'] ['ML Lee', 'M Kitsuregawa', 'BC Ooi', 'KL Tan', 'A Mondal']\n",
"['N Mamoulis', 'D Papadias'] ['N Mamoulis', 'D Papadias']\n",
"['S Acharya', 'P Gibbons', 'V Poosala', 'S Ramaswamy'] ['S Acharya', 'PB Gibbons']\n",
"['L Yang'] ['L Yang']\n",
"['G Manku', 'S Rajagopalan', 'B Lindsay'] ['GS Manku', 'S Rajagopalan', 'BG Lindsay']\n",
"['P Brown'] ['P Brown']\n",
"['D Lomet', 'G Weikum'] ['D Lomet', 'G Weikum']\n",
"['S Berchtold', 'D Keim'] ['S Berchtold', 'DA Keim']\n",
"['P Gibbons', 'Y Matias'] ['PB Gibbons', 'Y Matias']\n",
"['J Hellerstein', 'P Haas', 'H Wang'] ['JM Hellerstein', 'JP Haas', 'HJ Wang']\n",
"['J Hellerstein', 'P Haas', 'H Wang'] ['L Yang']\n",
"['B Adelberg', 'H Garcia-Molina', 'J Widom'] ['B Adelberg', 'H Garcia-Molina', 'J Widom']\n",
"['J Han', 'K Koperski', 'N Stefanovic'] ['K Koperski', 'J Han']\n",
"['D Simmen', 'E Shekita', 'T Malkemus'] ['DE Simmen', 'EJ Shekita', 'T Malkemus']\n",
"['M Fernandez', 'D Florescu', 'J Kang', 'A Levy', 'D Suciu'] ['F Levy']\n",
"['A Deutsch', 'L Popa', 'V Tannen'] ['A Deutsch', 'L Popa', 'V Tannen']\n",
"['K Mogi', 'M Kitsuregawa'] ['K Mogi', 'M Kitsuregawa']\n",
"['J Shanmugasundaram', 'K Tufte', 'C Zhang', 'G He', 'D DeWitt', 'J Naughton'] ['X Zhang']\n",
"['P Hung', 'H Yeung', 'K Karlapalem'] ['PCK Hung', 'HP Yeung', 'K Karlapalem']\n",
"['R Srikant', 'R Agrawal'] ['R Sfikant', 'R Agrawal']\n",
"['M Cherniack', 'S Zdonik'] ['M Chemiack', 'S Zdonik']\n",
"['G Gardarin', 'F Machuca', 'P Pucheral'] ['G Gardarin', 'F Machuca']\n",
"['T Griffin', 'L Libkin'] ['L Libkin']\n",
"['M Roth', 'P Schwarz'] ['PM Schwarz', 'MT Roth']\n",
"['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['F Levy']\n",
"['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['S Dar', 'HV Jagadish', 'AY Levy', 'D Srivastava']\n",
"['M Carey', 'D DeWitt'] ['MJ Carey', 'DJ DeWitt']\n",
"['K Sagonas', 'T Swift', 'D Warren'] ['K Sagonas', 'T Swift', 'DS Warren']\n",
"['V Raghavan'] ['V ay Raghavan']\n",
"['X Wang', 'M Cherniack'] ['X Wang', 'M Cherniack']\n",
"['M Petrovic', 'I Burcea', 'H Jacobsen'] ['M Petrovic', 'I Burcea', 'HA Jacobsen']\n",
"['S Raghavan', 'H Garcia-Molina'] ['H Garcia-Molina', 'S Raghavan']\n",
"['D Cosley', 'S Lawrence', 'D Pennock'] ['D Cosley', 'S Lawrence', 'DM Pennock']\n",
"['K Goldman', 'N Lynch'] ['KJ Goldman', 'N Lynch']\n",
"['S Guo', 'W Sun', 'M Weiss'] ['S Guo', 'W Sun', 'MA Weiss']\n",
"['W Litwin', 'M Neimat', 'D Schneider'] ['W Litwin', 'MA Neimat', 'DA Schneider']\n",
"['A Stolboushkin', 'M Taitslin'] ['AP Stolboushkin', 'MA Taitslin']\n",
"['V Verykios', 'G Moustakides', 'M Elfeky'] ['VS Verykios', 'GV Moustakides', 'MG Elfeky']\n",
"['C Lee', 'C Shih', 'Y Chen'] ['C Lee', 'CS Shih', 'YH Chen']\n",
"['E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']\n",
"['E Rahm', 'P Bernstein'] ['E Rahm']\n",
"['S Sarawagi'] ['S Sarawagi']\n",
"['E Harris', 'K Ramamohanarao'] ['EP Harris', 'K Ramamohanarao']\n",
"['D Barbará', 'T Imielinski'] ['D Barbara', 'T Imielinski']\n",
"['A Dan', 'P Yu', 'J Chung'] ['A Dan', 'PS Yu', 'JY Chung']\n",
"['B Hammond'] ['B Hammond']\n"
]
}
],
"source": [
"for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):\n",
" if is_pair(r_dblp, r_scholar):\n",
" print(r_dblp.names, r_scholar.names)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluation\n",
"\n",
"How do I know the performance of the strategy that I use? Evaluation is a built-in module for benchmarking.\n",
"\n",
"The first step is to label data to get ground truth."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"gt = rltk.GroundTruth()\n",
"with open('resources/dblp_scholar_gt.csv') as f:\n",
" for d in rltk.CSVReader(f): # this can be replace to python csv reader\n",
" gt.add_positive(d['idDBLP'], d['idScholar'])\n",
"gt.generate_all_negatives(ds_dblp, ds_scholar, range_in_gt=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`Trial` is used to records all the result for further evaluation. It needs to have an associated `GroundTruth`."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"precison: 0.8615384615384616 recall: 0.5894736842105263 f-measure: 0.7\n",
"tp: 56\n",
"fp: 9\n",
"tn: 8824\n",
"fn: 39\n"
]
}
],
"source": [
"trial = rltk.Trial(gt)\n",
"for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):\n",
" if is_pair(r_dblp, r_scholar):\n",
" trial.add_positive(r_dblp, r_scholar)\n",
" else:\n",
" trial.add_negative(r_dblp, r_scholar)\n",
"trial.evaluate()\n",
"print('precison:', trial.precision, 'recall:', trial.recall, 'f-measure:', trial.f_measure)\n",
"print('tp:', len(trial.true_positives_list))\n",
"print('fp:', len(trial.false_positives_list))\n",
"print('tn:', len(trial.true_negatives_list))\n",
"print('fn:', len(trial.false_negatives_list))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "rltk",
"language": "python",
"name": "rltk"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.2"
}
},
"nbformat": 4,
"nbformat_minor": 1
}