{ "cells": [ { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# Step-by-step example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, you will go through an example end to end. Here are the main steps you will go through:\n", "\n", "- Dataset analysis\n", "- Construct RLTK datasets\n", "- Blocking\n", "- Pairwise comparison\n", "- Evaluation\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dataset analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset used here is an artificial dataset which contructed from DBLP and Scholar data. Let's take a look." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# initialization\n", "import os\n", "from datetime import datetime\n", "import pandas as pd\n", "from IPython.display import display\n", "import rltk" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
idnamesdate
0journals/sigmod/HummerLW02W Hümmer, W Lehner, H Wedekind2018-12-24
1conf/vldb/AgrawalS94R Agrawal, R Srikant2018-12-22
2conf/vldb/Brin95S Brin2018-12-26
3conf/vldb/ChakravarthyKAK94S Chakravarthy, V Krishnaprasad, E Anwar, S Kim2018-12-29
4conf/vldb/MedianoCD94M Mediano, M Casanova, M Dreux2018-12-26
\n", "
" ], "text/plain": [ " id \\\n", "0 journals/sigmod/HummerLW02 \n", "1 conf/vldb/AgrawalS94 \n", "2 conf/vldb/Brin95 \n", "3 conf/vldb/ChakravarthyKAK94 \n", "4 conf/vldb/MedianoCD94 \n", "\n", " names date \n", "0 W Hümmer, W Lehner, H Wedekind 2018-12-24 \n", "1 R Agrawal, R Srikant 2018-12-22 \n", "2 S Brin 2018-12-26 \n", "3 S Chakravarthy, V Krishnaprasad, E Anwar, S Kim 2018-12-29 \n", "4 M Mediano, M Casanova, M Dreux 2018-12-26 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_dblp = pd.read_csv('resources/dblp.csv', parse_dates=False)\n", "df_dblp.head()" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
dateidnames
026, Dec 2018ek26aiEheesJM Fernandez, J Kang, A Levy, D Suciu
129, Dec 2018rmtEGXAXHKIJS Adali, KS Candan, Y Papakonstantinou, VS
227, Dec 2018D0z0BDnbnFcJS Christodoulakis
328, Dec 2018noTo81QxmHQJACMS Anthology, P Edition
428, Dec 2018l0W27c1C3NwJW Litwin, MA Neimat, DA Schneider
\n", "
" ], "text/plain": [ " date id names\n", "0 26, Dec 2018 ek26aiEheesJ M Fernandez, J Kang, A Levy, D Suciu\n", "1 29, Dec 2018 rmtEGXAXHKIJ S Adali, KS Candan, Y Papakonstantinou, VS \n", "2 27, Dec 2018 D0z0BDnbnFcJ S Christodoulakis\n", "3 28, Dec 2018 noTo81QxmHQJ ACMS Anthology, P Edition\n", "4 28, Dec 2018 l0W27c1C3NwJ W Litwin, MA Neimat, DA Schneider" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_scholar = pd.read_json('resources/scholar.jl', lines=True, convert_dates=False)\n", "df_scholar.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By a glance, it's easy to find out that both datasets have id, date and names.\n", "\n", "- Dates have different formats\n", "- Names columns contains many names separated by comma." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Construct RLTK datasets\n", "\n", "In RLTK, the data collection is named `Dataset` and each \"row\" is a `Record` instance. In order to construct a `Dataset`, you need to read data from source by a specific `Reader`, then the data is presented in a Python Dict `raw_object` which can be use to construct `Record` instance by the schema (concrete class of `Record`) you definded.\n", "\n", "For DBLP:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "class DBLP(rltk.Record):\n", " @property\n", " def id(self):\n", " return self.raw_object['id']\n", " \n", " @property\n", " def date(self):\n", " return self.raw_object['date']\n", " \n", " @property\n", " def names(self):\n", " return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "journals/sigmod/HummerLW02 2018-12-24 ['W Hümmer', 'W Lehner', 'H Wedekind']\n", "conf/vldb/AgrawalS94 2018-12-22 ['R Agrawal', 'R Srikant']\n", "conf/vldb/Brin95 2018-12-26 ['S Brin']\n", "conf/vldb/ChakravarthyKAK94 2018-12-29 ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim']\n", "conf/vldb/MedianoCD94 2018-12-26 ['M Mediano', 'M Casanova', 'M Dreux']\n", "conf/vldb/SistlaYH94 2018-12-25 ['A Sistla', 'C Yu', 'R Haddad']\n", "journals/sigmod/PourabbasR00 2018-12-20 ['E Pourabbas', 'M Rafanelli']\n", "conf/sigmod/MelnikRB03 2018-12-21 ['S Melnik', 'E Rahm', 'P Bernstein']\n", "conf/sigmod/ZhangDWEMPMDR03 2018-12-26 ['X Zhang', 'K Dimitrova', 'L Wang', 'M El-Sayed', 'B Murphy', 'B Pielech', 'M Mulchandani', 'L Ding', 'E Rundensteiner']\n", "conf/sigmod/ZhouWGGZWXYF03 2018-12-27 ['A Zhou', 'Q Wang', 'Z Guo', 'X Gong', 'S Zheng', 'H Wu', 'J Xiao', 'K Yue', 'W Fan']\n" ] } ], "source": [ "ds_dblp = rltk.Dataset(rltk.CSVReader('resources/dblp.csv'), record_class=DBLP)\n", "\n", "for r_dblp in ds_dblp.head():\n", " print(r_dblp.id, r_dblp.date, r_dblp.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For scholar:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "@rltk.remove_raw_object\n", "class Scholar(rltk.Record):\n", " @rltk.cached_property\n", " def id(self):\n", " return self.raw_object['id']\n", " \n", " @rltk.cached_property\n", " def date(self):\n", " return datetime.strptime(self.raw_object['date'], '%d, %b %Y').date().strftime('%Y-%m-%d')\n", " \n", " @rltk.cached_property\n", " def names(self):\n", " return list(map(lambda x: x.strip(), self.raw_object['names'].split(',')))" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ek26aiEheesJ 2018-12-26 ['M Fernandez', 'J Kang', 'A Levy', 'D Suciu']\n", "rmtEGXAXHKIJ 2018-12-29 ['S Adali', 'KS Candan', 'Y Papakonstantinou', 'VS']\n", "D0z0BDnbnFcJ 2018-12-27 ['S Christodoulakis']\n", "noTo81QxmHQJ 2018-12-28 ['ACMS Anthology', 'P Edition']\n", "l0W27c1C3NwJ 2018-12-28 ['W Litwin', 'MA Neimat', 'DA Schneider']\n", "IkNOhDqEY18J 2018-12-26 ['S Acharya', 'PB Gibbons']\n", "6QZGeKna5lgJ 2018-12-23 ['T Gri']\n", "XFCkL9QhTjIJ 2018-12-25 ['K Koperski', 'J Han']\n", "9Wo54Wyh_X8J 2018-12-23 ['H Garcia-Molina', 'S Raghavan']\n", "9uxj2XzGt9UJ 2018-12-28 ['M Flickner', 'H Sawhney', 'W Niblack', 'J Ashley', 'Q']\n" ] } ], "source": [ "ds_scholar = rltk.Dataset(rltk.JsonLinesReader('resources/scholar.jl'), record_class=Scholar)\n", "\n", "for r_scholar in ds_scholar.head():\n", " print(r_scholar.id, r_scholar.date, r_scholar.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Decorator `cached_property` means the property value will be pre-computed while generating the `Dataset`, it's especially useful to cache the value while the transformation of property is time consuming (e.g., tokenization, vectorization). `remove_raw_object` is used to release the space of `raw_object` after all properties are being cached.\n", "\n", "If you prefer to do data cleaning and manipulation in `pandas.Dataframe`, you can build `Dataset` from it easily." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ek26aiEheesJ 26, Dec 2018 M Fernandez, J Kang, A Levy, D Suciu\n", "rmtEGXAXHKIJ 29, Dec 2018 S Adali, KS Candan, Y Papakonstantinou, VS \n", "D0z0BDnbnFcJ 27, Dec 2018 S Christodoulakis\n", "noTo81QxmHQJ 28, Dec 2018 ACMS Anthology, P Edition\n", "l0W27c1C3NwJ 28, Dec 2018 W Litwin, MA Neimat, DA Schneider\n", "IkNOhDqEY18J 26, Dec 2018 S Acharya, PB Gibbons\n", "6QZGeKna5lgJ 23, Dec 2018 T Gri\n", "XFCkL9QhTjIJ 25, Dec 2018 K Koperski, J Han\n", "9Wo54Wyh_X8J 23, Dec 2018 H Garcia-Molina, S Raghavan\n", "9uxj2XzGt9UJ 28, Dec 2018 M Flickner, H Sawhney, W Niblack, J Ashley, Q \n" ] } ], "source": [ "# do data tranformation in df_scholar first, then:\n", "\n", "class Scholar2(rltk.AutoGeneratedRecord):\n", " pass\n", "\n", "ds_scholar2 = rltk.Dataset(rltk.DataFrameReader(df_scholar), record_class=Scholar2)\n", "\n", "for r_scholar2 in ds_scholar2.head():\n", " print(r_scholar2.id, r_scholar2.date, r_scholar2.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Blocking\n", "\n", "Blocking can be used to eliminate obvious impossible pairs then greatly reduce unnecessary comparisons.\n", "\n", "In this example, date is an ideal key for blocking." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "bg = rltk.HashBlockGenerator()\n", "block = bg.generate(\n", " bg.block(ds_dblp, property_='date'),\n", " bg.block(ds_scholar, property_='date')\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to know what's in a block aggregated by key, you can iterate on the `key_set_adapter` in block object. Block is stored in a concrete `KeySetAdapter` (default is `MemoryKeySetAdapter`)." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('2018-12-24', {('Scholar', 'BTalXWt3faUJ'), ('Scholar', 'bTYTn8VG5hIJ'), ('Scholar', 'sHJ914nPZtUJ'), ('DBLP', 'conf/sigmod/CherniackZ96'), ('Scholar', 'c9Humx2-EMgJ'), ('Scholar', 'YMcmy4FOXi8J'), ('Scholar', 'W1IcM8IUwAEJ'), ('DBLP', 'journals/sigmod/Yang94'), ('Scholar', 'wLNJcNvsulkJ'), ('DBLP', 'journals/sigmod/BohmR94'), ('DBLP', 'conf/sigmod/SimmenSM96'), ('DBLP', 'conf/sigmod/TatarinovIHW01'), ('DBLP', 'journals/vldb/BarbaraI95'), ('DBLP', 'conf/vldb/RohmBSS02'), ('Scholar', 'XVP8s4K0Bg4J'), ('Scholar', 'jfkafZcMjgIJ'), ('DBLP', 'conf/vldb/CosleyLP02'), ('DBLP', 'journals/sigmod/HummerLW02')})\n", "('2018-12-22', {('DBLP', 'journals/sigmod/SilberschatzSU96'), ('Scholar', 'ckrgSn0vBOMJ'), ('Scholar', 'cIJQ0qxrkMIJ'), ('Scholar', 'ZnWLup8HMkUJ'), ('DBLP', 'journals/tods/StolboushkinT98'), ('Scholar', '-iaSLKFHwUkJ'), ('DBLP', 'journals/tods/FernandezKSMT02'), ('Scholar', 'soiN2U4tXykJ'), ('Scholar', 'x4HkJDEYFmYJ'), ('DBLP', 'journals/tods/FranklinCL97'), ('DBLP', 'conf/vldb/AgrawalS94'), ('DBLP', 'conf/sigmod/GibbonsM98')})\n", "('2018-12-26', {('DBLP', 'conf/vldb/RothS97'), ('DBLP', 'journals/sigmod/DogacDKOONEHAKKM95'), ('DBLP', 'conf/sigmod/ZhangDWEMPMDR03'), ('Scholar', 'ek26aiEheesJ'), ('Scholar', '1hkVjoUg8hUJ'), ('Scholar', 'F2ecYx97F2sJ'), ('DBLP', 'journals/vldb/LiR99'), ('Scholar', 'rDObsYKVroMJ'), ('DBLP', 'conf/sigmod/AcharyaGPR99a'), ('Scholar', 'IkNOhDqEY18J'), ('Scholar', 'fXziEl_Htv8J'), ('Scholar', 'LxyVmHubIfUJ'), ('DBLP', 'conf/sigmod/FernandezFKLS97'), ('Scholar', 'qwjRkZuiMHsJ'), ('DBLP', 'conf/sigmod/NybergBCGL94'), ('DBLP', 'conf/sigmod/BreunigKKS01'), ('DBLP', 'conf/sigmod/LometW98'), ('DBLP', 'conf/vldb/MedianoCD94'), ('Scholar', 'Ko9e8CH2Si4J'), ('Scholar', 'DwwSuaisX5QJ'), ('Scholar', 'oAO74aolStoJ'), ('Scholar', 'jXvsW6VxbMYJ'), ('DBLP', 'conf/vldb/Brin95')})\n", "('2018-12-29', {('Scholar', 'Ph7ZpmdNOPEJ'), ('Scholar', 'OmYc0wE1j4kJ'), ('DBLP', 'journals/sigmod/SouzaS99'), ('Scholar', 'rmtEGXAXHKIJ'), ('Scholar', '3M_0Kd8NNjgJ'), ('DBLP', 'conf/vldb/ChakravarthyKAK94'), ('DBLP', 'conf/sigmod/AdaliCPS96'), ('Scholar', 'tbZ0J3HLI18J'), ('DBLP', 'journals/sigmod/KappelR98'), ('DBLP', 'conf/vldb/MeccaCM01')})\n", "('2018-12-25', {('Scholar', 'f1wgD54UUKwJ'), ('Scholar', 'RusJdYPDgQ4J'), ('Scholar', 'zkbTv93Zp1UJ'), ('Scholar', 'S8x6zjXc9oAJ'), ('Scholar', '0aJOXauNqYIJ'), ('Scholar', 'XFCkL9QhTjIJ'), ('DBLP', 'conf/vldb/SistlaYH94'), ('DBLP', 'conf/sigmod/HanKS97'), ('Scholar', 'xF8s5N7oUIMJ'), ('DBLP', 'journals/sigmod/FlorescuLM98'), ('Scholar', '_jl3bN2QlE4J'), ('Scholar', '0HlMHEPJRH4J'), ('DBLP', 'conf/sigmod/MamoulisP99'), ('DBLP', 'conf/sigmod/TatarinovVBSSZ02'), ('DBLP', 'conf/vldb/DeutschPT99'), ('DBLP', 'journals/tods/CliffordDIJS97'), ('DBLP', 'journals/vldb/HarrisR96'), ('DBLP', 'conf/sigmod/HuangSW94')})\n" ] } ], "source": [ "for idx, b in enumerate(block.key_set_adapter):\n", " if idx == 5: break\n", " print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pairwise comparison\n", "\n", "Now let's find out real pairs in all candidate pairs.\n", "\n", "First of all, you need to figure out how to measure two records." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "def is_pair(r1, r2):\n", " for n1, n2 in zip(sorted(r1.names), sorted(r2.names)):\n", " if rltk.levenshtein_distance(n1, n2) > min(len(n1), len(n2)) / 3:\n", " return False\n", " return True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then, make comparison on all candidate pairs (generated within blocks)." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['W Hümmer', 'W Lehner', 'H Wedekind'] ['W Huemmer', 'W Lehner', 'H Wedekind']\n", "['R Agrawal', 'R Srikant'] ['R Sfikant', 'R Agrawal']\n", "['S Brin'] ['S Brin']\n", "['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'S Kim'] ['S Chakravarthy', 'V Krishnaprasad', 'E Anwar', 'SK Kim']\n", "['A Sistla', 'C Yu', 'R Haddad'] ['AP Sistla', 'CT Yu', 'R Haddad']\n", "['E Pourabbas', 'M Rafanelli'] ['E Pourabbas', 'M Rafanelli']\n", "['S Melnik', 'E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']\n", "['S Melnik', 'E Rahm', 'P Bernstein'] ['E Rahm']\n", "['L Libkin'] ['L Libkin']\n", "['I Tatarinov', 'S Viglas', 'K Beyer', 'J Shanmugasundaram', 'E Shekita', 'C Zhang'] ['X Zhang']\n", "['J Gray', 'G Graefe'] ['J Gray', 'G Graefe']\n", "['D Florescu', 'A Levy', 'A Mendelzon'] ['F Levy']\n", "['G Kappel', 'W Retschitzegger'] ['G Kappel', 'W Retschitzegger']\n", "['I Tatarinov', 'Z Ives', 'A Halevy', 'D Weld'] ['I Tatarinov', 'ZG Ives', 'AY Halevy', 'DS Weld']\n", "['A Silberschatz', 'M Stonebraker', 'J Ullman'] ['A Silberschatz', 'M Stonebraker', 'J Ullman']\n", "['R Baeza-Yates', 'G Navarro'] ['R Baeza-Yates', 'G Navarro']\n", "['P Buneman', 'L Raschid', 'J Ullman'] ['P Buneman', 'L Raschid', 'JD Ullman']\n", "['K Böhm', 'T Rakow'] ['K Bohme', 'TC Rakow']\n", "['H Darwen', 'C Date'] ['H Darwen', 'CJ Date']\n", "['M Lee', 'M Kitsuregawa', 'B Ooi', 'K Tan', 'A Mondal'] ['ML Lee', 'M Kitsuregawa', 'BC Ooi', 'KL Tan', 'A Mondal']\n", "['N Mamoulis', 'D Papadias'] ['N Mamoulis', 'D Papadias']\n", "['S Acharya', 'P Gibbons', 'V Poosala', 'S Ramaswamy'] ['S Acharya', 'PB Gibbons']\n", "['L Yang'] ['L Yang']\n", "['G Manku', 'S Rajagopalan', 'B Lindsay'] ['GS Manku', 'S Rajagopalan', 'BG Lindsay']\n", "['P Brown'] ['P Brown']\n", "['D Lomet', 'G Weikum'] ['D Lomet', 'G Weikum']\n", "['S Berchtold', 'D Keim'] ['S Berchtold', 'DA Keim']\n", "['P Gibbons', 'Y Matias'] ['PB Gibbons', 'Y Matias']\n", "['J Hellerstein', 'P Haas', 'H Wang'] ['JM Hellerstein', 'JP Haas', 'HJ Wang']\n", "['J Hellerstein', 'P Haas', 'H Wang'] ['L Yang']\n", "['B Adelberg', 'H Garcia-Molina', 'J Widom'] ['B Adelberg', 'H Garcia-Molina', 'J Widom']\n", "['J Han', 'K Koperski', 'N Stefanovic'] ['K Koperski', 'J Han']\n", "['D Simmen', 'E Shekita', 'T Malkemus'] ['DE Simmen', 'EJ Shekita', 'T Malkemus']\n", "['M Fernandez', 'D Florescu', 'J Kang', 'A Levy', 'D Suciu'] ['F Levy']\n", "['A Deutsch', 'L Popa', 'V Tannen'] ['A Deutsch', 'L Popa', 'V Tannen']\n", "['K Mogi', 'M Kitsuregawa'] ['K Mogi', 'M Kitsuregawa']\n", "['J Shanmugasundaram', 'K Tufte', 'C Zhang', 'G He', 'D DeWitt', 'J Naughton'] ['X Zhang']\n", "['P Hung', 'H Yeung', 'K Karlapalem'] ['PCK Hung', 'HP Yeung', 'K Karlapalem']\n", "['R Srikant', 'R Agrawal'] ['R Sfikant', 'R Agrawal']\n", "['M Cherniack', 'S Zdonik'] ['M Chemiack', 'S Zdonik']\n", "['G Gardarin', 'F Machuca', 'P Pucheral'] ['G Gardarin', 'F Machuca']\n", "['T Griffin', 'L Libkin'] ['L Libkin']\n", "['M Roth', 'P Schwarz'] ['PM Schwarz', 'MT Roth']\n", "['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['F Levy']\n", "['D Srivastava', 'S Dar', 'H Jagadish', 'A Levy'] ['S Dar', 'HV Jagadish', 'AY Levy', 'D Srivastava']\n", "['M Carey', 'D DeWitt'] ['MJ Carey', 'DJ DeWitt']\n", "['K Sagonas', 'T Swift', 'D Warren'] ['K Sagonas', 'T Swift', 'DS Warren']\n", "['V Raghavan'] ['V ay Raghavan']\n", "['X Wang', 'M Cherniack'] ['X Wang', 'M Cherniack']\n", "['M Petrovic', 'I Burcea', 'H Jacobsen'] ['M Petrovic', 'I Burcea', 'HA Jacobsen']\n", "['S Raghavan', 'H Garcia-Molina'] ['H Garcia-Molina', 'S Raghavan']\n", "['D Cosley', 'S Lawrence', 'D Pennock'] ['D Cosley', 'S Lawrence', 'DM Pennock']\n", "['K Goldman', 'N Lynch'] ['KJ Goldman', 'N Lynch']\n", "['S Guo', 'W Sun', 'M Weiss'] ['S Guo', 'W Sun', 'MA Weiss']\n", "['W Litwin', 'M Neimat', 'D Schneider'] ['W Litwin', 'MA Neimat', 'DA Schneider']\n", "['A Stolboushkin', 'M Taitslin'] ['AP Stolboushkin', 'MA Taitslin']\n", "['V Verykios', 'G Moustakides', 'M Elfeky'] ['VS Verykios', 'GV Moustakides', 'MG Elfeky']\n", "['C Lee', 'C Shih', 'Y Chen'] ['C Lee', 'CS Shih', 'YH Chen']\n", "['E Rahm', 'P Bernstein'] ['S Melnik', 'E Rahm', 'PA Bernstein']\n", "['E Rahm', 'P Bernstein'] ['E Rahm']\n", "['S Sarawagi'] ['S Sarawagi']\n", "['E Harris', 'K Ramamohanarao'] ['EP Harris', 'K Ramamohanarao']\n", "['D Barbará', 'T Imielinski'] ['D Barbara', 'T Imielinski']\n", "['A Dan', 'P Yu', 'J Chung'] ['A Dan', 'PS Yu', 'JY Chung']\n", "['B Hammond'] ['B Hammond']\n" ] } ], "source": [ "for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):\n", " if is_pair(r_dblp, r_scholar):\n", " print(r_dblp.names, r_scholar.names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation\n", "\n", "How do I know the performance of the strategy that I use? Evaluation is a built-in module for benchmarking.\n", "\n", "The first step is to label data to get ground truth." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "gt = rltk.GroundTruth()\n", "with open('resources/dblp_scholar_gt.csv') as f:\n", " for d in rltk.CSVReader(f): # this can be replace to python csv reader\n", " gt.add_positive(d['idDBLP'], d['idScholar'])\n", "gt.generate_all_negatives(ds_dblp, ds_scholar, range_in_gt=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Trial` is used to records all the result for further evaluation. It needs to have an associated `GroundTruth`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "precison: 0.8615384615384616 recall: 0.5894736842105263 f-measure: 0.7\n", "tp: 56\n", "fp: 9\n", "tn: 8824\n", "fn: 39\n" ] } ], "source": [ "trial = rltk.Trial(gt)\n", "for r_dblp, r_scholar in rltk.get_record_pairs(ds_dblp, ds_scholar):\n", " if is_pair(r_dblp, r_scholar):\n", " trial.add_positive(r_dblp, r_scholar)\n", " else:\n", " trial.add_negative(r_dblp, r_scholar)\n", "trial.evaluate()\n", "print('precison:', trial.precision, 'recall:', trial.recall, 'f-measure:', trial.f_measure)\n", "print('tp:', len(trial.true_positives_list))\n", "print('fp:', len(trial.false_positives_list))\n", "print('tn:', len(trial.true_negatives_list))\n", "print('fn:', len(trial.false_negatives_list))" ] } ], "metadata": { "kernelspec": { "display_name": "rltk", "language": "python", "name": "rltk" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.2" } }, "nbformat": 4, "nbformat_minor": 1 }