{ "cells": [ { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "# Record Linkage\n", "\n", "----\n", "\n", "This notebook will provide you with an instruction into Record Linkage using Python. Upon completion of this notebook you will be able to apply record linkage techniques using the `recordlinkage` package to combine data from different sources in Python. It will lead you through all the steps necessary for a sucessful record linkage starting with data preparation including pre-processing, cleaning and standardization of data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Principles of Record Linkage\n", "The goal of record linkage is to determine if pairs of records describe the same entity. For instance, this is important for removing duplicates from a data source or joining two separate data sources together. Record linkage also goes by the terms data matching, merge/purge, duplication detection, de-duping, reference matching, entity resolution, disambiguation, co-reference/anaphora in various fields.\n", "\n", "There are several approaches to record linkage that include \n", " - exact matching, \n", " - rule-based linking and \n", " - probabilistic linking. \n", "- An example of **exact matching** is joining records based on a direct identifier. This is what we have already done in SQL by joining tables.\n", "- **Rule-based matching** involves applying a cascading set of rules that reflect the domain knowledge of the records being linked. \n", "- In **probabilistic record linkages**, linkage weights are estimated to calculate the probability of a certain match.\n", "\n", "In practical applications you will need record linkage techniques to combine information addressing the same entity that is stored in different data sources. Record linkage will also help you to address the quality of different data sources. For example, if one of your databases has missing values you might be able to fill those by finding an identical pair in a different data source. Overall, the main applications of record linkage are\n", " 1. Merging two or more data files. \n", " 2. Identifying the intersection of the two data sets. \n", " 3. Updating data files (with the data row of the other data files) and imputing missing data.\n", " 4. Entity disambiguation and de-duplication." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Motivation: Linking Patents to University IPEDS code\n", "\n", "In this notebook we show an example of linking a subset of patent data to universities. In both datasets we have university name and location (city, state) that we can use for the linkage. The data that we will use are stored in the `university.db` database." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting Started with Record Linkage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Import of Packages\n", "Python provides us with some tools we can use for record linkages so we don't have to start from scratch and code our own linkage algorithms. So before we start we need to load the package **recordlinkage**. To fully function, this package uses other packages which also need to be imported. Thus we are adding more packages to the ones you are already familiar with." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# data manipulation and machine learning\n", "import pandas as pd\n", "import scipy\n", "import sklearn\n", "from sqlite3 import connect\n", "\n", "# record linkage package\n", "import recordlinkage as rl\n", "from recordlinkage.preprocessing import clean, phonenumbers, phonetic" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# to create a connection to the database, \n", "# we need to pass the name of the database \n", "\n", "DB = 'university.db'\n", "\n", "conn = connect(DB)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Getting Patents and IPEDS Data\n", "\n", "Before we get started on linking two datasets, we need to first bring in our datasets. We'll be linking data from two sources: `uspto_org_location` and `ipeds_location`. We'll do this by bringing in the appropriate tables from the database." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Specify the PatentsView data table 'uspto_org_location'\n", "\n", "query = '''\n", "SELECT *\n", "FROM uspto\n", "'''\n", "# Read it into a pandas dataframe\n", "\n", "uspto_org = pd.read_sql(query,conn)\n", "uspto_org = uspto_org[['assignee_id','assignee_organization','assignee_city','assignee_state']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View the table\n", "uspto_org.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load the IPEDS university data\n", "\n", "query = '''\n", "SELECT *\n", "FROM ipeds\n", "'''\n", "# Read it into a pandas dataframe\n", "\n", "ipeds = pd.read_sql(query,conn)\n", "ipeds = ipeds[['unitid','instnm','city','stabbr']]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# View the table\n", "ipeds.head()" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "## The Importance of Pre-Processing\n", "Data pre-processing is an important step in a data analysis project in general, in record linkage applications in particular. The goal of pre-processing is to transform messy data into a dataset that can be used in a project workflow.\n", "\n", "Linking records from different data sources comes with different challenges that need to be addressed by the analyst. The analyst must determine whether or not two entities (individuals, businesses, geographical units) on two different files are the same. This determination is not always easy. In most of the cases there is no common uniquely identifing characteristic for an entity. For example, is Bob Miller from New York the same person as Bob Miller from Chicago in a given dataset? This determination has to be executed carefully because consequences of wrong linkages may be substantial (is person X the same person as the person X on the list of identified terrorists). Pre-processing can help to make better informed decisions.\n", "\n", "Pre-processing can be difficult because there are a lot of things to keep in mind. For example, data input errors, such as typos, misspellings, truncation, abbreviations, and missing values need to be corrected. The most common reason why matching projects fail is lack of time and resources for data cleaning. \n", "\n", "In the following section we will walk you through some pre-processing steps, these include but are not limited to removing spaces, parsing fields, and standardizing strings." ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true, "toc-nb-collapsed": true }, "source": [ "### Clean Patent Data\n", "We will start by cleaning and preprocessing the patent data. We need to remove whitespaces, make sure that everything is in lower case, and harmonize all the other information we need for the linkage. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The record linkage package comes with a built-in cleaning function we can use. The `clean()` function removes any characters such as '-', '.', '/', '\\', ':', brackets of all types, and also lowercases by default." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Cleaning names (using the record linkage package tool, see imports)\n", "\n", "uspto_org['assignee_organization'] = clean(uspto_org['assignee_organization'])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using `.str.replace()`, we can replace all instances of a white space in a name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Concatenate strings by removing white space\n", "uspto_org['assignee_organization'] = uspto_org['assignee_organization'].str.replace(' ','')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's view the finalized names in the patent data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we are done with the inital data prep work for the patent file. Please keep in mind that we just provided some examples for you to demonstrate the process. You can add as many further steps to it as necessary. " ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false }, "source": [ "### Phonetic Processing\n", "\n", "Sometimes, words or names are recorded differently because they are written down as they sound. This can result in failed matches, because the same institution or individual will technically have different written names, even though the names would sound identically when pronounced out loud. To avoid these issues, we will add one more thing: a soundex (a phonetic algorithm for indexing names by sound, as pronounced in English)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `phonetic()` function is used to convert strings into their corresponding phonetic codes. This is particularly useful when comparing names where different possible spellings make it difficult to find exact matches \n", "(e.g. Jillian and Gillian)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's add a column called `phonetic_name` to our existing table, which will contain the result of applying a `phonetic` function to the assignee organization name (the phonetic transcription of the name). We are using a method called NYSIIS - the New York State Identification and Intelligence System phonetic code. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org[\"phonetic_name\"] = phonetic(uspto_org[\"assignee_organization\"], method=\"nysiis\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Checkpoint 1: Pre-process IPEDS data

\n", "\n", "Let's do the same pre-processing steps for the IPEDS data." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use function `clean()` from above on the column with the university name in the IPEDS file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds['instnm'] = clean(ipeds['instnm'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use function`.str.replace()` from above to replace all instances of white space." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds['instnm'] = ipeds['instnm'].str.replace(' ','')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the results with the organization names in the patent file." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find phonetic transcriptions of university names in the IPEDS table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds[\"phonetic_name\"] = phonetic(ipeds['instnm'], method='nysiis')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Record Linkage\n", "\n", "We've done some basic pre-processing of the data, using some of the very useful functions in `recordlinkage.preprocessing`. Now, let's move on to the actual record linkage portion. Though we can dive right in with comparing two names and checking if they match, this process can actually have a lot of nuance to it. For example, you should consider how long this process will take if you have extremely large datasets, with millions and millions of rows to check against millions and millions of rows. In addition, you should consider how strict you want your matching to be. For example, you want to make sure you catch any typos or common misspellings, but want to avoid relaxing the match condition to the point that anything will match with anything." ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true, "toc-nb-collapsed": true }, "source": [ "### Indexing\n", "\n", "Indexing allows you to create candidate links, which basically means identifying pairs of data rows which might refer to the same real world entity. This is also called the comparison space (matrix). There are different ways to index data. The easiest is to create a full index and consider every pair a match. This is also the least efficient method, because we will be comparing every row of one dataset with every row of the other dataset.\n", "\n", "If we had 10,000 records in data frame A and 100,000 records in data frame B, we would have 1,000,000,000 candidate links. You can see that comparing over a full index is getting inefficient when working with big data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can do better if we actually include our knowledge about the data to eliminate bad link from the start. This can be done through blocking. The `recordlinkage` package gives you multiple options for this. For example, you can block by using variables, which means that only links exactly equal on specified values will be kept. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we will block on city and state, to narrow down the number of candidate links." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to make sure that the column names that we want to block on are the same in both datasets." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Which columns do we need to rename in both datasets, if we want to link on columns `city` and `state`?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Rename the patent dataset columns\n", "uspto_org = uspto_org.rename(columns={'assignee_city':'city'})\n", "uspto_org = uspto_org.rename(columns={'assignee_state':'state'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the IPEDS data, the city column has already the target name. Rename the column `stabbr` to the `state` column." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds = ipeds.rename(columns={'stabbr':'state'})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that are datasets have the same column names, we can block on them." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "indexerBL = rl.BlockIndex(on=['city', 'state'])\n", "candidate_links = indexerBL.index(ipeds, uspto_org)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Returns pairs of candidate records by their index number in the respective tables\n", "candidate_links" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the first pair of candidate links blocked on city and state: (1, 264)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds.iloc[1]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org.iloc[264]" ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true, "toc-nb-collapsed": true }, "source": [ "### Record Comparison\n", "\n", "After you have created a set of candidate links, you’re ready to begin comparing the records associated with each candidate link. In `recordlinkage` you must initiate a Compare object prior to performing any comparison functionality between records. The code block below initializes the comparison object." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initiate compare object \n", "compare_cl = rl.Compare()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Compare.string()` method generates a score based on well-known string-comparison algorithms. For this example, Jaro-Winkler distance is used (specifically developed with record linkage applications in mind) - words with more characters in common have a higher Jaro-Winkler value than those with fewer characters in common. The output value is normalized to fall between 0 (complete dissimilar strings) and 1 (exact match on strings). (Information about other string-comparison methods is included in the References section below)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you remember, we already did an exact matching on `city` and `state`, when we did the blocking above and created the candidate links." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will use the string method to compare the university names and their phonetic transcriptions." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to specify the respective columns with organization names in both datasets, the method, and the threshold. In this case, for all strings that have more than 85% in similarity, according to the Jaro-Winkler distance, a 1 will be returned, and otherwise 0." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initiate compare object \n", "compare_cl = rl.Compare()\n", "\n", "compare_cl.string('instnm','assignee_organization', method='jarowinkler', threshold=0.85,label='name')\n", "compare_cl.string('phonetic_name','phonetic_name', method='jarowinkler', threshold=0.85,label='phonetic_name')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The comparing of record pairs starts when the `compute` method is called. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## All attribute comparisons are stored in a DataFrame with horizontally the features and vertically the record pairs.\n", "\n", "features = compare_cl.compute(candidate_links, ipeds, uspto_org)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "features[features['name'] == 1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check how many records we get where one or both of comparison attributes match." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Simple Classification: Check for how many attributes records are identical by summing the comparison results.\n", "features.sum(axis=1).value_counts().sort_index(ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can make a decision now, and consider matches all those records which matched on both attributes in our case." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matches = features[features.sum(axis=1) > 1] # Filter by those cases which matched on more than 1 attribute\n", "print(len(matches))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 280 records, which had an exact match on `city` and `state`, and more than 85% in similarity based on university name and the phonetic transcription of the name." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matches.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's merge these matches back to original dataframes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our `matches` dataframe has MultiIndex - two indices to the left which correspond to the `ipeds` table and `uspto_org` table respectively." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access each matching pair individually, for example, the first one:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matches.index[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also do the following: first, pull all the indices for the `ipeds` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matches.index[0][0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will pull all corresponding rows from the `ipeds` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_results = [] # Create an empty list\n", "\n", "for match in matches.index: # For every pair in matches (index)\n", " df = pd.DataFrame(ipeds.loc[[match[0]]]) # Get the location in the original table, convert to dataframe\n", " ipeds_results.append(df) # Append to a list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_results[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we concatenate the list of dataframes into one dataframe." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_concat = pd.concat(ipeds_results) " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_concat.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We do the same for the `uspto` table." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_results = [] # Create an empty list\n", "\n", "for i in matches.index: # For every pair in matches (index)\n", " df = pd.DataFrame(uspto_org.loc[[i[1]]]) # Get the location in the original table, convert to dataframe\n", " uspto_results.append(df) # Append to a list\n", "\n", "uspto_concat = pd.concat(uspto_results) # Concatenate into one dataframe" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_concat.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we need to combine two tables on the index - notice that our tables right now have indices from the original tables. We can reset the index using `.reset_index()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_concat = ipeds_concat.reset_index()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_concat = uspto_concat.reset_index()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now our tables have the same index on which we can combine two tables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ipeds_concat.head()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_concat.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we concatenate these two tables using `.concat()`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matched = pd.concat([ipeds_concat,uspto_concat],axis=1) # Specify axis=1 to concatenate horizontally" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "matched.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that have merged our matches together, examine them. Remember that we matched our strings on 85% similarity and we blocked on city and state - that's why in our table we can see a match in row 3 between `universityofmobile` and `university of south alabama`, as they are from the same city and state, and the names have 85% in similarity, according to the Jaro-Winkler distance.\n", "Try using a different threshold. You can also use a different string-matching algorithm (please see below in the References)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Checkpoint 2: Record Linkage Decisions

\n", "\n", "What are some decisions we had to make as we went through the record linkage process above? What if we had made different choices instead? Try doing the record linkage with a few different options and see how many matches you get as you vary the approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For example, you can try Levenshtein distance in the string-matching part of the notebook. To see all available methods, search for the `recordlinkage` package in Python documentation and navigate to the section called `Comparison`, which lists available methods in string matching." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Fellegi Sunter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do this with a machine learning classifier. Supervised learning algorithms need training data. Training data is data for which the true match status is known for each comparison vector. \n", "\n", "In the example in this section, we will consider true matches those where we block (find an exact match) on the university name, city, and state. \n", "\n", "We will need to rename the columns with organization names, so they match on both datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "uspto_org = uspto_org.rename(columns={'assignee_organization':'organization'})\n", "ipeds = ipeds.rename(columns={'instnm':'organization'})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's consider these the true matches\n", "indexerBL = rl.BlockIndex(on=['organization','city', 'state'])\n", "true_matches = indexerBL.index(ipeds, uspto_org)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's see how many true matches we get\n", "len(true_matches)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's use the features of the first 100,000 features from above." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Generate Training Data and index\n", "ml_pairs = features[0:100000]\n", "ml_matches_index = ml_pairs.index & true_matches" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "len(ml_matches_index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " The Naive Bayes classifier is a probabilistic classifier. The probabilistic record linkage framework by Fellegi and Sunter (1969) is the most well-known probabilistic classification method for record linkage. Later, it was proved that the Fellegi and Sunter method is mathematically equivalent to the Naive Bayes method in case of assuming independence between comparison variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Train the classifier\n", "nb = rl.NaiveBayesClassifier()\n", "nb.learn(ml_pairs, ml_matches_index)\n", "\n", "## Predict the match status for all record pairs\n", "result_nb = nb.predict(features)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's see how many matches were predicted by a classifier\n", "len(result_nb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Evaluation\n", "\n", "The last step is to evaluate the results of the record linkage. We will cover this in more detail in the machine learning session. This is just for completeness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Confusion matrix - we include the total number of true matches, the predicted matches, and the total number of records to predict on\n", "conf_nb = rl.confusion_matrix(true_matches, result_nb, len(features))\n", "conf_nb" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Precision and Accuracy\n", "precision = rl.precision(conf_nb)\n", "accuracy = rl.accuracy(conf_nb)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## Precision and Accuracy\n", "print(precision)\n", "print(accuracy)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "## The F-score for this classification is\n", "rl.fscore(conf_nb)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### **Optional**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Regular Expressions - regex\n", "\n", "We can extract information from strings by using regex search commands.\n", "\n", "When defining a regular expression search pattern, it is a good idea to start out by writing down, explicitly, in plain English, what you are trying to search for and exactly how you identify when you've found a match.\n", "For example, if we look at an author field formatted as \" , \", in plain English, this is how I would explain where to find the last name: \"starting from the beginning of the line, take all the characters until you see a comma.\"\n", "\n", "We can build a regular expression that captures this idea from the following components:\n", "- ^ Matches beginning of the line\n", "- . Matches any character\n", "- .+ A modifier that means \"match one or more of the preceding expression\"\n", "\n", "In a regular expression, there are special reserved characters and character classes like those in the list above. Anything that is not a special character or class is just looked for explicitly (for example, a comma is not a special character in regular expressions, so if it is in a regular expression pattern, the regular expression processor will just be looking for a comma in the string, at that point in the pattern).\n", "\n", "Note: if you want to actually look for one of these reserved characters, it must be escaped, so that, for example, the expression looks for a literal period, rather than the special regular expression meaning of a period. To escape a reserved character in a regular expression, precede it with a back slash ( \".\" ).\n", "This results in the regular expression: ^.+,\n", "\n", "We start at the beginning of the line ( \"^\" ), matching any characters ( \".+\" ) until we come to the literal character of a comma ( \",\" ).\n", "\n", "In python, to use a regular expression like this to search for matches in a given string, we use the built-in \"re\" package ( https://docs.python.org/2/library/re.html ), specifically the \"re.search()\" method. To use \"re.search()\", pass it first the regular expression you want to use to search, enclosed in quotation marks, and then the string you want to search within. \n", "\n", "#### REGEX CHEATSHEET\n", "\n", "\n", " - abc... Letters\n", " - 123... Digits\n", " - \\d Any Digit\n", " - \\D Any non-Digit Character\n", " - . Any Character\n", " - \\. Period\n", " - [a,b,c] Only a, b or c\n", " - [^a,b,c] Not a,b, or c\n", " - [a-z] Characters a to z\n", " - [0-9] Numbers 0 to 9\n", " - \\w any Alphanumeric chracter\n", " - \\W any non-Alphanumeric character\n", " - {m} m Repetitions\n", " - {m,n} m to n repetitions\n", " - * Zero or more repetitions\n", " - + One or more repetitions\n", " - ? Optional Character\n", " - \\s any Whitespace\n", " - \\S any non-Whitespace character\n", " - ^...$ Starts & Ends\n", " - (...) Capture Group\n", " - (a(bc)) Capture sub-Group\n", " - (.*) Capture All\n", " - (abc|def) Capture abc or def\n", " \n", "#### EXAMPLES\n", " - (\\d\\d|\\D) will match 22X, 23G, 56H, etc...\n", " - \\w will match any characters between 0-9 or a-z\n", " - \\w{1-3} will match any alphanumeric character of a length of 1 to 3. " ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": true, "toc-nb-collapsed": true }, "source": [ "## References and Further Readings\n", "\n", "\n", "### Parsing\n", "\n", "* Python online documentation: https://docs.python.org/2/library/string.html#deprecated-string-functions\n", "* Python 2.7 Tutorial(Splitting and Joining Strings): http://www.pitt.edu/~naraehan/python2/split_join.html\n", "\n", "### Regular Expression\n", "\n", "* Python documentation: https://docs.python.org/2/library/re.html#regular-expression-syntax\n", "* Online regular expression tester (good for learning): http://regex101.com/\n", "\n", "### String Comparators\n", "\n", "* GitHub page of jellyfish: https://github.com/jamesturk/jellyfish\n", "* Different distances that measure the differences between strings:\n", " - Levenshtein distance: https://en.wikipedia.org/wiki/Levenshtein_distance\n", " - Damerau–Levenshtein distance: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance\n", " - Jaro–Winkler distance: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance\n", " - Hamming distance: https://en.wikipedia.org/wiki/Hamming_distance\n", " - Match rating approach: https://en.wikipedia.org/wiki/Match_rating_approach\n", "\n", "### Fellegi-Sunter Record Linkage \n", "\n", "* Introduction to Probabilistic Record Linkage: http://www.bristol.ac.uk/media-library/sites/cmm/migrated/documents/problinkage.pdf\n", "* Paper Review: https://www.cs.umd.edu/class/spring2012/cmsc828L/Papers/HerzogEtWires10.pdf\n", "\n" ] } ], "metadata": { "anaconda-cloud": {}, "hide_input": false, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }