Coverage for nltk.metrics.distance : 64%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: Distance Metrics # # Copyright (C) 2001-2012 NLTK Project # Author: Edward Loper <edloper@gradient.cis.upenn.edu> # Steven Bird <sb@csse.unimelb.edu.au> # Tom Lippincott <tom@cs.columbia.edu> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT #
Distance Metrics.
Compute the distance between two items (usually strings). As metrics, they must satisfy the following three requirements:
1. d(a, a) = 0 2. d(a, b) >= 0 3. d(a, c) <= d(a, b) + d(b, c)
"""
""" Calculate the Levenshtein edit-distance between two strings. The edit distance is the number of characters that need to be substituted, inserted, or deleted, to transform s1 into s2. For example, transforming "rain" to "shine" requires three steps, consisting of two substitutions and one insertion: "rain" -> "sain" -> "shin" -> "shine". These operations could have been done in other orders, but at least three steps are needed.
:param s1, s2: The strings to be analysed :type s1: str :type s2: str :rtype int """ # set up a 2-D array
# iterate over the array
"""Simple equality test.
0.0 if the labels are identical, 1.0 if they are different.
>>> from nltk.metrics import binary_distance >>> binary_distance(1,1) 0.0
>>> binary_distance(1,3) 1.0 """
else:
"""Distance metric comparing set-similarity.
"""
"""Distance metric that takes into account partial agreement when multiple labels are assigned.
>>> from nltk.metrics import masi_distance >>> masi_distance(set([1,2]),set([1,2,3,4])) 0.5
Passonneau 2005, Measuring Agreement on Set-Valued Items (MASI) for Semantic and Pragmatic Annotation. """
"""Krippendorff'1 interval distance metric
>>> from nltk.metrics import interval_distance >>> interval_distance(1,10) 81
Krippendorff 1980, Content Analysis: An Introduction to its Methodology """ # return pow(list(label1)[0]-list(label2)[0],2) except: print("non-numeric labels not supported with interval distance")
"""Higher-order function to test presence of a given label
""" return lambda x,y: 1.0*((label in x) == (label in y))
return lambda x,y:abs((float(1.0/len(x)) - float(1.0/len(y))))*(label in x and label in y) or 0.0*(label not in x and label not in y) or abs((float(1.0/len(x))))*(label in x and label not in y) or ((float(1.0/len(y))))*(label not in x and label in y)
data = {} for l in open(file): labelA, labelB, dist = l.strip().split("\t") labelA = frozenset([labelA]) labelB = frozenset([labelB]) data[frozenset([labelA,labelB])] = float(dist) return lambda x,y:data[frozenset([x,y])]
s1 = "rain" s2 = "shine" print("Edit distance between '%s' and '%s':" % (s1,s2), edit_distance(s1, s2))
s1 = set([1,2,3,4]) s2 = set([3,4,5]) print("s1:", s1) print("s2:", s2) print("Binary distance:", binary_distance(s1, s2)) print("Jaccard distance:", jaccard_distance(s1, s2)) print("MASI distance:", masi_distance(s1, s2))
demo() |