Coverage for nltk.corpus.reader.wordnet : 80%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: WordNet # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bethard <Steven.Bethard@colorado.edu> # Steven Bird <sb@csse.unimelb.edu.au> # Edward Loper <edloper@gradient.cis.upenn.edu> # Nitin Madnani <nmadnani@ets.org> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT
###################################################################### ## Table of Contents ###################################################################### ## - Constants ## - Data Classes ## - WordNetError ## - Lemma ## - Synset ## - WordNet Corpus Reader ## - WordNet Information Content Corpus Reader ## - Similarity Metrics ## - Demo
###################################################################### ## Constants ######################################################################
#: Positive infinity (for similarity functions)
#{ Part-of-speech constants #}
#: A table of strings that are used to express verb frames. None, "Something %s", "Somebody %s", "It is %sing", "Something is %sing PP", "Something %s something Adjective/Noun", "Something %s Adjective/Noun", "Somebody %s Adjective", "Somebody %s something", "Somebody %s somebody", "Something %s somebody", "Something %s something", "Something %s to somebody", "Somebody %s on something", "Somebody %s somebody something", "Somebody %s something to somebody", "Somebody %s something from somebody", "Somebody %s somebody with something", "Somebody %s somebody of something", "Somebody %s something on somebody", "Somebody %s somebody PP", "Somebody %s something PP", "Somebody %s PP", "Somebody's (body part) %s", "Somebody %s somebody to INFINITIVE", "Somebody %s somebody INFINITIVE", "Somebody %s that CLAUSE", "Somebody %s to somebody", "Somebody %s to INFINITIVE", "Somebody %s whether INFINITIVE", "Somebody %s somebody into V-ing something", "Somebody %s something with something", "Somebody %s INFINITIVE", "Somebody %s VERB-ing", "It %s that CLAUSE", "Something %s INFINITIVE")
###################################################################### ## Data Classes ######################################################################
"""An exception class for wordnet-related errors."""
"""A common base class for lemmas and synsets."""
return self._related('~i')
return self._related('#s')
return self._related('#p')
return self._related('%m')
return self._related('%s')
return self._related('%p')
return self._related('=')
return self._related('*')
return self._related('>')
return self._related('^')
return self._related('$')
""" The lexical entry for a single morphological form of a sense-disambiguated word.
Create a Lemma from a "<word>.<pos>.<number>.<lemma>" string where: <word> is the morphological stem identifying the synset <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0. <lemma> is the morphological form of interest
Note that <word> and <lemma> can be different, e.g. the Synset 'salt.n.03' has the Lemmas 'salt.n.03.salt', 'salt.n.03.saltiness' and 'salt.n.03.salinity'.
Lemma attributes:
- name: The canonical name of this lemma. - synset: The synset that this lemma belongs to. - syntactic_marker: For adjectives, the WordNet string identifying the syntactic position relative modified noun. See: http://wordnet.princeton.edu/man/wninput.5WN.html#sect10 For all other parts of speech, this attribute is None.
Lemma methods:
Lemmas have the following methods for retrieving related Lemmas. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Lemmas:
- antonyms - hypernyms, instance_hypernyms - hyponyms, instance_hyponyms - member_holonyms, substance_holonyms, part_holonyms - member_meronyms, substance_meronyms, part_meronyms - topic_domains, region_domains, usage_domains - attributes - derivationally_related_forms - entailments - causes - also_sees - verb_groups - similar_tos - pertainyms """
# formerly _from_synset_info lexname_index, lex_id, syntactic_marker):
for pos, offset, lemma_index in self.synset._lemma_pointers[self.name, relation_symbol]]
"""Return the frequency count for this Lemma"""
"""Create a Synset from a "<lemma>.<pos>.<number>" string where: <lemma> is the word's morphological stem <pos> is one of the module attributes ADJ, ADJ_SAT, ADV, NOUN or VERB <number> is the sense number, counting from 0.
Synset attributes:
- name: The canonical name of this synset, formed using the first lemma of this synset. Note that this may be different from the name passed to the constructor if that string used a different lemma to identify the synset. - pos: The synset's part of speech, matching one of the module level attributes ADJ, ADJ_SAT, ADV, NOUN or VERB. - lemmas: A list of the Lemma objects for this synset. - definition: The definition for this synset. - examples: A list of example strings for this synset. - offset: The offset in the WordNet dict file of this synset. - #lexname: The name of the lexicographer file containing this synset.
Synset methods:
Synsets have the following methods for retrieving related Synsets. They correspond to the names for the pointer symbols defined here: http://wordnet.princeton.edu/man/wninput.5WN.html#sect3 These methods all return lists of Synsets.
- hypernyms, instance_hypernyms - hyponyms, instance_hyponyms - member_holonyms, substance_holonyms, part_holonyms - member_meronyms, substance_meronyms, part_meronyms - attributes - entailments - causes - also_sees - verb_groups - similar_tos
Additionally, Synsets support the following methods specific to the hypernym relation:
- root_hypernyms - common_hypernyms - lowest_common_hypernyms
Note that Synsets do not support the following relations because these are defined by WordNet as lexical relations:
- antonyms - derivationally_related_forms - pertainyms """
# All of these attributes get initialized by # WordNetCorpusReader._synset_from_pos_and_line()
return True else:
"""Get the topmost hypernyms of this synset in WordNet."""
next_synset.instance_hypernyms() else:
# Simpler implementation which makes incorrect assumption that # hypernym hierarchy is acyclic: # # if not self.hypernyms(): # return [self] # else: # return list(set(root for h in self.hypernyms() # for root in h.root_hypernyms())) """ :return: The length of the longest hypernym path from this synset to the root. """
else:
""" :return: The length of the shortest hypernym path from this synset to the root. """
else:
"""Return the transitive closure of source under the rel relationship, breadth-first
>>> from nltk.corpus import wordnet as wn >>> dog = wn.synset('dog.n.01') >>> hyp = lambda s:s.hypernyms() >>> list(dog.closure(hyp)) [Synset('domestic_animal.n.01'), Synset('canine.n.02'), Synset('animal.n.01'), Synset('carnivore.n.01'), Synset('organism.n.01'), Synset('placental.n.01'), Synset('living_thing.n.01'), Synset('mammal.n.01'), Synset('whole.n.02'), Synset('vertebrate.n.01'), Synset('object.n.01'), Synset('chordate.n.01'), Synset('physical_entity.n.01'), Synset('entity.n.01')]
"""
""" Get the path(s) from this synset to the root, where each path is a list of the synset nodes traversed on the way to the root.
:return: A list of lists, where each list gives the node sequence connecting the initial ``Synset`` node and a root node. """
""" Find all synsets that are hypernyms of this synset and the other synset.
:type other: Synset :param other: other input synset. :return: The synsets that are hypernyms of both synsets. """ for self_synsets in self._iter_hypernym_lists() for self_synset in self_synsets) for other_synsets in other._iter_hypernym_lists() for other_synset in other_synsets)
"""Get the lowest synset that both synsets have as a hypernym."""
else:
""" Get the path(s) from this synset to the root, counting the distance of each node from the initial node on the way. A set of (synset, distance) tuples is returned.
:type distance: int :param distance: the distance (number of edges) from this hypernym to the original hypernym ``Synset`` on which this method was called. :return: A set of ``(Synset, int)`` tuples where each ``Synset`` is a hypernym of the first ``Synset``. """
""" Returns the distance of the shortest path linking the two synsets (if one exists). For each synset, all the ancestor nodes and their distances are recorded and compared. The ancestor node common to both synsets that can be reached with the minimum number of traversals is used. If no ancestor nodes are common, None is returned. If a node is compared with itself 0 is returned.
:type other: Synset :param other: The Synset to which the shortest path will be found. :return: The number of edges in the shortest path connecting the two nodes, or None if no path exists. """
return 0
# Transform each distance list into a dictionary. In cases where # there are duplicate nodes in the list (due to there being multiple # paths to the root) the duplicate with the shortest distance from # the original node is entered.
else:
# For each ancestor synset common to both subject synsets, find the # connecting path length. Return the shortest of these.
""" >>> from nltk.corpus import wordnet as wn >>> dog = wn.synset('dog.n.01') >>> hyp = lambda s:s.hypernyms() >>> from pprint import pprint >>> pprint(dog.tree(hyp)) [Synset('dog.n.01'), [Synset('domestic_animal.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]], [Synset('canine.n.02'), [Synset('carnivore.n.01'), [Synset('placental.n.01'), [Synset('mammal.n.01'), [Synset('vertebrate.n.01'), [Synset('chordate.n.01'), [Synset('animal.n.01'), [Synset('organism.n.01'), [Synset('living_thing.n.01'), [Synset('whole.n.02'), [Synset('object.n.01'), [Synset('physical_entity.n.01'), [Synset('entity.n.01')]]]]]]]]]]]]]] """
elif cut_mark: tree += [cut_mark]
# interface to similarity methods """ Path Distance Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1, except in those cases where a path cannot be found (will only be true for verbs as there are many distinct verb taxonomies), in which case None is returned. A score of 1 represents identity i.e. comparing a sense with itself will return 1.
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type simulate_root: bool :param simulate_root: The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well. :return: A score denoting the similarity of the two ``Synset`` objects, normally between 0 and 1. None is returned if no connecting path could be found. 1 is returned if a ``Synset`` is compared with itself. """
else:
""" Leacock Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d is the taxonomy depth.
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type simulate_root: bool :param simulate_root: The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well. :return: A score denoting the similarity of the two ``Synset`` objects, normally greater than 0. None is returned if no connecting path could be found. If a ``Synset`` is compared with itself, the maximum score is returned, which varies depending on the taxonomy depth. """
raise WordNetError('Computing the lch similarity requires ' + \ '%s and %s to have the same part of speech.' % \ (self, other))
else:
""" Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Previously, the scores computed by this implementation did _not_ always agree with those given by Pedersen's Perl implementation of WordNet Similarity. However, with the addition of the simulate_root flag (see below), the score for verbs now almost always agree but not always for nouns.
The LCS does not necessarily feature in the shortest path connecting the two senses, as it is by definition the common ancestor deepest in the taxonomy, not closest to the two senses. Typically, however, it will so feature. Where multiple candidates for the LCS exist, that whose shortest path to the root node is the longest will be selected. Where the LCS has multiple paths to the root, the longer path is used for the purposes of the calculation.
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type simulate_root: bool :param simulate_root: The various verb taxonomies do not share a single root which disallows this metric from working for synsets that are not connected. This flag (True by default) creates a fake root that connects all the taxonomies. Set it to false to disable this behavior. For the noun taxonomy, there is usually a default root except for WordNet version 1.6. If you are using wordnet 1.6, a fake root will be added for nouns as well. :return: A float score denoting the similarity of the two ``Synset`` objects, normally greater than zero. If no connecting path between the two senses can be found, None is returned.
"""
# If no LCS was found return None
# Get the longest path from the LCS to the root, # including a correction: # - add one because the calculations include both the start and end # nodes
# Note: No need for an additional add-one correction for non-nouns # to account for an imaginary root node because that is now automatically # handled by simulate_root # if subsumer.pos != NOUN: # depth += 1
# Get the shortest path from the LCS to each of the synsets it is # subsuming. Add this to the LCS path length to get the path # length from each synset to the root. return None
""" Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type ic: dict :param ic: an information content object (as returned by ``nltk.corpus.wordnet_ic.ic()``). :return: A float score denoting the similarity of the two ``Synset`` objects. Synsets whose LCS is the root node of the taxonomy will have a score of 0 (e.g. N['dog'][0] and N['table'][0]). """
""" Jiang-Conrath Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type ic: dict :param ic: an information content object (as returned by ``nltk.corpus.wordnet_ic.ic()``). :return: A float score denoting the similarity of the two ``Synset`` objects. """
return _INF
# If either of the input synsets are the root synset, or have a # frequency of 0 (sparse data problem), return 0. return 0
""" Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
:type other: Synset :param other: The ``Synset`` that this ``Synset`` is being compared to. :type ic: dict :param ic: an information content object (as returned by ``nltk.corpus.wordnet_ic.ic()``). :return: A float score denoting the similarity of the two ``Synset`` objects, in the range 0 to 1. """
""" :return: An iterator over ``Synset`` objects that are either proper hypernyms or instance of hypernyms of the synset. """ for synset in todo for hypernym in (synset.hypernyms() + \ synset.instance_hypernyms()) if hypernym not in seen]
###################################################################### ## WordNet Corpus Reader ######################################################################
""" A corpus reader used to access wordnet or its variants. """
#{ Part-of-speech constants #}
#{ Filename constants #}
#{ Part of speech constants #}
#: A list of file identifiers for all the fileids used by this #: corpus reader. 'index.adj', 'index.adv', 'index.noun', 'index.verb', 'data.adj', 'data.adv', 'data.noun', 'data.verb', 'adj.exc', 'adv.exc', 'noun.exc', 'verb.exc', )
""" Construct a new wordnet corpus reader, with the given root directory. """ encoding=self._ENCODING)
"""A index that provides the file offset
Map from lemma -> pos -> synset_index -> offset"""
"""A cache so we don't have to reconstuct synsets
Map from pos -> offset -> synset"""
"""A lookup for the maximum depth of each part of speech. Useful for the lch similarity metric. """
# Load the lexnames
# Load the indices for lemmas and synset offsets
# load the exception file data into memory
# parse each line of the file (ignoring comment lines)
# get the lemma and part-of-speech
# get the number of synsets for this lemma
# get the pointer symbols for all synsets of this lemma
# same as number of synsets
# get number of senses ranked according to frequency
# get synset offsets
# raise more informative error with file name and line number except (AssertionError, ValueError) as e: tup = ('index.%s' % suffix), (i + 1), e raise WordNetError('file %s, line %i: %s' % tup)
# map lemmas and parts of speech to synsets
# load the exception file data into memory
""" Compute the max depth for the given part of speech. This is used by the lch similarity metric. """ except RuntimeError: print(ii)
version = match.group(1) fh.seek(0) return version
#//////////////////////////////////////////////////////////// # Loading Lemmas #//////////////////////////////////////////////////////////// raise WordNetError('no lemma %r in %r' % (lemma_name, synset_name))
# Keys are case sensitive and always lower-case
# open the key -> synset file if necessary
# Find the synset for the lemma. raise WordNetError("No synset found for key %r" % key)
# return the corresponding lemma raise WordNetError("No lemma found for for key %r" % key)
#//////////////////////////////////////////////////////////// # Loading Synsets #//////////////////////////////////////////////////////////// # split name into lemma, part of speech and synset number
# get the offset for this synset except KeyError: message = 'no lemma %r with part of speech %r' raise WordNetError(message % (lemma, pos)) except IndexError: n_senses = len(self._lemma_pos_offset_map[lemma][pos]) message = "lemma %r with part of speech %r has only %i %s" if n_senses == 1: tup = lemma, pos, n_senses, "sense" else: tup = lemma, pos, n_senses, "senses" raise WordNetError(message % tup)
# load synset information from the appropriate file
# some basic sanity checks on loaded attributes message = ('adjective satellite requested but only plain ' 'adjective found for lemma %r') raise WordNetError(message % lemma)
# Return the synset object.
""" Return an open file pointer for the data file for the given part of speech. """
# Check to see if the synset is in the cache
# Construct a new (empty) synset.
# parse the entry for this synset
# parse out the definitions and examples from the gloss else:
# split the other info into fields
# get the offset
# determine the lexicographer file name
# get the part of speech
# create Lemma objects for each lemma # get the lemma name # get the lex_id (used for sense_keys) # If the lemma has a syntactic marker, extract it. # create the lemma object lex_id, syn_mark)
# collect the pointer tuples else:
# read the verb frames else: # read the plus sign # read the frame and lemma number # lemma number of 00 means all words in the synset lemma.name) # only a specific word in the synset else: lemma.name)
# raise a more informative error with line text except ValueError as e: raise WordNetError('line %r: %s' % (data_file_line, e))
# set sense keys for Lemma objects - note that this has to be # done afterwards so that the relations are available else: lemma._lexname_index, lemma._lex_id, head_name, head_id)
# the canonical name is based on the first lemma
#//////////////////////////////////////////////////////////// # Retrieve synsets and lemmas. #//////////////////////////////////////////////////////////// """Load all synsets with a given lemma and part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. """
for p in pos for form in self._morphy(lemma, p) for offset in index[form].get(p, [])]
"""Return all Lemma objects with a name matching the specified lemma name and part of speech tag. Matches any part of speech tag if none is specified.""" for synset in self.synsets(lemma, pos) for lemma_obj in synset.lemmas if lemma_obj.name == lemma]
"""Return all lemma names for all synsets for the given part of speech tag. If pos is not specified, all synsets for all parts of speech will be used. """ if pos is None: return iter(self._lemma_pos_offset_map) else: return (lemma for lemma in self._lemma_pos_offset_map if pos in self._lemma_pos_offset_map[lemma])
"""Iterate over all synsets with a given part of speech tag. If no pos is specified, all synsets for all parts of speech will be loaded. """ else:
# generate all synsets for each part of speech # Open the file for reading. Note that we can not re-use # the file poitners from self._data_file_map here, because # we're defining an iterator, and those file pointers might # be moved while we're not looking. pos_tag = ADJ
# generate synsets for each line in the POS file # See if the synset is cached else: # Otherwise, parse the line
# adjective satellites are in the same file as # adjectives so only yield the synset if it's actually # a satellite if synset.pos == pos_tag: yield synset
# for all other POS tags, yield all synsets (this means # that adjectives also include adjective satellites) else:
# close the extra file handle we opened else:
#//////////////////////////////////////////////////////////// # Misc #//////////////////////////////////////////////////////////// """Return the frequency count for this Lemma""" # open the count file if we haven't already # find the key in the counts file and return the count else:
return synset1.res_similarity(synset2, ic, verbose)
return synset1.jcn_similarity(synset2, ic, verbose)
return synset1.lin_similarity(synset2, ic, verbose)
#//////////////////////////////////////////////////////////// # Morphy #//////////////////////////////////////////////////////////// # Morphy, adapted from Oliver Steele's pywordnet """ Find a possible base form for the given form, with the given part of speech, by checking WordNet's list of exceptional forms, and by recursively stripping affixes for this part of speech until a form in WordNet is found.
>>> from nltk.corpus import wordnet as wn >>> wn.morphy('dogs') 'dog' >>> wn.morphy('churches') 'church' >>> wn.morphy('aardwolves') 'aardwolf' >>> wn.morphy('abaci') 'abacus' >>> wn.morphy('hardrock', wn.ADV) >>> wn.morphy('book', wn.NOUN) 'book' >>> wn.morphy('book', wn.ADJ) """
else:
# get the first one we find else:
NOUN: [('s', ''), ('ses', 's'), ('ves', 'f'), ('xes', 'x'), ('zes', 'z'), ('ches', 'ch'), ('shes', 'sh'), ('men', 'man'), ('ies', 'y')], VERB: [('s', ''), ('ies', 'y'), ('es', 'e'), ('es', ''), ('ed', 'e'), ('ed', ''), ('ing', 'e'), ('ing', '')], ADJ: [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')], ADV: []}
# from jordanbg: # Given an original string x # 1. Apply rules once to the input to get y1, y2, y3, etc. # 2. Return all that are in the database # 3. If there are no matches, keep applying rules until you either # find a match or you can't go any further
for form in forms for old, new in substitutions if form.endswith(old)]
# 0. Check the exception lists
# 1. Apply rules once to the input to get y1, y2, y3, etc.
# 2. Return all that are in the database (and check the original too)
# 3. If there are no matches, keep applying rules until we find a match
# Return an empty list if we can't find anything
#//////////////////////////////////////////////////////////// # Create information content from corpus #//////////////////////////////////////////////////////////// """ Creates an information content lookup dictionary from a corpus.
:type corpus: CorpusReader :param corpus: The corpus from which we create an information content dictionary. :type weight_senses_equally: bool :param weight_senses_equally: If this is True, gives all possible senses equal weight rather than dividing by the number of possible senses. (If a word has 3 synses, each sense gets 0.3333 per appearance when this is False, 1.0 when it is true.) :param smoothing: How much do we smooth synset counts (default is 1.0) :type smoothing: float :return: An information content dictionary """
# Initialize the counts with the smoothing value
# Distribute weight among possible synsets
# Add the weight to the root
###################################################################### ## WordNet Information Content Corpus Reader ######################################################################
""" A corpus reader for the WordNet information content corpus. """
# this load function would be more efficient if the data was pickled # Note that we can't use NLTK's frequency distributions because # synsets are overlapping (each instance of a synset also counts # as an instance of its hypernyms) """ Load an information content file from the wordnet_ic corpus and return a dictionary. This dictionary has just two keys, NOUN and VERB, whose values are dictionaries that map from synsets to information content values.
:type icfile: str :param icfile: The name of the wordnet_ic file (e.g. "ic-brown.dat") :return: An information content dictionary """ # Store root count.
###################################################################### # Similarity metrics ######################################################################
# TODO: Add in the option to manually add a new root node; this will be # useful for verb similarity as there exist multiple verb taxonomies.
# More information about the metrics is available at # http://marimba.d.umn.edu/similarity/measures.html
return synset1.path_similarity(synset2, verbose, simulate_root)
return synset1.lch_similarity(synset2, verbose, simulate_root)
return synset1.wup_similarity(synset2, verbose, simulate_root)
return synset1.res_similarity(synset2, verbose)
return synset1.jcn_similarity(synset2, verbose)
return synset1.lin_similarity(synset2, verbose)
""" Finds the least common subsumer of two synsets in a WordNet taxonomy, where the least common subsumer is defined as the ancestor node common to both input synsets whose shortest path to the root node is the longest.
:type synset1: Synset :param synset1: First input synset. :type synset2: Synset :param synset2: Second input synset. :return: The ancestor synset common to both input synsets which is also the LCS. """ subsumer = None max_min_path_length = -1
subsumers = synset1.common_hypernyms(synset2)
if verbose: print("> Subsumers1:", subsumers)
# Eliminate those synsets which are ancestors of other synsets in the # set of subsumers.
eliminated = set() hypernym_relation = lambda s: s.hypernyms() + s.instance_hypernyms() for s1 in subsumers: for s2 in subsumers: if s2 in s1.closure(hypernym_relation): eliminated.add(s2) if verbose: print("> Eliminated:", eliminated)
subsumers = [s for s in subsumers if s not in eliminated]
if verbose: print("> Subsumers2:", subsumers)
# Calculate the length of the shortest path to the root for each # subsumer. Select the subsumer with the longest of these.
for candidate in subsumers:
paths_to_root = candidate.hypernym_paths() min_path_length = -1
for path in paths_to_root: if min_path_length < 0 or len(path) < min_path_length: min_path_length = len(path)
if min_path_length > max_min_path_length: max_min_path_length = min_path_length subsumer = candidate
if verbose: print("> LCS Subsumer by depth:", subsumer) return subsumer
""" Get the information content of the least common subsumer that has the highest information content value. If two nodes have no explicit common subsumer, assume that they share an artificial root node that is the hypernym of all explicit roots.
:type synset1: Synset :param synset1: First input synset. :type synset2: Synset :param synset2: Second input synset. Must be the same part of speech as the first synset. :type ic: dict :param ic: an information content object (as returned by ``load_ic()``). :return: The information content of the two synsets and their most informative subsumer """ raise WordNetError('Computing the least common subsumer requires ' + \ '%s and %s to have the same part of speech.' % \ (synset1, synset2))
subsumer_ic = 0 else:
print("> LCS Subsumer by content:", subsumer_ic)
# Utility functions
except KeyError: msg = 'Information content file has no entries for part-of-speech: %s' raise WordNetError(msg % synset.pos)
return _INF else:
# get the part of speech (NOUN or VERB) from the information content record # (each identifier has a 'n' or 'v' suffix)
else:
###################################################################### # Demo ######################################################################
import nltk print('loading wordnet') wn = WordNetCorpusReader(nltk.data.find('corpora/wordnet')) print('done loading') S = wn.synset L = wn.lemma
print('getting a synset for go') move_synset = S('go.v.21') print(move_synset.name, move_synset.pos, move_synset.lexname) print(move_synset.lemma_names) print(move_synset.definition) print(move_synset.examples)
zap_n = ['zap.n.01'] zap_v = ['zap.v.01', 'zap.v.02', 'nuke.v.01', 'microwave.v.01']
def _get_synsets(synset_strings): return [S(synset) for synset in synset_strings]
zap_n_synsets = _get_synsets(zap_n) zap_v_synsets = _get_synsets(zap_v) zap_synsets = set(zap_n_synsets + zap_v_synsets)
print(zap_n_synsets) print(zap_v_synsets)
print("Navigations:") print(S('travel.v.01').hypernyms()) print(S('travel.v.02').hypernyms()) print(S('travel.v.03').hypernyms())
print(L('zap.v.03.nuke').derivationally_related_forms()) print(L('zap.v.03.atomize').derivationally_related_forms()) print(L('zap.v.03.atomise').derivationally_related_forms()) print(L('zap.v.03.zap').derivationally_related_forms())
print(S('dog.n.01').member_holonyms()) print(S('dog.n.01').part_meronyms())
print(S('breakfast.n.1').hypernyms()) print(S('meal.n.1').hyponyms()) print(S('Austen.n.1').instance_hypernyms()) print(S('composer.n.1').instance_hyponyms())
print(S('faculty.n.2').member_meronyms()) print(S('copilot.n.1').member_holonyms())
print(S('table.n.2').part_meronyms()) print(S('course.n.7').part_holonyms())
print(S('water.n.1').substance_meronyms()) print(S('gin.n.1').substance_holonyms())
print(L('leader.n.1.leader').antonyms()) print(L('increase.v.1.increase').antonyms())
print(S('snore.v.1').entailments()) print(S('heavy.a.1').similar_tos()) print(S('light.a.1').attributes()) print(S('heavy.a.1').attributes())
print(L('English.a.1.English').pertainyms())
print(S('person.n.01').root_hypernyms()) print(S('sail.v.01').root_hypernyms()) print(S('fall.v.12').root_hypernyms())
print(S('person.n.01').lowest_common_hypernyms(S('dog.n.01')))
print(S('dog.n.01').path_similarity(S('cat.n.01'))) print(S('dog.n.01').lch_similarity(S('cat.n.01'))) print(S('dog.n.01').wup_similarity(S('cat.n.01')))
wnic = WordNetICCorpusReader(nltk.data.find('corpora/wordnet_ic'), '.*\.dat') ic = wnic.ic('ic-brown.dat') print(S('dog.n.01').jcn_similarity(S('cat.n.01'), ic))
ic = wnic.ic('ic-semcor.dat') print(S('dog.n.01').lin_similarity(S('cat.n.01'), ic))
print(S('code.n.03').topic_domains()) print(S('pukka.a.01').region_domains()) print(S('freaky.a.01').usage_domains())
demo() |