Coverage for nltk.corpus.reader.plaintext : 64%
![](keybd_closed.png)
Hot-keys on this page
r m x p toggle line displays
j k next/prev highlighted chunk
0 (zero) top of page
1 (one) first highlighted chunk
# Natural Language Toolkit: Plaintext Corpus Reader # # Copyright (C) 2001-2012 NLTK Project # Author: Steven Bird <sb@ldc.upenn.edu> # Edward Loper <edloper@gradient.cis.upenn.edu> # Nitin Madnani <nmadnani@umiacs.umd.edu> # URL: <http://www.nltk.org/> # For license information, see LICENSE.TXT
A reader for corpora that consist of plaintext documents. """
""" Reader for corpora that consist of plaintext documents. Paragraphs are assumed to be split using blank lines. Sentences and words can be tokenized using the default tokenizers, or by custom tokenizers specificed as parameters to the constructor.
This corpus reader can be customized (e.g., to skip preface sections of specific document formats) by creating a subclass and overriding the ``CorpusView`` class variable. """
"""The corpus view class used by this reader. Subclasses of ``PlaintextCorpusReader`` may specify alternative corpus view classes (e.g., to skip the preface sections of documents.)"""
word_tokenizer=WordPunctTokenizer(), sent_tokenizer=nltk.data.LazyLoader( 'tokenizers/punkt/english.pickle'), para_block_reader=read_blankline_block, encoding=None): """ Construct a new plaintext corpus reader for a set of documents located at the given root directory. Example usage:
>>> root = '/usr/local/share/nltk_data/corpora/webtext/' >>> reader = PlaintextCorpusReader(root, '.*\.txt')
:param root: The root directory for this corpus. :param fileids: A list or regexp specifying the fileids in this corpus. :param word_tokenizer: Tokenizer for breaking sentences or paragraphs into words. :param sent_tokenizer: Tokenizer for breaking paragraphs into words. :param para_block_reader: The block reader used to divide the corpus into paragraph blocks. """
""" :return: the given file(s) as a single string. :rtype: str """ if fileids is None: fileids = self._fileids elif isinstance(fileids, compat.string_types): fileids = [fileids] return concat([self.open(f, sourced).read() for f in fileids])
""" :return: the given file(s) as a list of words and punctuation symbols. :rtype: list(str) """ # Once we require Python 2.5, use source=(fileid if sourced else None) encoding=enc, source=fileid) for (path, enc, fileid) in self.abspaths(fileids, True, True)]) else: encoding=enc) for (path, enc, fileid) in self.abspaths(fileids, True, True)])
""" :return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. :rtype: list(list(str)) """ raise ValueError('No sentence tokenizer for this corpus') encoding=enc, source=fileid) for (path, enc, fileid) in self.abspaths(fileids, True, True)]) else: return concat([self.CorpusView(path, self._read_sent_block, encoding=enc) for (path, enc, fileid) in self.abspaths(fileids, True, True)])
""" :return: the given file(s) as a list of paragraphs, each encoded as a list of sentences, which are in turn encoded as lists of word strings. :rtype: list(list(list(str))) """ if self._sent_tokenizer is None: raise ValueError('No sentence tokenizer for this corpus') if sourced: return concat([self.CorpusView(path, self._read_para_block, encoding=enc, source=fileid) for (path, enc, fileid) in self.abspaths(fileids, True, True)]) else: return concat([self.CorpusView(path, self._read_para_block, encoding=enc) for (path, enc, fileid) in self.abspaths(fileids, True, True)])
for sent in self._sent_tokenizer.tokenize(para)])
paras = [] for para in self._para_block_reader(stream): paras.append([self._word_tokenizer.tokenize(sent) for sent in self._sent_tokenizer.tokenize(para)]) return paras
PlaintextCorpusReader): """ A reader for plaintext corpora whose documents are divided into categories based on their file identifiers. """ """ Initialize the corpus reader. Categorization arguments (``cat_pattern``, ``cat_map``, and ``cat_file``) are passed to the ``CategorizedCorpusReader`` constructor. The remaining arguments are passed to the ``PlaintextCorpusReader`` constructor. """
raise ValueError('Specify fileids or categories, not both') return self.fileids(categories) else: return PlaintextCorpusReader.raw( self, self._resolve(fileids, categories)) self, self._resolve(fileids, categories)) return PlaintextCorpusReader.sents( self, self._resolve(fileids, categories)) return PlaintextCorpusReader.paras( self, self._resolve(fileids, categories))
# is there a better way?
""" Reader for Europarl corpora that consist of plaintext documents. Documents are divided into chapters instead of paragraphs as for regular plaintext documents. Chapters are separated using blank lines. Everything is inherited from ``PlaintextCorpusReader`` except that: - Since the corpus is pre-processed and pre-tokenized, the word tokenizer should just split the line at whitespaces. - For the same reason, the sentence tokenizer should just split the paragraph at line breaks. - There is a new 'chapters()' method that returns chapters instead instead of paragraphs. - The 'paras()' method inherited from PlaintextCorpusReader is made non-functional to remove any confusion between chapters and paragraphs for Europarl. """
words = [] for i in range(20): # Read 20 lines at a time. words.extend(stream.readline().split()) return words
sents = [] for para in self._para_block_reader(stream): sents.extend([sent.split() for sent in para.splitlines()]) return sents
paras = [] for para in self._para_block_reader(stream): paras.append([sent.split() for sent in para.splitlines()]) return paras
""" :return: the given file(s) as a list of chapters, each encoded as a list of sentences, which are in turn encoded as lists of word strings. :rtype: list(list(list(str))) """ return concat([self.CorpusView(fileid, self._read_para_block, encoding=enc) for (fileid, enc) in self.abspaths(fileids, True)])
raise NotImplementedError('The Europarl corpus reader does not support paragraphs. Please use chapters() instead.')
|