# Properties of Corpora

In [1]:
from nltk.corpus import brown

## Corpora are Collections of Files

In [17]:
brown.root

FileSystemPathPointer('/home/tmb/nltk_data/corpora/brown')

In [18]:
brown.readme()

'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'

In [15]:
brown.fileids()[:10]

['ca01',
 'ca02',
 'ca03',
 'ca04',
 'ca05',
 'ca06',
 'ca07',
 'ca08',
 'ca09',
 'ca10']

Files may have different encodings; the default is ASCII processed as `str`.

In [16]:
brown.encoding("ca01")

Files may also be in different categories.

In [19]:
brown.categories()

['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

## Accessing Content

The corpus abstraction allows you to avoid having to deal with individual files, encodings, etc.

That is, you can access all the words, all the text, all the sentences etc. in a corpus from a single object.


In [54]:
brown.raw()[:100]

'\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn'

In [2]:
brown.words()[:10]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of']

In [5]:
for s in brown.sents()[:10]: print s[:5]

['The', 'Fulton', 'County', 'Grand', 'Jury']
['The', 'jury', 'further', 'said', 'in']
['The', 'September-October', 'term', 'jury', 'had']
['``', 'Only', 'a', 'relative', 'handful']
['The', 'jury', 'said', 'it', 'did']
['It', 'recommended', 'that', 'Fulton', 'legislators']
['The', 'grand', 'jury', 'commented', 'on']
['Merger', 'proposed']
['However', ',', 'the', 'jury', 'said']
['The', 'City', 'Purchasing', 'Department', ',']


In [6]:
brown.tagged_words()[:10]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN')]

In [8]:
brown.tagged_sents()[0][:10]

[('The', 'AT'),
 ('Fulton', 'NP-TL'),
 ('County', 'NN-TL'),
 ('Grand', 'JJ-TL'),
 ('Jury', 'NN-TL'),
 ('said', 'VBD'),
 ('Friday', 'NR'),
 ('an', 'AT'),
 ('investigation', 'NN'),
 ('of', 'IN')]

# Reading New Corpora

In [20]:
import nltk.corpus.reader

In [30]:
corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".",r"[ft].*txt",encoding="utf8")

In [31]:
corpus.fileids()

['faust.txt', 'tomsawyer.txt']

In [32]:
corpus.raw()[:100]

u'Faust: Der Trag\xf6die erster Teil\n\nJohann Wolfgang von Goethe\n\n\nZueignung.\n\nIhr naht euch wieder, schw'

In [33]:
corpus.paras()[:2]

[[[u'Faust', u':', u'Der', u'Trag\xf6die', u'erster', u'Teil']],
 [[u'Johann', u'Wolfgang', u'von', u'Goethe']]]

In [39]:
print corpus.sents()[500]

[u'FAUST', u':', u'Vor', u'jenem', u'droben', u'steht', u'geb\xfcckt', u',', u'Der', u'helfen', u'lehrt', u'und', u'H\xfclfe', u'schickt', u'.']


In [40]:
print corpus.words()[500:510]

[u'heute', u'!', u'DICHTER', u':', u'O', u'sprich', u'mir', u'nicht', u'von', u'jener']


In [44]:
from nltk import Text
text = Text(corpus.words("tomsawyer.txt"))

In [47]:
text.concordance("with")

Building index...
Displaying 25 of 647 matches:
" TOM !" No answer . " What ' s gone with that boy , I wonder ? You TOM !" No 
ding down and punching under the bed with the broom , and so she needed breath
eded breath to punctuate the punches with . She resurrected nothing but the ca
 - brother ) Sid was already through with his part of the work ( picking up ch
et vanity to believe she was endowed with a talent for dark and mysterious dip
 sewed . " Bother ! Well , go ' long with you . I ' d made sure you ' d played
 didn ' t think you sewed his collar with white thread , but it ' s black ." "
it ' s black ." " Why , I did sew it with white ! Tom !" But Tom did not wait 
 Confound it ! sometimes she sews it with white , and sometimes she sews it wi
th white , and sometimes she sews it with black . I wish to geeminy she ' d st
f it , and he strode down the street with his mouth full of harmony __________
ure is concerned , the advantage was with the boy , not the astronomer . The s
art 

In [48]:
text.similar("with")

Building word-context index...
and in on to for of was at into up s that through but if just upon
what as by


In [50]:
text.common_contexts(["with","as"])

but_the is_a long_you up_a
