{ "cells": [ { "cell_type": "markdown", "id": "1ce9fe94", "metadata": {}, "source": [ "# Properties of Corpora" ] }, { "cell_type": "code", "execution_count": 1, "id": "90005a64", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk.corpus import brown" ] }, { "cell_type": "markdown", "id": "2980d0d7", "metadata": {}, "source": [ "## Corpora are Collections of Files" ] }, { "cell_type": "code", "execution_count": 17, "id": "47104c75", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "FileSystemPathPointer('/home/tmb/nltk_data/corpora/brown')" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.root" ] }, { "cell_type": "code", "execution_count": 18, "id": "9886db10", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'BROWN CORPUS\\n\\nA Standard Corpus of Present-Day Edited American\\nEnglish, for use with Digital Computers.\\n\\nby W. N. Francis and H. Kucera (1964)\\nDepartment of Linguistics, Brown University\\nProvidence, Rhode Island, USA\\n\\nRevised 1971, Revised and Amplified 1979\\n\\nhttp://www.hit.uib.no/icame/brown/bcm.html\\n\\nDistributed with the permission of the copyright holder,\\nredistribution permitted.\\n'" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.readme()" ] }, { "cell_type": "code", "execution_count": 15, "id": "1d690b25", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['ca01',\n", " 'ca02',\n", " 'ca03',\n", " 'ca04',\n", " 'ca05',\n", " 'ca06',\n", " 'ca07',\n", " 'ca08',\n", " 'ca09',\n", " 'ca10']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.fileids()[:10]" ] }, { "cell_type": "markdown", "id": "2e7d5c4b", "metadata": {}, "source": [ "Files may have different encodings; the default is ASCII processed as `str`." ] }, { "cell_type": "code", "execution_count": 16, "id": "d2efd5df", "metadata": { "collapsed": false }, "outputs": [], "source": [ "brown.encoding(\"ca01\")" ] }, { "cell_type": "markdown", "id": "f9b70616", "metadata": {}, "source": [ "Files may also be in different categories." ] }, { "cell_type": "code", "execution_count": 19, "id": "39af3549", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['adventure',\n", " 'belles_lettres',\n", " 'editorial',\n", " 'fiction',\n", " 'government',\n", " 'hobbies',\n", " 'humor',\n", " 'learned',\n", " 'lore',\n", " 'mystery',\n", " 'news',\n", " 'religion',\n", " 'reviews',\n", " 'romance',\n", " 'science_fiction']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.categories()" ] }, { "cell_type": "markdown", "id": "0ce78685", "metadata": {}, "source": [ "## Accessing Content" ] }, { "cell_type": "markdown", "id": "9fe8b232", "metadata": {}, "source": [ "The corpus abstraction allows you to avoid having to deal with individual files, encodings, etc.\n", "\n", "That is, you can access all the words, all the text, all the sentences etc. in a corpus from a single object.\n" ] }, { "cell_type": "code", "execution_count": 54, "id": "caddbbd4", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "'\\n\\n\\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn'" ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.raw()[:100]" ] }, { "cell_type": "code", "execution_count": 2, "id": "f281d7f8", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['The',\n", " 'Fulton',\n", " 'County',\n", " 'Grand',\n", " 'Jury',\n", " 'said',\n", " 'Friday',\n", " 'an',\n", " 'investigation',\n", " 'of']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.words()[:10]" ] }, { "cell_type": "code", "execution_count": 5, "id": "ed836db2", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['The', 'Fulton', 'County', 'Grand', 'Jury']\n", "['The', 'jury', 'further', 'said', 'in']\n", "['The', 'September-October', 'term', 'jury', 'had']\n", "['``', 'Only', 'a', 'relative', 'handful']\n", "['The', 'jury', 'said', 'it', 'did']\n", "['It', 'recommended', 'that', 'Fulton', 'legislators']\n", "['The', 'grand', 'jury', 'commented', 'on']\n", "['Merger', 'proposed']\n", "['However', ',', 'the', 'jury', 'said']\n", "['The', 'City', 'Purchasing', 'Department', ',']\n" ] } ], "source": [ "for s in brown.sents()[:10]: print s[:5]" ] }, { "cell_type": "code", "execution_count": 6, "id": "0ac68d8b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('The', 'AT'),\n", " ('Fulton', 'NP-TL'),\n", " ('County', 'NN-TL'),\n", " ('Grand', 'JJ-TL'),\n", " ('Jury', 'NN-TL'),\n", " ('said', 'VBD'),\n", " ('Friday', 'NR'),\n", " ('an', 'AT'),\n", " ('investigation', 'NN'),\n", " ('of', 'IN')]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.tagged_words()[:10]" ] }, { "cell_type": "code", "execution_count": 8, "id": "3f909b3b", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('The', 'AT'),\n", " ('Fulton', 'NP-TL'),\n", " ('County', 'NN-TL'),\n", " ('Grand', 'JJ-TL'),\n", " ('Jury', 'NN-TL'),\n", " ('said', 'VBD'),\n", " ('Friday', 'NR'),\n", " ('an', 'AT'),\n", " ('investigation', 'NN'),\n", " ('of', 'IN')]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "brown.tagged_sents()[0][:10]" ] }, { "cell_type": "markdown", "id": "edac51c0", "metadata": {}, "source": [ "# Reading New Corpora" ] }, { "cell_type": "code", "execution_count": 20, "id": "79f45f8a", "metadata": { "collapsed": false }, "outputs": [], "source": [ "import nltk.corpus.reader" ] }, { "cell_type": "code", "execution_count": 30, "id": "88ef0e47", "metadata": { "collapsed": false }, "outputs": [], "source": [ "corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(\".\",r\"[ft].*txt\",encoding=\"utf8\")" ] }, { "cell_type": "code", "execution_count": 31, "id": "045a15a0", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "['faust.txt', 'tomsawyer.txt']" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.fileids()" ] }, { "cell_type": "code", "execution_count": 32, "id": "d25b1b3d", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "u'Faust: Der Trag\\xf6die erster Teil\\n\\nJohann Wolfgang von Goethe\\n\\n\\nZueignung.\\n\\nIhr naht euch wieder, schw'" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.raw()[:100]" ] }, { "cell_type": "code", "execution_count": 33, "id": "b69ac46c", "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[[[u'Faust', u':', u'Der', u'Trag\\xf6die', u'erster', u'Teil']],\n", " [[u'Johann', u'Wolfgang', u'von', u'Goethe']]]" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "corpus.paras()[:2]" ] }, { "cell_type": "code", "execution_count": 39, "id": "9331354f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'FAUST', u':', u'Vor', u'jenem', u'droben', u'steht', u'geb\\xfcckt', u',', u'Der', u'helfen', u'lehrt', u'und', u'H\\xfclfe', u'schickt', u'.']\n" ] } ], "source": [ "print corpus.sents()[500]" ] }, { "cell_type": "code", "execution_count": 40, "id": "68818aff", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[u'heute', u'!', u'DICHTER', u':', u'O', u'sprich', u'mir', u'nicht', u'von', u'jener']\n" ] } ], "source": [ "print corpus.words()[500:510]" ] }, { "cell_type": "code", "execution_count": 44, "id": "23376f31", "metadata": { "collapsed": false }, "outputs": [], "source": [ "from nltk import Text\n", "text = Text(corpus.words(\"tomsawyer.txt\"))" ] }, { "cell_type": "code", "execution_count": 47, "id": "1b2dcb00", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building index...\n", "Displaying 25 of 647 matches:\n", "\" TOM !\" No answer . \" What ' s gone with that boy , I wonder ? You TOM !\" No \n", "ding down and punching under the bed with the broom , and so she needed breath\n", "eded breath to punctuate the punches with . She resurrected nothing but the ca\n", " - brother ) Sid was already through with his part of the work ( picking up ch\n", "et vanity to believe she was endowed with a talent for dark and mysterious dip\n", " sewed . \" Bother ! Well , go ' long with you . I ' d made sure you ' d played\n", " didn ' t think you sewed his collar with white thread , but it ' s black .\" \"\n", "it ' s black .\" \" Why , I did sew it with white ! Tom !\" But Tom did not wait \n", " Confound it ! sometimes she sews it with white , and sometimes she sews it wi\n", "th white , and sometimes she sews it with black . I wish to geeminy she ' d st\n", "f it , and he strode down the street with his mouth full of harmony __________\n", "ure is concerned , the advantage was with the boy , not the astronomer . The s\n", "art , don ' t you ? I could lick you with one hand tied behind me , if I wante\n", "do it .\" \" Well I will , if you fool with me .\" \" Oh yes -- I ' ve seen whole \n", "n ' t either .\" So they stood , each with a foot placed at an angle as a brace\n", " angle as a brace , and both shoving with might and main , and glowering at ea\n", "d main , and glowering at each other with hate . But neither could get an adva\n", "nd flushed , each relaxed his strain with watchful caution , and Tom said : \" \n", "other on you , and he can thrash you with his little finger , and I ' ll make \n", "it so .\" Tom drew a line in the dust with his big toe , and said : \" I dare yo\n", " out of his pocket and held them out with derision . Tom struck them to the gr\n", "er ' s nose , and covered themselves with dust and glory . Presently the confu\n", "tride the new boy , and pounding him with his fists . \" Holler ' nuff !\" said \n", "Better look out who you ' re fooling with next time .\" The new boy went off br\n", "ht him out .\" To which Tom responded with jeers , and started off in high feat\n" ] } ], "source": [ "text.concordance(\"with\")" ] }, { "cell_type": "code", "execution_count": 48, "id": "487f92bf", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Building word-context index...\n", "and in on to for of was at into up s that through but if just upon\n", "what as by\n" ] } ], "source": [ "text.similar(\"with\")" ] }, { "cell_type": "code", "execution_count": 50, "id": "fdaee38c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "but_the is_a long_you up_a\n" ] } ], "source": [ "text.common_contexts([\"with\",\"as\"])" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }