{ "cells": [ { "cell_type": "markdown", "id": "3468f54a", "metadata": {}, "source": [ "# Making the Brown Corpus Readable" ] }, { "cell_type": "markdown", "id": "3906c193", "metadata": {}, "source": [ "Here's a simple example of developing a command line removing the tags from the Brown corpus files." ] }, { "cell_type": "code", "execution_count": 1, "id": "4b86f5be", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CONTENTS\r\n", "README\r\n", "ca01\r\n", "ca02\r\n", "ca03\r\n", "ca04\r\n", "ca05\r\n", "ca06\r\n", "ca07\r\n", "ca08\r\n" ] } ], "source": [ "!ls brown/. | head" ] }, { "cell_type": "markdown", "id": "ce619cd5", "metadata": {}, "source": [ "We should probably look at the README for the definition of the tag file, but let's just figure this out." ] }, { "cell_type": "code", "execution_count": 2, "id": "88e3f236", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BROWN CORPUS\r\n", "\r\n", "A Standard Corpus of Present-Day Edited American\r\n", "English, for use with Digital Computers.\r\n", "\r\n", "by W. N. Francis and H. Kucera (1964)\r\n", "Department of Linguistics, Brown University\r\n", "Providence, Rhode Island, USA\r\n", "\r\n", "Revised 1971, Revised and Amplified 1979\r\n" ] } ], "source": [ "!head brown/README" ] }, { "cell_type": "markdown", "id": "7a58cf6a", "metadata": {}, "source": [ "Here's the first 10 lines from the file `brown/ca07`." ] }, { "cell_type": "code", "execution_count": 3, "id": "c9f39ff7", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\tResentment/nn welled/vbd up/rp yesterday/nr among/in Democratic/jj-tl district/nn leaders/nns and/cc some/dti county/nn leaders/nns at/in reports/nns that/cs Mayor/nn-tl Wagner/np had/hvd decided/vbn to/to seek/vb a/at third/od term/nn with/in Paul/np R./np Screvane/np and/cc Abraham/np D./np Beame/np as/cs running/vbg mates/nns ./.\r\n", "\r\n", "\r\n", "\tAt/in the/at same/ap time/nn reaction/nn among/in anti-organization/jj Democratic/jj-tl leaders/nns and/cc in/in the/at Liberal/jj-tl party/nn to/in the/at Mayor's/nn$-tl reported/vbn plan/nn was/bedz generally/rb favorable/jj ./.\r\n", "\r\n", "\r\n", "\tSome/dti anti-organization/jj Democrats/nps saw/vbd in/in the/at program/nn an/at opportunity/nn to/to end/vb the/at bitter/jj internal/jj fight/nn within/in the/at Democratic/jj-tl party/nn that/wps has/hvz been/ben going/vbg on/rp for/in the/at last/ap three/cd years/nns ./.\r\n", "\r\n" ] } ], "source": [ "!sed 10q brown/ca07" ] }, { "cell_type": "markdown", "id": "7e376c42", "metadata": {}, "source": [ "The main thing is that every word or punctuation is followed by a `/something`.\n", "We can remove that with a simple regular expression.\n", "Well, it's not quite so simple...\n", "\n", "- We want to replace `/`, but that's already the regular expression delimiter, so we need to escape it: `\\/`\n", "- the `g` is needed because we want to replace all occurrences" ] }, { "cell_type": "code", "execution_count": 7, "id": "6106ffad", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "\tResentment welled up yesterday among Democratic district leaders and some county leaders at reports that Mayor Wagner had decided to seek a third term with Paul R. Screvane and Abraham D. Beame as running mates .\r\n", "\r\n", "\r\n", "\tAt the same time reaction among anti-organization Democratic leaders and in the Liberal party to the Mayor's reported plan was generally favorable .\r\n", "\r\n", "\r\n", "\tSome anti-organization Democrats saw in the program an opportunity to end the bitter internal fight within the Democratic party that has been going on for the last three years .\r\n", "\r\n" ] } ], "source": [ "!sed 's/\\/[^ ]*//g;10q' brown/ca07" ] }, { "cell_type": "markdown", "id": "a78c6e56", "metadata": {}, "source": [ "Let's now clean up the whitespace at the beginning of the line. `\\t` is a shorthand for the tab character." ] }, { "cell_type": "code", "execution_count": 8, "id": "1f7307db", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "\r\n", "Resentment welled up yesterday among Democratic district leaders and some county leaders at reports that Mayor Wagner had decided to seek a third term with Paul R. Screvane and Abraham D. Beame as running mates .\r\n", "\r\n", "\r\n", "At the same time reaction among anti-organization Democratic leaders and in the Liberal party to the Mayor's reported plan was generally favorable .\r\n", "\r\n", "\r\n", "Some anti-organization Democrats saw in the program an opportunity to end the bitter internal fight within the Democratic party that has been going on for the last three years .\r\n", "\r\n" ] } ], "source": [ "!sed 's/\\/[^ ]*//g;s/^[ \\t]*//;10q' brown/ca07" ] }, { "cell_type": "markdown", "id": "211b87da", "metadata": {}, "source": [ "There are a lot of blank lines; the `cat -s` (squeeze) command gets rid of them." ] }, { "cell_type": "code", "execution_count": 9, "id": "32f18008", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Resentment welled up yesterday among Democratic district leaders and some county leaders at reports that Mayor Wagner had decided to seek a third term with Paul R. Screvane and Abraham D. Beame as running mates .\r\n", "\r\n", "At the same time reaction among anti-organization Democratic leaders and in the Liberal party to the Mayor's reported plan was generally favorable .\r\n", "\r\n", "Some anti-organization Democrats saw in the program an opportunity to end the bitter internal fight within the Democratic party that has been going on for the last three years .\r\n", "\r\n" ] } ], "source": [ "!sed 's/\\/[^ ]*//g;s/^[ \\t]*//;10q' brown/ca07 | cat -s" ] }, { "cell_type": "markdown", "id": "8135c8f7", "metadata": {}, "source": [ "Now we still have a problem with extra spaces before punctuation.\n", "We can fix that with another regular expression.\n", "This one contains *grouping* `\\(...\\)` and a backwards reference to the group `\\1`" ] }, { "cell_type": "code", "execution_count": 13, "id": "99222f49", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Resentment welled up yesterday among Democratic district leaders and some county leaders at reports that Mayor Wagner had decided to seek a third term with Paul R. Screvane and Abraham D. Beame as running mates.\r\n", "\r\n", "At the same time reaction among anti-organization Democratic leaders and in the Liberal party to the Mayor's reported plan was generally favorable.\r\n", "\r\n", "Some anti-organization Democrats saw in the program an opportunity to end the bitter internal fight within the Democratic party that has been going on for the last three years.\r\n", "\r\n" ] } ], "source": [ "!sed 's/\\/[^ ]*//g;s/^[ \\t]*//;s/ \\([.,]\\)/\\1/;10q' brown/ca07 | cat -s" ] }, { "cell_type": "markdown", "id": "f8378292", "metadata": {}, "source": [ "Finally, let's wrap the long lines back around. The `fmt` command is handy for that." ] }, { "cell_type": "code", "execution_count": 14, "id": "4e221ba4", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\r\n", "Resentment welled up yesterday among Democratic district leaders and\r\n", "some county leaders at reports that Mayor Wagner had decided to seek a\r\n", "third term with Paul R. Screvane and Abraham D. Beame as running mates.\r\n", "\r\n", "At the same time reaction among anti-organization Democratic leaders and\r\n", "in the Liberal party to the Mayor's reported plan was generally favorable.\r\n", "\r\n", "Some anti-organization Democrats saw in the program an opportunity to\r\n", "end the bitter internal fight within the Democratic party that has been\r\n", "going on for the last three years.\r\n", "\r\n" ] } ], "source": [ "!sed 's/\\/[^ ]*//g;s/^[ \\t]*//;s/ \\([.,]\\)/\\1/;10q' brown/ca07 | cat -s | fmt" ] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }