{ "cells": [ { "cell_type": "markdown", "id": "69e195ea", "metadata": {}, "source": [ "# Operating on Directory Trees" ] }, { "cell_type": "code", "execution_count": 46, "id": "5871d5a9", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "CONTENTS ca41\t\t cc10 ce18 cf24 cg18 cg60 ch27 cj39 ck01 cl14 cn26\r\n", "README\t ca42\t\t cc11 ce19 cf25 cg19 cg61 ch28 cj40 ck02 cl15 cn27\r\n", "ca01\t ca43\t\t cc12 ce20 cf26 cg20 cg62 ch29 cj41 ck03 cl16 cn28\r\n", "ca02\t ca44\t\t cc13 ce21 cf27 cg21 cg63 ch30 cj42 ck04 cl17 cn29\r\n", "ca03\t categories.pickle cc14 ce22 cf28 cg22 cg64 cj01 cj43 ck05 cl18 cp01\r\n", "ca04\t cats.txt\t cc15 ce23 cf29 cg23 cg65 cj02 cj44 ck06 cl19 cp02\r\n", "ca05\t cb01\t\t cc16 ce24 cf30 cg24 cg66 cj03 cj45 ck07 cl20 cp03\r\n", "ca06\t cb02\t\t cc17 ce25 cf31 cg25 cg67 cj04 cj46 ck08 cl21 cp04\r\n", "ca07\t cb03\t\t cd01 ce26 cf32 cg26 cg68 cj05 cj47 ck09 cl22 cp05\r\n", "ca08\t cb04\t\t cd02 ce27 cf33 cg27 cg69 cj06 cj48 ck10 cl23 cp06\r\n", "ca09\t cb05\t\t cd03 ce28 cf34 cg28 cg70 cj07 cj49 ck11 cl24 cp07\r\n", "ca10\t cb06\t\t cd04 ce29 cf35 cg29 cg71 cj08 cj50 ck12 cm01 cp08\r\n", "ca11\t cb07\t\t cd05 ce30 cf36 cg30 cg72 cj09 cj51 ck13 cm02 cp09\r\n", "ca12\t cb08\t\t cd06 ce31 cf37 cg31 cg73 cj10 cj52 ck14 cm03 cp10\r\n", "ca13\t cb09\t\t cd07 ce32 cf38 cg32 cg74 cj11 cj53 ck15 cm04 cp11\r\n", "ca14\t cb10\t\t cd08 ce33 cf39 cg33 cg75 cj12 cj54 ck16 cm05 cp12\r\n", "ca15\t cb11\t\t cd09 ce34 cf40 cg34 ch01 cj13 cj55 ck17 cm06 cp13\r\n", "ca16\t cb12\t\t cd10 ce35 cf41 cg35 ch02 cj14 cj56 ck18 cn01 cp14\r\n", "ca17\t cb13\t\t cd11 ce36 cf42 cg36 ch03 cj15 cj57 ck19 cn02 cp15\r\n", "ca18\t cb14\t\t cd12 cf01 cf43 cg37 ch04 cj16 cj58 ck20 cn03 cp16\r\n", "ca19\t cb15\t\t cd13 cf02 cf44 cg38 ch05 cj17 cj59 ck21 cn04 cp17\r\n", "ca20\t cb16\t\t cd14 cf03 cf45 cg39 ch06 cj18 cj60 ck22 cn05 cp18\r\n", "ca21\t cb17\t\t cd15 cf04 cf46 cg40 ch07 cj19 cj61 ck23 cn06 cp19\r\n", "ca22\t cb18\t\t cd16 cf05 cf47 cg41 ch08 cj20 cj62 ck24 cn07 cp20\r\n", "ca23\t cb19\t\t cd17 cf06 cf48 cg42 ch09 cj21 cj63 ck25 cn08 cp21\r\n", "ca24\t cb20\t\t ce01 cf07 cg01 cg43 ch10 cj22 cj64 ck26 cn09 cp22\r\n", "ca25\t cb21\t\t ce02 cf08 cg02 cg44 ch11 cj23 cj65 ck27 cn10 cp23\r\n", "ca26\t cb22\t\t ce03 cf09 cg03 cg45 ch12 cj24 cj66 ck28 cn11 cp24\r\n", "ca27\t cb23\t\t ce04 cf10 cg04 cg46 ch13 cj25 cj67 ck29 cn12 cp25\r\n", "ca28\t cb24\t\t ce05 cf11 cg05 cg47 ch14 cj26 cj68 cl01 cn13 cp26\r\n", "ca29\t cb25\t\t ce06 cf12 cg06 cg48 ch15 cj27 cj69 cl02 cn14 cp27\r\n", "ca30\t cb26\t\t ce07 cf13 cg07 cg49 ch16 cj28 cj70 cl03 cn15 cp28\r\n", "ca31\t cb27\t\t ce08 cf14 cg08 cg50 ch17 cj29 cj71 cl04 cn16 cp29\r\n", "ca32\t cc01\t\t ce09 cf15 cg09 cg51 ch18 cj30 cj72 cl05 cn17 cr01\r\n", "ca33\t cc02\t\t ce10 cf16 cg10 cg52 ch19 cj31 cj73 cl06 cn18 cr02\r\n", "ca34\t cc03\t\t ce11 cf17 cg11 cg53 ch20 cj32 cj74 cl07 cn19 cr03\r\n", "ca35\t cc04\t\t ce12 cf18 cg12 cg54 ch21 cj33 cj75 cl08 cn20 cr04\r\n", "ca36\t cc05\t\t ce13 cf19 cg13 cg55 ch22 cj34 cj76 cl09 cn21 cr05\r\n", "ca37\t cc06\t\t ce14 cf20 cg14 cg56 ch23 cj35 cj77 cl10 cn22 cr06\r\n", "ca38\t cc07\t\t ce15 cf21 cg15 cg57 ch24 cj36 cj78 cl11 cn23 cr07\r\n", "ca39\t cc08\t\t ce16 cf22 cg16 cg58 ch25 cj37 cj79 cl12 cn24 cr08\r\n", "ca40\t cc09\t\t ce17 cf23 cg17 cg59 ch26 cj38 cj80 cl13 cn25 cr09\r\n" ] } ], "source": [ "!ls brown" ] }, { "cell_type": "markdown", "id": "6ca95130", "metadata": {}, "source": [ "Let's look at operating on directory trees, a fairly common operation\n", "when dealing with files.\n", "\n", "It's common to want to search through a directory tree of files for matches.\n", "These days, `grep` has a built-in option for that, but let's\n", "see whether we can write that in some other (and more flexible) way." ] }, { "cell_type": "code", "execution_count": 15, "id": "730058e3", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 107 3132 28944\r\n" ] } ], "source": [ "!grep -r nuclear brown/. | wc" ] }, { "cell_type": "markdown", "id": "68e357ca", "metadata": {}, "source": [ "The first thing people tend to do is look at the `find` command and\n", "see its `-exec` option; they then write something like this command.\n", "Do not use this kind of command; `-exec` is rarely the right thing to use\n", "because it is quite inefficient, because it is limited in what you can do with it,\n", "and because the syntax and quoting can get tricky." ] }, { "cell_type": "code", "execution_count": 44, "id": "294b7597", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 107 3099 27553\r\n" ] } ], "source": [ "!find brown/. -type f -exec grep nuclear '{}' \\; | wc # DO NOT USE" ] }, { "cell_type": "markdown", "id": "bff52247", "metadata": {}, "source": [ "A better way of dealing with this is the `xargs` command.\n", "It takes a partial command as its arguments, reads a list of file names\n", "on its standard input, and then applies the command to all those\n", "file names. It can do this in parallel (and there are even distributed\n", "versions of it)." ] }, { "cell_type": "code", "execution_count": 16, "id": "1a7bdc18", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 107 3132 28944\r\n" ] } ], "source": [ "!find brown/. | xargs grep nuclear | wc " ] }, { "cell_type": "markdown", "id": "411423c5", "metadata": {}, "source": [ "To deal properly with file names containing spaces, you need to use one of the following two commands (look at the manual pages to see why that works). The latter is probably better behaved, since most UNIX commands expect line-oriented inputs, not null terminated inputs." ] }, { "cell_type": "code", "execution_count": 42, "id": "c95295cb", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 107 3132 28944\r\n" ] } ], "source": [ "!find brown/. -print0 | xargs -0 grep nuclear | wc " ] }, { "cell_type": "code", "execution_count": 45, "id": "e3bee5c4", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 107 3132 28944\r\n" ] } ], "source": [ "!find brown/. | xargs -d '\\n' grep nuclear | wc # THIS REALLY SHOULD BE THE DEFAULT" ] }, { "cell_type": "markdown", "id": "8dbbbdff", "metadata": {}, "source": [ "The `-l` option to `grep` tells it only to list the names of matching files.\n", "So, if we want to know the number of matching files (instead of the number of\n", "matching lines), we use this command:" ] }, { "cell_type": "code", "execution_count": 36, "id": "7058a80e", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "brown/./cj72\r\n", "brown/./cj74\r\n", "brown/./cb21\r\n", "brown/./ch21\r\n", "brown/./cg03\r\n" ] } ], "source": [ "!find brown/. | xargs grep -l nuclear | sed 5q" ] }, { "cell_type": "code", "execution_count": 37, "id": "36d4d79a", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 35 35 455\r\n" ] } ], "source": [ "!find brown/. | xargs grep -l nuclear | wc " ] }, { "cell_type": "markdown", "id": "fd5632f0", "metadata": {}, "source": [ "Since the output of `find` is just a list of lines, we can apply filters to it as well,\n", "for example searching for specific file names, file name extensions, or other conditions.\n", "So, if we want to look for the term `nuclear` only in the `ch` files of the Brown corpus,\n", "we can use this command:" ] }, { "cell_type": "code", "execution_count": 39, "id": "84bfda2f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 3 3 39\r\n" ] } ], "source": [ "!find brown/. | fgrep brown/./ch | xargs grep -l nuclear | wc" ] }, { "cell_type": "markdown", "id": "339900bb", "metadata": {}, "source": [ "We can even put another grep in between there to filter things:" ] }, { "cell_type": "code", "execution_count": 41, "id": "196285a8", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " 11 11 143\r\n" ] } ], "source": [ "!find brown/. | xargs grep -l Kennedy | xargs grep -l nuclear | wc" ] }, { "cell_type": "markdown", "id": "2037a2de", "metadata": {}, "source": [ "Finally, let's add our little `sed` script back in to format the output." ] }, { "cell_type": "code", "execution_count": 47, "id": "169b8d14", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Until Moscow resumed nuclear testing last September 1 , the US and UK had released more than twice as much radiation into the atmosphere as the Russians , and the fallout from the earlier blasts is still coming down .\r\n", "On October 19 , after the Soviets had detonated at least 20 nuclear devices , Ambassador Stevenson warned the UN General Assembly that this country , in `` self protection '' , might have to resume above-ground tests .\r\n", "Now , of course , that the Russians are the nuclear villains , radiation is a nastier word than it was in the mid , when the US was testing in the atmosphere .\r\n", "After a nuclear blast , one bureaucrat suggested in those halcyon days , about all you had to do was haul out the broom and sweep off your sidewalks and roof .\r\n", "Can thermonuclear war be set off by accident ? ?\r\n", "`` E '' stands for `` execution '' -- the moment a `` go order '' would unleash an American nuclear strike .\r\n", "Work is under way to see whether new restraining devices should be installed on all nuclear weapons .\r\n", "Only the President is permitted to authorize the use of nuclear weapons .\r\n", "The President cannot personally remove the safety devices from every nuclear trigger .\r\n", "However , the system is designed , ingeniously and hopefully , so that no one man could initiate a thermonuclear war .\r\n", "sed: couldn't flush stdout: Broken pipe\r\n" ] } ], "source": [ "!find brown/. | xargs grep -l Kennedy | xargs grep -h nuclear | sed 's/\\/[^ ]*//g;s/^\\s//' | head" ] }, { "cell_type": "code", "execution_count": null, "id": "0aa1a43e", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }