{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "69e195ea",
   "metadata": {},
   "source": [
    "# Operating on Directory Trees"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "id": "5871d5a9",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CONTENTS  ca41\t\t     cc10  ce18  cf24  cg18  cg60  ch27  cj39  ck01  cl14  cn26\r\n",
      "README\t  ca42\t\t     cc11  ce19  cf25  cg19  cg61  ch28  cj40  ck02  cl15  cn27\r\n",
      "ca01\t  ca43\t\t     cc12  ce20  cf26  cg20  cg62  ch29  cj41  ck03  cl16  cn28\r\n",
      "ca02\t  ca44\t\t     cc13  ce21  cf27  cg21  cg63  ch30  cj42  ck04  cl17  cn29\r\n",
      "ca03\t  categories.pickle  cc14  ce22  cf28  cg22  cg64  cj01  cj43  ck05  cl18  cp01\r\n",
      "ca04\t  cats.txt\t     cc15  ce23  cf29  cg23  cg65  cj02  cj44  ck06  cl19  cp02\r\n",
      "ca05\t  cb01\t\t     cc16  ce24  cf30  cg24  cg66  cj03  cj45  ck07  cl20  cp03\r\n",
      "ca06\t  cb02\t\t     cc17  ce25  cf31  cg25  cg67  cj04  cj46  ck08  cl21  cp04\r\n",
      "ca07\t  cb03\t\t     cd01  ce26  cf32  cg26  cg68  cj05  cj47  ck09  cl22  cp05\r\n",
      "ca08\t  cb04\t\t     cd02  ce27  cf33  cg27  cg69  cj06  cj48  ck10  cl23  cp06\r\n",
      "ca09\t  cb05\t\t     cd03  ce28  cf34  cg28  cg70  cj07  cj49  ck11  cl24  cp07\r\n",
      "ca10\t  cb06\t\t     cd04  ce29  cf35  cg29  cg71  cj08  cj50  ck12  cm01  cp08\r\n",
      "ca11\t  cb07\t\t     cd05  ce30  cf36  cg30  cg72  cj09  cj51  ck13  cm02  cp09\r\n",
      "ca12\t  cb08\t\t     cd06  ce31  cf37  cg31  cg73  cj10  cj52  ck14  cm03  cp10\r\n",
      "ca13\t  cb09\t\t     cd07  ce32  cf38  cg32  cg74  cj11  cj53  ck15  cm04  cp11\r\n",
      "ca14\t  cb10\t\t     cd08  ce33  cf39  cg33  cg75  cj12  cj54  ck16  cm05  cp12\r\n",
      "ca15\t  cb11\t\t     cd09  ce34  cf40  cg34  ch01  cj13  cj55  ck17  cm06  cp13\r\n",
      "ca16\t  cb12\t\t     cd10  ce35  cf41  cg35  ch02  cj14  cj56  ck18  cn01  cp14\r\n",
      "ca17\t  cb13\t\t     cd11  ce36  cf42  cg36  ch03  cj15  cj57  ck19  cn02  cp15\r\n",
      "ca18\t  cb14\t\t     cd12  cf01  cf43  cg37  ch04  cj16  cj58  ck20  cn03  cp16\r\n",
      "ca19\t  cb15\t\t     cd13  cf02  cf44  cg38  ch05  cj17  cj59  ck21  cn04  cp17\r\n",
      "ca20\t  cb16\t\t     cd14  cf03  cf45  cg39  ch06  cj18  cj60  ck22  cn05  cp18\r\n",
      "ca21\t  cb17\t\t     cd15  cf04  cf46  cg40  ch07  cj19  cj61  ck23  cn06  cp19\r\n",
      "ca22\t  cb18\t\t     cd16  cf05  cf47  cg41  ch08  cj20  cj62  ck24  cn07  cp20\r\n",
      "ca23\t  cb19\t\t     cd17  cf06  cf48  cg42  ch09  cj21  cj63  ck25  cn08  cp21\r\n",
      "ca24\t  cb20\t\t     ce01  cf07  cg01  cg43  ch10  cj22  cj64  ck26  cn09  cp22\r\n",
      "ca25\t  cb21\t\t     ce02  cf08  cg02  cg44  ch11  cj23  cj65  ck27  cn10  cp23\r\n",
      "ca26\t  cb22\t\t     ce03  cf09  cg03  cg45  ch12  cj24  cj66  ck28  cn11  cp24\r\n",
      "ca27\t  cb23\t\t     ce04  cf10  cg04  cg46  ch13  cj25  cj67  ck29  cn12  cp25\r\n",
      "ca28\t  cb24\t\t     ce05  cf11  cg05  cg47  ch14  cj26  cj68  cl01  cn13  cp26\r\n",
      "ca29\t  cb25\t\t     ce06  cf12  cg06  cg48  ch15  cj27  cj69  cl02  cn14  cp27\r\n",
      "ca30\t  cb26\t\t     ce07  cf13  cg07  cg49  ch16  cj28  cj70  cl03  cn15  cp28\r\n",
      "ca31\t  cb27\t\t     ce08  cf14  cg08  cg50  ch17  cj29  cj71  cl04  cn16  cp29\r\n",
      "ca32\t  cc01\t\t     ce09  cf15  cg09  cg51  ch18  cj30  cj72  cl05  cn17  cr01\r\n",
      "ca33\t  cc02\t\t     ce10  cf16  cg10  cg52  ch19  cj31  cj73  cl06  cn18  cr02\r\n",
      "ca34\t  cc03\t\t     ce11  cf17  cg11  cg53  ch20  cj32  cj74  cl07  cn19  cr03\r\n",
      "ca35\t  cc04\t\t     ce12  cf18  cg12  cg54  ch21  cj33  cj75  cl08  cn20  cr04\r\n",
      "ca36\t  cc05\t\t     ce13  cf19  cg13  cg55  ch22  cj34  cj76  cl09  cn21  cr05\r\n",
      "ca37\t  cc06\t\t     ce14  cf20  cg14  cg56  ch23  cj35  cj77  cl10  cn22  cr06\r\n",
      "ca38\t  cc07\t\t     ce15  cf21  cg15  cg57  ch24  cj36  cj78  cl11  cn23  cr07\r\n",
      "ca39\t  cc08\t\t     ce16  cf22  cg16  cg58  ch25  cj37  cj79  cl12  cn24  cr08\r\n",
      "ca40\t  cc09\t\t     ce17  cf23  cg17  cg59  ch26  cj38  cj80  cl13  cn25  cr09\r\n"
     ]
    }
   ],
   "source": [
    "!ls brown"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ca95130",
   "metadata": {},
   "source": [
    "Let's look at operating on directory trees, a fairly common operation\n",
    "when dealing with files.\n",
    "\n",
    "It's common to want to search through a directory tree of files for matches.\n",
    "These days, `grep` has a built-in option for that, but let's\n",
    "see whether we can write that in some other (and more flexible) way."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "730058e3",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    107    3132   28944\r\n"
     ]
    }
   ],
   "source": [
    "!grep -r nuclear brown/. | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "68e357ca",
   "metadata": {},
   "source": [
    "The first thing people tend to do is look at the `find` command and\n",
    "see its `-exec` option; they then write something like this command.\n",
    "Do not use this kind of command; `-exec` is rarely the right thing to use\n",
    "because it is quite inefficient, because it is limited in what you can do with it,\n",
    "and because the syntax and quoting can get tricky."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "id": "294b7597",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    107    3099   27553\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. -type f -exec grep nuclear '{}' \\; | wc # DO NOT USE"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bff52247",
   "metadata": {},
   "source": [
    "A better way of dealing with this is the `xargs` command.\n",
    "It takes a partial command as its arguments, reads a list of file names\n",
    "on its standard input, and then applies the command to all those\n",
    "file names.  It can do this in parallel (and there are even distributed\n",
    "versions of it)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "1a7bdc18",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    107    3132   28944\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs grep nuclear | wc "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "411423c5",
   "metadata": {},
   "source": [
    "To deal properly with file names containing spaces, you need to use one of the following two commands (look at the manual pages to see why that works).  The latter is probably better behaved, since most UNIX commands expect line-oriented inputs, not null terminated inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "c95295cb",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    107    3132   28944\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. -print0 | xargs -0 grep nuclear | wc "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "id": "e3bee5c4",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    107    3132   28944\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs -d '\\n' grep nuclear | wc   # THIS REALLY SHOULD BE THE DEFAULT"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dbbbdff",
   "metadata": {},
   "source": [
    "The `-l` option to `grep` tells it only to list the names of matching files.\n",
    "So, if we want to know the number of matching files (instead of the number of\n",
    "matching lines), we use this command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "7058a80e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "brown/./cj72\r\n",
      "brown/./cj74\r\n",
      "brown/./cb21\r\n",
      "brown/./ch21\r\n",
      "brown/./cg03\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs grep -l nuclear | sed 5q"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "36d4d79a",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     35      35     455\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs grep -l nuclear | wc "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fd5632f0",
   "metadata": {},
   "source": [
    "Since the output of `find` is just a list of lines, we can apply filters to it as well,\n",
    "for example searching for specific file names, file name extensions, or other conditions.\n",
    "So, if we want to look for the term `nuclear` only in the `ch` files of the Brown corpus,\n",
    "we can use this command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "84bfda2f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "      3       3      39\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | fgrep brown/./ch | xargs grep -l nuclear | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "339900bb",
   "metadata": {},
   "source": [
    "We can even put another grep in between there to filter things:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "196285a8",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     11      11     143\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs grep -l Kennedy | xargs grep -l nuclear | wc"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2037a2de",
   "metadata": {},
   "source": [
    "Finally, let's add our little `sed` script back in to format the output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "id": "169b8d14",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Until Moscow resumed nuclear testing last September 1 , the US and UK had released more than twice as much radiation into the atmosphere as the Russians , and the fallout from the earlier blasts is still coming down .\r\n",
      "On October 19 , after the Soviets had detonated at least 20 nuclear devices , Ambassador Stevenson warned the UN General Assembly that this country , in `` self protection '' , might have to resume above-ground tests .\r\n",
      "Now , of course , that the Russians are the nuclear villains , radiation is a nastier word than it was in the mid , when the US was testing in the atmosphere .\r\n",
      "After a nuclear blast , one bureaucrat suggested in those halcyon days , about all you had to do was haul out the broom and sweep off your sidewalks and roof .\r\n",
      "Can thermonuclear war be set off by accident ? ?\r\n",
      "`` E '' stands for `` execution '' -- the moment a `` go order '' would unleash an American nuclear strike .\r\n",
      "Work is under way to see whether new restraining devices should be installed on all nuclear weapons .\r\n",
      "Only the President is permitted to authorize the use of nuclear weapons .\r\n",
      "The President cannot personally remove the safety devices from every nuclear trigger .\r\n",
      "However , the system is designed , ingeniously and hopefully , so that no one man could initiate a thermonuclear war .\r\n",
      "sed: couldn't flush stdout: Broken pipe\r\n"
     ]
    }
   ],
   "source": [
    "!find brown/. | xargs grep -l Kennedy | xargs grep -h nuclear | sed 's/\\/[^ ]*//g;s/^\\s//' | head"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0aa1a43e",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}