{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c6a93818",
   "metadata": {},
   "source": [
    "# A Database Join on Text Files from the Unix Command Line"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3981c748",
   "metadata": {},
   "source": [
    "Let's perform a simple database join from the UNIX command line.\n",
    "\n",
    "We start with two files, `household-ppp.tsv`, a tab-separated file of median household incomes by country, and `college-degrees-perc.tsv`, the percent of people with Bachelor or higher degrees in the age 25-64.\n",
    "\n",
    "To look at the contents of a file, we use the `cat` command.\n",
    "\n",
    "(The `!` in this worksheet means that the rest of the line is a UNIX command.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "id": "a5d44112",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tLuxembourg\t34407\r\n",
      "2\tUnited States\t31111\r\n",
      "3\tNorway\t31011\r\n",
      "4\tIceland\t28166\r\n",
      "5\tAustralia\t26915\r\n",
      "6\tSwitzerland\t26844\r\n",
      "7\tCanada\t25363\r\n",
      "8\tUnited Kingdom\t25168\r\n",
      "9\tIreland\t24677\r\n",
      "10\tAustria\t24114\r\n",
      "11\tNetherlands\t24024\r\n",
      "12\tSweden\t22889\r\n",
      "13\tDenmark\t22461\r\n",
      "14\tBelgium\t21532\r\n",
      "15\tGermany\t21241\r\n",
      "16\tFinland\t20875\r\n",
      "17\tNew Zealand\t20679\r\n",
      "18\tFrance\t19615\r\n",
      "19\tJapan\t19432\r\n",
      "20\tSouth Korea\t19179\r\n"
     ]
    }
   ],
   "source": [
    "!cat household-ppp.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "653d3116",
   "metadata": {},
   "source": [
    "These are spaced out in an odd way; the reason is that there isn't space characters but tab characters.\n",
    "We can make these visible with the `-vT` option to `cat`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "id": "d4947202",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1^ILuxembourg^I34407\r\n",
      "2^IUnited States^I31111\r\n",
      "3^INorway^I31011\r\n",
      "4^IIceland^I28166\r\n",
      "5^IAustralia^I26915\r\n",
      "6^ISwitzerland^I26844\r\n",
      "7^ICanada^I25363\r\n",
      "8^IUnited Kingdom^I25168\r\n",
      "9^IIreland^I24677\r\n",
      "10^IAustria^I24114\r\n",
      "11^INetherlands^I24024\r\n",
      "12^ISweden^I22889\r\n",
      "13^IDenmark^I22461\r\n",
      "14^IBelgium^I21532\r\n",
      "15^IGermany^I21241\r\n",
      "16^IFinland^I20875\r\n",
      "17^INew Zealand^I20679\r\n",
      "18^IFrance^I19615\r\n",
      "19^IJapan^I19432\r\n",
      "20^ISouth Korea^I19179\r\n"
     ]
    }
   ],
   "source": [
    "!cat -vT household-ppp.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9557658",
   "metadata": {},
   "source": [
    "We can also format this more nicely by putting it throught the `column` command.\n",
    "There are many such commands available at the command line.\n",
    "You won't remember them all, but if you need something, instead of writing\n",
    "something yourself, take a minute or two looking in the man pages or on Google.\n",
    "\n",
    "The scary looking expression with the printf is just saying \"put a TAB\" character here.\n",
    "We only need it here in the notebook because there's no easy way of typing a literal tab\n",
    "and because the notebook uses /bin/sh.\n",
    "On the command line, you just type a literal TAB with Control-V TAB,\n",
    "or in Bash you can write `$'\\t'`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "708ba99c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1   Luxembourg      34407\r\n",
      "2   United States   31111\r\n",
      "3   Norway          31011\r\n",
      "4   Iceland         28166\r\n",
      "5   Australia       26915\r\n",
      "6   Switzerland     26844\r\n",
      "7   Canada          25363\r\n",
      "8   United Kingdom  25168\r\n",
      "9   Ireland         24677\r\n",
      "10  Austria         24114\r\n",
      "11  Netherlands     24024\r\n",
      "12  Sweden          22889\r\n",
      "13  Denmark         22461\r\n",
      "14  Belgium         21532\r\n",
      "15  Germany         21241\r\n",
      "16  Finland         20875\r\n",
      "17  New Zealand     20679\r\n",
      "18  France          19615\r\n",
      "19  Japan           19432\r\n",
      "20  South Korea     19179\r\n"
     ]
    }
   ],
   "source": [
    "!cat household-ppp.tsv | column -t -s \"`printf '\\t'`\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dea84fcc",
   "metadata": {},
   "source": [
    "Let's look at the second table.  We use `head` instead of `cat`; `head` shows us just the initial contents of a file.\n",
    "Another useful command might be `more` or `less`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 78,
   "id": "6c5304c8",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tNorway\t32\r\n",
      "2\tUnited States\t31\r\n",
      "3\tNetherlands\t29\r\n",
      "4\tIceland\t26\r\n",
      "5\tDenmark\t25\r\n",
      "6\tNew Zealand\t25\r\n",
      "7\tCanada\t25\r\n",
      "8\tSouth Korea\t24\r\n",
      "9\tAustralia\t24\r\n",
      "10\tSweden\t23\r\n"
     ]
    }
   ],
   "source": [
    "!head college-degrees-perc.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0e02a3ee",
   "metadata": {},
   "source": [
    "Some people prefer using a more general tool called `sed` (stream editor).\n",
    "The following command says copy and edit files (with no editing to be done), and whe you hit line 7, then quit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "id": "b7d8c359",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1\tNorway\t32\r\n",
      "2\tUnited States\t31\r\n",
      "3\tNetherlands\t29\r\n",
      "4\tIceland\t26\r\n",
      "5\tDenmark\t25\r\n",
      "6\tNew Zealand\t25\r\n",
      "7\tCanada\t25\r\n"
     ]
    }
   ],
   "source": [
    "!sed 7q college-degrees-perc.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1f3c83d7",
   "metadata": {},
   "source": [
    "There are other kinds of edits we can make with `sed`, for example.\n",
    "\n",
    "This command say:\n",
    "\n",
    "- substitute `United States` with `USA` (the strings can be regular expressions)\n",
    "- substitute `United Kingdom` with `UK`\n",
    "- translate all spaces to underscores\n",
    "- delete any line not containing either US or UK\n",
    "\n",
    "In fact, `sed` really is a little programming language with one letter commands.\n",
    "The commands themselves are used in many apps in similar forms (e.g., `ed`, `vi`, etc.)\n",
    "\n",
    "- s/old/new/g - text substitution with regular expressions\n",
    "- /target/ - string search\n",
    "- a - add line\n",
    "- d - delete line\n",
    "- y/old/new/ - replace characters\n",
    "- ! - negation\n",
    "- ; - statement separator\n",
    "\n",
    "You don't have to remember all these commands, but you should know that they are there.\n",
    "And a few standard `sed` commands are useful to know."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "id": "2f5c8120",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2\tUSA\t31\r\n",
      "11\tUK\t23\r\n"
     ]
    }
   ],
   "source": [
    "!sed 's/United States/USA/;s/United Kingdom/UK/;y/ /_/;/US\\|UK/!d' college-degrees-perc.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "924c58a8",
   "metadata": {},
   "source": [
    "In order to perform a relational join, we need to sort on the fields we are joining on.\n",
    "The sort command has many options.\n",
    "You can find out about them from the manual page."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 81,
   "id": "8fb3c0b1",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "SORT(1)                             User Commands                             SORT(1)\r\n",
      "\r\n",
      "\r\n",
      "\r\n",
      "NAME\r\n",
      "       sort - sort lines of text files\r\n",
      "\r\n",
      "SYNOPSIS\r\n",
      "       sort [OPTION]... [FILE]...\r\n",
      "       sort [OPTION]... --files0-from=F\r\n",
      "\r\n",
      "DESCRIPTION\r\n",
      "       Write sorted concatenation of all FILE(s) to standard output.\r\n",
      "\r\n",
      "       Mandatory  arguments  to  long  options  are  mandatory for short options too.\r\n",
      "       Ordering options:\r\n",
      "\r\n",
      "       -b, --ignore-leading-blanks\r\n",
      "              ignore leading blanks\r\n",
      "\r\n",
      "       -d, --dictionary-order\r\n",
      "              consider only blanks and alphanumeric characters\r\n",
      "\r\n",
      "       -f, --ignore-case\r\n",
      "              fold lower case to upper case characters\r\n",
      "\r\n",
      "       -g, --general-numeric-sort\r\n",
      "              compare according to general numerical value\r\n",
      "\r\n",
      "       -i, --ignore-nonprinting\r\n"
     ]
    }
   ],
   "source": [
    "!man sort | sed 30q"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "253aaf36",
   "metadata": {},
   "source": [
    "You can also find out about them from the `--help` argument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 82,
   "id": "18f61a4c",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Usage: sort [OPTION]... [FILE]...\r\n",
      "  or:  sort [OPTION]... --files0-from=F\r\n",
      "Write sorted concatenation of all FILE(s) to standard output.\r\n",
      "\r\n",
      "Mandatory arguments to long options are mandatory for short options too.\r\n",
      "Ordering options:\r\n",
      "\r\n",
      "  -b, --ignore-leading-blanks  ignore leading blanks\r\n",
      "  -d, --dictionary-order      consider only blanks and alphanumeric characters\r\n",
      "  -f, --ignore-case           fold lower case to upper case characters\r\n",
      "  -g, --general-numeric-sort  compare according to general numerical value\r\n",
      "  -i, --ignore-nonprinting    consider only printable characters\r\n",
      "  -M, --month-sort            compare (unknown) < `JAN' < ... < `DEC'\r\n",
      "  -h, --human-numeric-sort    compare human readable numbers (e.g., 2K 1G)\r\n",
      "  -n, --numeric-sort          compare according to string numerical value\r\n",
      "  -R, --random-sort           sort by random hash of keys\r\n",
      "      --random-source=FILE    get random bytes from FILE\r\n",
      "  -r, --reverse               reverse the result of comparisons\r\n",
      "      --sort=WORD             sort according to WORD:\r\n",
      "                                general-numeric -g, human-numeric -h, month -M,\r\n",
      "                                numeric -n, random -R, version -V\r\n",
      "  -V, --version-sort          natural sort of (version) numbers within text\r\n",
      "\r\n",
      "Other options:\r\n",
      "\r\n",
      "      --batch-size=NMERGE   merge at most NMERGE inputs at once;\r\n",
      "                            for more use temp files\r\n",
      "  -c, --check, --check=diagnose-first  check for sorted input; do not sort\r\n",
      "  -C, --check=quiet, --check=silent  like -c, but do not report first bad line\r\n",
      "      --compress-program=PROG  compress temporaries with PROG;\r\n",
      "                              decompress them with PROG -d\r\n",
      "      --debug               annotate the part of the line used to sort,\r\n",
      "                              and warn about questionable usage to stderr\r\n",
      "      --files0-from=F       read input from the files specified by\r\n",
      "                            NUL-terminated names in file F;\r\n",
      "                            If F is - then read names from standard input\r\n",
      "  -k, --key=POS1[,POS2]     start a key at POS1 (origin 1), end it at POS2\r\n",
      "                            (default end of line).  See POS syntax below\r\n",
      "  -m, --merge               merge already sorted files; do not sort\r\n",
      "  -o, --output=FILE         write result to FILE instead of standard output\r\n",
      "  -s, --stable              stabilize sort by disabling last-resort comparison\r\n",
      "  -S, --buffer-size=SIZE    use SIZE for main memory buffer\r\n",
      "  -t, --field-separator=SEP  use SEP instead of non-blank to blank transition\r\n",
      "  -T, --temporary-directory=DIR  use DIR for temporaries, not $TMPDIR or /tmp;\r\n",
      "                              multiple options specify multiple directories\r\n",
      "      --parallel=N          change the number of sorts run concurrently to N\r\n",
      "  -u, --unique              with -c, check for strict ordering;\r\n",
      "                              without -c, output only the first of an equal run\r\n",
      "  -z, --zero-terminated     end lines with 0 byte, not newline\r\n",
      "      --help     display this help and exit\r\n",
      "      --version  output version information and exit\r\n",
      "\r\n",
      "POS is F[.C][OPTS], where F is the field number and C the character position\r\n",
      "in the field; both are origin 1.  If neither -t nor -b is in effect, characters\r\n",
      "in a field are counted from the beginning of the preceding whitespace.  OPTS is\r\n",
      "one or more single-letter ordering options, which override global ordering\r\n",
      "options for that key.  If no key is given, use the entire line as the key.\r\n",
      "\r\n",
      "SIZE may be followed by the following multiplicative suffixes:\r\n",
      "% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.\r\n",
      "\r\n",
      "With no FILE, or when FILE is -, read standard input.\r\n",
      "\r\n",
      "*** WARNING ***\r\n",
      "The locale specified by the environment affects sort order.\r\n",
      "Set LC_ALL=C to get the traditional sort order that uses\r\n",
      "native byte values.\r\n",
      "\r\n",
      "Report sort bugs to bug-coreutils@gnu.org\r\n",
      "GNU coreutils home page: <http://www.gnu.org/software/coreutils/>\r\n",
      "General help using GNU software: <http://www.gnu.org/gethelp/>\r\n",
      "Report sort translation bugs to <http://translationproject.org/team/>\r\n",
      "For complete documentation, run: info coreutils 'sort invocation'\r\n"
     ]
    }
   ],
   "source": [
    "!sort --help"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9264dec",
   "metadata": {},
   "source": [
    "We want to join on the country name.  So let's sort on that field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 83,
   "id": "4a15d58f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\tAustralia\t26915\r\n",
      "10\tAustria\t24114\r\n",
      "14\tBelgium\t21532\r\n",
      "7\tCanada\t25363\r\n",
      "13\tDenmark\t22461\r\n",
      "16\tFinland\t20875\r\n",
      "18\tFrance\t19615\r\n",
      "15\tGermany\t21241\r\n",
      "4\tIceland\t28166\r\n",
      "9\tIreland\t24677\r\n",
      "19\tJapan\t19432\r\n",
      "1\tLuxembourg\t34407\r\n",
      "11\tNetherlands\t24024\r\n",
      "17\tNew Zealand\t20679\r\n",
      "3\tNorway\t31011\r\n",
      "20\tSouth Korea\t19179\r\n",
      "12\tSweden\t22889\r\n",
      "6\tSwitzerland\t26844\r\n",
      "8\tUnited Kingdom\t25168\r\n",
      "2\tUnited States\t31111\r\n"
     ]
    }
   ],
   "source": [
    "!sort -k 2 household-ppp.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "247a4afb",
   "metadata": {},
   "source": [
    "That seems to have worked.  So let's do that for both files now.\n",
    "So, let's sort both files and then join them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "29f56702",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Australia\t5\t26915\t9\t24\r\n",
      "Belgium\t14\t21532\t24\t14\r\n",
      "Canada\t7\t25363\t7\t25\r\n",
      "Denmark\t13\t22461\t5\t25\r\n",
      "Finland\t16\t20875\t13\t21\r\n",
      "France\t18\t19615\t21\t16\r\n",
      "Germany\t15\t21241\t20\t16\r\n",
      "Iceland\t4\t28166\t4\t26\r\n",
      "Ireland\t9\t24677\t14\t21\r\n",
      "Japan\t19\t19432\t12\t23\r\n",
      "Luxembourg\t1\t34407\t19\t18\r\n",
      "Netherlands\t11\t24024\t3\t29\r\n",
      "New Zealand\t17\t20679\t6\t25\r\n",
      "Norway\t3\t31011\t1\t32\r\n",
      "South Korea\t20\t19179\t8\t24\r\n",
      "Sweden\t12\t22889\t10\t23\r\n",
      "Switzerland\t6\t26844\t15\t21\r\n",
      "United Kingdom\t8\t25168\t11\t23\r\n",
      "United States\t2\t31111\t2\t31\r\n"
     ]
    }
   ],
   "source": [
    "!sort -k 2 household-ppp.tsv > A\n",
    "!sort -k 2 college-degrees-perc.tsv > B\n",
    "!join -t \"`printf '\\t'`\" -j 2 A B"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e424d0d1",
   "metadata": {},
   "source": [
    "Each table contained a rank that we aren't interested in.\n",
    "So we need to cut out fields 1, 3, and 5.\n",
    "But the output of join is space separated and cut wants a tab separated input,\n",
    "so we use the `tr` command to translate spaced to tabs.\n",
    "We could also have used `sed \"y/ /\\t/\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "8c03b76f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Luxembourg\t34407\t18\r\n",
      "United States\t31111\t31\r\n",
      "Norway\t31011\t32\r\n",
      "Iceland\t28166\t26\r\n",
      "Australia\t26915\t24\r\n",
      "Switzerland\t26844\t21\r\n",
      "Canada\t25363\t25\r\n",
      "United Kingdom\t25168\t23\r\n",
      "Ireland\t24677\t21\r\n",
      "Netherlands\t24024\t29\r\n",
      "Sweden\t22889\t23\r\n",
      "Denmark\t22461\t25\r\n",
      "Belgium\t21532\t14\r\n",
      "Germany\t21241\t16\r\n",
      "Finland\t20875\t21\r\n",
      "New Zealand\t20679\t25\r\n",
      "France\t19615\t16\r\n",
      "Japan\t19432\t23\r\n",
      "South Korea\t19179\t24\r\n"
     ]
    }
   ],
   "source": [
    "!join -t \"`printf '\\t'`\" -j 2 A B | cut -f 1,3,5 | sort -t \"`printf '\\t'`\" -r -k 2 | tee table.tsv"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b5fdea0",
   "metadata": {},
   "source": [
    "Note that with the `tee` command, we simultaneously saved the output in a file and displayed it on standard output.\n",
    "\n",
    "Let's now reformat this output a little more nicely, using the `column` command.\n",
    "\n",
    "Note the use of \"(...)\" at the shell to combine the output from three commands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f4d77395",
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "COUNTRY         FAMILY-INCOME-PPP  PERC-UNIVERSITY\r\n",
      "Luxembourg      34407              18\r\n",
      "United States   31111              31\r\n",
      "Norway          31011              32\r\n",
      "Iceland         28166              26\r\n",
      "Australia       26915              24\r\n",
      "Switzerland     26844              21\r\n",
      "Canada          25363              25\r\n",
      "United Kingdom  25168              23\r\n",
      "Ireland         24677              21\r\n",
      "Netherlands     24024              29\r\n",
      "Sweden          22889              23\r\n",
      "Denmark         22461              25\r\n",
      "Belgium         21532              14\r\n",
      "Germany         21241              16\r\n",
      "Finland         20875              21\r\n",
      "New Zealand     20679              25\r\n",
      "France          19615              16\r\n",
      "Japan           19432              23\r\n",
      "South Korea     19179              24\r\n"
     ]
    }
   ],
   "source": [
    "!(echo \"COUNTRY\\tFAMILY-INCOME-PPP\\tPERC-UNIVERSITY\"; cat table.tsv) | column -t -s \"`printf '\\t'`\" "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "05d41516",
   "metadata": {},
   "source": [
    "This may seem like an awful lot of work to join two small tables.\n",
    "Why not just write a Python script or load the data into a spreadsheet?\n",
    "\n",
    "- The commands on arbitrarily large files, even those that don't fit in memory.\n",
    "- These utilities handle Unicode, locales, and search order correctly.\n",
    "- These utilities know how to handle disk caching etc. correctly, and they can use multiple cores.\n",
    "- If you're skilled in both Python and UNIX, this will still be a lot faster to write on the command line.\n",
    "\n",
    "In short, you'd have a hard time beating them for large problems with any code you are likely to be able to write.\n",
    "\n",
    "NB: Your *locale* is defined by your environment variable `LC_ALL` and affects things like sort order."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "id": "bcc8751f",
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}