{ "cells": [ { "cell_type": "markdown", "id": "c6a93818", "metadata": {}, "source": [ "# A Database Join on Text Files from the Unix Command Line" ] }, { "cell_type": "markdown", "id": "3981c748", "metadata": {}, "source": [ "Let's perform a simple database join from the UNIX command line.\n", "\n", "We start with two files, `household-ppp.tsv`, a tab-separated file of median household incomes by country, and `college-degrees-perc.tsv`, the percent of people with Bachelor or higher degrees in the age 25-64.\n", "\n", "To look at the contents of a file, we use the `cat` command.\n", "\n", "(The `!` in this worksheet means that the rest of the line is a UNIX command.)" ] }, { "cell_type": "code", "execution_count": 76, "id": "a5d44112", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\tLuxembourg\t34407\r\n", "2\tUnited States\t31111\r\n", "3\tNorway\t31011\r\n", "4\tIceland\t28166\r\n", "5\tAustralia\t26915\r\n", "6\tSwitzerland\t26844\r\n", "7\tCanada\t25363\r\n", "8\tUnited Kingdom\t25168\r\n", "9\tIreland\t24677\r\n", "10\tAustria\t24114\r\n", "11\tNetherlands\t24024\r\n", "12\tSweden\t22889\r\n", "13\tDenmark\t22461\r\n", "14\tBelgium\t21532\r\n", "15\tGermany\t21241\r\n", "16\tFinland\t20875\r\n", "17\tNew Zealand\t20679\r\n", "18\tFrance\t19615\r\n", "19\tJapan\t19432\r\n", "20\tSouth Korea\t19179\r\n" ] } ], "source": [ "!cat household-ppp.tsv" ] }, { "cell_type": "markdown", "id": "653d3116", "metadata": {}, "source": [ "These are spaced out in an odd way; the reason is that there isn't space characters but tab characters.\n", "We can make these visible with the `-vT` option to `cat`." ] }, { "cell_type": "code", "execution_count": 77, "id": "d4947202", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1^ILuxembourg^I34407\r\n", "2^IUnited States^I31111\r\n", "3^INorway^I31011\r\n", "4^IIceland^I28166\r\n", "5^IAustralia^I26915\r\n", "6^ISwitzerland^I26844\r\n", "7^ICanada^I25363\r\n", "8^IUnited Kingdom^I25168\r\n", "9^IIreland^I24677\r\n", "10^IAustria^I24114\r\n", "11^INetherlands^I24024\r\n", "12^ISweden^I22889\r\n", "13^IDenmark^I22461\r\n", "14^IBelgium^I21532\r\n", "15^IGermany^I21241\r\n", "16^IFinland^I20875\r\n", "17^INew Zealand^I20679\r\n", "18^IFrance^I19615\r\n", "19^IJapan^I19432\r\n", "20^ISouth Korea^I19179\r\n" ] } ], "source": [ "!cat -vT household-ppp.tsv" ] }, { "cell_type": "markdown", "id": "e9557658", "metadata": {}, "source": [ "We can also format this more nicely by putting it throught the `column` command.\n", "There are many such commands available at the command line.\n", "You won't remember them all, but if you need something, instead of writing\n", "something yourself, take a minute or two looking in the man pages or on Google.\n", "\n", "The scary looking expression with the printf is just saying \"put a TAB\" character here.\n", "We only need it here in the notebook because there's no easy way of typing a literal tab\n", "and because the notebook uses /bin/sh.\n", "On the command line, you just type a literal TAB with Control-V TAB,\n", "or in Bash you can write `$'\\t'`." ] }, { "cell_type": "code", "execution_count": 21, "id": "708ba99c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1 Luxembourg 34407\r\n", "2 United States 31111\r\n", "3 Norway 31011\r\n", "4 Iceland 28166\r\n", "5 Australia 26915\r\n", "6 Switzerland 26844\r\n", "7 Canada 25363\r\n", "8 United Kingdom 25168\r\n", "9 Ireland 24677\r\n", "10 Austria 24114\r\n", "11 Netherlands 24024\r\n", "12 Sweden 22889\r\n", "13 Denmark 22461\r\n", "14 Belgium 21532\r\n", "15 Germany 21241\r\n", "16 Finland 20875\r\n", "17 New Zealand 20679\r\n", "18 France 19615\r\n", "19 Japan 19432\r\n", "20 South Korea 19179\r\n" ] } ], "source": [ "!cat household-ppp.tsv | column -t -s \"`printf '\\t'`\"" ] }, { "cell_type": "markdown", "id": "dea84fcc", "metadata": {}, "source": [ "Let's look at the second table. We use `head` instead of `cat`; `head` shows us just the initial contents of a file.\n", "Another useful command might be `more` or `less`." ] }, { "cell_type": "code", "execution_count": 78, "id": "6c5304c8", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\tNorway\t32\r\n", "2\tUnited States\t31\r\n", "3\tNetherlands\t29\r\n", "4\tIceland\t26\r\n", "5\tDenmark\t25\r\n", "6\tNew Zealand\t25\r\n", "7\tCanada\t25\r\n", "8\tSouth Korea\t24\r\n", "9\tAustralia\t24\r\n", "10\tSweden\t23\r\n" ] } ], "source": [ "!head college-degrees-perc.tsv" ] }, { "cell_type": "markdown", "id": "0e02a3ee", "metadata": {}, "source": [ "Some people prefer using a more general tool called `sed` (stream editor).\n", "The following command says copy and edit files (with no editing to be done), and whe you hit line 7, then quit." ] }, { "cell_type": "code", "execution_count": 79, "id": "b7d8c359", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\tNorway\t32\r\n", "2\tUnited States\t31\r\n", "3\tNetherlands\t29\r\n", "4\tIceland\t26\r\n", "5\tDenmark\t25\r\n", "6\tNew Zealand\t25\r\n", "7\tCanada\t25\r\n" ] } ], "source": [ "!sed 7q college-degrees-perc.tsv" ] }, { "cell_type": "markdown", "id": "1f3c83d7", "metadata": {}, "source": [ "There are other kinds of edits we can make with `sed`, for example.\n", "\n", "This command say:\n", "\n", "- substitute `United States` with `USA` (the strings can be regular expressions)\n", "- substitute `United Kingdom` with `UK`\n", "- translate all spaces to underscores\n", "- delete any line not containing either US or UK\n", "\n", "In fact, `sed` really is a little programming language with one letter commands.\n", "The commands themselves are used in many apps in similar forms (e.g., `ed`, `vi`, etc.)\n", "\n", "- s/old/new/g - text substitution with regular expressions\n", "- /target/ - string search\n", "- a - add line\n", "- d - delete line\n", "- y/old/new/ - replace characters\n", "- ! - negation\n", "- ; - statement separator\n", "\n", "You don't have to remember all these commands, but you should know that they are there.\n", "And a few standard `sed` commands are useful to know." ] }, { "cell_type": "code", "execution_count": 80, "id": "2f5c8120", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\tUSA\t31\r\n", "11\tUK\t23\r\n" ] } ], "source": [ "!sed 's/United States/USA/;s/United Kingdom/UK/;y/ /_/;/US\\|UK/!d' college-degrees-perc.tsv" ] }, { "cell_type": "markdown", "id": "924c58a8", "metadata": {}, "source": [ "In order to perform a relational join, we need to sort on the fields we are joining on.\n", "The sort command has many options.\n", "You can find out about them from the manual page." ] }, { "cell_type": "code", "execution_count": 81, "id": "8fb3c0b1", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "SORT(1) User Commands SORT(1)\r\n", "\r\n", "\r\n", "\r\n", "NAME\r\n", " sort - sort lines of text files\r\n", "\r\n", "SYNOPSIS\r\n", " sort [OPTION]... [FILE]...\r\n", " sort [OPTION]... --files0-from=F\r\n", "\r\n", "DESCRIPTION\r\n", " Write sorted concatenation of all FILE(s) to standard output.\r\n", "\r\n", " Mandatory arguments to long options are mandatory for short options too.\r\n", " Ordering options:\r\n", "\r\n", " -b, --ignore-leading-blanks\r\n", " ignore leading blanks\r\n", "\r\n", " -d, --dictionary-order\r\n", " consider only blanks and alphanumeric characters\r\n", "\r\n", " -f, --ignore-case\r\n", " fold lower case to upper case characters\r\n", "\r\n", " -g, --general-numeric-sort\r\n", " compare according to general numerical value\r\n", "\r\n", " -i, --ignore-nonprinting\r\n" ] } ], "source": [ "!man sort | sed 30q" ] }, { "cell_type": "markdown", "id": "253aaf36", "metadata": {}, "source": [ "You can also find out about them from the `--help` argument." ] }, { "cell_type": "code", "execution_count": 82, "id": "18f61a4c", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: sort [OPTION]... [FILE]...\r\n", " or: sort [OPTION]... --files0-from=F\r\n", "Write sorted concatenation of all FILE(s) to standard output.\r\n", "\r\n", "Mandatory arguments to long options are mandatory for short options too.\r\n", "Ordering options:\r\n", "\r\n", " -b, --ignore-leading-blanks ignore leading blanks\r\n", " -d, --dictionary-order consider only blanks and alphanumeric characters\r\n", " -f, --ignore-case fold lower case to upper case characters\r\n", " -g, --general-numeric-sort compare according to general numerical value\r\n", " -i, --ignore-nonprinting consider only printable characters\r\n", " -M, --month-sort compare (unknown) < `JAN' < ... < `DEC'\r\n", " -h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)\r\n", " -n, --numeric-sort compare according to string numerical value\r\n", " -R, --random-sort sort by random hash of keys\r\n", " --random-source=FILE get random bytes from FILE\r\n", " -r, --reverse reverse the result of comparisons\r\n", " --sort=WORD sort according to WORD:\r\n", " general-numeric -g, human-numeric -h, month -M,\r\n", " numeric -n, random -R, version -V\r\n", " -V, --version-sort natural sort of (version) numbers within text\r\n", "\r\n", "Other options:\r\n", "\r\n", " --batch-size=NMERGE merge at most NMERGE inputs at once;\r\n", " for more use temp files\r\n", " -c, --check, --check=diagnose-first check for sorted input; do not sort\r\n", " -C, --check=quiet, --check=silent like -c, but do not report first bad line\r\n", " --compress-program=PROG compress temporaries with PROG;\r\n", " decompress them with PROG -d\r\n", " --debug annotate the part of the line used to sort,\r\n", " and warn about questionable usage to stderr\r\n", " --files0-from=F read input from the files specified by\r\n", " NUL-terminated names in file F;\r\n", " If F is - then read names from standard input\r\n", " -k, --key=POS1[,POS2] start a key at POS1 (origin 1), end it at POS2\r\n", " (default end of line). See POS syntax below\r\n", " -m, --merge merge already sorted files; do not sort\r\n", " -o, --output=FILE write result to FILE instead of standard output\r\n", " -s, --stable stabilize sort by disabling last-resort comparison\r\n", " -S, --buffer-size=SIZE use SIZE for main memory buffer\r\n", " -t, --field-separator=SEP use SEP instead of non-blank to blank transition\r\n", " -T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;\r\n", " multiple options specify multiple directories\r\n", " --parallel=N change the number of sorts run concurrently to N\r\n", " -u, --unique with -c, check for strict ordering;\r\n", " without -c, output only the first of an equal run\r\n", " -z, --zero-terminated end lines with 0 byte, not newline\r\n", " --help display this help and exit\r\n", " --version output version information and exit\r\n", "\r\n", "POS is F[.C][OPTS], where F is the field number and C the character position\r\n", "in the field; both are origin 1. If neither -t nor -b is in effect, characters\r\n", "in a field are counted from the beginning of the preceding whitespace. OPTS is\r\n", "one or more single-letter ordering options, which override global ordering\r\n", "options for that key. If no key is given, use the entire line as the key.\r\n", "\r\n", "SIZE may be followed by the following multiplicative suffixes:\r\n", "% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.\r\n", "\r\n", "With no FILE, or when FILE is -, read standard input.\r\n", "\r\n", "*** WARNING ***\r\n", "The locale specified by the environment affects sort order.\r\n", "Set LC_ALL=C to get the traditional sort order that uses\r\n", "native byte values.\r\n", "\r\n", "Report sort bugs to bug-coreutils@gnu.org\r\n", "GNU coreutils home page: \r\n", "General help using GNU software: \r\n", "Report sort translation bugs to \r\n", "For complete documentation, run: info coreutils 'sort invocation'\r\n" ] } ], "source": [ "!sort --help" ] }, { "cell_type": "markdown", "id": "e9264dec", "metadata": {}, "source": [ "We want to join on the country name. So let's sort on that field." ] }, { "cell_type": "code", "execution_count": 83, "id": "4a15d58f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "5\tAustralia\t26915\r\n", "10\tAustria\t24114\r\n", "14\tBelgium\t21532\r\n", "7\tCanada\t25363\r\n", "13\tDenmark\t22461\r\n", "16\tFinland\t20875\r\n", "18\tFrance\t19615\r\n", "15\tGermany\t21241\r\n", "4\tIceland\t28166\r\n", "9\tIreland\t24677\r\n", "19\tJapan\t19432\r\n", "1\tLuxembourg\t34407\r\n", "11\tNetherlands\t24024\r\n", "17\tNew Zealand\t20679\r\n", "3\tNorway\t31011\r\n", "20\tSouth Korea\t19179\r\n", "12\tSweden\t22889\r\n", "6\tSwitzerland\t26844\r\n", "8\tUnited Kingdom\t25168\r\n", "2\tUnited States\t31111\r\n" ] } ], "source": [ "!sort -k 2 household-ppp.tsv" ] }, { "cell_type": "markdown", "id": "247a4afb", "metadata": {}, "source": [ "That seems to have worked. So let's do that for both files now.\n", "So, let's sort both files and then join them." ] }, { "cell_type": "code", "execution_count": 4, "id": "29f56702", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Australia\t5\t26915\t9\t24\r\n", "Belgium\t14\t21532\t24\t14\r\n", "Canada\t7\t25363\t7\t25\r\n", "Denmark\t13\t22461\t5\t25\r\n", "Finland\t16\t20875\t13\t21\r\n", "France\t18\t19615\t21\t16\r\n", "Germany\t15\t21241\t20\t16\r\n", "Iceland\t4\t28166\t4\t26\r\n", "Ireland\t9\t24677\t14\t21\r\n", "Japan\t19\t19432\t12\t23\r\n", "Luxembourg\t1\t34407\t19\t18\r\n", "Netherlands\t11\t24024\t3\t29\r\n", "New Zealand\t17\t20679\t6\t25\r\n", "Norway\t3\t31011\t1\t32\r\n", "South Korea\t20\t19179\t8\t24\r\n", "Sweden\t12\t22889\t10\t23\r\n", "Switzerland\t6\t26844\t15\t21\r\n", "United Kingdom\t8\t25168\t11\t23\r\n", "United States\t2\t31111\t2\t31\r\n" ] } ], "source": [ "!sort -k 2 household-ppp.tsv > A\n", "!sort -k 2 college-degrees-perc.tsv > B\n", "!join -t \"`printf '\\t'`\" -j 2 A B" ] }, { "cell_type": "markdown", "id": "e424d0d1", "metadata": {}, "source": [ "Each table contained a rank that we aren't interested in.\n", "So we need to cut out fields 1, 3, and 5.\n", "But the output of join is space separated and cut wants a tab separated input,\n", "so we use the `tr` command to translate spaced to tabs.\n", "We could also have used `sed \"y/ /\\t/\"`." ] }, { "cell_type": "code", "execution_count": 5, "id": "8c03b76f", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Luxembourg\t34407\t18\r\n", "United States\t31111\t31\r\n", "Norway\t31011\t32\r\n", "Iceland\t28166\t26\r\n", "Australia\t26915\t24\r\n", "Switzerland\t26844\t21\r\n", "Canada\t25363\t25\r\n", "United Kingdom\t25168\t23\r\n", "Ireland\t24677\t21\r\n", "Netherlands\t24024\t29\r\n", "Sweden\t22889\t23\r\n", "Denmark\t22461\t25\r\n", "Belgium\t21532\t14\r\n", "Germany\t21241\t16\r\n", "Finland\t20875\t21\r\n", "New Zealand\t20679\t25\r\n", "France\t19615\t16\r\n", "Japan\t19432\t23\r\n", "South Korea\t19179\t24\r\n" ] } ], "source": [ "!join -t \"`printf '\\t'`\" -j 2 A B | cut -f 1,3,5 | sort -t \"`printf '\\t'`\" -r -k 2 | tee table.tsv" ] }, { "cell_type": "markdown", "id": "9b5fdea0", "metadata": {}, "source": [ "Note that with the `tee` command, we simultaneously saved the output in a file and displayed it on standard output.\n", "\n", "Let's now reformat this output a little more nicely, using the `column` command.\n", "\n", "Note the use of \"(...)\" at the shell to combine the output from three commands." ] }, { "cell_type": "code", "execution_count": 18, "id": "f4d77395", "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "COUNTRY FAMILY-INCOME-PPP PERC-UNIVERSITY\r\n", "Luxembourg 34407 18\r\n", "United States 31111 31\r\n", "Norway 31011 32\r\n", "Iceland 28166 26\r\n", "Australia 26915 24\r\n", "Switzerland 26844 21\r\n", "Canada 25363 25\r\n", "United Kingdom 25168 23\r\n", "Ireland 24677 21\r\n", "Netherlands 24024 29\r\n", "Sweden 22889 23\r\n", "Denmark 22461 25\r\n", "Belgium 21532 14\r\n", "Germany 21241 16\r\n", "Finland 20875 21\r\n", "New Zealand 20679 25\r\n", "France 19615 16\r\n", "Japan 19432 23\r\n", "South Korea 19179 24\r\n" ] } ], "source": [ "!(echo \"COUNTRY\\tFAMILY-INCOME-PPP\\tPERC-UNIVERSITY\"; cat table.tsv) | column -t -s \"`printf '\\t'`\" " ] }, { "cell_type": "markdown", "id": "05d41516", "metadata": {}, "source": [ "This may seem like an awful lot of work to join two small tables.\n", "Why not just write a Python script or load the data into a spreadsheet?\n", "\n", "- The commands on arbitrarily large files, even those that don't fit in memory.\n", "- These utilities handle Unicode, locales, and search order correctly.\n", "- These utilities know how to handle disk caching etc. correctly, and they can use multiple cores.\n", "- If you're skilled in both Python and UNIX, this will still be a lot faster to write on the command line.\n", "\n", "In short, you'd have a hard time beating them for large problems with any code you are likely to be able to write.\n", "\n", "NB: Your *locale* is defined by your environment variable `LC_ALL` and affects things like sort order." ] }, { "cell_type": "code", "execution_count": 85, "id": "bcc8751f", "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": {}, "nbformat": 4, "nbformat_minor": 5 }