#+TITLE: Shell Script Resilience #+SUBTITLE: Overview of the Problem Space #+AUTHOR: J. Greg Davidson #+DATE: 18 October 2022 #+OPTIONS: toc:nil #+OPTIONS: num:nil # +OPTIONS: date:nil # +OPTIONS: author:nil * Shell Script Resilience This is an OrgMode Document. If you are reading it with Emacs you can run the examples as well as change them and run them again. ** Version 1: The Naive Script If you know how to issue commands to a shell manually, then you can trivially create a naive shell script: - Create a new text file using your favorite editor - Enter your desired commands into it - Make the script executable with =chmod +x SCRIPT-FILE-NAME= - To run it, either - specify a path to it when you run it - e.g. type =./SCRIPT-FILE-NAME= at the Shell Prompt - place it in a directory which is on your =$PATH= - e.g. =mv SCRIPT-FILE-NAME ~/bin= - so you can run it simply as =SCRIPT-FILE-NAME= #+begin_src sh :results output echo Hello from ${0##*/} echo `date -I` is a good day to run! #+end_src #+RESULTS: : Hello from sh : 2022-10-18 is a good day to run! *** Let's criticise version 1 A great thing about storing commands in a script is that we (and others) can now re-execute these commands as often as we like far into the future. This is also the problem: The people running the script (including a future version of you) do not necessarily understand these commands *and* the script can fail because the script may be used differently than you initially used it *and* the resources invoked and accessed by the script may change in ways which cause the script to fail. **** Specifying an Interpreter The excutable file is still just a text file, which won't do anything by itself. Giving it execute permission doesn't say which program should be run to /Interpret/ the script. So we get a system default, which may or may not be the program we intended when we (or whoever) originally wrote the script. We let the system know which program we would like it to use to /Interpret/ our /Script/ with a [[https://en.wikipedia.org/wiki/Shebang_(Unix)][Shebang Line]] as the first line of the file. If you're using =bash=, you can find out where a program is located on your local system at the current time with the builtin =type= command. #+begin_src bash :results output type bash #+end_src #+RESULTS: : bash is /usr/bin/bash If you're using =bash=, you can get brief documentation about /any/ builtin command using the builtin =help= command. #+begin_src bash :results output help type #+end_src **** Documenting The Script Let's assume that you've looked up the features behind - ${0##*/} - date -I - And nesting commands inside of `back quotes` Will someone looking at this script in the future - possibly you, after you've forgotten some of these things understand what these things mean? You should document anything non-obvious - The shell ignores anything after an unquoted # character - If a short comment will do, give it - Otherwise link to a more complete explanation elsewhere **** Controlling Shell Variables It's good to use shell variables for any content which might change or which needs to be repeated, or simply content we wish to document by giving it a name which explains its purpose or makes it easier to think about. The value of a shell variable can contain nearly any string of characters, including spaces and special characters which unless quoted may activate shell features unexpectedly. Thus we almost always quote the initial value of a shell variable with either 'single quotes' or "double quotes" and we almost always quote a shell expansion with "double quotes". In those rare cases where we don't quote shell values or expansions, a comment should explain why! ** Version 2: A Little Better There is nothing wrong with creating a naive script, as long as you upgrade it before using it again, and especially before giving it to anyone else to use! So let's upgrade our script: #+begin_src bash :results output #!/usr/bin/bash # Strip the directories off the executable program path script_name="${0##*/}" echo Hello from "$script_name" # Embed the ISO date in our message echo "`date -I` is a good day to run!" #+end_src #+RESULTS: : Hello from bash : 2022-10-18 is a good day to run! There's not a lot of point making this script more resilient, but what about a script which administers essential services? ** Ensuring the expected context When you're issuing a command interactively, there's a certain background context: - You're logged in under a particular user account with particular permissions - On a system with particular versions of programs and libraries - You have a particular /Current Working Directory/ - Your particular /Environment Variables/ are set in particular ways - You have particular configuration files with particular contents All of the above particulars can effect what, if anything, a command you issue might do. In addition to that context - consider the state of any resources your command might access - other files and directories on your local system - services and other processes on your local system - services and resources on other systems across the Internet When you start to issue a series of familiar commands manually and one of them gives an unexpected result - especially an error message! your /Common Sense/ tells you to - stop and find out what happened - and take corrective action before continuing with the rest of your intended commands - if they're even still relevant! *Scripts have no Common Sense!* Unless you add explicit code to your script, it will simply barrel on, executing the rest of the commands willy-nilly! ** Fully Automating Complex Scripts If we want to build complex artifacts and store them into databases or the filesystem and/or changes the state of the system or some subsystem in complex ways, we would certainly prefer using a script, especially if we're going to want to do similar tasks repeatedly. Using a script 1. documents the process 2. saves labor 3. increases reliability But 2 and 3 are only true if the script can detect and handle errors. - Stopping with a transcript is only semi-automation *** Error Detection Strategies All processes (commands) in a Posix environment return an /Exit Status/. - By convention, 0 means success, non-0 means something weird happened - Note that this is the opposite of traditional Boolean values! - The /Exit Status/ of the /Last Command/ is available in the =$?= pseudo-variable. Some processes require explicit integrity tests - The /Posix/ environment provides some has many often helpful tools - =cmp= will compare two files that should be the same - =test= has lots of built-in tests - The =case= and =expr= commands can do pattern matching - etc. - The =make= tool is often used to organize tests scripts - =make test= is a frequent part of a build process *** Error Recovery Strategies Once a problem has been detected, error recovery needs to - Capture what happened - Restore the system to a known state - Diagnose the problem - Document and log the problem - Execute an alternative process if there is one - Indicate failure if we're out of alternatives Coding this is usually done with /Exit Codes/ which control - the =if= and =while= commands - the Boolean operators =!= (not), =&&= (and then), =||= (or else) - Here's a [[file://Reference-Sheets/bash-metas.pdf][handy reference sheet]] on such In many cases a script is just one part of a more complex automated process, so all it has to do is exit with a non-zero Exit Status, e.g. with =exit 1= -- although it's best to have different non-zero statuses for different kinds of failure. A top level script may need to alert humans that an important process has failed. This should /never/ be done by popping up a notification on a user's screen asking them to report an error! A script should be able to send a text, email, etc. or file a trouble ticket, etc. to bring attention to the problem by the right person in a timely fashion. Scripts can also monitor a trouble ticket system or repeatedly check a system which is out of order and escalate an issue when fixes are not occurring within an expected timeframe. ** How Do We Code When Things Might Fail? At first blush it seems obvious what we should do if things might fail. We simply use =if/else= statements to account for all possibilities. We'll start out with just reporting problems, leaving it up to a human to read the problem reports and deal with them. But we could add more code anywhere to do cleanup, try fixes and alternatives, etc. #+begin_src sh archive_url='https://ftp.postgresql.org/pub/source/v15.0/postgresql-15.0.tar.bz2' if type wget >/dev/null; then wget "$archive_url" else >&2 "$pgm error: missing program wget; aborting" exit 1 fi #+end_src but then the =wget= command could fail, so maybe we better do #+begin_src sh project='postgresql-15.0' project_dir="$HOME/Projects/$project" archive_url="https://ftp.postgresql.org/pub/source/v15.0/$project.tar.bz2" mkdir -p "$project_dir" cd "$project_dir" git init if type wget >/dev/null; then if wget "$archive_url"; then tar xf "$archive_url" ./configure make make test make install else >&2 "$pgm error: wget can't get $archive_url; aborting" exit 2 fi else >&2 "$pgm error: missing program wget; aborting" exit 1 fi #+end_src - What if we're missing any of the commands =git=, =tar= or =make=? - Suppose we have all the programs - what can still fail? - How would we need to write this to immediately report when something failed? - How might we reverse any side effects before exiting? - How might we log or communicate any problems appropriately? - Do we care why a command might have failed? - We have the wrong version of the program? - The data given to it is not as expected - We're missing permissions *** An Organized Semi-Automated Approach We're still just going to report problems if they occur, so our script is still not fully automating the task. This next attempt is going to be tedious. When you get bored with reading it, skip to the next section where we make it better! #+begin_src bash #!/usr/bin/bash -u pgm='install-pgsql-v1' project='postgresql-15.0' project_dir="$HOME/Projects/$project" archive_dir='https://ftp.postgresql.org/pub/source/v15.0' archive_file="$project.tar.bz2" archive_url="$archive_dir/$archive_file" mkdir -p "$project_dir" || { >&2 echo "$pgm error: Can't create directory $project_dir" exit 1 } cd "$project_dir" || { >&2 echo "$pgm error: Can't cd to directory $project_dir" exit 2 } for p in git wget tar; do type "$p" >/dev/null || { >&2 echo "$pgm error: Missing required command $p" exit 3 } # How might we check the required versions? done git init || { >&2 echo "$pgm error: git init failed in directory $project_dir" exit 4 } wget "$archive_url" || { >&2 echo "$pgm error: wget failed to get $archive_url" exit 5 } tar xf "$archive_file" || { >&2 echo "$pgm error: tar failed to extract $archive_file" exit 6 } ./configure || { >&2 echo "$pgm error: configure of $archive_file failed in $project_dir" exit 7 } make || { >&2 echo "$pgm error: make of $archive_file failed in $project_dir" exit 8 } make test || { >&2 echo "$pgm error: test of $archive_file failed in $project_dir" exit 9 } make install || { >&2 echo "$pgm error: install of $archive_file from $project_dir failed" exit 10 } #+end_src Hmm, that's a lot of boiler plate, and we haven't even added any fallback code! - Let's see if we can simplify it first! *** A Better Organized Semi-Automated Approach #+begin_src bash #!/usr/bin/bash -u # expanding undefined variables will cause an error (-u in effect) # set variables for clarity and multiple use pgm='install-pgsql-v2' project='postgresql-15.0' project_dir="$HOME/Projects/$project" archive_dir='https://ftp.postgresql.org/pub/source/v15.0' archive_file="$project.tar.bz2" archive_url="$archive_dir/$archive_file" # define some handy functions -- could be imported from a library! report() { local level="$1"; shift; >&2 echo "$pgm $level: $*"; } error() { report error "$*"; return 1; } error_exit() { local code="$1"; report error "$*"; exit "$code"; } mkdir -p "$project_dir" || error_exit 1 "Can't create directory $project_dir" cd "$project_dir" || error_exit 2 "Can't cd to directory $project_dir" for p in git wget tar; do type "$p" >/dev/null || error_exit 3 "Missing required command $p" # How might we check the required versions? done git init || error_exit 4 "git init failed in directory $project_dir" wget "$archive_url" || error_exit 5 "wget failed to get $archive_url" tar xf "$archive_file" || error_exit 6 "tar failed to extract $archive_file" ./configure || error_exit 7 "configure of $project failed in $project_dir" make || error_exit 8 "make of $project failed in $project_dir" make test || error_exit 9 "test of $project failed in $project_dir" make install || error_exit 10 "install of $project from $project_dir failed" #+end_src Anywhere where we have an =error_exit= we could put back in a block to clean things up, or we could be even fancier: #+begin_src bash { git init || error "git init failed in directory $project_dir"; } && { wget "$archive_url" || error "wget failed to get $archive_url"; } && { tar xf "$archive_file" || error "tar failed to extract $archive_file"; } || { cd .. rm -rf "$project_dir" error_exit 4 "Removed botched $project_dir" } #+end_src *** A Super-Organized Semi-Automated Approach #+begin_src bash #!/usr/bin/bash -u # expanding undefined variables will cause an error (-u in effect) # set variables for clarity and multiple use pgm='install-pgsql-v3' project='postgresql-15.0' project_dir="$HOME/Projects/$project" archive_dir='https://ftp.postgresql.org/pub/source/v15.0' archive_file="$project.tar.bz2" archive_url="$archive_dir/$archive_file" # define some handy functions -- could be imported from a library! try_code=10 # non-zero and unique try() { (( try_code++ )) # increment the failure code if "$@"; then echo "OK: $@" else echo "$pgm FAILED: $@"; exit "$try_code" fi } # check for the existence of required programs for p in git wget tar; do try type "$p" >/dev/null || { >&2 echo "Missing required command $p" exit 1 } # How might we check the required versions? done # Now the business logic try mkdir -p "$project_dir" try cd "$project_dir" try git init try wget "$archive_url" try tar xf "$archive_file" try ./configure try make try make test try make install #+end_src Now where might we put fixup, fallback, cleanup or tactical communication code? *** Criticism We've achieved some success in reducing boiler plate - After we've defined variables and functions - And checked for existence of the required programs - We have about the same number of commands and complexity We still need to deal with - actually dealing with failure - diagnosing the source of the problem - trying any known fixes or alternatives - removing (perhaps to a study area) any messes left behind - whether there's success or failure - logging and communicating appropriately ** Examples of Resilient Scripts - [[file:shell-script-example-pginstall.org][A Custom Installation of PostgreSQL from Source]]