#+TITLE: Shell Script Resilience
#+SUBTITLE: Overview of the Problem Space
#+AUTHOR: J. Greg Davidson
#+DATE: 18 October 2022
#+OPTIONS: toc:nil 
#+OPTIONS: num:nil
# +OPTIONS: date:nil 
# +OPTIONS: author:nil 

* Shell Script Resilience

This is an OrgMode Document. If you are reading it with Emacs you can run the
examples as well as change them and run them again.

** Version 1: The Naive Script

If you know how to issue commands to a shell manually, then you can trivially
create a naive shell script:
- Create a new text file using your favorite editor
- Enter your desired commands into it
- Make the script executable with =chmod +x SCRIPT-FILE-NAME=
- To run it, either
      - specify a path to it when you run it
            - e.g. type =./SCRIPT-FILE-NAME= at the Shell Prompt
      - place it in a directory which is on your =$PATH=
            - e.g. =mv SCRIPT-FILE-NAME ~/bin=
            - so you can run it simply as =SCRIPT-FILE-NAME=

#+begin_src sh :results output
  echo Hello from ${0##*/}
  echo `date -I` is a good day to run!
#+end_src

#+RESULTS:
: Hello from sh
: 2022-10-18 is a good day to run!

*** Let's criticise version 1

A great thing about storing commands in a script is that we (and others) can now
re-execute these commands as often as we like far into the future. This is also
the problem: The people running the script (including a future version of you)
do not necessarily understand these commands *and* the script can fail because
the script may be used differently than you initially used it *and* the
resources invoked and accessed by the script may change in ways which cause the
script to fail.

**** Specifying an Interpreter

The excutable file is still just a text file, which won't do anything by itself.

Giving it execute permission doesn't say which program should be run to
/Interpret/ the script. So we get a system default, which may or may not be the
program we intended when we (or whoever) originally wrote the script.

We let the system know which program we would like it to use to /Interpret/ our
/Script/ with a [[https://en.wikipedia.org/wiki/Shebang_(Unix)][Shebang Line]] as the first line of the file.

If you're using =bash=, you can find out where a program is located on your
local system at the current time with the builtin =type= command.

#+begin_src bash :results output
 type bash 
#+end_src

#+RESULTS:
: bash is /usr/bin/bash

If you're using =bash=, you can get brief documentation about /any/ builtin
command using the builtin =help= command.

#+begin_src bash :results output
 help type
#+end_src

**** Documenting The Script

Let's assume that you've looked up the features behind
- ${0##*/}
- date -I
- And nesting commands inside of `back quotes`

Will someone looking at this script in the future
- possibly you, after you've forgotten some of these things
understand what these things mean?

You should document anything non-obvious
- The shell ignores anything after an unquoted # character
- If a short comment will do, give it
- Otherwise link to a more complete explanation elsewhere
 
**** Controlling Shell Variables

It's good to use shell variables for any content which might change or which
needs to be repeated, or simply content we wish to document by giving it a name
which explains its purpose or makes it easier to think about.

The value of a shell variable can contain nearly any string of characters,
including spaces and special characters which unless quoted may activate shell
features unexpectedly. Thus we almost always quote the initial value of a shell
variable with either 'single quotes' or "double quotes" and we almost always
quote a shell expansion with "double quotes". In those rare cases where we don't
quote shell values or expansions, a comment should explain why!

** Version 2: A Little Better

There is nothing wrong with creating a naive script, as long as you upgrade it
before using it again, and especially before giving it to anyone else to use!

So let's upgrade our script:

#+begin_src bash :results output
  #!/usr/bin/bash
  # Strip the directories off the executable program path
  script_name="${0##*/}"
  echo Hello from "$script_name"
  # Embed the ISO date in our message
  echo "`date -I` is a good day to run!"
#+end_src

#+RESULTS:
: Hello from bash
: 2022-10-18 is a good day to run!

There's not a lot of point making this script more resilient, but what about a
script which administers essential services?

** Ensuring the expected context

When you're issuing a command interactively, there's a certain background
context:
- You're logged in under a particular user account with particular permissions
- On a system with particular versions of programs and libraries
- You have a particular /Current Working Directory/
- Your particular /Environment Variables/ are set in particular ways
- You have particular configuration files with particular contents

All of the above particulars can effect what, if anything, a command
you issue might do.  In addition to that context
- consider the state of any resources your command might access
	- other files and directories on your local system
	- services and other processes on your local system
	- services and resources on other systems across the Internet

When you start to issue a series of familiar commands manually and one
of them gives an unexpected result
- especially an error message!
your /Common Sense/ tells you to
- stop and find out what happened
- and take corrective action
before continuing with the rest of your intended commands
- if they're even still relevant!

*Scripts have no Common Sense!* Unless you add explicit code to your script, it
will simply barrel on, executing the rest of the commands willy-nilly!

** Fully Automating Complex Scripts

If we want to build complex artifacts and store them into databases or the
filesystem and/or changes the state of the system or some subsystem in complex
ways, we would certainly prefer using a script, especially if we're going to
want to do similar tasks repeatedly.

Using a script
1. documents the process
2. saves labor
3. increases reliability
But 2 and 3 are only true if the script can detect and handle errors.
- Stopping with a transcript is only semi-automation

*** Error Detection Strategies

All processes (commands) in a Posix environment return an /Exit Status/.
- By convention, 0 means success, non-0 means something weird happened
      - Note that this is the opposite of traditional Boolean values!
- The /Exit Status/ of the /Last Command/ is available in the =$?= pseudo-variable.

Some processes require explicit integrity tests
- The /Posix/ environment provides some has many often helpful tools
      - =cmp= will compare two files that should be the same
      - =test= has lots of built-in tests
      - The =case= and =expr= commands can do pattern matching
      - etc.
- The =make= tool is often used to organize tests scripts
      - =make test= is a frequent part of a build process

*** Error Recovery Strategies

Once a problem has been detected, error recovery needs to
- Capture what happened
- Restore the system to a known state
- Diagnose the problem
- Document and log the problem
- Execute an alternative process if there is one
- Indicate failure if we're out of alternatives

Coding this is usually done with /Exit Codes/ which control
- the =if= and =while= commands
- the Boolean operators =!= (not), =&&= (and then), =||= (or else)
      - Here's a [[file://Reference-Sheets/bash-metas.pdf][handy reference sheet]] on such

In many cases a script is just one part of a more complex automated process, so
all it has to do is exit with a non-zero Exit Status, e.g. with =exit 1= --
although it's best to have different non-zero statuses for different kinds of
failure.

A top level script may need to alert humans that an important process has
failed. This should /never/ be done by popping up a notification on a user's
screen asking them to report an error! A script should be able to send a text,
email, etc. or file a trouble ticket, etc. to bring attention to the problem by
the right person in a timely fashion. Scripts can also monitor a trouble ticket
system or repeatedly check a system which is out of order and escalate an issue
when fixes are not occurring within an expected timeframe.

** How Do We Code When Things Might Fail?

At first blush it seems obvious what we should do if things might fail. We
simply use =if/else= statements to account for all possibilities.

We'll start out with just reporting problems, leaving it up to a human to read
the problem reports and deal with them.  But we could add more code anywhere to do
cleanup, try fixes and alternatives, etc.

#+begin_src sh
  archive_url='https://ftp.postgresql.org/pub/source/v15.0/postgresql-15.0.tar.bz2'
  if type wget >/dev/null; then
      wget "$archive_url"
  else
      >&2 "$pgm error: missing program wget; aborting"
      exit 1
  fi
#+end_src

but then the =wget= command could fail, so maybe we better do

#+begin_src sh
  project='postgresql-15.0'
  project_dir="$HOME/Projects/$project"
  archive_url="https://ftp.postgresql.org/pub/source/v15.0/$project.tar.bz2"
  mkdir -p "$project_dir"
  cd "$project_dir"
  git init
  if type wget >/dev/null; then
      if wget "$archive_url"; then
          tar xf "$archive_url"
          ./configure
          make
          make test
          make install
      else
          >&2 "$pgm error: wget can't get $archive_url; aborting"
          exit 2
      fi
  else
      >&2 "$pgm error: missing program wget; aborting"
      exit 1
  fi
#+end_src

- What if we're missing any of the commands =git=, =tar= or =make=?
- Suppose we have all the programs - what can still fail?
- How would we need to write this to immediately report when something failed?
- How might we reverse any side effects before exiting?
- How might we log or communicate any problems appropriately?
- Do we care why a command might have failed?
      - We have the wrong version of the program?
      - The data given to it is not as expected
      - We're missing permissions

*** An Organized Semi-Automated Approach

We're still just going to report problems if they occur, so our script is still
not fully automating the task.

This next attempt is going to be tedious. When you get bored with reading it,
skip to the next section where we make it better!

#+begin_src bash
  #!/usr/bin/bash -u
  pgm='install-pgsql-v1'
  project='postgresql-15.0'
  project_dir="$HOME/Projects/$project"
  archive_dir='https://ftp.postgresql.org/pub/source/v15.0'
  archive_file="$project.tar.bz2"
  archive_url="$archive_dir/$archive_file"
  mkdir -p "$project_dir" || {
      >&2 echo "$pgm error: Can't create directory $project_dir"
      exit 1
  }
  cd "$project_dir" || {
      >&2 echo "$pgm error: Can't cd to directory $project_dir"
      exit 2
  }
  for p in git wget tar; do
    type "$p" >/dev/null || {
      >&2 echo "$pgm error: Missing required command $p"
      exit 3
    }
    # How might we check the required versions?
  done
  git init || {
      >&2 echo "$pgm error: git init failed in directory $project_dir"
      exit 4
  }
  wget "$archive_url" || {
      >&2 echo "$pgm error: wget failed to get $archive_url"
      exit 5
  }
  tar xf "$archive_file" || {
      >&2 echo "$pgm error: tar failed to extract $archive_file"
      exit 6
  }
  ./configure || {
      >&2 echo "$pgm error: configure of $archive_file failed in $project_dir"
      exit 7
  }
  make || {
      >&2 echo "$pgm error: make of $archive_file failed in $project_dir"
      exit 8
  }
  make test || {
      >&2 echo "$pgm error: test of $archive_file failed in $project_dir"
      exit  9
  }
  make install || {
      >&2 echo "$pgm error: install of $archive_file from $project_dir failed"
      exit  10
  }
#+end_src

Hmm, that's a lot of boiler plate, and we haven't even added any fallback code!
- Let's see if we can simplify it first!

*** A Better Organized Semi-Automated Approach

#+begin_src bash
  #!/usr/bin/bash -u
  # expanding undefined variables will cause an error (-u in effect)
  # set variables for clarity and multiple use
  pgm='install-pgsql-v2'
  project='postgresql-15.0'
  project_dir="$HOME/Projects/$project"
  archive_dir='https://ftp.postgresql.org/pub/source/v15.0'
  archive_file="$project.tar.bz2"
  archive_url="$archive_dir/$archive_file"
  # define some handy functions -- could be imported from a library!
  report() { local level="$1"; shift; >&2 echo "$pgm $level: $*"; }
  error() { report error "$*"; return 1; }
  error_exit() { local code="$1"; report error "$*"; exit "$code"; }
  mkdir -p "$project_dir" || error_exit 1 "Can't create directory $project_dir"
  cd "$project_dir" || error_exit 2 "Can't cd to directory $project_dir"
  for p in git wget tar; do
    type "$p" >/dev/null || error_exit 3 "Missing required command $p"
    # How might we check the required versions?
  done
  git init || error_exit 4 "git init failed in directory $project_dir"
  wget "$archive_url" || error_exit 5 "wget failed to get $archive_url"
  tar xf "$archive_file" || error_exit 6 "tar failed to extract $archive_file"
  ./configure || error_exit 7 "configure of $project failed in $project_dir"
  make || error_exit 8 "make of $project failed in $project_dir"
  make test || error_exit 9 "test of $project failed in $project_dir"
  make install || error_exit 10 "install of $project from $project_dir failed"
#+end_src

Anywhere where we have an =error_exit= we could put back in a block to clean
things up, or we could be even fancier:

#+begin_src bash
  { git init || error "git init failed in directory $project_dir"; } &&
  { wget "$archive_url" || error "wget failed to get $archive_url"; } &&
  { tar xf "$archive_file" || error "tar failed to extract $archive_file"; } || {
      cd ..
      rm -rf "$project_dir"
      error_exit 4 "Removed botched $project_dir"
  }
#+end_src

*** A Super-Organized Semi-Automated Approach

#+begin_src bash
  #!/usr/bin/bash -u
  # expanding undefined variables will cause an error (-u in effect)
  # set variables for clarity and multiple use
  pgm='install-pgsql-v3'
  project='postgresql-15.0'
  project_dir="$HOME/Projects/$project"
  archive_dir='https://ftp.postgresql.org/pub/source/v15.0'
  archive_file="$project.tar.bz2"
  archive_url="$archive_dir/$archive_file"
  # define some handy functions -- could be imported from a library!
  try_code=10			# non-zero and unique
  try() {
    (( try_code++ ))	# increment the failure code
    if "$@"; then echo "OK: $@"
    else echo "$pgm FAILED: $@"; exit "$try_code"
    fi
  }
  # check for the existence of required programs
  for p in git wget tar; do
    try type "$p" >/dev/null || {
        >&2 echo "Missing required command $p"
        exit 1
    }
    # How might we check the required versions?
  done
  # Now the business logic
  try mkdir -p "$project_dir"
  try cd "$project_dir"
  try git init
  try wget "$archive_url"
  try tar xf "$archive_file"
  try ./configure
  try make
  try make test
  try make install
#+end_src

Now where might we put fixup, fallback, cleanup or tactical communication code?

*** Criticism

We've achieved some success in reducing boiler plate
- After we've defined variables and functions
- And checked for existence of the required programs
- We have about the same number of commands and complexity

We still need to deal with
- actually dealing with failure
      - diagnosing the source of the problem
      - trying any known fixes or alternatives
      - removing (perhaps to a study area) any messes left behind
- whether there's success or failure
      - logging and communicating appropriately

** Examples of Resilient Scripts

- [[file:shell-script-example-pginstall.org][A Custom Installation of PostgreSQL from Source]]