---
title: "Linux Basics"
author: "Dr. Hua Zhou"
date: "Jan 11, 2018"
output: ioslides_presentation
subtitle: Biostat M280
bibliography: ../bib-HZ.bib
---
## Why Linux
Linux is _the_ most common platform for scientific computing.
- Open source and community support.
- Things break; when they break using Linux, it's easy to fix.
- Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers.
- E.g. UCLA Hoffmann2 cluster runs on Linux.
- Cost: it's free!
## [Distributions of Linux](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg)
- Debian/Ubuntu is a popular choice for personal computers.
- RHEL/CentOS is popular on servers.
- The teaching server for this class runs CentOS 7.
- Mac OS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to Mac OS terminal as well. Windows/DOS, unfortunately, is a totally different breed.
## {.smaller}
- Show distribution/version on Linux:
```{bash}
cat /etc/*-release
```
----
- Show distribution/version on Mac:
```{bash, eval=FALSE}
sw_vers -productVersion
```
or
```{bash, eval=FALSE}
system_profiler SPSoftwareDataType
```
# Linux shells
## Linux shells
- A shell translates commands to OS instructions.
- Most commonly used shells include `bash`, `csh`, `tcsh`, `zsh`, etc.
- Sometimes a script or a command does not run simply because it's written for another shell.
- We mostly use `bash` shell commands in this class.
----
- Determine the current shell:
```{bash}
echo $SHELL
```
- List available shells:
```{bash}
cat /etc/shells
```
----
- Change to another shell:
```{bash, eval=FALSE}
exec bash -l
```
The `-l` option indicates it should be a login shell.
- Change your login shell permanently:
```{bash, eval=FALSE}
chsh -s /bin/bash userid
```
Then log out and log in.
## Bash completion
Bash provides the following standard completion for the Linux users by default. Much less typing errors and time!
- Pathname completion.
- Filename completion.
- Variablename completion: `echo $[TAB][TAB]`.
- Username completion: `cd ~[TAB][TAB]`.
- Hostname completion `ssh huazhou@[TAB][TAB]`.
- It can also be customized to auto-complete other stuff such as options and command's arguments. Google `bash completion` for more information.
# Navigate file system
## Linux directory structure
- Upon log in, user is at his/her home directory.
## Move around the file system {.smaller}
- `pwd` prints absolute path to the current working directory:
```{bash}
pwd
```
- `ls` lists contents of a directory:
```{bash}
ls
```
## {.smaller}
- `ls -l` lists detailed contents of a directory:
```{bash}
ls -l
```
## {.smaller}
- `ls -al` lists all contents of a directory, including those start with `.` (hidden folders):
```{bash, small=TRUE}
ls -al
```
----
- `..` denotes the parent of current working directory.
- `.` denotes the current working directory.
- `~` denotes user's home directory.
- `/` denotes the root directory.
- `cd ..` changes to parent directory.
- `cd` or `cd ~` changes to home directory.
- `cd /` changes to root directory.
## File permissions
----
- `chmod g+x file` makes a file executable to group members.
- `chmod 751 file` sets permission `rwxr-x--x` to a file.
- `groups userid` shows which group(s) a user belongs to:
```{bash}
groups huazhou
```
## Manipulate files and directories
- `cp` copies file to a new location.
- `mv` moves file to a new location.
- `touch` creates a text file; if file already exists, it's left unchanged.
- `rm` deletes a file.
- `mkdir` creates a new directory.
- `rmdir` deletes an _empty_ directory.
- `rm -rf` deletes a directory and all contents in that directory (be cautious using the `-f` option ...).
## Find files {.smaller}
- `locate` locates a file by name:
```{bash}
locate linux.Rmd
```
- `which` locates a program:
```{bash}
which R
```
----
- `find` is similar to `locate` but has more functionalities, e.g., select files by age, size, permissions, .... , and is ubiquitous.
```{bash}
find linux.Rmd
```
```{bash}
find /home/huazhou -name linux.Rmd
```
## Wildcard characters {.smaller}
| Wildcard | Matches |
|------------|-------------------------------------|
| `?` | any single character |
| `*` | any character 0 or more times |
| `+` | one or more preceding pattern |
| `^` | beginning of the line |
| `$` | end of the line |
| `[set]` | any character in set |
| `[!set]` | any character not in set |
| `[a-z]` | any lowercase letter |
| `[0-9]` | any number (same as `[0123456789]`) |
## {.smaller}
-
```{bash}
# all png files in current folder
ls -l *.png
```
## Regular expression
- Wildcards are examples of _regular expressions_.
- Regular expressions are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed.
- Google `regular expressions` to learn.
# Work with text files
## View/peek text files
- `cat` prints the contents of a file:
```{bash, size='smallsize'}
cat linux.Rmd
```
----
- `head -l` prints the first $l$ lines of a file:
```{bash}
head linux.Rmd
```
----
- `tail -l` prints the last $l$ lines of a file:
```{bash}
tail linux.Rmd
```
## `less` is more; `more` is less
- `more` browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the `q` key.
- `less` is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input.
- `less` doesn't need to read the whole file, i.e., it loads files faster than `more`.
## `grep`
`grep` prints lines that match an expression:
- Show lines that contain string `CentOS`:
```{bash}
# quotes not necessary if not a regular expression
grep 'CentOS' linux.Rmd
```
----
- Search multiple text files:
```{bash}
grep 'CentOS' *.Rmd
```
----
- Show matching line numbers:
```{bash}
grep -n 'CentOS' linux.Rmd
```
----
- Find all files in current directory with `.png` extension:
```{bash}
ls | grep '\.png$'
```
- Find all directories in the current directory:
```{bash}
ls -al | grep '^d'
```
## `sed`
- `sed` is a stream editor.
- Replace `CentOS` by `RHEL` in a text file:
```{bash}
sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL
```
## `awk`
- `awk` is a filter and report writer.
- Print sorted list of login names:
```{bash}
awk -F: '{ print $1 }' /etc/passwd | sort | head -5
```
----
- Print number of lines in a file, as `NR` stands for Number of Rows:
```{bash}
awk 'END { print NR }' /etc/passwd
```
or
```{bash}
wc -l /etc/passwd
```
or
```{bash}
wc -l < /etc/passwd
```
----
- Print login names with UID in range `1000-1035`:
```{bash}
awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd
```
- Print login names and log-in shells in comma-seperated format:
```{bash}
awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd
```
----
- Print login names and indicate those with UID>1000 as `vip`:
```{bash}
awk -F: -v status="" '{OFS = ","}
{if ($3 >= 1000) status="vip"; else status="regular"}
{print $1, status}' /etc/passwd
```
## Piping and redirection
- `|` sends output from one command as input of another command.
- `>` directs output from one command to a file.
- `>>` appends output from one command to a file.
- `<` reads input from a file.
- Combinations of shell commands (`grep`, `sed`, `awk`, ...), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently.
- See HW1.
## Text editors
Source: [Editor War](http://en.wikipedia.org/wiki/Editor_war) on Wikipedia.
## Emacs
- `Emacs` is a powerful text editor with extensive support for many languages including `R`, $\LaTeX$, `python`, and `C/C++`; however it's _not_ installed by default on many Linux distributions.
- Basic survival commands:
- `emacs filename` to open a file with emacs.
- `CTRL-x CTRL-f` to open an existing or new file.
- `CTRL-x CTRX-s` to save.
- `CTRL-x CTRL-w` to save as.
- `CTRL-x CTRL-c` to quit.
----
- Google `emacs cheatsheet`
`C-` means hold the `control` key, and press ``.
`M-` means press the `Esc` key once, and press ``.
## Vi
- `Vi` is ubiquitous (POSIX standard). Learn at least its basics; otherwise you can edit nothing on some clusters.
- Basic survival commands:
- `vi filename` to start editing a file.
- `vi` is a _modal_ editor: _insert_ mode and _normal_ mode. Pressing `i` switches from the normal mode to insert mode. Pressing `ESC` switches from the insert mode to normal mode.
- `:x` quits `vi` and saves changes.
- `:q!` quits vi without saving latest changes.
- `:w` saves changes.
- `:wq` quits `vi` and saves changes.
----
- Google `vi cheatsheet`
## IDE (Integrated Development Environment)
- Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc.
- R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc.
# Processes
## Processes
- OS runs processes on behalf of user.
- Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc.
```{bash}
ps
```
----
- All current running processes:
```{bash}
ps -eaf
```
----
- All Python processes:
```{bash}
ps -eaf | grep python
```
----
- Process with PID=1:
```{bash}
ps -fp 1
```
----
- All processes owned by a user:
```{bash}
ps -fu huazhou
```
## Kill processes
- Kill process with PID=1001:
```{bash, eval=FALSE}
kill 1001
```
- Kill all R processes.
```{bash, eval=FALSE}
killall -r R
```
## `top`
- `top` prints realtime process information (very useful).
```{bash, eval=FALSE}
top
```
# Secure shell (SSH)
## SSH
SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network.
- On Linux or Mac, access the teaching server by
```{bash, eval=FALSE}
ssh username@35.227.165.60
```
- Windows machines need the [PuTTY](http://www.putty.org) program (free).
## Use keys over password
- Key authentication is more secure than password. Most passwords are weak.
- Script or a program may need to systematically SSH into other machines.
- Log into multiple machines using the same key.
- Seamless use of many services: Git, svn, Amazon EC2 cloud service, parallel computing on multiple hosts, etc.
- Many servers only allow key authentication and do not accept password authentication.
## Key authentication
----
- _Public key_. Put on the machine(s) you want to log in.
- _Private key_. Put on your own computer. Consider this as the actual key in your pocket; never give to others.
- Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key.
- Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication).
## Steps for generate keys {.smaller}
- On Linux or Mac, to generate a key pair:
```{bash, eval=FALSE}
ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME]
```
- `[KEY_FILENAME]` is the name that you want to use for your SSH key files. For example, a filename of `my-ssh-key` generates a private key file named `my-ssh-key` and a public key file named `my-ssh-key.pub`.
- `[USERNAME]` is the user for whom you will apply this SSH key.
- Use a (optional) paraphrase different form password.
- Set correct permissions on the `.ssh` folder and key files
```{bash, eval=FALSE}
chmod 400 ~/.ssh/[KEY_FILENAME]
```
----
- Append the public key to the `~/.ssh/authorized_keys` file of any Linux machine we want to SSH to, e.g.,
```{bash, eval=FALSE}
ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
```
- Test your new key.
```{bash, eval=FALSE}
ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60
```
- Now you don't need password each time you connect from your machine to the teaching server.
----
- If you set paraphrase when generating keys, you'll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using `ssh-agent` on Linux/Mac or Pagent on Windows.
- Same key pair can be used between any two machines. We don't need to regenerate keys for each new connection.
- For Windows users, the private key generated by `ssh-keygen` cannot be directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen use the converted private key. Read [tutorial](https://www.digitalocean.com/community/tutorials/how-to-create-ssh-keys-with-putty-to-connect-to-a-vps).
## Transfer files between machines
- `scp` securely transfers files between machines using SSH.
```{bash, eval=FALSE}
## copy file from local to remote
scp localfile username@35.227.165.60:/pathtofolder
```
```{bash, eval=FALSE}
## copy file from remote to local
scp username@35.227.165.60:/pathtofile pathtolocalfolder
```
- `sftp` is FTP via SSH.
- GUIs for Windows (WinSCP) or Mac (Cyberduck).
- (My preferred way) Use a **version control system** to sync project files between different machines and systems.
## Line breaks in text files
- Windows uses a pair of `CR` and `LF` for line breaks.
- Linux/Unix uses an `LF` character only.
- MacOS X also uses a single `LF` character. But old Mac OS used a single `CR` character for line breaks.
- If transferred in binary mode (bit by bit) between OSs, a text file could look a mess.
- Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred.
# Run R in Linux
## Interactive mode
- Start R in the interactive mode by typing `R` in shell.
- Then run R script by
```{r, eval=FALSE}
source("script.R")
```
## Batch mode {.smaller}
- Demo script [`meanEst.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/meanEst.R) implements an (terrible) estimator of mean
$$
{\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{x_i \text{ is prime}}}{\sum_{i=1}^n 1_{x_i \text{ is prime}}}.
$$
```{bash, echo=FALSE}
cat meanEst.R
```
----
- To run your R code non-interactively aka in batch mode, we have at least two options:
```{bash, eval=FALSE}
# default output to meanEst.Rout
R CMD BATCH meanEst.R
```
or
```{bash, eval=FALSE}
# output to stdout
Rscript meanEst.R
```
- Typically automate batch calls using a scripting language, e.g., Python, perl, and shell script.
## Pass arguments to R scripts
- Specify arguments in `R CMD BATCH`:
```{bash, eval=FALSE}
R CMD BATCH '--args mu=1 sig=2 kap=3' script.R
```
- Specify arguments in `Rscript`:
```{bash, eval=FALSE}
Rscript script.R mu=1 sig=2 kap=3
```
- Parse command line arguments using magic formula
```{r, eval=FALSE}
for (arg in commandArgs(T)) {
eval(parse(text=arg))
}
```
in R script. After calling the above code, all command line arguments will be available in the global namespace.
----
- To understand the magic formula `commandArgs`, run R by:
```{bash, eval=FALSE}
R '--args mu=1 sig=2 kap=3'
```
and then issue commands in R
```{r, eval=FALSE}
commandArgs()
commandArgs(TRUE)
```
----
- Understand the magic formula `parse` and `eval`:
```{r, error=TRUE}
rm(list=ls())
print(x)
parse(text="x=3")
eval(parse(text="x=3"))
print(x)
```
## {.smaller}
- [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) has components: (1) method implementation, (2) data generator with unspecified parameter `n`, (3) estimation based on generated data, and (4) **command argument parser**.
```{bash, echo=FALSE}
cat runSim.R
```
----
- Call `runSim.R` with sample size `n=100`:
```{bash}
R CMD BATCH '--args n=100' runSim.R
```
or
```{bash}
Rscript runSim.R n=100
```
## Run long jobs
- Many statistical computing tasks take long: simulation, MCMC, etc.
- `nohup` command in Linux runs program(s) immune to hangups and writes output to `nohup.out` by default. Logging out will _not_ kill the process; we can log in later to check status and results.
- `nohup` is POSIX standard thus available on Linux and MacOS.
- Run `runSim.R` in background and writes output to `nohup.out`:
```{bash}
nohup Rscript runSim.R n=100 &
```
## screen
- `screen` is another popular utility, but not installed by default.
- Typical workflow using `screen`.
0. Access remote server using `ssh`.
0. Start jobs in batch mode.
0. Detach jobs.
0. Exit from server, wait for jobs to finish.
0. Access remote server using `ssh`.
0. Re-attach jobs, check on progress, get results, etc.
## Use R to call R
R in conjuction with `nohup` or `screen` can be used to orchestrate a large simulation study.
- It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation.
- We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc.
- Python in many ways makes a better _glue_; we may discuss this later in the course.
----
- Suppose we have
- [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) which runs a simulation based on command line argument `n`.
- A large collection of `n` values that we want to use in our simulation study.
- Access to a server with 128 cores.
- Option 1: manually call `runSim.R` for each setting.
- Option 2: automate calls using R and `nohup`. [autoSim.R](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/autoSim.R)
----
-
```{bash}
cat autoSim.R
```
----
-
```{bash}
Rscript autoSim.R
```
```{bash, echo=FALSE, eval=TRUE}
rm n*.txt *.Rout
```
- Now we just need write a script to collect results from the output files.