--- title: "Linux Basics" author: "Dr. Hua Zhou" date: "Jan 11, 2018" output: ioslides_presentation subtitle: Biostat M280 bibliography: ../bib-HZ.bib --- ## Why Linux Linux is _the_ most common platform for scientific computing. - Open source and community support. - Things break; when they break using Linux, it's easy to fix. - Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers. - E.g. UCLA Hoffmann2 cluster runs on Linux. - Cost: it's free! ## [Distributions of Linux](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg) - Debian/Ubuntu is a popular choice for personal computers. - RHEL/CentOS is popular on servers. - The teaching server for this class runs CentOS 7. - Mac OS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to Mac OS terminal as well. Windows/DOS, unfortunately, is a totally different breed. ## {.smaller} - Show distribution/version on Linux: ```{bash} cat /etc/*-release ``` ---- - Show distribution/version on Mac: ```{bash, eval=FALSE} sw_vers -productVersion ``` or ```{bash, eval=FALSE} system_profiler SPSoftwareDataType ``` # Linux shells ## Linux shells - A shell translates commands to OS instructions. - Most commonly used shells include `bash`, `csh`, `tcsh`, `zsh`, etc. - Sometimes a script or a command does not run simply because it's written for another shell. - We mostly use `bash` shell commands in this class. ---- - Determine the current shell: ```{bash} echo $SHELL ``` - List available shells: ```{bash} cat /etc/shells ``` ---- - Change to another shell: ```{bash, eval=FALSE} exec bash -l ``` The `-l` option indicates it should be a login shell. - Change your login shell permanently: ```{bash, eval=FALSE} chsh -s /bin/bash userid ``` Then log out and log in. ## Bash completion Bash provides the following standard completion for the Linux users by default. Much less typing errors and time! - Pathname completion. - Filename completion. - Variablename completion: `echo $[TAB][TAB]`. - Username completion: `cd ~[TAB][TAB]`. - Hostname completion `ssh huazhou@[TAB][TAB]`. - It can also be customized to auto-complete other stuff such as options and command's arguments. Google `bash completion` for more information. # Navigate file system ## Linux directory structure
- Upon log in, user is at his/her home directory. ## Move around the file system {.smaller} - `pwd` prints absolute path to the current working directory: ```{bash} pwd ``` - `ls` lists contents of a directory: ```{bash} ls ``` ## {.smaller} - `ls -l` lists detailed contents of a directory: ```{bash} ls -l ``` ## {.smaller} - `ls -al` lists all contents of a directory, including those start with `.` (hidden folders): ```{bash, small=TRUE} ls -al ``` ---- - `..` denotes the parent of current working directory. - `.` denotes the current working directory. - `~` denotes user's home directory. - `/` denotes the root directory. - `cd ..` changes to parent directory. - `cd` or `cd ~` changes to home directory. - `cd /` changes to root directory. ## File permissions
---- - `chmod g+x file` makes a file executable to group members. - `chmod 751 file` sets permission `rwxr-x--x` to a file. - `groups userid` shows which group(s) a user belongs to: ```{bash} groups huazhou ``` ## Manipulate files and directories - `cp` copies file to a new location. - `mv` moves file to a new location. - `touch` creates a text file; if file already exists, it's left unchanged. - `rm` deletes a file. - `mkdir` creates a new directory. - `rmdir` deletes an _empty_ directory. - `rm -rf` deletes a directory and all contents in that directory (be cautious using the `-f` option ...). ## Find files {.smaller} - `locate` locates a file by name: ```{bash} locate linux.Rmd ``` - `which` locates a program: ```{bash} which R ``` ---- - `find` is similar to `locate` but has more functionalities, e.g., select files by age, size, permissions, .... , and is ubiquitous. ```{bash} find linux.Rmd ``` ```{bash} find /home/huazhou -name linux.Rmd ``` ## Wildcard characters {.smaller} | Wildcard | Matches | |------------|-------------------------------------| | `?` | any single character | | `*` | any character 0 or more times | | `+` | one or more preceding pattern | | `^` | beginning of the line | | `$` | end of the line | | `[set]` | any character in set | | `[!set]` | any character not in set | | `[a-z]` | any lowercase letter | | `[0-9]` | any number (same as `[0123456789]`) | ## {.smaller} - ```{bash} # all png files in current folder ls -l *.png ``` ## Regular expression - Wildcards are examples of _regular expressions_. - Regular expressions are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed. - Google `regular expressions` to learn. # Work with text files ## View/peek text files - `cat` prints the contents of a file: ```{bash, size='smallsize'} cat linux.Rmd ``` ---- - `head -l` prints the first $l$ lines of a file: ```{bash} head linux.Rmd ``` ---- - `tail -l` prints the last $l$ lines of a file: ```{bash} tail linux.Rmd ``` ## `less` is more; `more` is less - `more` browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the `q` key. - `less` is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input. - `less` doesn't need to read the whole file, i.e., it loads files faster than `more`. ## `grep` `grep` prints lines that match an expression: - Show lines that contain string `CentOS`: ```{bash} # quotes not necessary if not a regular expression grep 'CentOS' linux.Rmd ``` ---- - Search multiple text files: ```{bash} grep 'CentOS' *.Rmd ``` ---- - Show matching line numbers: ```{bash} grep -n 'CentOS' linux.Rmd ``` ---- - Find all files in current directory with `.png` extension: ```{bash} ls | grep '\.png$' ``` - Find all directories in the current directory: ```{bash} ls -al | grep '^d' ``` ## `sed` - `sed` is a stream editor. - Replace `CentOS` by `RHEL` in a text file: ```{bash} sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL ``` ## `awk` - `awk` is a filter and report writer. - Print sorted list of login names: ```{bash} awk -F: '{ print $1 }' /etc/passwd | sort | head -5 ``` ---- - Print number of lines in a file, as `NR` stands for Number of Rows: ```{bash} awk 'END { print NR }' /etc/passwd ``` or ```{bash} wc -l /etc/passwd ``` or ```{bash} wc -l < /etc/passwd ``` ---- - Print login names with UID in range `1000-1035`: ```{bash} awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd ``` - Print login names and log-in shells in comma-seperated format: ```{bash} awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd ``` ---- - Print login names and indicate those with UID>1000 as `vip`: ```{bash} awk -F: -v status="" '{OFS = ","} {if ($3 >= 1000) status="vip"; else status="regular"} {print $1, status}' /etc/passwd ``` ## Piping and redirection - `|` sends output from one command as input of another command. - `>` directs output from one command to a file. - `>>` appends output from one command to a file. - `<` reads input from a file. - Combinations of shell commands (`grep`, `sed`, `awk`, ...), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently. - See HW1. ## Text editors
Source: [Editor War](http://en.wikipedia.org/wiki/Editor_war) on Wikipedia. ## Emacs - `Emacs` is a powerful text editor with extensive support for many languages including `R`, $\LaTeX$, `python`, and `C/C++`; however it's _not_ installed by default on many Linux distributions. - Basic survival commands: - `emacs filename` to open a file with emacs. - `CTRL-x CTRL-f` to open an existing or new file. - `CTRL-x CTRX-s` to save. - `CTRL-x CTRL-w` to save as. - `CTRL-x CTRL-c` to quit. ---- - Google `emacs cheatsheet`
`C-
## IDE (Integrated Development Environment) - Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc. - R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc. # Processes ## Processes - OS runs processes on behalf of user. - Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc. ```{bash} ps ``` ---- - All current running processes: ```{bash} ps -eaf ``` ---- - All Python processes: ```{bash} ps -eaf | grep python ``` ---- - Process with PID=1: ```{bash} ps -fp 1 ``` ---- - All processes owned by a user: ```{bash} ps -fu huazhou ``` ## Kill processes - Kill process with PID=1001: ```{bash, eval=FALSE} kill 1001 ``` - Kill all R processes. ```{bash, eval=FALSE} killall -r R ``` ## `top` - `top` prints realtime process information (very useful). ```{bash, eval=FALSE} top ```
# Secure shell (SSH) ## SSH SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network. - On Linux or Mac, access the teaching server by ```{bash, eval=FALSE} ssh username@35.227.165.60 ``` - Windows machines need the [PuTTY](http://www.putty.org) program (free). ## Use keys over password - Key authentication is more secure than password. Most passwords are weak. - Script or a program may need to systematically SSH into other machines. - Log into multiple machines using the same key. - Seamless use of many services: Git, svn, Amazon EC2 cloud service, parallel computing on multiple hosts, etc. - Many servers only allow key authentication and do not accept password authentication. ## Key authentication
---- - _Public key_. Put on the machine(s) you want to log in. - _Private key_. Put on your own computer. Consider this as the actual key in your pocket; never give to others. - Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key. - Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication). ## Steps for generate keys {.smaller} - On Linux or Mac, to generate a key pair: ```{bash, eval=FALSE} ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME] ``` - `[KEY_FILENAME]` is the name that you want to use for your SSH key files. For example, a filename of `my-ssh-key` generates a private key file named `my-ssh-key` and a public key file named `my-ssh-key.pub`. - `[USERNAME]` is the user for whom you will apply this SSH key. - Use a (optional) paraphrase different form password. - Set correct permissions on the `.ssh` folder and key files ```{bash, eval=FALSE} chmod 400 ~/.ssh/[KEY_FILENAME] ``` ---- - Append the public key to the `~/.ssh/authorized_keys` file of any Linux machine we want to SSH to, e.g., ```{bash, eval=FALSE} ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60 ``` - Test your new key. ```{bash, eval=FALSE} ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@35.227.165.60 ``` - Now you don't need password each time you connect from your machine to the teaching server. ---- - If you set paraphrase when generating keys, you'll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using `ssh-agent` on Linux/Mac or Pagent on Windows. - Same key pair can be used between any two machines. We don't need to regenerate keys for each new connection. - For Windows users, the private key generated by `ssh-keygen` cannot be directly used by PuTTY; use PuTTYgen for conversion. Then let PuTTYgen use the converted private key. Read [tutorial](https://www.digitalocean.com/community/tutorials/how-to-create-ssh-keys-with-putty-to-connect-to-a-vps). ## Transfer files between machines - `scp` securely transfers files between machines using SSH. ```{bash, eval=FALSE} ## copy file from local to remote scp localfile username@35.227.165.60:/pathtofolder ``` ```{bash, eval=FALSE} ## copy file from remote to local scp username@35.227.165.60:/pathtofile pathtolocalfolder ``` - `sftp` is FTP via SSH. - GUIs for Windows (WinSCP) or Mac (Cyberduck). - (My preferred way) Use a **version control system** to sync project files between different machines and systems. ## Line breaks in text files - Windows uses a pair of `CR` and `LF` for line breaks. - Linux/Unix uses an `LF` character only. - MacOS X also uses a single `LF` character. But old Mac OS used a single `CR` character for line breaks. - If transferred in binary mode (bit by bit) between OSs, a text file could look a mess. - Most transfer programs automatically switch to text mode when transferring text files and perform conversion of line breaks between different OSs; but I used to run into problems using WinSCP. Sometimes you have to tell WinSCP explicitly a text file is being transferred. # Run R in Linux ## Interactive mode - Start R in the interactive mode by typing `R` in shell. - Then run R script by ```{r, eval=FALSE} source("script.R") ``` ## Batch mode {.smaller} - Demo script [`meanEst.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/meanEst.R) implements an (terrible) estimator of mean $$ {\widehat \mu}_n = \frac{\sum_{i=1}^n x_i 1_{x_i \text{ is prime}}}{\sum_{i=1}^n 1_{x_i \text{ is prime}}}. $$ ```{bash, echo=FALSE} cat meanEst.R ``` ---- - To run your R code non-interactively aka in batch mode, we have at least two options: ```{bash, eval=FALSE} # default output to meanEst.Rout R CMD BATCH meanEst.R ``` or ```{bash, eval=FALSE} # output to stdout Rscript meanEst.R ``` - Typically automate batch calls using a scripting language, e.g., Python, perl, and shell script. ## Pass arguments to R scripts - Specify arguments in `R CMD BATCH`: ```{bash, eval=FALSE} R CMD BATCH '--args mu=1 sig=2 kap=3' script.R ``` - Specify arguments in `Rscript`: ```{bash, eval=FALSE} Rscript script.R mu=1 sig=2 kap=3 ``` - Parse command line arguments using magic formula ```{r, eval=FALSE} for (arg in commandArgs(T)) { eval(parse(text=arg)) } ``` in R script. After calling the above code, all command line arguments will be available in the global namespace. ---- - To understand the magic formula `commandArgs`, run R by: ```{bash, eval=FALSE} R '--args mu=1 sig=2 kap=3' ``` and then issue commands in R ```{r, eval=FALSE} commandArgs() commandArgs(TRUE) ``` ---- - Understand the magic formula `parse` and `eval`: ```{r, error=TRUE} rm(list=ls()) print(x) parse(text="x=3") eval(parse(text="x=3")) print(x) ``` ## {.smaller} - [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) has components: (1) method implementation, (2) data generator with unspecified parameter `n`, (3) estimation based on generated data, and (4) **command argument parser**. ```{bash, echo=FALSE} cat runSim.R ``` ---- - Call `runSim.R` with sample size `n=100`: ```{bash} R CMD BATCH '--args n=100' runSim.R ``` or ```{bash} Rscript runSim.R n=100 ``` ## Run long jobs - Many statistical computing tasks take long: simulation, MCMC, etc. - `nohup` command in Linux runs program(s) immune to hangups and writes output to `nohup.out` by default. Logging out will _not_ kill the process; we can log in later to check status and results. - `nohup` is POSIX standard thus available on Linux and MacOS. - Run `runSim.R` in background and writes output to `nohup.out`: ```{bash} nohup Rscript runSim.R n=100 & ``` ## screen - `screen` is another popular utility, but not installed by default. - Typical workflow using `screen`. 0. Access remote server using `ssh`. 0. Start jobs in batch mode. 0. Detach jobs. 0. Exit from server, wait for jobs to finish. 0. Access remote server using `ssh`. 0. Re-attach jobs, check on progress, get results, etc. ## Use R to call R R in conjuction with `nohup` or `screen` can be used to orchestrate a large simulation study. - It can be more elegant, transparent, and robust to parallelize jobs corresponding to different scenarios (e.g., different generative models) outside of the code used to do statistical computation. - We consider a simulation study in R but the same approach could be used with code written in Julia, Matlab, Python, etc. - Python in many ways makes a better _glue_; we may discuss this later in the course. ---- - Suppose we have - [`runSim.R`](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/runSim.R) which runs a simulation based on command line argument `n`. - A large collection of `n` values that we want to use in our simulation study. - Access to a server with 128 cores. - Option 1: manually call `runSim.R` for each setting. - Option 2: automate calls using R and `nohup`. [autoSim.R](http://hua-zhou.github.io/teaching/biostatm280-2018winter/slides/02-linux/autoSim.R) ---- - ```{bash} cat autoSim.R ``` ---- - ```{bash} Rscript autoSim.R ``` ```{bash, echo=FALSE, eval=TRUE} rm n*.txt *.Rout ``` - Now we just need write a script to collect results from the output files.