--- title: "Linux Basics" author: "Dr. Hua Zhou @ UCLA" date: "Jan 7, 2020" subtitle: Biostat 203B output: # ioslides_presentation: default html_document: toc: true toc_depth: 4 bibliography: ../bib-HZ.bib --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, comment = '') ``` ## Why Linux Linux is _the_ most common platform for scientific computing and deployment of data science tools. - Open source and community support. - Things break; when they break using Linux, it's easy to fix. - Scalability: portable devices (Android, iOS), laptops, servers, clusters, and super computers. - E.g. UCLA Hoffmann2 cluster runs on Linux; most machines in cloud (AWS, Azure, GCP) run on Linux. - Cost: it's free! ## [Distributions of Linux](http://upload.wikimedia.org/wikipedia/commons/1/1b/Linux_Distribution_Timeline.svg) - Debian/Ubuntu is a popular choice for personal computers. - RHEL/CentOS is popular on servers. - The teaching server for this class runs CentOS 7. - MacOS was originally derived from Unix/Linux (Darwin kernel). It is POSIX compliant. Most shell commands we review here apply to MacOS terminal as well. Windows/DOS, unfortunately, is a totally different breed. - Show distribution/version on Linux: ```{bash} cat /etc/*-release ``` ---- - Show distribution/version on MacOS: ```{bash, eval=FALSE} sw_vers -productVersion ``` or ```{bash, eval=FALSE} system_profiler SPSoftwareDataType ``` # Linux shells ## Shells - A shell translates commands to OS instructions. - Most commonly used shells include `bash`, `csh`, `tcsh`, `zsh`, etc. - The default shell in MacOS changed from `bash` to `zsh` since MacOS v10.15. - Sometimes a command and a script does not run simply because it's written for another shell. - We mostly use `bash` shell commands in this class. - Determine the current shell: ```{bash} echo $SHELL ``` - List available shells: ```{bash} cat /etc/shells ``` - Change to another shell: ```{bash, eval=FALSE} exec bash -l ``` The `-l` option indicates it should be a login shell. - Change your login shell permanently: ```{bash, eval=FALSE} chsh -s /bin/bash userid ``` Then log out and log in. ## Command history and bash completion We can navigate to previous/next commands by the upper and lower keys, or maintain a command history stack using `pushd` and `popd` commands. Bash provides the following standard completion for the Linux users by default. Much less typing errors and time! - Pathname completion. - Filename completion. - Variablename completion: `echo $[TAB][TAB]`. - Username completion: `cd ~[TAB][TAB]`. - Hostname completion `ssh hwachou@[TAB][TAB]`. - It can also be customized to auto-complete other stuff such as options and command's arguments. Google `bash completion` for more information. ## `man` is man's best friend Online help for shell commands: `man commandname`. ```{bash} # display documentation for the ls command man ls ``` # Navigate file system ## Linux directory structure
- Upon log in, user is at his/her home directory. - `tree` command (if installed) displays directory structure. `tree -L levels` display levels directories deep. ```{bash} # display only directories in levels 1, 2 from root directory tree -d -L 2 / ``` ## Move around the file system - `pwd` prints absolute path to the current working directory: ```{bash} pwd ``` - `ls` lists contents of a directory: ```{bash} ls ``` - `ls -l` lists detailed contents of a directory: ```{bash} ls -l ``` - `ls -al` lists all contents of a directory, including those start with `.` (hidden folders): ```{bash, small=TRUE} ls -al ``` - `..` denotes the parent of current working directory. - `.` denotes the current working directory. - `~` denotes user's home directory. - `/` denotes the root directory. - `cd ..` changes to parent directory. - `cd` or `cd ~` changes to home directory. - `cd /` changes to root directory. ## File permissions
---- - `chmod g+x file` makes a file executable to group members. - `chmod 751 file` sets permission `rwxr-x--x` to a file. - `groups userid` shows which group(s) a user belongs to: ```{bash} groups huazhou ``` ## Manipulate files and directories - `cp` copies file to a new location. - `mv` moves file to a new location. - `touch` creates a text file; if file already exists, it's left unchanged. - `rm` deletes a file. - `mkdir` creates a new directory. - `rmdir` deletes an _empty_ directory. - `rm -rf` deletes a directory and all contents in that directory (be cautious using the `-f` option ...). ## Find files - `locate` locates a file by name (need `mlocate` program installed): ```{bash} locate linux.Rmd ``` - `find` is similar to `locate` but has more functionalities, e.g., select files by age, size, permissions, .... , and is ubiquitous. ```{bash} find linux.Rmd ``` ```{bash} find /home/huazhou -name linux.Rmd ``` - `which` locates a program (executable file): ```{bash} which R ``` ## Wildcard characters | Wildcard | Matches | |------------|-------------------------------------| | `?` | any single character | | `*` | any character 0 or more times | | `+` | one or more preceding pattern | | `^` | beginning of the line | | `$` | end of the line | | `[set]` | any character in set | | `[!set]` | any character not in set | | `[a-z]` | any lowercase letter | | `[0-9]` | any number (same as `[0123456789]`) | - ```{bash} # all png files in current folder ls -l *.png ``` ## Regular expression - Wildcards are examples of _regular expressions_. - Regular expressions are a powerful tool to efficiently sift through large amounts of text: record linking, data cleaning, scraping data from website or other data-feed. - Google `regular expressions` to learn. # Work with text files ## View/peek text files - `cat` prints the contents of a file: ```{bash, size='smallsize'} cat runSim.R ``` ---- - `head` prints the first 10 lines of a file: ```{bash} head runSim.R ``` `head -l` prints the first $l$ lines of a file: ```{bash} head -15 runSim.R ``` - `tail` prints the last 10 lines of a file: ```{bash} tail runSim.R ``` `tail -l` prints the last $l$ lines of a file: ```{bash} tail -15 runSim.R ``` ---- - Questions: - How to see the 11th line of the file and nothing else? - What about the 11th to the last line? ## Piping and redirection - `|` sends output from one command as input of another command. - `>` directs output from one command to a file. - `>>` appends output from one command to a file. - `<` reads input from a file. - Combinations of shell commands (`grep`, `sed`, `awk`, ...), piping and redirection, and regular expressions allow us pre-process and reformat huge text files efficiently. - See HW1. ## `less` is more; `more` is less - `more` browses a text file screen by screen (only downwards). Scroll down one page (paging) by pressing the spacebar; exit by pressing the `q` key. - `less` is also a pager, but has more functionalities, e.g., scroll upwards and downwards through the input. - `less` doesn't need to read the whole file, i.e., it loads files faster than `more`. ## `grep` `grep` prints lines that match an expression: - Show lines that contain string `CentOS`: ```{bash} # quotes not necessary if not a regular expression grep 'CentOS' linux.Rmd ``` - Search multiple text files: ```{bash} grep 'CentOS' *.Rmd ``` - Show matching line numbers: ```{bash} grep -n 'CentOS' linux.Rmd ``` - Find all files in current directory with `.png` extension: ```{bash} ls | grep '.png$' ``` - Find all directories in the current directory: ```{bash} ls -al | grep '^d' ``` ## `sed` - `sed` is a stream editor. - Replace `CentOS` by `RHEL` in a text file: ```{bash} sed 's/CentOS/RHEL/' linux.Rmd | grep RHEL ``` ## `awk` - `awk` is a filter and report writer. - First let's display first lines of the file `/etc/passwd`: ```{bash} head /etc/passwd ``` Each line contains fields (1) user name, (2) password, (3) user ID, (4) group ID, (5) user ID info, (6) home directory, and (7) command shell, spearated by `:`. - Print sorted list of login names: ```{bash} awk -F: '{ print $1 }' /etc/passwd | sort | head -10 ``` - Print number of lines in a file, as `NR` stands for Number of Rows: ```{bash} awk 'END { print NR }' /etc/passwd ``` or ```{bash} wc -l /etc/passwd ``` or (not displaying file name) ```{bash} wc -l < /etc/passwd ``` - Print login names with UID in range `1000-1035`: ```{bash} awk -F: '{if ($3 >= 1000 && $3 <= 1035) print}' /etc/passwd ``` - Print login names and log-in shells in comma-seperated format: ```{bash} awk -F: '{OFS = ","} {print $1, $7}' /etc/passwd ``` - Print login names and indicate those with UID>1000 as `vip`: ```{bash} awk -F: -v status="" '{OFS = ","} {if ($3 >= 1000) status="vip"; else status="regular"} {print $1, status}' /etc/passwd ``` ## Text editors
Source: [Editor War](http://en.wikipedia.org/wiki/Editor_war) on Wikipedia. ## Emacs - `Emacs` is a powerful text editor with extensive support for many languages including `R`, $\LaTeX$, `python`, and `C/C++`; however it's _not_ installed by default on many Linux distributions. - Basic survival commands: - `emacs filename` to open a file with emacs. - `CTRL-x CTRL-f` to open an existing or new file. - `CTRL-x CTRX-s` to save. - `CTRL-x CTRL-w` to save as. - `CTRL-x CTRL-c` to quit. - Google `emacs cheatsheet`
`C-
## IDE (Integrated Development Environment) - Statisticians write a lot of code. Critical to adopt a good IDE that goes beyond code editing: syntax highlighting, executing code within editor, debugging, profiling, version control, etc. - R Studio, Eclipse, Emacs, Matlab, Visual Studio, etc. # Processes ## Cancel a non-responding program - Press `Ctrl+C` to cancel a non-responding or long-running program. ## Processes - OS runs processes on behalf of user. - Each process has Process ID (PID), Username (UID), Parent process ID (PPID), Time and data process started (STIME), time running (TIME), etc. ```{bash} ps ``` - All current running processes: ```{bash} ps -eaf ``` - All Python processes: ```{bash} ps -eaf | grep python ``` - Process with PID=1: ```{bash} ps -fp 1 ``` - All processes owned by a user: ```{bash} ps -fu huazhou ``` ## Kill processes - Kill process with PID=1001: ```{bash, eval=FALSE} kill 1001 ``` - Kill all R processes. ```{bash, eval=FALSE} killall -r R ``` ## `top` - `top` prints realtime process information (very useful). ```{bash, eval=FALSE} top ```
- Exit the `top` program by pressing the `q` key. # Secure shell (SSH) ## SSH SSH (secure shell) is the dominant cryptographic network protocol for secure network connection via an insecure network. - On Linux or Mac ternminal, access the teaching server by ```{bash, eval=FALSE} ssh username@server.ucla-biostat-203b.com ``` - For Windows users, there are 2 ways: (1) (highly recommended) [Git Bash](git bash commands), (2) (not recommended) [PuTTY](http://www.putty.org) program (free), or (3) (may be an overkill for this class) use WSL for Windows to install a full fledged Linux system within Windows. ## Use keys over password - Key authentication is more secure than password. Most passwords are weak. - Script or a program may need to systematically SSH into other machines. - Log into multiple machines using the same key. - Seamless use of many services: Git, AWS or Google cloud service, parallel computing on multiple hosts, Travis CI (continuous integration) etc. - Many servers only allow key authentication and do not accept password authentication. ## Key authentication
---- - _Public key_. Put on the machine(s) you want to log in. - _Private key_. Put on your own computer. Consider this as the actual key in your pocket; never give to others. - Messages from server to your computer is encrypted with your public key. It can only be decrypted using your private key. - Messages from your computer to server is signed with your private key (digital signatures) and can be verified by anyone who has your public key (authentication). ## Steps to generate keys - On Linux, Mac, or Windows Git Bash, to generate a key pair: ```{bash, eval=FALSE} ssh-keygen -t rsa -f ~/.ssh/[KEY_FILENAME] -C [USERNAME] ``` - `[KEY_FILENAME]` is the name that you want to use for your SSH key files. For example, a filename of `id_rsa` generates a private key file named `id_rsa` and a public key file named `id_rsa.pub`. - `[USERNAME]` is the user for whom you will apply this SSH key. - Use a (optional) paraphrase different form password. - Set correct permissions on the `.ssh` folder and key files ```{bash, eval=FALSE} chmod 400 ~/.ssh/[KEY_FILENAME] ``` ---- - Append the public key to the `~/.ssh/authorized_keys` file of any Linux machine we want to SSH to, e.g., ```{bash, eval=FALSE} ssh-copy-id -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com ``` - Test your new key. ```{bash, eval=FALSE} ssh -i ~/.ssh/[KEY_FILENAME] [USERNAME]@server.ucla-biostat-203b.com ``` - Now you don't need password each time you connect from your machine to the teaching server. ---- - If you set paraphrase when generating keys, you'll be prompted for the paraphrase each time the private key is used. Avoid repeatedly entering the paraphrase by using `ssh-agent` on Linux/Mac or Pagent on Windows. - Same key pair can be used between any two machines. We don't need to regenerate keys for each new connection. ## Transfer files between machines - `scp` securely transfers files between machines using SSH. ```{bash, eval=FALSE} ## copy file from local to remote scp [LOCALFILE] [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FOLDER] ``` ```{bash, eval=FALSE} ## copy file from remote to local scp [USERNAME]@server.ucla-biostat-203b.com:/[PATH_TO_FILE] [PATH_TO_LOCAL_FOLDER] ``` - `sftp` is FTP via SSH. - `Globus` is GUI program for securely transferring files between machines. To use Globus you will have to go to