--- title: "Biostat 203B Homework 1" subtitle: Due Jan 27 @ 11:59PM author: Your Name and UID format: html: theme: cosmo number-sections: true toc: true toc-depth: 4 toc-location: left code-fold: false knitr: opts_chunk: cache: false echo: true fig.align: 'center' fig.width: 6 fig.height: 4 message: FALSE --- Display machine information for reproducibility: ```{r} #| eval: false sessionInfo() ``` ## Q1. Git/GitHub **No handwritten homework reports are accepted for this course.** We work with Git and GitHub. Efficient and abundant use of Git, e.g., frequent and well-documented commits, is an important criterion for grading your homework. 1. Apply for the [Student Developer Pack](https://education.github.com/pack) at GitHub using your UCLA email. You'll get GitHub Pro account for free (unlimited public and private repositories). 2. Create a **private** repository `biostat-203b-2023-winter` and add `Hua-Zhou` and `tomokiokuno0528` as your collaborators with write permission. 3. Top directories of the repository should be `hw1`, `hw2`, ... Maintain two branches `master` and `develop`. The `develop` branch will be your main playground, the place where you develop solution (code) to homework problems and write up report. The `master` branch will be your presentation area. Submit your homework files (Quarto file `qmd`, `html` file converted by Quarto, all code and extra data sets to reproduce results) in `main` branch. 4. After each homework due date, course reader and instructor will check out your `master` branch for grading. Tag each of your homework submissions with tag names `hw1`, `hw2`, ... Tagging time will be used as your submission time. That means if you tag your `hw1` submission after deadline, penalty points will be deducted for late submission. 5. After this course, you can make this repository public and use it to demonstrate your skill sets on job market. ## Q2. Data ethics training This exercise (and later in this course) uses the [MIMIC-IV data](https://mimic-iv.mit.edu), a freely accessible critical care database developed by the MIT Lab for Computational Physiology. Follow the instructions at to (1) complete the CITI `Data or Specimens Only Research` course and (2) obtain the PhysioNet credential for using the MIMIC-IV data. Display the verification links to your completion report and completion certificate here. (Hint: The CITI training takes a couple hours and the PhysioNet credentialing takes a couple days; do not leave it to the last minute.) ## Q3. Linux Shell Commands 1. The `~/mimic` folder within the Docker container contains data sets from MIMIC-IV. Refer to the documentation for details of data files. ```{bash} #| eval: false ls -l ~/mimic ``` Please, do **not** put these data files into Git; they are big. Do **not** copy them into your directory. Do **not** decompress the gz data files. These create unnecessary big files on storage and are not big data friendly practices. Just read from the data folder `~/mimic` directly in following exercises. Use Bash commands to answer following questions. 2. Display the contents in the folders `core`, `hosp`, `icu`. Why are these data files distributed as `.csv.gz` files instead of `.csv` (comma separated values) files? Read the page to understand what's in each folder. 3. Briefly describe what bash commands `zcat`, `zless`, `zmore`, and `zgrep` do. 4. What's the output of the following bash script? ```{bash} #| eval: false for datafile in ~/mimic/core/*.gz do ls -l $datafile done ``` Display the number of lines in each data file using a similar loop. 5. Display the first few lines of `admissions.csv.gz`. How many rows are in this data file? How many unique patients (identified by `subject_id`) are in this data file? (Hint: combine Linux commands `zcat`, `head`/`tail`, `awk`, `sort`, `uniq`, `wc`, and so on.) 6. What are the possible values taken by each of the variable `admission_type`, `admission_location`, `insurance`, and `ethnicity`? Also report the count for each unique value of these variables. (Hint: combine Linux commands `zcat`, `head`/`tail`, `awk`, `uniq -c`, `wc`, and so on.) ## Q4. Who's popular in Price and Prejudice 1. You and your friend just have finished reading *Pride and Prejudice* by Jane Austen. Among the four main characters in the book, Elizabeth, Jane, Lydia, and Darcy, your friend thinks that Darcy was the most mentioned. You, however, are certain it was Elizabeth. Obtain the full text of the novel from and save to your local folder. ```{bash} #| eval: false wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt ``` Explain what `wget -nc` does. Do **not** put this text file `pg42671.txt` in Git. Complete the following loop to tabulate the number of times each of the four characters is mentioned using Linux commands. ```{bash} #| eval: false wget -nc http://www.gutenberg.org/cache/epub/42671/pg42671.txt for char in Elizabeth Jane Lydia Darcy do echo $char: # some bash commands here done ``` 2. What's the difference between the following two commands? ```{bash} #| eval: false echo 'hello, world' > test1.txt ``` and ```{bash} #| eval: false echo 'hello, world' >> test2.txt ``` 3. Using your favorite text editor (e.g., `vi`), type the following and save the file as `middle.sh`: ```{bash eval=FALSE} #!/bin/sh # Select lines from the middle of a file. # Usage: bash middle.sh filename end_line num_lines head -n "$2" "$1" | tail -n "$3" ``` Using `chmod` to make the file executable by the owner, and run ```{bash} #| eval: false ./middle.sh pg42671.txt 20 5 ``` Explain the output. Explain the meaning of `"$1"`, `"$2"`, and `"$3"` in this shell script. Why do we need the first line of the shell script? ## Q5. More fun with Linux Try following commands in Bash and interpret the results: `cal`, `cal 2021`, `cal 9 1752` (anything unusual?), `date`, `hostname`, `arch`, `uname -a`, `uptime`, `who am i`, `who`, `w`, `id`, `last | head`, `echo {con,pre}{sent,fer}{s,ed}`, `time sleep 5`, `history | tail`.