{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Git/GitHub + `bedtools` +TSCC + YOU!\n", "\n", "To bring it all together, you will combine everything you've learned so far about UNIX, bash shell, TSCC, git, and github to collaboratively write a submitter script to TSCC.\n", "\n", "1. Identify your randomly assigned partner that was emailed out.\n", "2. Pick one person to be reponsible for exercises 1-3 from `6_tf_binding_promoters.ipynb` (person1) and the the other to be responsible for exercises 4-6 (person2)\n", "3. Have person1 create a GitHub repo called `biom262-hw1` and write a short `README.md` file\n", "4. Have person2 add a LICENSE file copied from the \"Text of the UC Copyright Notice\" section [here](https://confluence.crbs.ucsd.edu/display/CRBS/Releasing+Open+Source+Software+at+UCSD)\n", "5. Collaboratively work on a submitter script called `tf_binding.sh` to TSCC which has all the `##PBS` flags shown in the TSCC Quick Start Guide below, and contribute the lines of code that you are responsible for. Use `git blame tf_binding.sh` to make sure person1 is to \"blame\" for exercises 1-3 and person2 is to \"blame\" for exercises 4-5. The \"blame\" of the remaining file doesn't matter.\n", "6. Have person1 add the line: `echo \"Hello I am a message in standard out (stdout)\"` and have person2 add the line `echo \"Hello I am a message in standard error (stderr) >&2\"` (the `>&2` outputs to \"secondary\" aka \"error\" output)\n", "7. Have both people `add`, `commit`, and `push` their changes to the server. Are there merge conflicts? How do you solve them?\n", "7. Submit your script to TSCC!\n", "8. Wait for it to run.\n", "9. Check the output. Is it correct? Were the correct files added?\n", "9. Add the resulting `.o#####` and `.e#####` (or if you were fancy and redirected your `stdout` and `stderr` to something else then include those files)\n", "10. Your final repository should have at least five files:\n", "\n", "```\n", "LICENSE\n", "README.txt\n", "tf_binding.sh\n", "tf_binding.sh.o######\n", "tf_binding.sh.e######\n", "```\n", "\n", "Run `git blame` on `tf_binding.sh`. You should get output that looks like this, which has commit ID in the first column, the name of the user, the timestamp, and the line number for each line. Make sure different people wrote their exercise sections, and the `echo` stuff too.\n", "\n", "```\n", "53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 1) #!/bin/bash\n", "53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 2) #PBS -q hotel\n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 3) #PBS -l nodes=2:ppn=2\n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 4) #PBS -l walltime=:00:20:00\n", "53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 5) #PBS -N tf_binding.sh\n", "53317857 (Jaclyn Einstein 2016-01-19 15:26:59 -0800 6) \n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 7) cd ~/code/biom262-hw1/data\n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 8) module load biotools\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 9) \n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 10) #Exercise 1\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 11) # Filter the tf.bed file for only the NFKB\\n\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 12) awk '{if($4==\"NFKB\") print}' tf.bed > tf.nfkb.bed\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 13) \n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 14) #Exercise 2\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 15) # Filter only the rows of the gtf file that contain the features of type \"transcript\"\\n\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 16) awk '{if($3==\"transcript\") print}' gencode.v19.annotation.chr22.gtf > gencode.v19.annotation.chr22.transcript.gtf\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 17) \n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 18) #Exercise 3\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 19) # Use bedtools to find promoters (2000 bases upstream of gene)\\n\n", "b63c5509 (Jaclyn Einstein 2016-01-18 16:37:52 -0800 20) bedtools flank -i gencode.v19.annotation.chr22.transcript.gtf -g hg19.genome -l 2000 -r 0 -s > gencode.v19.annotation.chr22.transcript.promoter.gtf\n", "0c09c96b (Jaclyn Einstein 2016-01-18 17:47:54 -0800 21) \n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 22) #Exercise 4\n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 23) bedtools intersect -a gencode.v19.annotation.chr22.transcript.promoter.gtf -b tf.nfkb.bed >gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf\n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 24) \n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 25) #Exercise 5\n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 26) bedtools getfasta -fi GRCh37.p13.chr22.fa -bed gencode.v19.annotation.chr22.transcript.promoter.nfkb.gtf -s -fo gencode.v19.annotation.chr22.transcript.promoter.nfkb.fasta\n", "88ed4261 (Ben Rubin 2016-01-19 18:15:18 -0800 27) \n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 28) echo \"Hello I am a message in standard out (stout)\"\n", "34b09512 (Jaclyn Einstein 2016-01-20 18:01:55 -0800 29) echo \"Hello I am a message in standard error (stderr)\" >&2\n", "```\n", "\n", "***Note: You will need to add your partner as a collaborator to your repository so they can push to your repo***\n", "\n", "Feel free to include any other `hg19.genome` or `bed` or `gtf` files that you want to make your life easier.\n", "\n", "Resources for TSCC:\n", "* [TSCC Quick Start guide](http://www.sdsc.edu/support/user_guides/tscc-quick-start.html)\n", "* [TSCC User Guide](http://www.sdsc.edu/support/user_guides/tscc.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is this `>&2` garbage and what does it have to do with `stdout` vs `stderr`?\n", "\n", "To get the most out of this, follow along with the commands and type them out on the terminal.\n", "\n", "So far, you have become quite familiar with `stdout` and `stderr` without even knowing it! Whenver you do something that outputs something to the terminal, that is `stdout`. For example, let's take `ls`. Change directories to your home directory, then type `ls`.\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ cd\n", "[ucsd-train01@tscc-login2 ~]$ ls\n", "Anaconda3-2.4.1-Linux-x86_64.sh bin code notebooks test_script.sh test_script.sh.e3962194 test_script.sh.o3962194\n", "```\n", "\n", "You can write this output to a file using the `>` which outputs the `stdout` to a file:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ ls > ls.txt\n", "```\n", "\n", "Notice that there was no output! Where did it go? Well, let's look at `ls.txt`:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ cat ls.txt \n", "Anaconda3-2.4.1-Linux-x86_64.sh\n", "bin\n", "code\n", "ls.txt\n", "notebooks\n", "test_script.sh\n", "test_script.sh.e3962194\n", "test_script.sh.o3962194\n", "```\n", "\n", "This put the output of `ls` into a file! It also added a line between entries instead of a space. This is a special feature of `ls` where it detects that you're outputting to a file that you probably want to `grep` so to make your life easier, it puts everything on one file.\n", "\n", "### So why do we need `stderr` ?\n", "\n", "To learn about `stderr`, let's first use a command that we *know* will fail. Let's try to `ls` a directory that doesn't exist.\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder\n", "ls: cannot access nonexistent_folder: No such file or directory\n", "```\n", "\n", "Now let's save the `stdout` as a file.\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder > ls_nonexistent_folder.txt\n", "ls: cannot access nonexistent_folder: No such file or directory\n", "```\n", "\n", "Wait, that still put stuff out on the terminal? Didn't we save this output? Let's take a look.\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt \n", "```\n", "\n", "Hmm. This is empty. What's going on?\n", "\n", "### `stdout`: The other white meat\n", "\n", "When you use `>`, the technical jargon for what you are doing is \"redirecting standard output to a file.\" Specifically, `stdout` is considered the \"first\" output of a program and is given the number \"`1`\". So you secretly specified the \"`1`\" already! Try the `ls` command again, but use `1>` to save the `stdout`. You should get the same output as if you did it without the `1` (do it to convince yourself)\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ ls 1> ls1.txt\n", "[ucsd-train01@tscc-login2 ~]$ cat ls1.txt\n", "Anaconda3-2.4.1-Linux-x86_64.sh\n", "bin\n", "code\n", "ls1.txt\n", "ls_nonexistent_folder.txt\n", "ls.txt\n", "notebooks\n", "test_script.sh\n", "test_script.sh.e3962194\n", "test_script.sh.o3962194\n", "```\n", "\n", "Now, `stderr` is officially the \"second\" output and is given the number \"`2`\". Let's try saving the `stderr` from our failing `ls` command from before:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ ls nonexistent_folder 2> ls_nonexistent_folder.txt\n", "```\n", "\n", "Notice that now there was no output! Let's check out the file we created.\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ cat ls_nonexistent_folder.txt \n", "ls: cannot access nonexistent_folder: No such file or directory\n", "```\n", "\n", "Ah-ha! This is ***exactly*** the output we *would* have seen had we not \"redirected the stderr\" (i.e. used this command `2>`).\n", "\n", "### Why are we doing #6?\n", "\n", "You may be wondering what is the point of this step is:\n", "\n", "> 6. Have person1 add the line: `echo \"Hello I am a message in standard out (stdout)\"` and have person2 add the line `echo \"Hello I am a message in standard error (stderr) >&2` (the `>&2` outputs to \"secondary\" aka \"error\" output)\n", "\n", "The idea is that then you'll have different outputs in the files you specify by \"`-e`\" and \"`-o`\" files from the TSCC PBS Submitter script below. That literally means for one person to add the stdout line and the other person to add the stderr line (see the example script at the [very bottom](#Another-example))\n", "\n", "Now that you've practiced a bit with `stderr`, Try doing this on the terminal:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ echo \"Hello I am a message in standard error (stderr)\" >&2 | cat > asdf\n", "```\n", "\n", "This should output:\n", "\n", "```\n", "Hello I am a message in standard error (stderr)\n", "```\n", "\n", "But what about this?\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ echo \"Hello I am a message in standard output (stdout)\" | cat > asdf\n", "```\n", "\n", "Why does the above command have no output?\n", "\n", "### Exercise: What is in `asdf`?\n", "\n", "What are the contents of `asdf`? Why?\n", "\n", "### Additional reading, if you feel like it\n", "\n", "Read more about standard streams on [Wikipedia](https://en.wikipedia.org/wiki/Standard_streams), or on [this](http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO-3.html) awesome website which has a really great bash overview/tutorial that I still reference to this day. Plus [this](http://www.jstorimer.com/blogs/workingwithcode/7766119-when-to-use-stderr-instead-of-stdout) long article explains when to use `stdout` vs `stderr` very well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example scripts\n", "\n", "What is PBS? PBS stands for \"Public Broadcasting Service.\" Just kidding, it stands for [Portable Batch System](https://en.wikipedia.org/wiki/Portable_Batch_System). It is one of many scheduling and resource management (resources = compute nodes attached to TSCC and their processors and memory) systems out there.\n", "\n", "### Simple submitter script example - one processor, one node\n", "\n", "Below is an example submitter script that is in the file called `basewise_conservation.sh`, and after that I'll explain line-by-line what everything is doing.\n", "\n", "You submit the script below with:\n", "\n", " qsub basewise_conservation.sh" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "#!/bin/bash\n", "#PBS -N basewise_conservation\n", "#PBS -o basewise_conservation.sh.out\n", "#PBS -e basewise_conservation.sh.err\n", "#PBS -V\n", "#PBS -l walltime=24:00:00\n", "#PBS -l nodes=1:ppn=1\n", "#PBS -A yeo-group\n", "#PBS -q home\n", "\n", "# Go to the directory from which the script was called\n", "cd $PBS_O_WORKDIR\n", "python /home/obotvinnik/ipython_notebook/singlecell/manuscript/0._Data_collection/basewise_conseration.py" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Line 1\n", "\n", " #!/bin/bash\n", "\n", "This line is specifying to use the [`bash`](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) aka \"Bourne-Again shell\" ,vs the [Bourne (`sh`) shell](https://en.wikipedia.org/wiki/Bourne_shell) or [C-shell (`csh`)](https://en.wikipedia.org/wiki/C_shell). The differences between the shells are rather esoteric and can get quite religous so I won't go into them. So far, we've been using `bash` so stick to that.\n", "\n", "\n", "#### Line 2\n", "\n", " #PBS -N basewise_conservation\n", "\n", "This \"`-N`\" flag is the name of the job you'll see when you do `qstat`. I recommend doing `qstat -u $USER` because then you'll see only YOUR jobs. In fact, I recommend adding this alias to your `~/.bashrc`:\n", "\n", " alias qme=\"qstat -q $USER\"\n", " \n", "The double quotes are important because they tell the computer to \"evaluate\" the dollar sign variables inside. Check out the difference between double quotes here:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ echo \"Who am I? I am $USER\"\n", "Who am I? I am ucsd-train01\n", "```\n", "\n", "And single quotes here:\n", "\n", "```\n", "[ucsd-train01@tscc-login2 ~]$ echo 'Who am I? I am $USER' \n", "Who am I? I am $USER\n", "```\n", "\n", "#### Line 3\n", "\n", " #PBS -o basewise_conservation.sh.out\n", " \n", "This \"`-o`\" flag tells the PBS compute cluster to save the output from `stdout`, which is relative to the directory this script was run from. i.e. if you change directories in the script itself to `biom262/weeks/week01/data`, but you\n", "submitted the script from your home directory `~`, then this file will be in\n", "`~/basewise_conservation.sh.out` not `biom262/weeks/week01/data`\n", "\n", "\n", "#### Line 4\n", "\n", " #PBS -e basewise_conservation.sh.err\n", "\n", "This \"`-e`\" flag tells the supercompting cluster where to save the output from `stderr`. The same folder location conventions are true with this as they are with the output in line 3 above.\n", "\n", "#### Line 5\n", "\n", " #PBS -V\n", " \n", "This \"-V\" argument ensures that you have access to all the programs in your `PATH` that you did before you jumped into the TSCC compute node. It literally prepends your existing path, just like Anaconda did to make sure you first access the Anaconda Python and not any other Python.\n", "\n", "#### Line 6\n", "\n", " #PBS -l walltime=24:00:00\n", "\n", "This is the first argument you've seen with the \"`-l`\" (\"dash ell\") flag. I don't know what \"`-l`\" stands for but it's properties of the job that have to do with how many resources to specify (ie time and computers). This one specifies the amount of clock time this script can use. This is specifying 24 hours. Your script will probably only need 10 minutes, which you can specify with `walltime=00:10:00`.\n", "\n", "#### Line 7\n", "\n", " #PBS -l nodes=1:ppn=1\n", " \n", "This is the second line with the \"`-l`\" flag. This time, it's specifying the number of nodes (i.e. literal computers) and the number of processors to use on those nodes. For your code, you will only need one node and one processor.\n", "\n", "The maximum number of processors is `16`. In all of the submission scripts for this class, you'll need a **maximum** of one node. If you ever need more, **always** increase the processors, then the nodes. The example from TSCC is really bad:\n", "\n", " nodes=10:ppn=2\n", "\n", "Using the room/chair analogy where a node is a room and processor is a chair, that's like asking for 10 rooms, and for each of them have two empty chairs! That's a really strange and weird request so don't have these lopsided requests where the number of nodes is far greater than the number of processors. The main thing to remember is that there's a maximum of 16 chairs per room, and to increase the number of chairs (processors) before the number of rooms (nodes).\n", "\n", "#### Line 8\n", "\n", " #PBS -A yeo-group\n", " \n", "This \"`-A`\" flag specifies the account that you're charging to submit to. Ignore this line. We'll use the `hotel` queue which is free to use.\n", " \n", "#### Line 9\n", "\n", " #PBS -q home-yeo\n", "\n", "This \"`-q`\" flag specifies the \"queue\" that you'll submit to. TSCC [describes](http://www.sdsc.edu/support/user_guides/tscc-quick-start.html#queues) all the different queues, but for this class we'll stick to `hotel` because anyone can submit to that one -- you don't have to be a \"node contributor\" (i.e. your lab paid for compute nodes on the cluster) to use it. That said, `hotel` can get pretty clogged up.\n", "\n", "If you try to submit to a queue you don't have access to, you'll get this error:\n", "\n", "```\n", "qsub: submit error (Unauthorized Request MSG=group ACL is not satisfied: user ucsd-train01@tscc-login2.local, queue home-yeo)\n", "```\n", "\n", "#### Line 10\n", "This line is empty\n", "\n", "#### Line 11\n", "\n", " # Go to the directory from which the script was called\n", "\n", "This is a comment line, which starts with a hash \"`#`\" and can say anything afterwards. It's mainly for the reader of the script (me) to look at later and remember what happened.\n", "\n", "#### Line 12\n", "\n", " cd $PBS_O_WORKDIR\n", " \n", "This line changes directories to the folder where you ran the \"`qsub`\" command. You can see more of the PBS environment variables [here](https://wiki.hpcc.msu.edu/display/hpccdocs/Advanced+Scripting+Using+PBS+Environment+Variables).\n", " \n", "#### Line 13\n", "\n", " python /home/obotvinnik/ipython_notebook/singlecell/manuscript/0._Data_collection/basewise_conseration.py\n", " \n", "This is the meat of the script! (albeit misspelled) This is the actual function that runs and reads files and does things to data to calculate basewise conservation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Another example\n", "\n", "Let's look at another example that looks more like what you would submit for your homework. Let's say it's in a file called ... oh, I don't know, how about `tf_binding.sh`." ] }, { "cell_type": "raw", "metadata": {}, "source": [ "#!/bin/bash\n", "#PBS -N tf_binding\n", "#PBS -o tf_binding.sh.out\n", "#PBS -e tf_binding.sh.err\n", "#PBS -V\n", "#PBS -l walltime=00:10:00\n", "#PBS -l nodes=1:ppn=1\n", "#PBS -q hotel\n", "\n", "grep NFKB tf.bed > tf.nkfb.bed\n", "\n", "echo \"Hello I am a message in standard out (stdout)\"\n", "echo \"Hello I am a message in standard error (stderr)\" >&2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that you can submit this with:\n", "\n", " qsub tf_binding.sh\n", "\n", "OR, if you had a file called \"`grep_tf.sh`\" that only had the line \"`grep NFKB tf.bed > tf.nkfb.bed`\" in it, you could have written out all this stuff:\n", "\n", " qsub -N tf_binding -o tf_binding.sh.out -e tf_binding.sh.err -V -l walltime=00:10:00 -l nodes=1:ppn=1 -q hotel grep_tf.sh\n", "\n", "Which specifies everything we did in the script in a single line. I don't like this method because it's harder to read and is not as reproducible. So the `#PBS` syntax is a shortcut for writing all that other stuff on the command line, and it's easier to keep track of than the different commands you ran.\n", "\n", "PHEW! that was a lot!" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.5.1" } }, "nbformat": 4, "nbformat_minor": 0 }