### Running in Docker container on Ostrich

#### Started Docker container with the following command:

```docker run - p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:owl_web -v /Users/sam/gitrepos/LabDocs/jupyter_nbs/sam/:/jupyter_nbs -it f99537d7e06a```

The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files on Owl/home and Owl/web accessible to the Docker container.

Once the container was started, started Jupyter Notebook with the following command inside the Docker container:

```jupyter notebook```

This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.

The Docker container is running on an image created from this [Dockerfile (Git commit 443bc42)](https://github.com/sr320/LabDocs/blob/443bc425cd36d23a07cf12625f38b7e3a397b9be/code/dockerfiles/Dockerfile.bio)

In [1]:
%%bash
date

Wed Dec 14 15:52:40 UTC 2016


### Check computer specs

In [2]:
%%bash
hostname

4bd1957ce190


In [3]:
%%bash
lscpu

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 8
Socket(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Model name: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
Stepping: 5
CPU MHz: 2260.998
BogoMIPS: 4521.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K


### Bloated notebook analysis

In [5]:
%%bash
ls -lh /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

-rw-r--r-- 1 srlab staff 104M Dec 8 12:09 /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb


#### That notebook is over >100MB in size, which is too large for hosting on GitHub. Additionally, the notebook crashes the browser (and sometimes the computer) due to the ridiculous number of output lines generated by the ```wget``` command. Let's look at some more details.

#### Line count

In [6]:
%%bash
wc -l /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

1197134 /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb


#### In order to preserve some of the information in the orginal notebook before I strip the output, we'll look at the file in a bit more depth...

#### How long did the wget command for the Ostrea lurida files take?

#### First, let's find the line that has the output of the ```time``` command that I ran. The ```grep``` command includes the ```-n``` flag to identify line number(s) of search results.

In [7]:
%%bash
grep -n real /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

1197005: "real\t2529m32.643s\n",


#### Whoa! That's a LONG time! Let's try to pull the full time output.

#### Using ```head``` and ```tail``` to pull out a specific range of lines from the file. Making a rough guess...

In [9]:
%%bash
head -1197020 /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb | tail -20

 "FINISHED --2016-12-08 12:07:18--\n",
 "Total wall clock time: 1d 18h 9m 33s\n",
 "Downloaded: 25 files, 55G in 1d 17h 53m 46s (379 KB/s)\n",
 "\n",
 "real\t2529m32.643s\n",
 "user\t0m11.190s\n",
 "sys\t40m9.630s\n"
 ]
 }
 ],
 "source": [
 "%%bash\n",
 "time wget -m ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Ostrea_lurida/"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "#### View directory structure of downloaded files\n",


#### So, to download all of the Ostrea lurida files, it took a little over 37hrs!

#### Let's see what the time frame was on the Panopea generosa files was...

In [10]:
%%bash
grep -n wget /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

115: "### Download all Ostrea lurida files from BGI using ```wget```"
142: "time wget -m ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Ostrea_lurida/ \\\n",
197: "#### Not going to waste time figuring out why the ```-P``` argument didn't work for ```wget```, so just changing to desired directory and running ```wget``` command again..."
1197013: "time wget -m ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Ostrea_lurida/"
1197049: "### Download all Panopea gererosa files from BGI using ```wget```"
1197080: "time wget -m ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Panopea_generosa"


#### Since the total number of lines in the file is 1197134, I'll just use the ```tail``` command to look at the last 100 lines (because the ```wget``` command for the Panopea generosa files is at line 1197080.

In [11]:
%%bash
tail -100 /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

 "text": [
 "bash: line 1: tree: command not found\n"
 ]
 }
 ],
 "source": [
 "%%bash\n",
 "tree /owl_web/O_lurida_genome_assemblies_BGI/20161201/"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "### Download all Panopea gererosa files from BGI using ```wget```"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": null,
 "metadata": {
 "collapsed": false
 },
 "outputs": [
 {
 "name": "stdout",
 "output_type": "stream",
 "text": [
 "/owl_web/P_generosa_genome_assemblies_BGI/20161201\n"
 ]
 }
 ],
 "source": [
 "cd /owl_web/P_generosa_genome_assemblies_BGI/20161201/"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": null,
 "metadata": {
 "collapsed": true
 },
 "outputs": [],
 "source": [
 "%%bash\n",
 "time wget -m ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Panopea_generosa"
 ]
 },
 {
 "cell_type": "markdown",
 "metadata": {},
 "source": [
 "#### View directory structure of downloaded files"
 ]
 },
 {
 "cell_type": "code",
 "execution_count": null,
 "metadata": 

#### Well, what we see (and should've realized when we ran ```grep -n real``` on line 7) is that there is no output from that ```wget``` command.

#### So, let's see if the files got downloaded or not...

In [12]:
%%bash
ls -lhr /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/

total 1.3G
-rw-rw-rw- 1 srlab staff 2.3K Dec 1 05:39 md5.txt
-rw-rw-rw- 1 srlab staff 1.7K Dec 1 09:37 md5.check
drwxrwxrwx 1 srlab staff 704 Dec 10 08:35 clean_data
-rw-rw-rw- 1 srlab staff 1.3G Dec 1 04:12 Panopea_generosa.fa
-rw-rw-rw- 1 srlab staff 432 Dec 1 04:12 N50.xls
-rw-rw-rw- 1 srlab staff 3.6K Dec 1 04:11 17mer.log
-rw-rw-rw- 1 srlab staff 7.6K Dec 1 04:11 17mer.freq


In [13]:
%%bash
ls -lhr /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/

total 71G
-rw-rw-rw- 1 srlab staff 2.3K Dec 1 03:56 lane.lst.stat.xls
-rw-rw-rw- 1 srlab staff 1.3G Dec 1 05:07 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 1.2G Dec 1 05:03 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 2.2G Dec 1 04:55 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 2.0G Dec 1 04:51 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 1.3G Dec 1 05:01 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 1.2G Dec 1 04:58 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 2.3G Dec 1 04:47 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz
-rw-rw-rw- 1 srlab staff 2.1G Dec 1 04:42 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-1

#### OK, the files got downloaded. I'm guessing the enormous oupt from the Ostrea lurida ```wget``` command crashed the browser, but the notebook commands still proceeded to completion.

### Stripping cell output

#### Use nbconvert to convert from "notebook" format to "notebook" format. A [Jupyter Google Group post provided the use of ```--ClearOutputPreprocessor.enabled=True```](https://groups.google.com/forum/#!topic/jupyter/z6ODiJ6VUzI) to strip output from cells.

In [15]:
%%bash
jupyter nbconvert \
--to notebook \
/gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb \
--ClearOutputPreprocessor.enabled=True \
--output /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

[NbConvertApp] Converting notebook /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb to notebook
[NbConvertApp] Writing 6510 bytes to /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb


#### Let's see if it worked by doing another line count on the notebook file

In [16]:
%%bash
wc -l /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb

295 /gitrepos/LabDocs/jupyter_nbs/sam/20161206_docker_BGI_genome_downloads.ipynb


#### Indeed it did! Will get the notebook (and this notebook) pushed to GitHub!