{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Running in Docker container on Ostrich\n", "\n", "#### Started Docker container with the following command:\n", "\n", "```run -p 8888:8888 -v /Users/sam/data/:/data -v /Users/sam/owl_home/:/owl_home -v /Users/sam/owl_web/:/owl_web -v /Users/sam/gitrepos:/gitrepos -it 2f0f50dc230c```\n", "\n", "The command allows access to Jupyter Notebook over port 8888 and makes my Jupyter Notebook GitHub repo and my data files on Owl/home and Owl/web accessible to the Docker container.\n", "\n", "Once the container was started, started Jupyter Notebook with the following command inside the Docker container:\n", "\n", "```jupyter notebook```\n", "\n", "This is configured in the Docker container to launch a Jupyter Notebook without a browser on port 8888.\n", "\n", "The Docker container is running on an image created from this [Dockerfile (Git commit ac060a2)](https://github.com/sr320/LabDocs/blob/ac060a2f2e0a2f6714c2f657a30980d37253b3b0/code/dockerfiles/Dockerfile.bio)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wed Dec 14 20:55:08 UTC 2016\n" ] } ], "source": [ "%%bash\n", "date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check computer specs" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0f2bca9c664b\n" ] } ], "source": [ "%%bash\n", "hostname" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Architecture: x86_64\n", "CPU op-mode(s): 32-bit, 64-bit\n", "Byte Order: Little Endian\n", "CPU(s): 8\n", "On-line CPU(s) list: 0-7\n", "Thread(s) per core: 1\n", "Core(s) per socket: 8\n", "Socket(s): 1\n", "Vendor ID: GenuineIntel\n", "CPU family: 6\n", "Model: 26\n", "Model name: Intel(R) Xeon(R) CPU E5520 @ 2.27GHz\n", "Stepping: 5\n", "CPU MHz: 2260.998\n", "BogoMIPS: 4521.99\n", "Hypervisor vendor: KVM\n", "Virtualization type: full\n", "L1d cache: 32K\n", "L1i cache: 32K\n", "L2 cache: 256K\n", "L3 cache: 8192K\n" ] } ], "source": [ "%%bash\n", "lscpu" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Check out files provided by BGI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ostrea lurida" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/owl_web/O_lurida_genome_assemblies_BGI/20161201/\n", "`-- cdts-hk.genomics.cn\n", " `-- Ostrea_lurida\n", " |-- 17mer.freq\n", " |-- 17mer.log\n", " |-- N50.xls\n", " |-- Ostrea_lurida.fa\n", " |-- clean_data\n", " | |-- 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", " | |-- 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", " | `-- lane.lst.stat.xls\n", " |-- md5.check\n", " `-- md5.txt\n", "\n", "3 directories, 25 files\n" ] } ], "source": [ "%%bash\n", "tree /owl_web/O_lurida_genome_assemblies_BGI/20161201/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Panopea generosa" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/owl_web/P_generosa_genome_assemblies_BGI/20161201/\n", "`-- cdts-hk.genomics.cn\n", " `-- Panopea_generosa\n", " |-- 17mer.freq\n", " |-- 17mer.log\n", " |-- N50.xls\n", " |-- Panopea_generosa.fa\n", " |-- clean_data\n", " | |-- 151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz.clean.dup.clean.gz\n", " | |-- 151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz.clean.dup.clean.gz\n", " | |-- 151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", " | |-- 160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n", " | `-- lane.lst.stat.xls\n", " |-- md5.check\n", " `-- md5.txt\n", "\n", "3 directories, 25 files\n" ] } ], "source": [ "%%bash\n", "tree /owl_web/P_generosa_genome_assemblies_BGI/20161201/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Compare md5 checksums" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### View md5 files provided by BGI" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Ostrea lurida" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz: OK\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz: OK\n", "17mer.log: OK\n", "17mer.freq: OK\n", "N50.xls: OK\n", "Ostrea_lurida.fa: OK\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.check" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "810d188468dbd8bb36b2af3bf3b9fee6 clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", "cf92b18e0815dc0471d61f9107142257 clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", "3dc2137d7df0af8d6a007516908361a3 clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", "8bc0d7c7a7af3954baca31a4a7fe9f2b clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", "08cfdc6fdc5a6190cb05cdcb81fa5b9c clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", "a503043167457337a65d51151ceb5dd0 clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", "b371a3b3588060bb2200f5caaf9a9d5c clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "dd8b0fef21fc5e330d08ae4a48c8d67b clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "50b5dbff9426738005b81efd49d66329 clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "29be6e734ad65c180aee23d1514c5c35 clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "a0c0177a7a4a4ca28c37bd3802361564 clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "74e51d608de7a409a29545a95ac3ec14 clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "9b5f7c1593f216f710814c299738493f clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "5a6392d9c23aa85d170d7f50de4f0b54 clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "7438fcf14797976de9288368236ed75d clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "26dd6bc17c1596e2881924bcc167125a clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "4221fa24ccbb1735202fcfaa80d69d95 clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "0adbb85da2610131b728417669e0922c clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "fafc80b25613cff45598f4f37dde9d8e 17mer.log\n", "e9ea19f40b0ebb212ab33c78e350a679 17mer.freq\n", "a4f92de7bd24d7ebbc02c40696a0f4e8 N50.xls\n", "f0a7772d4f1074698b50c913783c6fe2 Ostrea_lurida.fa\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksums of fastq files" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t31m28.393s\n", "user\t0m7.370s\n", "sys\t6m27.230s\n" ] } ], "source": [ "%%bash\n", "\n", "#For loop generates a md5 checksum has value for each file\n", "#and appends the output to the checksums.md5 file.\n", "time for file in /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/*.gz\n", " do\n", " md5sum \"$file\" >> /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5\n", " done\n", "\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "410cfcdf170125f4d8cb1ac4baf0007c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", "cf92b18e0815dc0471d61f9107142257 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", "3dc2137d7df0af8d6a007516908361a3 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", "8bc0d7c7a7af3954baca31a4a7fe9f2b /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", "08cfdc6fdc5a6190cb05cdcb81fa5b9c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", "a503043167457337a65d51151ceb5dd0 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", "a0c0177a7a4a4ca28c37bd3802361564 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "74e51d608de7a409a29545a95ac3ec14 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "7438fcf14797976de9288368236ed75d /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "26dd6bc17c1596e2881924bcc167125a /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "9b5f7c1593f216f710814c299738493f /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "5a6392d9c23aa85d170d7f50de4f0b54 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "4221fa24ccbb1735202fcfaa80d69d95 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "0adbb85da2610131b728417669e0922c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "b371a3b3588060bb2200f5caaf9a9d5c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "dd8b0fef21fc5e330d08ae4a48c8d67b /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "50b5dbff9426738005b81efd49d66329 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "29be6e734ad65c180aee23d1514c5c35 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksums of log, freq, xls, and fa files" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t0m16.567s\n", "user\t0m0.090s\n", "sys\t0m4.270s\n" ] } ], "source": [ "%%bash\n", "\n", "#For loop generates a md5 checksum has value for each file\n", "#and appends the output to the checksums.md5 file.\n", "time for file in /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/*.[lfx]*\n", " do\n", " md5sum \"$file\" >> /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5\n", " done" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "e9ea19f40b0ebb212ab33c78e350a679 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/17mer.freq\n", "fafc80b25613cff45598f4f37dde9d8e /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/17mer.log\n", "a4f92de7bd24d7ebbc02c40696a0f4e8 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/N50.xls\n", "1c8c33470654e3f7993e48b4a6b4989a /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "#### Compare checksums of FASTQ files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### The gist is that the output from the awk command is saved to an array. Then a for loop is run which iterates over the array and prints each position.\n", "\n", "---\n", "###### Break down the first line:\n", "\n", "```bgi_md5=()``` - This is an empty array called \"bgi_md5\".\n", "\n", "```$()``` - This is an empty command substitution. The stdout of commands within the parenthese are stored.\n", "\n", "```awk '/gz/{print $1}' md5_file``` - Awk looks for any lines from the input file (md5_file) with \"gz\" in them. If a line contains \"gz\", awk prints the first field (i.e. the first column).\n", "\n", "Summary - The output from each result printed by awk is saved in an auto-incrementing fashion in the array called \"bgi_md5\".\n", "\n", "---\n", "###### Break down the 3rd line:\n", "\n", "```count=$(())``` - A variable called \"count\". This is a combination of empty command substitution and bash arithmeetic. Double parentheses are required for bash arithmetic.\n", "\n", "\n", "```${#bgi_md5[@]} - 1``` - This prints the number of indeces in the array called \"bgi_md5\" and subtracts 1 from that number. Subtraction of one is necessary because bash is a zero-based language (e.g. the array starts at index 0).\n", "\n", "Summary - The length of the array minus one is saved the the variable called \"count\".\n", "\n", "---\n", "###### Break down the for loop:\n", "\n", "```((i=0;i<=$count;++i))``` - Sets variable \"i\" to 0. Then, the loop evaluates whether or not the value of \"i\" is than/equal to the value in the variable \"count\". If that condition is met, the loop increases the value stored in \"i\" by 1 and continues through the loop.\n", "\n", "```printf \"%s\\n\" \"${bgi_md5[$i]}\"``` - Prints the value at the array index designated by the value currently stored in \"i\" (the printing is specified by the \"%s\", which means string). This is followed by printing a new line (\\n).\n", "\n", "Summary - This prints the value at each position within the array and uses printf to improve legibility of output." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "810d188468dbd8bb36b2af3bf3b9fee6\n", "410cfcdf170125f4d8cb1ac4baf0007c\n", "\n", "cf92b18e0815dc0471d61f9107142257\n", "cf92b18e0815dc0471d61f9107142257\n", "\n", "3dc2137d7df0af8d6a007516908361a3\n", "3dc2137d7df0af8d6a007516908361a3\n", "\n", "8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "\n", "08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "\n", "a503043167457337a65d51151ceb5dd0\n", "a503043167457337a65d51151ceb5dd0\n", "\n", "b371a3b3588060bb2200f5caaf9a9d5c\n", "a0c0177a7a4a4ca28c37bd3802361564\n", "\n", "dd8b0fef21fc5e330d08ae4a48c8d67b\n", "74e51d608de7a409a29545a95ac3ec14\n", "\n", "50b5dbff9426738005b81efd49d66329\n", "7438fcf14797976de9288368236ed75d\n", "\n", "29be6e734ad65c180aee23d1514c5c35\n", "26dd6bc17c1596e2881924bcc167125a\n", "\n", "a0c0177a7a4a4ca28c37bd3802361564\n", "9b5f7c1593f216f710814c299738493f\n", "\n", "74e51d608de7a409a29545a95ac3ec14\n", "5a6392d9c23aa85d170d7f50de4f0b54\n", "\n", "9b5f7c1593f216f710814c299738493f\n", "4221fa24ccbb1735202fcfaa80d69d95\n", "\n", "5a6392d9c23aa85d170d7f50de4f0b54\n", "0adbb85da2610131b728417669e0922c\n", "\n", "7438fcf14797976de9288368236ed75d\n", "b371a3b3588060bb2200f5caaf9a9d5c\n", "\n", "26dd6bc17c1596e2881924bcc167125a\n", "dd8b0fef21fc5e330d08ae4a48c8d67b\n", "\n", "4221fa24ccbb1735202fcfaa80d69d95\n", "50b5dbff9426738005b81efd49d66329\n", "\n", "0adbb85da2610131b728417669e0922c\n", "29be6e734ad65c180aee23d1514c5c35\n", "\n" ] } ], "source": [ "%%bash\n", "bgi_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt))\n", "my_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "count=$(( ${#bgi_md5[@]} - 1 ))\n", "for ((i=0;i<=$count;++i))\n", " do\n", " printf \"%s\\n\" \"${bgi_md5[$i]}\"\n", " printf \"%s\\n\\n\" \"${my_md5[$i]}\"\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### There are problems. Most of the files do not have matching md5 checksums. Let's see if we can tweak the code from above to print file names to make identification easier...\n", "\n", "##### The code below adds a line to print the 2nd field of the checksum files and add that value to a new array. It also adds the following operation in the printf statements:\n", "\n", "```${bgi_filename[$i]##*/}``` - Like before, this prints the value of the array at each index specified by the value stored in \"i\". It also uses parameter substitution for substring removal. The ```##*/``` matches the longest pattern before, and including the last slash, and deletes that pattern. This effectively removes the full path details and leaves us with just the filename." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 810d188468dbd8bb36b2af3bf3b9fee6\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 410cfcdf170125f4d8cb1ac4baf0007c\n", "\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "\n" ] } ], "source": [ "%%bash\n", "bgi_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt))\n", "bgi_filename=($(awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt))\n", "my_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "my_filename=($(awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "count=$(( ${#bgi_md5[@]} - 1 ))\n", "for ((i=0;i<=$count;++i))\n", " do\n", " printf \"%s %s\\n\" \"${bgi_filename[$i]##*/}\" \"${bgi_md5[$i]}\"\n", " printf \"%s %s\\n\\n\" \"${my_filename[$i]##*/}\" \"${my_md5[$i]}\"\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Well, the output with the file names helps to partly explain why the md5 checksums didn't match in the previous command: the outputs from the BGI md5 file and my md5 aren't in the same order (why?). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Although not critical, I'd like to figure out why the sorting seems whacky in the above command..." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n" ] } ], "source": [ "%%bash\n", "awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n" ] } ], "source": [ "%%bash\n", "awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### I didn't really have to run the second command on my checksums.md5. The md5.txt file provided by BGI is not sorted correctly. Odd...\n", "\n", "#### Let's see if we can fix the sorting for the checksum comparison" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Test sorting of md5 file before parsing with awk." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\n", "clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\n", "clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\n" ] } ], "source": [ "%%bash\n", "sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt | awk '/gz/{print $2}'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### That works, so let's use that to populate the arrays and see if we get proper output..." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 810d188468dbd8bb36b2af3bf3b9fee6\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 410cfcdf170125f4d8cb1ac4baf0007c\n", "\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "\n" ] } ], "source": [ "%%bash\n", "bgi_md5=($(sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt | awk '/gz/{print $1}'))\n", "bgi_filename=($(sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt | awk '/gz/{print $2}'))\n", "my_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "my_filename=($(awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "count=$(( ${#bgi_md5[@]} - 1 ))\n", "for ((i=0;i<=$count;++i))\n", " do\n", " printf \"%s %s\\n\" \"${bgi_filename[$i]##*/}\" \"${bgi_md5[$i]}\"\n", " printf \"%s %s\\n\\n\" \"${my_filename[$i]##*/}\" \"${my_md5[$i]}\"\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### OK! Visual comparison of the md5 checksums reveal that 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz has differing checksums and will need to be re-downloaded.\n", "\n", "##### I actually already knew that because I had been monitoriing file sizes as things downloaded and noticed that this file was significantly smaller than the size listed on the BGI server, but I still needed to verify the integrity of all the other files, too." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Re-download 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The ```wget``` command below utilizes two arguments:\n", "\n", "```-q``` - This is for \"quiet\" ouput and will not print [the thousands of lines of output that screwed up my notebook the first time around](http://onsnetwork.org/kubu4/2016/12/14/data-management-download-final-bgi-genome-assembly-files/).\n", "\n", "```-O``` - Specifies the output filename. This is required to overwrite an existing file with the same name." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t193m14.075s\n", "user\t0m0.500s\n", "sys\t4m1.740s\n" ] } ], "source": [ "%%bash\n", "time wget -q ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz \\\n", "-O /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remove original 151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz checksum from file\n", "##### Uses the ```-v``` flag in grep to exclude any lines that match (i.e. print all other lines)." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "grep -v \\\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5 > \\\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5_temp; \\\n", "mv /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5_temp \\\n", "/owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksum" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz \\\n", ">> /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Sort updated checksum file\n", "##### Sorts to a temporary file and then uses the ```mv``` command to overwrite the old, unsorted version of the file." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5 > tmp && mv tmp /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "810d188468dbd8bb36b2af3bf3b9fee6 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz\r\n", "cf92b18e0815dc0471d61f9107142257 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz\r\n", "3dc2137d7df0af8d6a007516908361a3 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz\r\n", "8bc0d7c7a7af3954baca31a4a7fe9f2b /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz\r\n", "08cfdc6fdc5a6190cb05cdcb81fa5b9c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz\r\n", "a503043167457337a65d51151ceb5dd0 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz\r\n", "a0c0177a7a4a4ca28c37bd3802361564 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\r\n", "74e51d608de7a409a29545a95ac3ec14 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\r\n", "7438fcf14797976de9288368236ed75d /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\r\n", "26dd6bc17c1596e2881924bcc167125a /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\r\n", "9b5f7c1593f216f710814c299738493f /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz\r\n", "5a6392d9c23aa85d170d7f50de4f0b54 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz\r\n", "4221fa24ccbb1735202fcfaa80d69d95 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz\r\n", "0adbb85da2610131b728417669e0922c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz\r\n", "b371a3b3588060bb2200f5caaf9a9d5c /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\r\n", "dd8b0fef21fc5e330d08ae4a48c8d67b /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\r\n", "50b5dbff9426738005b81efd49d66329 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz\r\n", "29be6e734ad65c180aee23d1514c5c35 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz\r\n" ] } ], "source": [ "cat /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### MD5 checksum comparisons\n", "\n", "##### See my original MD5 comparisons above for an explanation of the majority of the code.\n", "\n", "##### This code adds a ```sort -k2``` command to deal with the improperly sorted md5.txt file provided by BGI. This command sorts on the second column (-k2), which is the column that contains the filenames." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 810d188468dbd8bb36b2af3bf3b9fee6\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz.clean.dup.clean.gz 810d188468dbd8bb36b2af3bf3b9fee6\n", "\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz.clean.dup.clean.gz cf92b18e0815dc0471d61f9107142257\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz.clean.dup.clean.gz 3dc2137d7df0af8d6a007516908361a3\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz.clean.dup.clean.gz 8bc0d7c7a7af3954baca31a4a7fe9f2b\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz.clean.dup.clean.gz 08cfdc6fdc5a6190cb05cdcb81fa5b9c\n", "\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz.clean.dup.clean.gz a503043167457337a65d51151ceb5dd0\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz a0c0177a7a4a4ca28c37bd3802361564\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 74e51d608de7a409a29545a95ac3ec14\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 7438fcf14797976de9288368236ed75d\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 26dd6bc17c1596e2881924bcc167125a\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz.clean.dup.clean.gz 9b5f7c1593f216f710814c299738493f\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz.clean.dup.clean.gz 5a6392d9c23aa85d170d7f50de4f0b54\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz.clean.dup.clean.gz 4221fa24ccbb1735202fcfaa80d69d95\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz.clean.dup.clean.gz 0adbb85da2610131b728417669e0922c\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz b371a3b3588060bb2200f5caaf9a9d5c\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "160103_I137_FCH3V5YBBXX_L5_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz dd8b0fef21fc5e330d08ae4a48c8d67b\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_1.fq.gz.clean.dup.clean.gz 50b5dbff9426738005b81efd49d66329\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "160103_I137_FCH3V5YBBXX_L6_WHOSTibkDCAADWAAPEI-74_2.fq.gz.clean.dup.clean.gz 29be6e734ad65c180aee23d1514c5c35\n", "\n" ] } ], "source": [ "%%bash\n", "bgi_md5=($(sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt | awk '/gz/{print $1}'))\n", "bgi_filename=($(sort -k2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt | awk '/gz/{print $2}'))\n", "my_md5=($(awk '/gz/{print $1}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "my_filename=($(awk '/gz/{print $2}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/clean_data/checksums.md5))\n", "count=$(( ${#bgi_md5[@]} - 1 ))\n", "for ((i=0;i<=$count;++i))\n", " do\n", " printf \"%s %s\\n\" \"${bgi_filename[$i]##*/}\" \"${bgi_md5[$i]}\"\n", " printf \"%s %s\\n\\n\" \"${my_filename[$i]##*/}\" \"${my_md5[$i]}\"\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare FASTA assembly checksums" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "f0a7772d4f1074698b50c913783c6fe2 Ostrea_lurida.fa\n", "1c8c33470654e3f7993e48b4a6b4989a /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa\n" ] } ], "source": [ "%%bash\n", "awk '/\\.fa/{print $0}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt\n", "awk '/\\.fa/{print $0}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The checksums do not match. Will re-download..." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Re-download Ostrea_lurida.fa" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t22m20.294s\n", "user\t0m0.050s\n", "sys\t0m26.100s\n" ] } ], "source": [ "%%bash\n", "time wget -q ftp://F15FTSUSAT0327:OSTibkD@cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa \\\n", "-O /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Remove original Ostrea_lurida.fa checksum from file\n", "##### Uses the ```-v``` flag in grep to exclude any lines that match (i.e. print all other lines)." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "grep -v \\\n", "Ostrea_lurida.fa /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5 > \\\n", "tmp; \\\n", "mv tmp /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksum" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "md5sum /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa \\\n", ">> /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare checksums" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "f0a7772d4f1074698b50c913783c6fe2 Ostrea_lurida.fa\n", "f0a7772d4f1074698b50c913783c6fe2 /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/Ostrea_lurida.fa\n" ] } ], "source": [ "%%bash\n", "awk '/\\.fa/{print $0}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/md5.txt\n", "awk '/\\.fa/{print $0}' /owl_web/O_lurida_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Ostrea_lurida/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Panopea generosa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### View md5 files provided by BGI" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4e95a487fd10e60ea5af7b62287f88a0 clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz.clean.dup.clean.gz\n", "ce78b1856aa612473de14a9599d0c4ff clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz.clean.dup.clean.gz\n", "a21f45349c1ebae6f8723d07613189cb clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz.clean.dup.clean.gz\n", "23fcb4cda2aba74e97644a8cfac46a59 clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz.clean.dup.clean.gz\n", "4b6225d3d0280fc9bf4161c6f86b5586 clean_data/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz.clean.dup.clean.gz\n", "b9f80f96fda43362da9ee08aced56bef clean_data/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz.clean.dup.clean.gz\n", "109507129ac270d54159bce197ce0b13 clean_data/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", "10acd92a3b2170367b09f45be30c7b4c clean_data/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", "203c8428f6a787db0d49d43801fef0e3 clean_data/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", "b9edf4fe8ea20d2b2f044837c18c2d2f clean_data/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", "efcefa598dfc9230aa2f5b4d787388ee clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", "0d3fabf04fc56e6249378fcddc286f06 clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", "666f18e9cf0ba248e41bc089736932bb clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", "e73619c9bff1dbc816b6b17ca16518b1 clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", "ea41c8677502dc096a67dfabf279d556 clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", "8cd443fbd790d71378cab4df87cac247 clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n", "712a5c2c6d7a5851101face85b234bb6 clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", "97089726a31e080920b66caec4a7aee9 clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n", "873963ffcaffe6bf1ecc65bd34fedf2a 17mer.log\n", "4df56d94c5357baf764658c5f53f7609 17mer.freq\n", "c01069a6d2a6a0e6bcce7daa1e339253 N50.xls\n", "0348d8a1c5aea2c936ce47b5addcd857 Panopea_generosa.fa\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/P_generosa_genome_assemblies_BGI/rosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/md5.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksums of fastq files" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t41m43.942s\n", "user\t0m9.400s\n", "sys\t8m4.230s\n" ] } ], "source": [ "%%bash\n", "\n", "#For loop generates a md5 checksum has value for each file\n", "#and appends the output to the checksums.md5 file.\n", "time for file in /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/*.gz\n", " do\n", " md5sum \"$file\" >> /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/checksums.md5\n", " done" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4e95a487fd10e60ea5af7b62287f88a0 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz.clean.dup.clean.gz\n", "ce78b1856aa612473de14a9599d0c4ff /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz.clean.dup.clean.gz\n", "a21f45349c1ebae6f8723d07613189cb /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz.clean.dup.clean.gz\n", "23fcb4cda2aba74e97644a8cfac46a59 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz.clean.dup.clean.gz\n", "4b6225d3d0280fc9bf4161c6f86b5586 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz.clean.dup.clean.gz\n", "b9f80f96fda43362da9ee08aced56bef /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz.clean.dup.clean.gz\n", "109507129ac270d54159bce197ce0b13 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", "10acd92a3b2170367b09f45be30c7b4c /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", "203c8428f6a787db0d49d43801fef0e3 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz\n", "b9edf4fe8ea20d2b2f044837c18c2d2f /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz\n", "efcefa598dfc9230aa2f5b4d787388ee /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", "0d3fabf04fc56e6249378fcddc286f06 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", "ea41c8677502dc096a67dfabf279d556 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", "8cd443fbd790d71378cab4df87cac247 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n", "666f18e9cf0ba248e41bc089736932bb /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz\n", "e73619c9bff1dbc816b6b17ca16518b1 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz\n", "712a5c2c6d7a5851101face85b234bb6 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz\n", "97089726a31e080920b66caec4a7aee9 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Generate md5 checksums of log, freq, xls, and fa files" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "\n", "real\t0m45.120s\n", "user\t0m0.120s\n", "sys\t0m8.630s\n" ] } ], "source": [ "%%bash\n", "\n", "#For loop generates a md5 checksum has value for each file\n", "#and appends the output to the checksums.md5 file.\n", "time for file in /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/*.[lfx]*\n", " do\n", " md5sum \"$file\" >> /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/checksums.md5\n", " done" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4df56d94c5357baf764658c5f53f7609 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/17mer.freq\n", "873963ffcaffe6bf1ecc65bd34fedf2a /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/17mer.log\n", "c01069a6d2a6a0e6bcce7daa1e339253 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/N50.xls\n", "0348d8a1c5aea2c936ce47b5addcd857 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/Panopea_generosa.fa\n" ] } ], "source": [ "%%bash\n", "cat /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare checksums of FASTQ files" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz.clean.dup.clean.gz 4e95a487fd10e60ea5af7b62287f88a0\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz.clean.dup.clean.gz 4e95a487fd10e60ea5af7b62287f88a0\n", "\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz.clean.dup.clean.gz ce78b1856aa612473de14a9599d0c4ff\n", "151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz.clean.dup.clean.gz ce78b1856aa612473de14a9599d0c4ff\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz.clean.dup.clean.gz a21f45349c1ebae6f8723d07613189cb\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz.clean.dup.clean.gz a21f45349c1ebae6f8723d07613189cb\n", "\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz.clean.dup.clean.gz 23fcb4cda2aba74e97644a8cfac46a59\n", "151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz.clean.dup.clean.gz 23fcb4cda2aba74e97644a8cfac46a59\n", "\n", "151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz.clean.dup.clean.gz 4b6225d3d0280fc9bf4161c6f86b5586\n", "151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz.clean.dup.clean.gz 4b6225d3d0280fc9bf4161c6f86b5586\n", "\n", "151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz.clean.dup.clean.gz b9f80f96fda43362da9ee08aced56bef\n", "151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz.clean.dup.clean.gz b9f80f96fda43362da9ee08aced56bef\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz 109507129ac270d54159bce197ce0b13\n", "160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz 109507129ac270d54159bce197ce0b13\n", "\n", "160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz 10acd92a3b2170367b09f45be30c7b4c\n", "160103_I137_FCH3V5YBBXX_L3_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz 10acd92a3b2170367b09f45be30c7b4c\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz 203c8428f6a787db0d49d43801fef0e3\n", "160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_1.fq.gz.clean.dup.clean.gz 203c8428f6a787db0d49d43801fef0e3\n", "\n", "160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz b9edf4fe8ea20d2b2f044837c18c2d2f\n", "160103_I137_FCH3V5YBBXX_L4_WHPANwalDDAADWAAPEI-101_2.fq.gz.clean.dup.clean.gz b9edf4fe8ea20d2b2f044837c18c2d2f\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz efcefa598dfc9230aa2f5b4d787388ee\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz efcefa598dfc9230aa2f5b4d787388ee\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz 0d3fabf04fc56e6249378fcddc286f06\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz 0d3fabf04fc56e6249378fcddc286f06\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz ea41c8677502dc096a67dfabf279d556\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz ea41c8677502dc096a67dfabf279d556\n", "\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz 8cd443fbd790d71378cab4df87cac247\n", "160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz 8cd443fbd790d71378cab4df87cac247\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz 666f18e9cf0ba248e41bc089736932bb\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz.clean.dup.clean.gz 666f18e9cf0ba248e41bc089736932bb\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz e73619c9bff1dbc816b6b17ca16518b1\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz.clean.dup.clean.gz e73619c9bff1dbc816b6b17ca16518b1\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz 712a5c2c6d7a5851101face85b234bb6\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz.clean.dup.clean.gz 712a5c2c6d7a5851101face85b234bb6\n", "\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz 97089726a31e080920b66caec4a7aee9\n", "160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz.clean.dup.clean.gz 97089726a31e080920b66caec4a7aee9\n", "\n" ] } ], "source": [ "%%bash\n", "bgi_md5=($(sort -k2 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/md5.txt | awk '/gz/{print $1}'))\n", "bgi_filename=($(sort -k2 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/md5.txt | awk '/gz/{print $2}'))\n", "my_md5=($(awk '/gz/{print $1}' /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/checksums.md5))\n", "my_filename=($(awk '/gz/{print $2}' /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/clean_data/checksums.md5))\n", "count=$(( ${#bgi_md5[@]} - 1 ))\n", "for ((i=0;i<=$count;++i))\n", " do\n", " printf \"%s %s\\n\" \"${bgi_filename[$i]##*/}\" \"${bgi_md5[$i]}\"\n", " printf \"%s %s\\n\\n\" \"${my_filename[$i]##*/}\" \"${my_md5[$i]}\"\n", " done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Everything looks good so far!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Compare checksums for Panopea_generosa.fa assembly file" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0348d8a1c5aea2c936ce47b5addcd857 Panopea_generosa.fa\n", "0348d8a1c5aea2c936ce47b5addcd857 /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/Panopea_generosa.fa\n" ] } ], "source": [ "%%bash\n", "awk '/\\.fa/{print $0}' /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/md5.txt\n", "awk '/\\.fa/{print $0}' /owl_web/P_generosa_genome_assemblies_BGI/20161201/cdts-hk.genomics.cn/Panopea_generosa/checksums.md5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Great! All of the checksums now match!!! These files are ready for use. Next on the agenda: README files for each directory." ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.12" } }, "nbformat": 4, "nbformat_minor": 2 }