{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "### Duplicate files on Eagle (Synology server)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### System identification" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Software:\n", "\n", " System Software Overview:\n", "\n", " System Version: Mac OS X 10.7.5 (11G63)\n", " Kernel Version: Darwin 11.4.2\n", " Boot Volume: SSD2\n", " Boot Mode: Normal\n", " Computer Name: greenbird (2)\n", " User Name: Sam (Sam)\n", " Secure Virtual Memory: Enabled\n", " 64-bit Kernel and Extensions: No\n", " Time since boot: 28 days 18:18\n", "\n" ] } ], "source": [ "%%bash\n", "system_profiler SPSoftwareDataType" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### View output from fslint program" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#2 x 2,765,343,018\t(2,765,343,232)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence.gz\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence2.gz\n", "#2 x 1,337,988,079\t(1,337,988,096)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Friedman_Oly_broodstock/106A Female Mix/filtered_106A_Female_Mix_GATCAG_L007_R1.fastq.gz\n" ] } ], "source": [ "%%bash\n", "head -5 /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The output format for the fslint shows the number of duplicates for a particular file, the size (in bytes) of a single file, and then the total bytes wasted ( = number of dupes x single file size - single file size)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Calculations for total data consumption of duplicates >100MB on Eagle/web" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "713041788416 bytes wasted\n", "713 gigabytes wasted\n" ] } ], "source": [ "%%bash\n", "#This script calculates the total wasted bytes of files >100MB from the output\n", "#of the Linux program fslint.\n", "\n", "#Use grep to identify lines beginning with '#', extract field #4 (awk '{print $4} which contains\n", "#the bytes wasted entry, then remove the parentheses and commas so that bash stores the value as an integer.\n", "#Save all resulting values to the variable \"dupes\"\n", "dupes=$(grep '^\\#' /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_web.txt | awk '{print $4}' | tr -d '(),')\n", "\n", "#Initialize variable \"running_total\" to 0.\n", "running_total=0\n", "\n", "#Initialize variable \"single\" to 0.\n", "single=0\n", "\n", "#For loop to process the values in the variable \"dupes\".\n", "for number in $dupes\n", "\tdo\n", "\t\tif [ $number -ge 1000000 ] #If the value in number is greater than or equal to 100MB\n", "\t\tthen\n", "\t\t\tsingle=$number #Value of \"number\" is assigned to \"single\"\n", "\t\t\trunning_total=$((running_total+single)) #Adds \"running_total\" to \"single\" and assigns value to \"running_total\"\n", "\t\tfi\n", "done\n", "\n", "#Converts bytes to gigabytes and assigns to variable \"bytes_to_gigs\"\n", "bytes_to_gigs=$((running_total/1000000000))\n", "\n", "#Prints total of bytes wasted calculations in bytes and gigabytes.\n", "echo $running_total\" bytes wasted\"\n", "echo $bytes_to_gigs\" gigabytes wasted\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Calculations for total data consumption of duplicates >100MB on Eagle/archive" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "19525825024 bytes wasted\n", "19 gigabytes wasted\n" ] } ], "source": [ "%%bash\n", "#This script calculates the total wasted bytes of files >100MB from the output\n", "#of the Linux program fslint.\n", "\n", "#Use grep to identify lines beginning with '#', extract field #4 (awk '{print $4} which contains\n", "#the bytes wasted entry, then remove the parentheses and commas so that bash stores the value as an integer.\n", "#Save all resulting values to the variable \"dupes\"\n", "dupes=$(grep '^\\#' /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive.txt | awk '{print $4}' | tr -d '(),')\n", "\n", "#Initialize variable \"running_total\" to 0.\n", "running_total=0\n", "\n", "#Initialize variable \"single\" to 0.\n", "single=0\n", "\n", "#For loop to process the values in the variable \"dupes\".\n", "for number in $dupes\n", "\tdo\n", "\t\tif [ $number -ge 1000000 ] #If the value in number is greater than or equal to 100MB\n", "\t\tthen\n", "\t\t\tsingle=$number #Value of \"number\" is assigned to \"single\"\n", "\t\t\trunning_total=$((running_total+single)) #Adds \"running_total\" to \"single\" and assigns value to \"running_total\"\n", "\t\tfi\n", "done\n", "\n", "#Converts bytes to gigabytes and assigns to variable \"bytes_to_gigs\"\n", "bytes_to_gigs=$((running_total/1000000000))\n", "\n", "#Prints total of bytes wasted calculations in bytes and gigabytes.\n", "echo $running_total\" bytes wasted\"\n", "echo $bytes_to_gigs\" gigabytes wasted\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Clean up the fslint output files" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is how the file looks.\n", "\n", "Has ugly remote path due to running the fslint program remotely from a Linux machine." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#2 x 2,765,343,018\t(2,765,343,232)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence.gz\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence2.gz\n", "#2 x 1,337,988,079\t(1,337,988,096)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/NGS Raw Data/Friedman_Oly_broodstock/106A Female Mix/filtered_106A_Female_Mix_GATCAG_L007_R1.fastq.gz\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/armina/filtered_106A_Female_Mix_GATCAG_L007_R1.fastq.gz\n", "#2 x 1,034,159,435\t(1,034,159,616)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/filefish/MBD_meth_refmap_v030.sam\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=archive/site_sucker/aquacul4.fish.washington.edu/~steven/filefish/MBD_meth_refmap_v030.sam\n", "#2 x 1,011,649,653\t(1,011,650,048)\tbytes wasted\n" ] } ], "source": [ "%%bash\n", "head /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive.txt" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "#Use sed to edit out extra file path info from fslint output file for Eagle/archive\n", "sed 's/\\/run\\/user\\/1000\\/gvfs\\/smb-share\\:server\\=eagle.fish.washington.edu\\,share\\=//g' \\\n", "/Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive.txt \\\n", "> /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive_cleaned.txt" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#2 x 2,765,343,018\t(2,765,343,232)\tbytes wasted\n", "archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence.gz\n", "archive/NGS Raw Data/Burge_Laby/10233509_709JBAAXX_s_6_sequence2.gz\n", "#2 x 1,337,988,079\t(1,337,988,096)\tbytes wasted\n", "archive/NGS Raw Data/Friedman_Oly_broodstock/106A Female Mix/filtered_106A_Female_Mix_GATCAG_L007_R1.fastq.gz\n", "archive/armina/filtered_106A_Female_Mix_GATCAG_L007_R1.fastq.gz\n", "#2 x 1,034,159,435\t(1,034,159,616)\tbytes wasted\n", "archive/filefish/MBD_meth_refmap_v030.sam\n", "archive/site_sucker/aquacul4.fish.washington.edu/~steven/filefish/MBD_meth_refmap_v030.sam\n", "#2 x 1,011,649,653\t(1,011,650,048)\tbytes wasted\n" ] } ], "source": [ "%%bash\n", "#View file after editing with sed\n", "head /Volumes/web/Arabidopsis/20151229_duplicate_files_eagle_archive_cleaned.txt" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#2 x 116,741,159,757\t(116,741,159,936)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/cnidarian/Geo-Trinity2/trinity_out_dir/bowtie.nameSorted.sam\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/cnidarian/Geo-trinity/trinity_out_dir/bowtie.nameSorted.sam\n", "#2 x 57,464,638,599\t(57,464,638,976)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/cnidarian/Geo-Trinity2/trinity_out_dir/both.fa\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/cnidarian/Geo-trinity/trinity_out_dir/both.fa\n", "#2 x 16,807,324,618\t(16,807,324,672)\tbytes wasted\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/Ichthyophonus/ICH_SNP/iplant_vcf_to_gff.gff\n", "/run/user/1000/gvfs/smb-share:server=eagle.fish.washington.edu,share=web/whale/fish546/module8/ICH_SNP/iplant_vcf_to_gff.gff\n", "#7 x 2,747,235,434\t(16,483,415,040)\tbytes wasted\n" ] } ], "source": [ "%%bash\n", "head /Volumes/web/Arabidopsis/20160104_duplicate_files_eagle_web.txt" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%%bash\n", "#Use sed to edit out extra file path info from fslint output file for Eagle/archive\n", "sed 's/\\/run\\/user\\/1000\\/gvfs\\/smb-share\\:server\\=eagle.fish.washington.edu\\,share\\=//g' \\\n", "/Volumes/web/Arabidopsis/20160104_duplicate_files_eagle_web.txt \\\n", "> /Volumes/web/Arabidopsis/20160104_duplicate_files_eagle_web_cleaned.txt" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "#2 x 116,741,159,757\t(116,741,159,936)\tbytes wasted\n", "web/cnidarian/Geo-Trinity2/trinity_out_dir/bowtie.nameSorted.sam\n", "web/cnidarian/Geo-trinity/trinity_out_dir/bowtie.nameSorted.sam\n", "#2 x 57,464,638,599\t(57,464,638,976)\tbytes wasted\n", "web/cnidarian/Geo-Trinity2/trinity_out_dir/both.fa\n", "web/cnidarian/Geo-trinity/trinity_out_dir/both.fa\n", "#2 x 16,807,324,618\t(16,807,324,672)\tbytes wasted\n", "web/Ichthyophonus/ICH_SNP/iplant_vcf_to_gff.gff\n", "web/whale/fish546/module8/ICH_SNP/iplant_vcf_to_gff.gff\n", "#7 x 2,747,235,434\t(16,483,415,040)\tbytes wasted\n" ] } ], "source": [ "%%bash\n", "head /Volumes/web/Arabidopsis/20160104_duplicate_files_eagle_web_cleaned.txt" ] } ], "metadata": { "celltoolbar": "Raw Cell Format", "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }