# Repeat Analysis Repetitive regions were long believed to be just "junk DNA" but recent genomic projects, such as the Human Genome Project, revealed that these repeat elements play crucial roles in genome organization, function, and evolution. Because of this, repeat analysis is an important step in genome annotation and involves identifying and characterizing repetitive elements within a genome. Two widely used tools for this purpose are RepeatModeler and RepeatMasker. ## What you need You only need **a reference genome** to complete the repeat analysis. ### Installations RepeatModeler and RepeatMasker have many dependencies, so make sure you have everything below installed and configured. Install these in /vol/storage/software ``` # Install HMMER cd /vol/storage/software wget http://eddylab.org/software/hmmer/hmmer-3.3.2.tar.gz tar -xzf hmmer-3.3.2.tar.gz rm /vol/storage/software/hmmer-3.3.2.tar.gz cd hmmer-3.3.2 ./configure --prefix=/vol/storage/software/hmmer-3.3.2 make make install # Install RM BLAST cd /vol/storage/software wget http://www.repeatmasker.org/rmblast/rmblast-2.14.0+-x64-linux.tar.gz tar -xzf rmblast-2.14.0+-x64-linux.tar.gz rm rmblast-2.14.0+-x64-linux.tar.gz # Install TRF cd /vol/storage/software wget https://github.com/Benson-Genomics-Lab/TRF/archive/refs/tags/v4.09.1.tar.gz tar -xzf /vol/storage/software/v4.09.1.tar.gz rm /vol/storage/software/v4.09.1.tar.gz mkdir /vol/storage/software/TRF-4.09.1/build cd /vol/storage/software/TRF-4.09.1/build /vol/storage/software/TRF-4.09.1/configure --prefix=/vol/storage/software/TRF-4.09.1 make make install # Install RepeatMasker cd /vol/storage/software wget https://www.repeatmasker.org/RepeatMasker/RepeatMasker-4.1.5.tar.gz tar -xzf RepeatMasker-4.1.5.tar.gz rm RepeatMasker-4.1.5.tar.gz # Configure RepeatMasker perl /vol/storage/software/RepeatMasker/configure #Paths for RepeatMasker /vol/storage/software/TRF-4.09.1/bin/trf /vol/storage/software/rmblast-2.14.0/bin /vol/storage/software/hmmer-3.3.2/bin #Selections Add a Search Engine: 1. Crossmatch: [ Un-configured ] 2. RMBlast: [ Configured ] 3. HMMER3.1 & DFAM: [ Configured, Default ] 4. ABBlast: [ Un-configured ] 5. Done # Install RECON cd /vol/storage/software wget http://www.repeatmasker.org/RepeatModeler/RECON-1.08.tar.gz tar RECON-1.08.tar.gz rm /vol/storage/software/RECON-1.08.tar.gz cd /vol/storage/software/RECON-1.08/src make make install # Install Ninja cd /vol/storage/software wget https://github.com/TravisWheelerLab/NINJA/archive/0.95-cluster_only.tar.gz tar -xzf /vol/storage/software/0.95-cluster_only.tar.gz rm /vol/storage/software/0.95-cluster_only.tar.gz mv /vol/storage/software/NINJA-0.95-cluster_only/NINJA/Ninja_new /vol/storage/software/NINJA-0.95-cluster_only/NINJA/Ninja # Install LTR Retriever cd /vol/storage/software wget https://github.com/oushujun/LTR_retriever/archive/v2.8.tar.gz tar -xzf v2.8.tar.gz rm v2.8.tar.gz # Install Mafft cd /vol/storage/software wget https://mafft.cbrc.jp/alignment/software/mafft-7.505-with-extensions-src.tgz tar -xzf /vol/storage/software/mafft-7.505-with-extensions-src.tgz rm /vol/storage/software/mafft-7.505-with-extensions-src.tgz cd /vol/storage/software/mafft-7.505-with-extensions/core sed -i 's#PREFIX = /usr/local#PREFIX = /vol/storage/software/mafft-7.505-with-extensions#' /vol/storage/software/mafft-7.505-with-extensions/core/Makefile sed -i 's#BINDIR = $(PREFIX)/bin#BINDIR = /vol/storage/software/mafft-7.505-with-extensions/bin#' /vol/storage/software/mafft-7.505-with-extensions/core/Makefile make clean make make install cd /vol/storage/software/mafft-7.505-with-extensions/extensions sed -i 's#PREFIX = /usr/local#PREFIX = /vol/storage/software/mafft-7.505-with-extensions#' /vol/storage/software/mafft-7.505-with-extensions/extensions/Makefile sed -i 's#BINDIR = $(PREFIX)/bin#BINDIR = /vol/storage/software/mafft-7.505-with-extensions/bin#' /vol/storage/software/mafft-7.505-with-extensions/extensions/Makefile make clean make make install # Install CD-Hit cd /vol/storage/software wget https://github.com/weizhongli/cdhit/archive/refs/tags/V4.8.1.tar.gz tar -xzf /vol/storage/software/V4.8.1.tar.gz rm /vol/storage/software/V4.8.1.tar.gz cd /vol/storage/software/cdhit-4.8.1 make # Install Genometools cd /vol/storage/software wget http://genometools.org/pub/genometools-1.6.2.tar.gz tar -xzf /vol/storage/software/genometools-1.6.2.tar.gz rm /vol/storage/software/genometools-1.6.2.tar.gz cd /vol/storage/software/genometools-1.6.2 make prefix=/vol/storage/software/genometools-1.6.2 cairo=no # Install RepeatModeler2 cd /vol/storage/software wget http://www.repeatmasker.org/RepeatModeler/RepeatModeler-2.0.4.tar.gz tar -xvf /vol/storage/software/RepeatModeler-2.0.4.tar.gz rm /vol/storage/software/RepeatModeler-2.0.4.tar.gz cd /vol/storage/software/RepeatModeler-2.0.4 # Install RepeatScout cd /vol/storage/software wget http://www.repeatmasker.org/RepeatScout-1.0.6.tar.gz tar -xzf /vol/storage/software/RepeatScout-1.0.6.tar.gz rm /vol/storage/software/RepeatScout-1.0.6.tar.gz cd /vol/storage/software/RepeatScout-1.0.6 make # Install USCS rm -r /vol/storage/software/USCS mkdir /vol/storage/software/USCS cd /vol/storage/software/USCS rsync -aP hgdownload.soe.ucsc.edu::genome/admin/exe/linux.x86_64/ ./ cpan install JSON #yes #sudo cpan install File::Which cpan install URI cpan install Devel::Size cpan install LWP::UserAgent cd /vol/storage/software/RepeatModeler-2.0.4 perl ./configure # Configure RepeatModeler2 with Paths w/ yes ("y") to using LTR Retriever */vol/storage/software/RepeatMasker */vol/storage/software/RECON-1.08/bin */vol/storage/software/RepeatScout-1.0.6 */vol/storage/software/TRF-4.09.1/bin */vol/storage/software/cdhit-4.8.1 */vol/storage/software/USCS */vol/storage/software/rmblast-2.14.0/bin #yes */vol/storage/software/genometools-1.6.2/bin */vol/storage/software/LTR_retriever-2.8 */vol/storage/software/mafft-7.505-with-extensions/bin */vol/storage/software/NINJA-0.95-cluster_only/NINJA ``` ## Running repeat_annotation.bash You will need to download and edit [**params.txt**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Parameters/params.txt) file, and download [**repeat_annotation.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/repeat_annotation.bash). You may change the following parameters in the params.txt file: ``` # ============================ # REPEAT ANNOTATION # ============================ RM_THREADS=$THREADS RM_GENOME_SAMPLE=243000000 RM2_DIR="${WORKING_DIR}/RepeatModeler2" RM_DIR="${WORKING_DIR}/RepeatMasker" ``` Remember to ```chmod +x repeat_annotation.bash```. Then run with nohup + & to run in the background as the analyses can take around 48h to run: ``` nohup ./repeat_annotation.bash "species" "path/to/reference_genome.fna.gz" params.txt" & ``` The results will be copied into the results directory within the species directory. ## Running repeat analysis without params.txt The easisest way do the repeat analysis is by running the script [**RepeatAnnotator.bash**](https://github.com/AureKylmanen/Swarmgenomics/blob/main/Scripts/RepeatAnnotator.bash), which you will first need to edit to give paths to the reference genome, working directory, RepeatModeler2 and RepeatMasker. The beginning of the script should then look something like this: ``` REFERENCE="/vol/storage/swarmGenomics/giant_panda/acreference.fna.gz" wDIR="/vol/storage/swarmGenomics/giant_panda" REPEATMODELER2="/vol/storage/software/RepeatModeler-2.0.4" REPEATMASKER="/vol/storage/software/RepeatMasker" ``` Depending on how many threads you have available you might need to change the thread numbers, which are currently set as 13. Before running the script make sure your reference genome is in zipped format, if not, you can compress it with gzip: ``` gzip acreference.fna ``` Once you have added the paths to the RepeatAnnotator.bash and compressed the reference genome, you can run the script. The script takes usually around 24-48h to run, but it depends on the size and thread number. ``` # Use nohup and & to allow the script to run even when you close the terminal nohup ./RepeatAnnotator.bash & ``` If your script fails, read the nohup.out file, which will give you information on which step has failed. Usually the problems are typos in the paths, or issues with the installations. ## Results The final results are in an html file (e.g. acreference.repeat_landscape.html) which you will need to copy to your computer for viewing. The end result is a repeat landscape, with Kimura substitution level on the x-axis and percentage of repeat families on the y-axis. Lower Kimura values indicate fewer mutations and thus more recent insertions, while higher values indicate more mutations, suggesting older insertions. The shape and number of peaks can indicate the activity patterns of different repeat families over time. You may look up more information on the different repeat families and how to analyse these results.