{ "cells": [ { "attachments": {}, "cell_type": "markdown", "id": "c43fa0d5", "metadata": {}, "source": [ "# High Performance Computing at Dartmouth\n", "*Written by Courtney Jiminez & Luke Chang*" ] }, { "cell_type": "markdown", "id": "88da1c35", "metadata": {}, "source": [ "## What is High Performance Computing?" ] }, { "cell_type": "markdown", "id": "8fe3c712", "metadata": {}, "source": [ "https://www.youtube.com/watch?v=nIBu1EFYmBU 00:13 - 01:21" ] }, { "cell_type": "markdown", "id": "71f089d0", "metadata": {}, "source": [ "## HPCs at Dartmouth" ] }, { "cell_type": "markdown", "id": "8be827ba", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "4a9e688c", "metadata": {}, "source": [ "https://rc.dartmouth.edu/index.php/discoveryhpc/" ] }, { "cell_type": "markdown", "id": "0c63befd", "metadata": {}, "source": [ "### Discovery Cluster\n", "\n", "Where is Discovery located? In the basement of Baker Berry! In a huge room full of servers. Racks of computers. Each computer is a node - newer nodes have 32 cores with 8GB per core. Used by all Departments (PBS, CS, Physics & Astronomy, etc.)" ] }, { "cell_type": "markdown", "id": "f7591677", "metadata": {}, "source": [ "### System Layout\n", "\n", "Head Node: what you log-on to when you SSH in with your NetID. Can edit code, submit jobs to scheduler from head node - but DON'T want to run jobs from head node. If the head node computer resources get tied up in running jobs, it can keep others from logging in/opening or copying files, etc. \n", "\n", "CPU Node: can't directly log into these - can schedule jobs from head node that get executed on CPU nodes by the scheduler, but can't directly log on (unless interactive node). \n", "\n", "Storage?" ] }, { "cell_type": "markdown", "id": "b50c0259", "metadata": {}, "source": [ "### Research Computing Team\n", "\n", "Centralized high-performance computing system group. \n", "\n", "services.dartmouth.edu - where help/request forms are for research computing (try searching for Discovery, SLURM, etc.) Locally written documentation for things that people have problems with. \n", "\n", "Simple, basic examples of what you can do with the scheduler. \n", "\n", "If you have a Discovery question, should go to services.dartmouth.edu, search for a key word, see if it's already written up. If you have that question, likely a bunch of others did as well, and RC has tried to answer it within the Dartmouth context.\n", "\n", "research.computing@dartmouth.edu with questions.\n", "\n", "https://services.dartmouth.edu" ] }, { "cell_type": "markdown", "id": "862d50f8", "metadata": {}, "source": [ "### DBIC Resources on Discovery\n", "\n", "DBIC owns 15 nodes (16 cores?) on Discovery. Acess to ~1200 CPUs (Luke, what is this x5 multiplier? How do we go from 15x16=240 to 1200 CPUs?)\n", "\n", "DBIC Node (Ndoli): high-RAM node (1.5 TB), 20-24 CPUs. Can only access it by directly logging in. UserID@ndoli.dartmouth.edu (not available on scheduler). " ] }, { "cell_type": "markdown", "id": "665ba5cf", "metadata": {}, "source": [ "### PBS HPCs\n", "\n", "Dr. Zeus? LOL. Still availalbe - 2-3 nodes, 64 cores, 5-12 GB RAM per core (1.5 TB). Not *realy* maintained. 30-40 TB storage (likely completely full?) Primarily used by Social Area. \n", "\n", "Hydra - Haxby Lab." ] }, { "cell_type": "markdown", "id": "61264e91", "metadata": {}, "source": [ "## How do I access Discovery? " ] }, { "cell_type": "markdown", "id": "65095b6a", "metadata": {}, "source": [ "### Need to be on Campus Network\n", "\n", "Eduroam\n", "OpenVPN" ] }, { "cell_type": "markdown", "id": "285223c9", "metadata": {}, "source": [ "### SSH From Terminal\n", "\n", "Can't use Key Pairs." ] }, { "cell_type": "markdown", "id": "8e2bacd7", "metadata": {}, "source": [ "## How do I navigate Discovery? " ] }, { "cell_type": "markdown", "id": "6d684a41", "metadata": {}, "source": [ "### Linux\n" ] }, { "cell_type": "markdown", "id": "90f05eb0", "metadata": {}, "source": [ "### BASH\n" ] }, { "cell_type": "markdown", "id": "c4903f70", "metadata": {}, "source": [ "### Text Editors\n", "VIM/Nano" ] }, { "cell_type": "markdown", "id": "8c13ea5f", "metadata": {}, "source": [ "### .bashrc .bash_profile\n", "\n", "when you login.\n", "\n", "don't load modules automatically. " ] }, { "cell_type": "markdown", "id": "03a1c6fc", "metadata": {}, "source": [ "### ACL Permissions\n", "\n", "ls -al\n", "\n", "horribly confusing and annoying. 14 different permissions.\n", "\n", "listacl -v foldername\n", "\n", "POSIX permissions don't apply :( \n", "\n", "see AskPBS\n", "\n", "know what they are when you run into a problem and google it LOL\n", "\n", "can also just ask research computing to update your permissions\n", "\n", "https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=88459" ] }, { "cell_type": "markdown", "id": "78d675bb", "metadata": {}, "source": [ "## Where is my data stored on Discovery?" ] }, { "cell_type": "markdown", "id": "71867866", "metadata": {}, "source": [ "### Dart-FS File System\n", "\n", "DBIC has some storage. \n", "\n", "Lab Storage: For the most part, each lab is responsible for buying their own storage on Discovery (~$120/TB/year)\n", "\n", "Personal Storage: 50GB in Home Directory" ] }, { "cell_type": "markdown", "id": "8ff77459", "metadata": {}, "source": [ "### Scratch Space\n", "\n", "LOCAL SCRATCH\n", "local scratch space on every node of discovery (1 TB), much faster than network storage so if your job needs to keep any amount of state information/files it'd help if you wrote it on /scratch. can then copy things from /sratch to dart-fs at the end of your script - clean up any temp files in local scratch at end of script as well. Can make a big difference in some sorts of jobs. Worth noting that local scratch isn't managed - may not be a lot of room on it, but everything gets purged after ~45 days. Usually there's a good amount of space (not too many people use it), but a chance you could end up running a job on a node where the local scratch is a little full. \n", "\n", "GLOBAL SCRATCH \n", "We also have a GLOBAL SCRATCH dart-fs-hpc/scratch but it is not any faster than just accessing lab share on dart-fs etc. " ] }, { "cell_type": "markdown", "id": "55483719", "metadata": {}, "source": [ "### Transferring Data to Discovery\n", "\n", "Moving Data from the Command Line:\n", "\n", "scp, rsync - don't do this from the head (login) node! run interactive job\n", "\n", "datalad? dbic handbook" ] }, { "cell_type": "markdown", "id": "a43d4366", "metadata": {}, "source": [ "## How do I manage software on Discovery? " ] }, { "cell_type": "markdown", "id": "9a2e2535", "metadata": {}, "source": [ "### Modules\n", "\n", "Modules are software that are already installed on Discovery (Python, R, FSL). Module commands allow you to modify your environment on Discovery. \n", "\n", "If you want more control over the software you are using, can always build your own modules in your home/lab directory.\n", "\n", "You can load these software modules when you want to use these programs. Can do manually (module load command). \n", "\n", "Can see a list of modules by using 'module avail' command. \n", "\n", "module load\n", "module avail\n", "module list\n", "module unload\n", "module purge\n", "\n", "weird* - if you've loaded a module, then submit a job, the batch job inherits the environment from which it was submitted - can avoid this by: \n", "\n", "using the #!/bin/bash -l flag in shebang line of sbatch job script. this sets up a log-in environment (reads bashrc profile)\n", "\n", "putting module purge at the start of each job script and then load specific modules you need " ] }, { "cell_type": "markdown", "id": "6fe0304b", "metadata": {}, "source": [ "### Conda\n", "\n", "A way to install additional Python/R packages to your computing environment on the cluster. Don't need any privilieges - installs dependencies, creates separate environments to avoid conflicts. \n", "\n", "Configured inside .bashrc profile - so if using with a SBATCH script, shebang line needs -l flag (#!/bin/sh -l). This results in the SBATCH script reading your .bashrc and .bashprofile scripts and behaving like your log-in shell. Without, .bashrc doens't get read and Conda isn't available. SBATCH script inherits current environment." ] }, { "cell_type": "markdown", "id": "7281c483", "metadata": {}, "source": [ "### Containers\n", "\n", "A reproducible environment in which you control what software is in it. Can load container onto the cluster, run scripts that use software in that container. \n", "\n", "Docker is most popular container, but requires ROOT ACCESS to the computer it is being run on. Therefore, can't use Docker on Discovery.\n", "\n", "Singularity - can be run on the cluster (doesn't require ROOT ACCESS). Can convert a Docker container to a Singularity container, etc. \n", "\n", "Can build them in home/lab directories.\n", "\n", "YCRC Workshop" ] }, { "cell_type": "markdown", "id": "20b8a131", "metadata": {}, "source": [ "## How do I run jobs on Discovery?" ] }, { "cell_type": "markdown", "id": "0040e9a6", "metadata": {}, "source": [ "### SLURM\n", "\n", "SLURM is an open source, free and very flexible scheduler.\n", "\n", "SLURM is the SCHEDULER for Discovery - it is in control of a whole bunch of computers. When we submit jobs, we request a certain number of CPUs, a certain amount of RAM, and specify how long the job will take. The scheduler decides which computer to run that job on. Potentially 1000s of people submitting jobs, and scheduler is trying to optimize resource allocation. " ] }, { "cell_type": "markdown", "id": "b0e6a7b7", "metadata": {}, "source": [ "### Common SLURM Commands\n", "\n", "SBATCH - \n", "\n", "SRUN - \n", "\n", "SQUEUE - SLURM function that will list everything in the queue for your SLURM account (DBIC group). SQUEUE will show what is running, who owns it, and what is queued. Will also show WHY it's queued (e.g. group limit reached). \n", "\n", "SCANCEL - \n", "\n", "SACCTMGR - show associations, to find out what your scheduling limits are. \n", "\n", "SSTAT - \n", "\n", "SINFO - \n", "\n", "SACCT - \n", "\n", "SEFF - doesn't really work well for array jobs. " ] }, { "cell_type": "markdown", "id": "5f682c9e", "metadata": {}, "source": [ "### Requesting Resources\n", "\n", "How do I know how many CPUs, how much RAM, and how much time I'll need for my job? \n", "\n", "If not enough CPUs, will take longer. If not enough RAM, job will error out. If you request TOO MANY resources, system will give you what you ask for (and you'll be charged $$) which may limit where your job can run and other's jobs from running. \n", "\n", "NEED TO SPECIFY: CPU count, Memory, Walltime. \n", "If you don't, the scheduler allocates 1 CPU, 4 GB RAM, and 1 hour. \n", "\n", "Limits for a given Resource Pool: \n", "CPU:\n", "WALLTIME:\n", "\n", "DBIC pool: allowed to use a certain number of CPUs concurrently, submit jobs for a number of total day's run time. If limit is hit, nobody in group can submit jobs until something finishes. \n", "\n", "Intended to keep one group from occupying entire cluster. \n", "\n", "If you submit a job (and haveb'nt reached a limit cap), scheduler should take about 30 seconds to decide where to put your job. If it doesn't start immediately, job goes into a queue. " ] }, { "cell_type": "markdown", "id": "1e8eb669", "metadata": {}, "source": [ "### Running Batch Jobs\n", "\n", "Most jobs on clusters are batch jobs. Text files that get run as a script. \n", "Submit a script.\n", "\n", "You submit a job by preparing a small shell script that has SBATCH directives. Script can be written in any shell you like but most examples use BASH - whatever you specify in shebang line (#!/bin/bash). SBATCH stops parsing script at first uncommented line.\n", "\n", "NODES (never need > 1 node) -\n", "\n", "TASKS/NODES (PROCESSES/NODE) - if your job is a call to one program (even if multi-threaded), it just does one task - only need 1 core here (which is the default). Never need more than 1 unless running an MPI job that knows how to run in parallel. \n", "\n", "CPUs/TASK (CORES/PROCESS) - can request more than 1 for multi-threaded/multi-process jobs. (e.g. fMRIprep can run multi-threaded, can specify 16 cpus/threads)\n", "\n", "RAM - hardest thing to estimate. Ideally you run a similar job in an unrestricted system and use top to see how much memory it allocates for the job, then request similar amount. \n", "\n", "Walltime (doesn't hurt too much to overestimate but don't get wild LOL - will keep others' jobs from starting). Noticed your job isn't finished but walltime is running out? Can extend walltime (for up to 30 days!) but need to ask research computing to do this. Email research.computing@dartmouth.edu. You can use the SBATCH qnotify directive to let you know a certain amont of time before your job terminates.\n", "\n", "Other SBATCH directives: check out this link. https://slurm.schedmd.com/sbatch.html" ] }, { "cell_type": "markdown", "id": "8baa5f33", "metadata": {}, "source": [ "### Arrays: Running Multiple Batch Jobs at Once\n", "\n", "Job Arrays - if you have a lot of jobs you want to run that are veyr similar but differ on some parameter given to a differnet program or perhaps a data input file, then you can submit one job script with a job array that specifies the different variable for each job. Will launch the amount of jobs specified by the number of different variables listed in the array. (SID example). Can throttle - say no more than 10 at a time (nice to others :), will get to them as earlier jobs finish)." ] }, { "cell_type": "markdown", "id": "1ed260ad", "metadata": {}, "source": [ "### Running Interactive Jobs\n", "\n", "Like a remote session - useful for development, debugging, or interactive coding environments (R, Python, Matlab). " ] }, { "cell_type": "markdown", "id": "4f9425c5", "metadata": {}, "source": [ "### Monitoring Jobs \n", "\n", "while it's running: \n", "launch job - use squeue to see where it's running - then ssh to that node and use top or htop to see resource use. \n", "scontrol\n", "\n", "after it's finished: \n", "seff\n", "sacct" ] }, { "cell_type": "markdown", "id": "74b87b3d", "metadata": {}, "source": [ "### Scheduler Etiquette\n", "\n", "How can I figure out who's running jobs on Discovery? SLURM scheduling command - squeue: lists all active jobs on the cluster.\n", "\n", "User ID - dartmouth lookup" ] }, { "cell_type": "markdown", "id": "fb7966fc", "metadata": {}, "source": [ "## Can I use Jupyter Notebook on Discovery?" ] }, { "cell_type": "markdown", "id": "db953099", "metadata": {}, "source": [ "AskPBS.org link" ] }, { "cell_type": "markdown", "id": "fc1b3179", "metadata": {}, "source": [ "## Have Additional Discovery Questions?" ] }, { "cell_type": "markdown", "id": "d2049409", "metadata": {}, "source": [ "### AskPBS\n", "\n", "\n", "### DBIC Handbook\n", "https://dbic-handbook.readthedocs.io/en/latest/discovery.html" ] }, { "cell_type": "markdown", "id": "989fc587", "metadata": {}, "source": [ "## References" ] }, { "cell_type": "markdown", "id": "d7b697b0", "metadata": {}, "source": [ "Knowledge Base\n", "\n", "DBIC Handbook\n", "\n", "YCRC YouTube" ] }, { "cell_type": "markdown", "id": "bf6209d5", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.13 (main, May 24 2022, 21:13:51) \n[Clang 13.1.6 (clang-1316.0.21.2)]" }, "vscode": { "interpreter": { "hash": "b0fa6594d8f4cbf19f97940f81e996739fb7646882a419484c72d19e05852a7e" } } }, "nbformat": 4, "nbformat_minor": 5 }