BurrMill ======== **_BurrMill_ is a software suite for cost-efficient number grinding on Google Cloud Platform.** It has been designed for running [Kaldi experiments]( http://kaldi-asr.org/), but is equally suitable for workloads with a similar pattern. Why cloud, and why GCP? ----------------------- With our Kaldi work, we quickly ran out of capacity of a single machine with a 16-core Skylake-X processor (i9-7960X) and two 1080Ti GPUs. An average trial experiment took a couple days to train, a relatively large model more than a week. At this point, the choice everyone faces is either expand in-house computing capacity, or rent cloud resources from a provider, of which are currently many. After some investigation and experimenting, we found that Google Cloud Platform is the best choice. Two main benefits that put it above other contenders were, first, a very quick virtual machine staging (average 40 seconds from prepared disk image to a responsive CPU-only computing node, 80 seconds for a GPU node), and, second, the availability of _preemptible_ VMs, including GPUs, at a fixed and a very attractive cost. If you are considering a high-performance computation that can be expressed as a succession of **well-parallelizable, relatively short, independently sharded jobs**, you can use this suite to its best. It was designed and is tuned primarily for Kaldi experiments, but you can adapt it to any computationally-intensive load with a similar load pattern. Read on. ### You need to know: * The very basics of IP networking, what is SSH, and how it is used to work with remote machines. Advanced configuration topics will be covered, but you must have used it already. * Basics of Un*x shell command line. Assuming you have experience runnig Kaldi, you likely know enough. ### You need to have: 1. Started [the _BurrMill 101_ crash course](https://100d.space/burrmill-101). The first post is an introduction, and posts 2 through 7 are walkthroughs. 2. A Google account, protected with [2-factor authentication]( https://support.google.com/accounts/answer/185839). 3. A credit card or other means of setting up a billing account. When you first use GCP and enable billing, Google gives you (as of this moment) a credit of $300, good for 12 months. This is all covered in _BurrMill 101_. 3. A Web browser. GCP provides a small Debian-based virtual machine, free of charge, which you can use from the browser directly, [for up to 50 hours a week](https://cloud.google.com/shell/docs/limitations). This is the most secure, and a recommended way to run administrative tools, such as BurrMill. Besides, you need the GCP Web interface for the initial setup of GCP and billing accounts, for monitoring the performance of your cluster, and one-off tasks. But I'm sure you have one. 4. A computer from which you will connect to the cluster to run experiments, with [Google Cloud SDK]( https://cloud.google.com/sdk/install#installation_options) and an SSH terminal emulator. It can be any OS matching the SDK requirements. Cloud Shell is not suitable for daily work, and has a usage limit. Linux obviously qualifies; For Windows, WSL with a Debian distribution works (but is a bit sluggish when running Cloud SDK tools), or a native Windows build of Cloud SDK and a Windows-based SSH emulator. There are free emulators [Putty]( https://www.chiark.greenend.org.uk/~sgtatham/putty/), [Windows Terminal]( https://www.microsoft.com/p/wt/9n0dx20hk701) and probably other; and a few commercial clients. This may provide a better experience that WSL. Mac... I do not know anything about them. Chime in and tell me (or, even better, send a PR to this file). 5. Optionally, you can use your own Linux computer if you are extending BurrMill-packaged software inventory, and need to debug its build process. You can also start a VM in the cloud for that. Regular BurrMill builds are performed entirely in the clound, by the Cloud Build service. 6. Optionally, if you want to run BurrMill commands locally, you need a Unix-like system with Bash 4.2+ and a few other tools. For a Mac, this means you need [Homebrew](https://formulae.brew.sh/formula/bash#default). On Windows, WSL works with it, if a bit sluggish running Cloud SDK commands. Design goals and constraints ---------------------------- ### Lowest possible cost A new compute node in one of configurations that you define is created on demand when a new batch is submitted, no current nodes are able to accommodate the capacity, and the configured number of nodes is not exceeded. The node is deleted entirely when it's no longer required. You can tune some variables, such as the common NFSv4 filer machine size and the working disk size/performance, depending on the size of your experiments. The power control utility switches between "low" and "full" power (and cost); you run a computation at the full power, but you may debug your setup, launch small number (1-3 at a time) of test jobs, and prepare experiments and analyze results in the low-power mode just fine. The cost of running two machines, the login and NFS server nodes, in a "low power" mode is $30-$50 a month even if you leave them on around the clock (you probably won't), and is billed per second of runtime only for the time it is "powered on," except disks, which are billed as long as they exist. The same tool controls the cluster's power on/off state. The transition takes 2-3 minutes, like booting a desktop computer would. You may put the cluster into a _hibernation state_ by keeping only a snapshot of its main disk in stowage , but ready to wake up. This is a recommended mode if you use the _BurrMill_ only to run some of the larger experiments, and get by your in-house setup otherwise. The hibernation state incurs some storage charges, but they are quite small. Transition may take 5 to 20 minutes, depending on the amount of data on the NFS disk. ### Little or no security inside the firewall _BurrMill_ has been designed under an assumption of *no role-based or account-based access control whatsoever,* to achieve the highest possible throughput. The reasons are explained in the documentation, and have to do with the way Munge security works in Slurm. All files are stored on the central file server under the same UID/GID, and are readable and writable by anyone. This is not a crippling restriction: BurrMill clusters are designed for private use by an individual or a tight team of coworkers; they are built around a single NFS server, which has a limited (although quite formidable) throughput. If you really want to isolate workspaces of more than one team, nothing prevents you from running multiple clusters, or even multiple projects. Given that GCP is very flexible (although a bit daunting) in security controls, you may share computing images and data buckets between teams, and have a separate cluster for each of them, with independent power controls. If you are after managing muliple projects with different acess rights, check out [GCP Cloud Identity and Organizations]( https://cloud.google.com/resource-manager/docs/quickstart-organizations): the organization is free of charge, you only need to own a domain (you can get one from the very same Google for $8 a year). Uh, interesting, but what do all these words mean? -------------------------------------------------- ### Kaldi: workload pattern [Kaldi](https://kaldi-asr.org/) is a leading toolkit in ASR research. If you have to ask, then probably what you _really_ want to know is its workload pattern, and whether adopting it is feasible for you. If you are familiar with Slurm, Kaldi under _BurrMill_ launches batch jobs, most often arrays, with (our patched) `sbatch --wait` command, and sends a script to its stdin. Kaldi experiment scripts are sharding workloads into units of works, mere shell commands (either simple or pipelines), launch them in the cluster in parallel, then wait for their completion. Note that *Kaldi does not use MPI;* every separate machine is churning its own shard of data. This is important, because any work unit can be restarted independently of others, and that's where the much cheaper preemptible VM instances really shine. Another difference from a typical HPC load is that jobs come and go very quickly, so it is important to waste as little time as possible between jobs. It is also more efficient to plan for jobs that execute quickly (from seconds to 30 minutes) to take the full advantage of the preemptible VM discounts, at the very least for computations on GPU, which are expensive. **Note** that you do not have to use preemptible instances. If your budget allows, and your jobs require that machines stay alive for a while (e.g., longer running payloads depending on MPI). You may still use Slurm as usual, and allow _BurrMill_ to bring the nodes up and down as required by your workload. ### GCP, or Google Cloud Platform I bet you heard about it. In a nutshell, you rent virtual machines in your desired configuration, as simple as this. CPU, GPU and RAM are quited per unit×hour of use but charged at a 1 second granularity, provisioned disks per GB×hour, storage buckets and snapshots also per GB×hour, but at a lower rate. It is important that you understand how to optimize your cost by correctly sizing some items in your setup, and how to secure your computing rig, lest it be hijacked by cryptocurrency miners. We have [a crash course]( https://100d.space/burrmill-101) to get you up and running. GCP is overwhelming for a beginner. _BurrMill_ was designed to help with this complexity, partially by hiding it, partially driving you to understand the platform. ### Preemptible VMs GCP has two types of VM runtime policy: permanent and [preemptible]( https://cloud.google.com/compute/docs/instances/preemptible). In either case you are billed for the total uptime of the machine (CPU, RAM and GPU), but at a very different rate: preemptible VMs are charged at 30% (±0.5%) the cost of a permanent VM, including CPU, GPU and RAM. This is a huge advantage for restartable batch loads! The permanent machines are not stopped by GCP on their own once booted. GCP can even migrate a VM during their hardware maintenance, and the move takes less than a second, so you won't likely even notice. The preemptible machine, on the other hand, can be stopped by GCP at any time with only 30 seconds of advance notice: when a hardware maintenance is needed, or, more often, when resources are requested by other users. The loss rate is in fact very low. ### Slurm [Slurm](https://github.com/SchedMD/slurm), formerly known as SLURM, is a resource manager and job scheduler widely used in high-performance computing (HPC) world to control physical supercomputers. Slurm has enough features for cloud-based computing to make an efficient use of the cloud environment. Slurm is open source and GPL, and actively maintained and developed by SchedMD]( https://schedmd.com/); they also provide paid support for supercomputer operators. Full [Slurm documentation]( https://slurm.schedmd.com/) is available on their site. SchedMD also provides a [set of scripts](https://github.com/SchedMD/slurm-gcp) to run Slurm entirely on GCP, but it's more like a demonstration of the possibilities than a way to cost-efficiently run day-to-day experiments. Ballpark costs (Kaldi) ---------------------- In our experience, the largest contributing factor to the total expense was the charge for the GPUs. As an estimate, a very large TDNN-F model that took 48 hours to train on GPUs, while ramping up its Tesla P100 GPU count from 3 to 18, and 52 hours total with feature and i-vector extraction, shuffling and decoding cost about $240. Of these, $190 were charged for the use of 438 GPU-hours. This is ≈80% of the total, and the rest splits roughly evenly between computing node CPU, RAM and disk, and (non-preemptible) CPU, RAM and disk of the 3 control machines. Another factor is the Tesla P100 is at the least 20% faster under Kaldi training load than the stock GTX-1080Ti GPU (they share the GPU chip, but P100 has a very different GRAM), everything else being equal. Given that these factors nearly cancel out, a training run will cost close to $0.46 per your estimated 1080Ti GPU-hour, lock, stock and barrel. Put another way, the 20% GPU speedup nearly absorbs the extra 25% expense of "everything else" but the GPU. YMMV; you'll correct the cost estimation factor after your first sizable run. Start with 1.05. As a reference point for the preemptible node loss rate, there were 34 events altogether during these 52 hours. Given that at times there were 60+ active nodes, this is quite low. Currently, we are not using the 30-second warning at all; there is a potential to improve job rescheduling time by listening for it, which we plan to do eventually. We are experimenting with training on the T4 GPUs, offered at an obscenely low price of $0.11/hour. This may be the best choice if you are learning on your own and paying out of your own pocket. Comparison will be linked to from here soon, but being 4 times cheaper, they are certainly far from being 4 times slower! Documentation ------------- ...is in progress. If you have no or little experience with GCP, you should use read [the _BurrMill 101_ crash course](https://100d.space/burrmill-101) before even attempting to set up your GCP account (yes, there are more and less convoluted ways!). I am writing more in the same blog, with the idea that this will eventually develop into the documentation in this repository. If you know enough of GCP and are eager to start experimenting deeper, feel free to join the [BurrMill Q&A forum]( https://groups.google.com/forum/#!forum/burrmill-users), so that the communication will be in the open and available to other early adventurers. And, needless to say, helping me write the documentation would be awesome! It goes without saying that collaboration on any part is more than welcome! --- _BurrMill_ is licensed under the [Apache License, Version 2.0](LICENSE) Copyright 2020 Kirill 'kkm' Katsnelson Copyright 2020 BurrMill Contributors