--- aliases: - /2018/05/shared-dask-kubernetes-jetstream categories: - jupyterhub - jetstream - dask date: 2018-05-04 18:00 layout: post slug: shared-dask-kubernetes-jetstream title: Launch a shared dask cluster in Kubernetes alongside JupyterHub on Jetstream --- Let's assume we have already a Kubernetes deployment and have installed JupyterHub, see for example my [previous tutorial on Jetstream](https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html). Now that users can login and access a Jupyter Notebook, we would also like to provide them more computing power for their interactive data exploration. The easiest way is through [`dask`](https://dask.pydata.org), we can launch a scheduler and any number of workers as containers inside Kubernetes so that users can leverage the computing power of many Jetstream instances at once. There are 2 main strategies, we can give each user their own dask cluster with exclusive access and this would be more performant but cause quick spike of usage of the Kubernetes cluster, or just launch a shared cluster and give all users access to that. In this tutorial we cover the second scenario, we'll cover the first scenario in a following tutorial. We will deploy first Jupyterhub through the Zero-to-JupyterHub guide, then launch via Helm a fixed size dask clusters and show how users can connect, submit distributed Python jobs and monitor their execution on the dashboard. The configuration files mentioned in the tutorial are available in the Github repository [zonca/jupyterhub-deploy-kubernetes-jetstream](https://github.com/zonca/jupyterhub-deploy-kubernetes-jetstream). ## Deploy JupyterHub First we start from Jupyterhub on Jetstream with Kubernetes at Optionally, for testing purposes, we can simplify the deployment by skipping permanent storage, if this is an option, see the relevant section below. We want to install Jupyterhub in the `pangeo` namespace with the name `jupyter`, replace the `helm install` line in the tutorial with: ``` sudo helm install --name jupyter jupyterhub/jupyterhub -f config_jupyterhub_pangeo_helm.yaml --namespace pangeo ``` The `pangeo` configuration file is using a different single user image which has the right version of `dask` for this tutorial. ## (Optional) Simplify deployment using ephemeral storage Instead of installing and configuring rook, we can temporarily disable permanent storage to make the setup quicker and easier to maintain. In the JupyterHub configuration `yaml` set: ``` hub: db: type: sqlite-memory singleuser: storage: type: none ``` Now every time a user container is killed and restarted, all data are gone, this is good enough for testing purposes. ## Configure Github authentication Follow the instructions on the Zero-to-Jupyterhub documentation, at the end you should have in the YAML: ``` auth: type: github admin: access: true users: [zonca, otherusername] github: clientId: "xxxxxxxxxxxxxxxxxxxx" clientSecret: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" callbackUrl: "https://js-xxx-xxx.jetstream-cloud.org/hub/oauth_callback" ``` ## Test Jupyterhub Connect to the master node with your browser at: `https://js-xxx-xxx.jetstream-cloud.org` Login with your Github credentials, you should get a Jupyter Notebook. You can also check that your pod is running: ``` sudo kubectl get pods -n pangeo NAME READY STATUS RESTARTS AGE jupyter-zonca 1/1 Running 0 2m ......other pods ``` ## Install Dask We want to deploy a single dask cluster that all the users can submit jobs to. Customize the `dask_shared/dask_config.yaml` file available in the repository, for testing purposes I set just 1 GB RAM and 1 CPU limits on each of 3 workers. We can change `replicas` of the workers to add more. sudo helm install stable/dask --name=dask --namespace=pangeo -f dask_config.yaml Then check that the `dask` instances are running: ``` $ sudo kubectl get pods --namespace pangeo NAME READY STATUS RESTARTS AGE dask-jupyter-647bdc8c6d-mqhr4 1/1 Running 0 22m dask-scheduler-5d98cbf54c-4rtdr 1/1 Running 0 22m dask-worker-6457975f74-dqhsh 1/1 Running 0 22m dask-worker-6457975f74-lpvk4 1/1 Running 0 22m dask-worker-6457975f74-xzcmc 1/1 Running 0 22m hub-7f75b59fc5-8c2pg 1/1 Running 0 6d jupyter-zonca 1/1 Running 0 10m proxy-6bbf67f6bd-swt7f 2/2 Running 0 6d ``` ### Access the scheduler and launch a distributed job `kube-dns` gives a name to each service and automatically propagates it to each pod, so we can connect by name ``` from dask.distributed import Client client = Client("dask-scheduler:8786") client ``` Now we can access the 3 workers that we launched before: ``` Client Scheduler: tcp://dask-scheduler:8786 Dashboard: http://dask-scheduler:8787/status Cluster Workers: 3 Cores: 6 Memory: 12.43 GB ``` We can run an example computation with dask array: ``` import dask.array as da x = da.random.random((20000, 20000), chunks=(2000, 2000)).persist() x.sum().compute() ``` ### Access the Dask dashboard for monitoring job execution We need to setup ingress so that a path points to the Dask dashboard instead of Jupyterhub, Checkout the file `dask_shared/dask_webui_ingress.yaml` in the repository, it routes the path `/dask` to the `dask-scheduler` service. Create the ingress resource with: sudo kubectl create ingress -n pangeo -f dask_webui_ingress.yaml All users can now access the dashboard at: * Make sure to use `/dask/status/` and not only `/dask`. Currently this is not authenticated, so this address is publicly available. A simple way to hide it is to choose a custom name instead of `/dask` and edit the ingress accordingly with: sudo kubectl edit ingress dask -n pangeo