--- aliases: - /2018/09/jetstream_kubernetes_kubespray - /2018/09/jetstream-kubernetes-kubespray categories: - kubernetes - jetstream date: 2018-09-23 18:00 layout: post slug: kubernetes-jetstream-kubespray title: Deploy Kubernetes on Jetstream with Kubespray 1/3 --- **Note**: Jetstream 1 has been retired. For current Kubernetes deployments, see [Jetstream 2 documentation](https://docs.jetstream-cloud.org/). **This tutorial is obsolete, check the [updated version of the tutorial](https://zonca.dev/2020/06/kubernetes-jetstream-kubespray.html)** The purpose of this tutorial series is to deploy Jupyterhub on top of Kubernetes on Jetstream. This material was presented as a tutorial at the Gateways 2018 conference, see also [the slides on Figshare](https://figshare.com/articles/Hands-on_Tutorial_Deploying_Kubernetes_and_JupyterHub_on_Jetstream/7137884). Compared to my [initial tutorial](https://zonca.github.io/2017/12/scalable-jupyterhub-kubernetes-jetstream.html), I focused on improving automation. Instead of creating Jetstream instances via the Atmosphere web interface and then SSHing into the instances and run `kubeadm` based commands to setup Docker and Kubernetes we will: * Use the `terraform` recipe part of the `kubespray` project to interface with the Jetstream API and create a cluster of virtual machines * Run the `kubespray` ansible recipe to setup a production-ready Kubernetes deployment, optionally with High Availability features like redundant master nodes and much more, see [kubepray.io](http://kubespray.io). ## Create Jetstream Virtual machines with Terraform `kubespray` is able to deploy production-ready Kubernetes deployments and initially targeted only commercial cloud platforms. They recently added support for Openstack via a Terraform recipe which is available in [their Github repository](https://github.com/kubernetes-incubator/kubespray/tree/master/contrib/terraform/openstack). Terraform allows to execute recipes that describe a set of OpenStack resources and their relationship. In the context of this tutorial, we do not need to learn much about Terraform, we will configure and execute the recipe provided by `kubespray`. ### Requirements On a Ubuntu 18.04 install `python3-openstackclient` with APT. Any other platform works as well, also install `terraform` by copying the correct binary to `/usr/local/bin/`, see . The current version of the recipe requires Terraform `0.11.x`, **not the newest 0.12**. ### Request API access **Note**: Jetstream 1 documentation and services are no longer available. In order to make sure your XSEDE account can access the Jetstream API, you need to contact the Helpdesk (Jetstream 1 wiki is no longer available). Login to the TACC Horizon panel (Jetstream 1 dashboard is no longer available), this is basically the low level web interface to OpenStack, a lot more complex and powerful than Atmosphere. First choose the right project you would like to charge to in the top dropdown menu (see the XSEDE website if you don't recognize the grant code). Click on Compute / API Access and download the OpenRC V3 authentication file to your machine. Source it typing: source XX-XXXXXXXX-openrc.sh it should ask for your TACC password. This configures all the environment variables needed by the `openstack` command line tool to interface with the Openstack API. Test with: openstack flavor list This should return the list of available "sizes" of the Virtual Machines. ### Clone kubespray I had to make a few modifications to `kubespray` to adapt it to Jetstream or backport bug fixes not merged yet, so currently better use my fork of `kubespray`: git clone https://github.com/zonca/jetstream_kubespray See an [overview of my changes compared to the standard `kubespray` release 2.6.0](https://github.com/zonca/jetstream_kubespray/pull/2). ### Run Terraform Inside `jetstream_kubespray`, copy from my template: export CLUSTER=$USER cp -LRp inventory/zonca_kubespray inventory/$CLUSTER cd inventory/$CLUSTER Open and modify `cluster.tf`, choose your image and number of nodes. Make sure to change the network name to something unique, like the expanded form of `$CLUSTER_network`. You can find suitable images (they need to be JS-API-Featured, you cannot use the same instances used in Atmosphere): openstack image list | grep "JS-API" I already preconfigured the network UUID both for IU and TACC, but you can crosscheck looking for the `public` network in: openstack network list Initialize Terraform: bash terraform_init.sh Create the resources: bash terraform_apply.sh The last output log of Terraform should contain the IP of the master node `k8s_master_fips`, wait for it to boot then SSH in with: ssh ubuntu@$IP or `centos@$IP` for CentOS images. Inspect with Openstack the resources created: openstack server list openstack network list You can cleanup the virtual machines and all other Openstack resources (all data is lost) with `bash terraform_destroy.sh`. ## Install Kubernetes with `kubespray` Change folder back to the root of the `jetstream_kubespray` repository, First make sure you have a recent version of `ansible` installed, you also need additional modules, so first run: pip install -r requirements.txt It is useful to create a `virtualenv` and install packages inside that. This will also install `ansible`, it is important to install `ansible` with `pip` so that the path to access the modules is correct. So remove any pre-installed `ansible`. Then following the [`kubespray` documentation](https://github.com/kubernetes-incubator/kubespray/blob/master/contrib/terraform/openstack/README.md#ansible), we setup `ssh-agent` so that `ansible` can SSH from the machine with public IP to the others: eval $(ssh-agent -s) ssh-add ~/.ssh/id_rsa Test the connection through ansible: ansible -i inventory/$CLUSTER/hosts -m ping all If a server is not answering to ping, first try to reboot it: openstack server reboot $CLUSTER-k8s-node-nf-1 Or delete it and run `terraform_apply.sh` to create it again. check `inventory/$CLUSTER/group_vars/all.yml`, in particular `bootstrap_os`, I setup `ubuntu`, change it to `centos` if you used the Centos 7 base image. Due to a bug in the recipe, run ( see details in the Troubleshooting notes below): export OS_TENANT_ID=$OS_PROJECT_ID Finally run the full playbook, it is going to take a good 10 minutes: ansible-playbook --become -i inventory/$CLUSTER/hosts cluster.yml If the playbook fails with "cannot lock the administrative directory", it is due to the fact that the Virtual Machine is automatically updating so it has locked the APT directory. Just wait a minute and launch it again. It is always safe to run `ansible` multiple times. If the playbook gives any error, try to retry the above command, sometimes there are temporary failed tasks, Ansible is designed to be executed multiple times with consistent results. You should have now a Kubernetes cluster running, test it: ``` $ ssh ubuntu@$IP $ kubectl get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE cert-manager cert-manager-78fb746bc7-w9r94 1/1 Running 0 2h ingress-nginx default-backend-v1.4-7795cd847d-g25d8 1/1 Running 0 2h ingress-nginx ingress-nginx-controller-bdjq7 1/1 Running 0 2h kube-system kube-apiserver-zonca-kubespray-k8s-master-1 1/1 Running 0 2h kube-system kube-controller-manager-zonca-kubespray-k8s-master-1 1/1 Running 0 2h kube-system kube-dns-69f4c8fc58-6vhhs 3/3 Running 0 2h kube-system kube-dns-69f4c8fc58-9jn25 3/3 Running 0 2h kube-system kube-flannel-7hd24 2/2 Running 0 2h kube-system kube-flannel-lhsvx 2/2 Running 0 2h kube-system kube-proxy-zonca-kubespray-k8s-master-1 1/1 Running 0 2h kube-system kube-proxy-zonca-kubespray-k8s-node-nf-1 1/1 Running 0 2h kube-system kube-scheduler-zonca-kubespray-k8s-master-1 1/1 Running 0 2h kube-system kubedns-autoscaler-565b49bbc6-7wttm 1/1 Running 0 2h kube-system kubernetes-dashboard-6d4dfd56cb-24f98 1/1 Running 0 2h kube-system nginx-proxy-zonca-kubespray-k8s-node-nf-1 1/1 Running 0 2h kube-system tiller-deploy-5c688d5f9b-fpfpg 1/1 Running 0 2h ``` Compare that you have all those services running also in your cluster. We have also configured NGINX to proxy any service that we will later deploy on Kubernetes, test it with: ``` $ wget localhost --2018-09-24 03:01:14-- http://localhost/ Resolving localhost (localhost)... 127.0.0.1 Connecting to localhost (localhost)|127.0.0.1|:80... connected. HTTP request sent, awaiting response... 404 Not Found 2018-09-24 03:01:14 ERROR 404: Not Found. ``` Error 404 is a good sign, the service is up and serving requests, currently there is nothing to deliver. Finally test that the routing through the Jetstream instance is working correctly by opening your browser and test that if you access `js-XX-XXX.jetstream-cloud.org` you also get a `default backend - 404` message. If any of the tests hangs or cannot connect, there is probably a networking issue. ## Next Next you can [explore the kubernetes deployment to learn more about how you deploy resources in the second part of my tutorial](https://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-explore.html) or skip it and proceed directly to the [third and final part of the tutorial and deploy Jupyterhub and configure it with HTTPS](http://zonca.github.io/2018/09/kubernetes-jetstream-kubespray-jupyterhub.html). ### Troubleshooting notes For future reference, disregard this. Failing ansible task: `openstack_tenant_id is missing` fixed with: `export OS_TENANT_ID=$OS_PROJECT_ID`, this should be fixed once is merged, anyway this is not blocking. Failing task `Write cacert file`: NOTE: had to cherry-pick a commit from , this will be unnecessary once this is fixed upstream ## (Optional) Setup kubectl locally We also set `kubectl_localhost: true` and `kubeconfig_localhost: true`. so that `kubectl` is installed on your local machine it also copies `admin.conf` to: inventory/$CLUSTER/artifacts now copy that to `~/.kube/config` this has an issue, it has the internal IP of the Jetstream master. We cannot replace it with the public floating ip because the certificate is not valid for that. Best workaround is to replace it with `127.0.0.1` inside `~/.kube/config` at the `server:` key. Then make a SSH tunnel: ssh ubuntu@$IP -f -L 6443:localhost:6443 sleep 3h * `-f` sends the process in the background * executing `sleep` for 3 hours makes the tunnel automatically close after 3 hours, otherwise `-N` would keep the tunnel permanently open ## (Optional) Setup helm locally ssh into the master node, check helm version with: helm version Download the same binary version from [the release page on Github](https://github.com/helm/helm/releases) and copy the binary to `/url/local/bin`. helm ls