{ "cells": [ { "cell_type": "markdown", "id": "2248a2d8-bd70-4ac3-8fa8-7beeb4a2a8ae", "metadata": { "collapsed": true, "editable": true, "jupyter": { "outputs_hidden": true }, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "(kuberay-raycluster-quickstart)=\n", "\n", "# RayCluster Quickstart\n", "\n", "This guide shows you how to manage and interact with Ray clusters on Kubernetes.\n", "\n", "## Preparation\n", "\n", "* Install [kubectl](https://kubernetes.io/docs/tasks/tools/#kubectl) (>= 1.23), [Helm](https://helm.sh/docs/intro/install/) (>= v3.4) if needed, [Kind](https://kind.sigs.k8s.io/docs/user/quick-start/#installation), and [Docker](https://docs.docker.com/engine/install/).\n", "* Make sure your Kubernetes cluster has at least 4 CPU and 4 GB RAM.\n", "\n", "## Step 1: Create a Kubernetes cluster\n", "\n", "This step creates a local Kubernetes cluster using [Kind](https://kind.sigs.k8s.io/). If you already have a Kubernetes cluster, you can skip this step." ] }, { "cell_type": "code", "execution_count": 1, "id": "752cb733-ab8f-4fe0-b7a4-3aede9eee30d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-ignore-output", "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Creating cluster \"kind\" ...\n", " \u001b[32mโœ“\u001b[0m Ensuring node image (kindest/node:v1.26.0) ๐Ÿ–ผ\n", " \u001b[32mโœ“\u001b[0m Preparing nodes ๐Ÿ“ฆ 7l\n", " \u001b[32mโœ“\u001b[0m Writing configuration ๐Ÿ“œ7l\n", " \u001b[32mโœ“\u001b[0m Starting control-plane ๐Ÿ•น๏ธ7l\n", " \u001b[32mโœ“\u001b[0m Installing CNI ๐Ÿ”Œ7l\n", " \u001b[32mโœ“\u001b[0m Installing StorageClass ๐Ÿ’พ7l\n", "Set kubectl context to \"kind-kind\"\n", "You can now use your cluster with:\n", "\n", "kubectl cluster-info --context kind-kind\n", "\n", "Have a nice day! ๐Ÿ‘‹\n" ] } ], "source": [ "kind create cluster --image=kindest/node:v1.26.0" ] }, { "cell_type": "markdown", "id": "9ef5b3ae-3404-4bb5-8894-f2719f39f4d6", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Step 2: Deploy a KubeRay operator\n", "\n", "Follow [this document](kuberay-operator-deploy) to install the latest stable KubeRay operator from the Helm repository." ] }, { "cell_type": "code", "execution_count": 2, "id": "648041da-62ac-45b3-bf9e-8bd6433f94d7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell", "nbval-ignore-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NAME: kuberay-operator\n", "LAST DEPLOYED: Tue Mar 18 16:26:18 2025\n", "NAMESPACE: default\n", "STATUS: deployed\n", "REVISION: 1\n", "TEST SUITE: None\n", "deployment.apps/kuberay-operator condition met\n" ] } ], "source": [ "../scripts/doctest-utils.sh install_kuberay_operator" ] }, { "cell_type": "markdown", "id": "f475edda-b74f-4a36-b9cf-69f9b3c5ff74", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "(raycluster-deploy)=\n", "## Step 3: Deploy a RayCluster custom resource\n", "\n", "Once the KubeRay operator is running, you're ready to deploy a RayCluster. Create a RayCluster Custom Resource (CR) in the `default` namespace." ] }, { "cell_type": "code", "execution_count": 3, "id": "58098e44-1405-45af-9e49-551f5f521c9d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-ignore-output", "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NAME: raycluster\n", "LAST DEPLOYED: Tue Mar 18 16:26:52 2025\n", "NAMESPACE: default\n", "STATUS: deployed\n", "REVISION: 1\n", "TEST SUITE: None\n" ] } ], "source": [ "helm install raycluster kuberay/ray-cluster --version 1.3.0" ] }, { "cell_type": "markdown", "id": "e2bb64fd-13b8-4369-b8d5-d83fbc3803d2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Use `helm install raycluster kuberay/ray-cluster --version 1.3.0 --set 'image.tag=2.41.0-aarch64'` instead if are using ARM64 (Apple Silicon) machines." ] }, { "cell_type": "code", "execution_count": 4, "id": "97f09ebc-8806-4375-87fb-3479a5c91faf", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "remove-cell", "nbval-ignore-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "raycluster.ray.io/raycluster-kuberay condition met\n" ] } ], "source": [ "kubectl wait --for=condition=RayClusterProvisioned raycluster/raycluster-kuberay --timeout=500s" ] }, { "cell_type": "markdown", "id": "1319599c-9203-476f-b9ad-1fe9b72e1549", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Once the RayCluster CR has been created, you can view it by running:" ] }, { "cell_type": "code", "execution_count": 5, "id": "44a213ca-6134-4d17-ba61-b09229580eda", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NAME DESIRED WORKERS AVAILABLE WORKERS CPUS MEMORY GPUS STATUS AGE\n", "raycluster-kuberay 1 1 2 3G 0 ready 55s\n" ] } ], "source": [ "kubectl get rayclusters" ] }, { "cell_type": "markdown", "id": "d29106be-e955-4238-8bfa-deed0c7d4904", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The KubeRay operator detects the RayCluster object and starts your Ray cluster by creating head and worker pods. To view Ray clusterโ€™s pods, run the following command:" ] }, { "cell_type": "code", "execution_count": 6, "id": "52832255-1dbf-44d2-adbb-a04f591eec4f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NAME READY STATUS RESTARTS AGE\n", "raycluster-kuberay-head-k7rlq 1/1 Running 0 56s\n", "raycluster-kuberay-workergroup-worker-65zl8 1/1 Running 0 56s\n" ] } ], "source": [ "# View the pods in the RayCluster named \"raycluster-kuberay\"\n", "kubectl get pods --selector=ray.io/cluster=raycluster-kuberay" ] }, { "cell_type": "markdown", "id": "710b90c3-fd6c-479e-9d1b-eac59fe2dd83", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Wait for the pods to reach `Running` state. This may take a few minutes, downloading the Ray images takes most of this time. If your pods stick in the `Pending` state, you can check for errors using `kubectl describe pod raycluster-kuberay-xxxx-xxxxx` and ensure your Docker resource limits meet the requirements.\n", "\n", "## Step 4: Run an application on a RayCluster\n", "\n", "Now, interact with the RayCluster deployed.\n", "\n", "### Method 1: Execute a Ray job in the head Pod\n", "\n", "The most straightforward way to experiment with your RayCluster is to exec directly into the head pod.\n", "First, identify your RayCluster's head pod:" ] }, { "cell_type": "code", "execution_count": 7, "id": "1711be7f-89d2-4c31-8bc6-a96a841e3239", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "raycluster-kuberay-head-k7rlq\n" ] } ], "source": [ "export HEAD_POD=$(kubectl get pods --selector=ray.io/node-type=head -o custom-columns=POD:metadata.name --no-headers)\n", "echo $HEAD_POD" ] }, { "cell_type": "code", "execution_count": 8, "id": "e9abeae9-5300-4381-89f0-ec44fa7a0113", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2025-03-18 01:27:48,692\tINFO worker.py:1514 -- Using address 127.0.0.1:6379 set in the environment variable RAY_ADDRESS\n", "2025-03-18 01:27:48,692\tINFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...\n", "2025-03-18 01:27:48,699\tINFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32m10.244.0.6:8265 \u001b[39m\u001b[22m\n", "{'CPU': 2.0,\n", " 'memory': 3000000000.0,\n", " 'node:10.244.0.6': 1.0,\n", " 'node:10.244.0.7': 1.0,\n", " 'node:__internal_head__': 1.0,\n", " 'object_store_memory': 749467238.0}\n" ] } ], "source": [ "# Print the cluster resources.\n", "kubectl exec -it $HEAD_POD -- python -c \"import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)\"" ] }, { "cell_type": "markdown", "id": "4b46912a-e9cb-42b0-8b7c-7f30afb41c9f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Method 2: Submit a Ray job to the RayCluster using [ray job submission SDK](jobs-quickstart)\n", "\n", "Unlike Method 1, this method doesn't require you to execute commands in the Ray head pod.\n", "Instead, you can use the [Ray job submission SDK](jobs-quickstart) to submit Ray jobs to the RayCluster through the Ray Dashboard port where Ray listens for Job requests.\n", "The KubeRay operator configures a [Kubernetes service](https://kubernetes.io/docs/concepts/services-networking/service/) targeting the Ray head Pod." ] }, { "cell_type": "code", "execution_count": 9, "id": "6d69dddf-4b0d-4e0b-89f1-f361e047cfe1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE\n", "raycluster-kuberay-head-svc ClusterIP None 10001/TCP,8265/TCP,6379/TCP,8080/TCP,8000/TCP 57s\n" ] } ], "source": [ "kubectl get service raycluster-kuberay-head-svc" ] }, { "cell_type": "markdown", "id": "c15d50f2-8315-4b44-bf31-f1c4c711d9d3", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Now that the service name is available, use port-forwarding to access the Ray Dashboard port which is 8265 by default." ] }, { "cell_type": "code", "execution_count": 10, "id": "6bb117ad-1362-455c-a580-be8e99226957", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-ignore-output", "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1] 290315\n" ] } ], "source": [ "# Execute this in a separate shell.\n", "kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265 > /dev/null &" ] }, { "cell_type": "markdown", "id": "cbff0e1b-bf0e-4fe9-84cc-f2081aa04e5c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "Now that the Dashboard port is accessible, submit jobs to the RayCluster:" ] }, { "cell_type": "code", "execution_count": 11, "id": "a08556bb-c9c0-4120-8790-c1eade78f813", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[37mJob submission server address\u001b[39m: \u001b[1mhttp://localhost:8265\u001b[22m\n", "\n", "\u001b[32m-------------------------------------------------------\u001b[39m\n", "\u001b[32mJob 'raysubmit_8vJ7dKqYrWKbd17i' submitted successfully\u001b[39m\n", "\u001b[32m-------------------------------------------------------\u001b[39m\n", "\n", "\u001b[36mNext steps\u001b[39m\n", " Query the logs of the job:\n", " \u001b[1mray job logs raysubmit_8vJ7dKqYrWKbd17i\u001b[22m\n", " Query the status of the job:\n", " \u001b[1mray job status raysubmit_8vJ7dKqYrWKbd17i\u001b[22m\n", " Request the job to be stopped:\n", " \u001b[1mray job stop raysubmit_8vJ7dKqYrWKbd17i\u001b[22m\n", "\n", "Tailing logs until the job exits (disable with --no-wait):\n", "2025-03-18 01:27:51,014\tINFO job_manager.py:530 -- Runtime env is setting up.\n", "2025-03-18 01:27:51,744\tINFO worker.py:1514 -- Using address 10.244.0.6:6379 set in the environment variable RAY_ADDRESS\n", "2025-03-18 01:27:51,744\tINFO worker.py:1654 -- Connecting to existing Ray cluster at address: 10.244.0.6:6379...\n", "2025-03-18 01:27:51,750\tINFO worker.py:1832 -- Connected to Ray cluster. View the dashboard at \u001b[1m\u001b[32m10.244.0.6:8265 \u001b[39m\u001b[22m\n", "{'CPU': 2.0,\n", " 'memory': 3000000000.0,\n", " 'node:10.244.0.6': 1.0,\n", " 'node:10.244.0.7': 1.0,\n", " 'node:__internal_head__': 1.0,\n", " 'object_store_memory': 749467238.0}\n", "\n", "\u001b[32m------------------------------------------\u001b[39m\n", "\u001b[32mJob 'raysubmit_8vJ7dKqYrWKbd17i' succeeded\u001b[39m\n", "\u001b[32m------------------------------------------\u001b[39m\n", "\n", "\u001b[0m\n" ] } ], "source": [ "# The following job's logs will show the Ray cluster's total resource capacity, including 2 CPUs.\n", "ray job submit --address http://localhost:8265 -- python -c \"import pprint; import ray; ray.init(); pprint.pprint(ray.cluster_resources(), sort_dicts=True)\"" ] }, { "cell_type": "markdown", "id": "7a106033-76f3-490d-9c73-328218c035c6", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Step 5: Access the Ray Dashboard\n", "\n", "Visit `${YOUR_IP}:8265` in your browser for the Dashboard. For example, `127.0.0.1:8265`.\n", "See the job you submitted in Step 4 in the **Recent jobs** pane as shown below.\n", "\n", "![Ray Dashboard](../images/ray-dashboard.png)\n", "\n", "## Step 6: Cleanup" ] }, { "cell_type": "code", "execution_count": 12, "id": "911f29ab-7148-4b86-b686-0176ac1c06e4", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [ "nbval-ignore-output", "remove-output" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Deleting cluster \"kind\" ...\n", "Deleted nodes: [\"kind-control-plane\"]\n", "[1]+ Terminated kubectl port-forward service/raycluster-kuberay-head-svc 8265:8265 > /dev/null\n" ] } ], "source": [ "# Kill the `kubectl port-forward` background job in the earlier step\n", "killall kubectl\n", "kind delete cluster" ] } ], "metadata": { "kernelspec": { "display_name": "Bash", "language": "bash", "name": "bash" }, "language_info": { "codemirror_mode": "shell", "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" } }, "nbformat": 4, "nbformat_minor": 5 }