--- name: flash description: runpod-flash SDK and CLI for deploying AI workloads on Runpod serverless GPUs/CPUs. user-invocable: true --- # Runpod Flash Write code locally, test with `flash run` (dev server at localhost:8888), and flash automatically provisions and deploys to remote GPUs/CPUs in the cloud. `Endpoint` handles everything. ## Setup ```bash pip install runpod-flash # requires Python >=3.10 # auth option 1: browser-based login (saves token locally) flash login # auth option 2: API key via environment variable export RUNPOD_API_KEY=your_key flash init my-project # scaffold a new project in ./my-project ``` ## CLI ```bash flash run # start local dev server at localhost:8888 flash run --auto-provision # same, but pre-provision endpoints (no cold start) flash build # package artifact for deployment (500MB limit) flash build --exclude pkg1,pkg2 # exclude packages from build flash deploy # build + deploy (auto-selects env if only one) flash deploy --env staging # build + deploy to "staging" environment flash deploy --app my-app --env prod # deploy a specific app to an environment flash deploy --preview # build + launch local preview in Docker flash env list # list deployment environments flash env create staging # create "staging" environment flash env get staging # show environment details + resources flash env delete staging # delete environment + tear down resources flash undeploy list # list all active endpoints flash undeploy my-endpoint # remove a specific endpoint ``` ## Endpoint: Three Modes ### Mode 1: Your Code (Queue-Based Decorator) One function = one endpoint with its own workers. ```python from runpod_flash import Endpoint, GpuGroup @Endpoint(name="my-worker", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"]) async def compute(data): import torch # MUST import inside function (cloudpickle) return {"sum": torch.tensor(data, device="cuda").sum().item()} result = await compute([1, 2, 3]) ``` ### Mode 2: Your Code (Load-Balanced Routes) Multiple HTTP routes share one pool of workers. ```python from runpod_flash import Endpoint, GpuGroup api = Endpoint(name="my-api", gpu=GpuGroup.ADA_24, workers=(1, 5), dependencies=["torch"]) @api.post("/predict") async def predict(data: list[float]): import torch return {"result": torch.tensor(data, device="cuda").sum().item()} @api.get("/health") async def health(): return {"status": "ok"} ``` ### Mode 3: External Image (Client) Deploy a pre-built Docker image and call it via HTTP. ```python from runpod_flash import Endpoint, GpuGroup, PodTemplate server = Endpoint( name="my-server", image="my-org/my-image:latest", gpu=GpuGroup.AMPERE_80, workers=1, env={"HF_TOKEN": "xxx"}, template=PodTemplate(containerDiskInGb=100), ) # LB-style result = await server.post("/v1/completions", {"prompt": "hello"}) models = await server.get("/v1/models") # QB-style job = await server.run({"prompt": "hello"}) await job.wait() print(job.output) ``` Connect to an existing endpoint by ID (no provisioning): ```python ep = Endpoint(id="abc123") job = await ep.runsync({"input": "hello"}) print(job.output) ``` ## How Mode Is Determined | Parameters | Mode | |-----------|------| | `name=` only | Decorator (your code) | | `image=` set | Client (deploys image, then HTTP calls) | | `id=` set | Client (connects to existing, no provisioning) | ## Endpoint Constructor ```python Endpoint( name="endpoint-name", # required (unless id= set) id=None, # connect to existing endpoint gpu=GpuGroup.AMPERE_80, # single GPU type (default: ANY) gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80], # or list for auto-select by supply cpu=CpuInstanceType.CPU5C_4_8, # CPU type (mutually exclusive with gpu) workers=5, # shorthand for (0, 5) workers=(1, 5), # explicit (min, max) idle_timeout=60, # seconds before scale-down (default: 60) dependencies=["torch"], # pip packages for remote exec system_dependencies=["ffmpeg"], # apt-get packages image="org/image:tag", # pre-built Docker image (client mode) env={"KEY": "val"}, # environment variables volume=NetworkVolume(...), # persistent storage gpu_count=1, # GPUs per worker template=PodTemplate(containerDiskInGb=100), flashboot=True, # fast cold starts execution_timeout_ms=0, # max execution time (0 = unlimited) ) ``` - `gpu=` and `cpu=` are mutually exclusive - `workers=5` means `(0, 5)`. Default is `(0, 1)` - `idle_timeout` default is **60 seconds** - `flashboot=True` (default) -- enables fast cold starts via snapshot restore - `gpu_count` -- GPUs per worker (default 1), use >1 for multi-GPU models ### NetworkVolume ```python NetworkVolume(name="my-vol", size=100) # size in GB, default 100 ``` ### PodTemplate ```python PodTemplate( containerDiskInGb=64, # container disk size (default 64) dockerArgs="", # extra docker arguments ports="", # exposed ports startScript="", # script to run on start ) ``` ## EndpointJob Returned by `ep.run()` and `ep.runsync()` in client mode. ```python job = await ep.run({"data": [1, 2, 3]}) await job.wait(timeout=120) # poll until done print(job.id, job.output, job.error, job.done) await job.cancel() ``` ## GPU Types (GpuGroup) | Enum | GPU | VRAM | |------|-----|------| | `ANY` | any | varies | | `AMPERE_16` | RTX A4000 | 16GB | | `AMPERE_24` | RTX A5000/L4 | 24GB | | `AMPERE_48` | A40/A6000 | 48GB | | `AMPERE_80` | A100 | 80GB | | `ADA_24` | RTX 4090 | 24GB | | `ADA_32_PRO` | RTX 5090 | 32GB | | `ADA_48_PRO` | RTX 6000 Ada | 48GB | | `ADA_80_PRO` | H100 PCIe (80GB) / H100 HBM3 (80GB) / H100 NVL (94GB) | 80GB+ | | `HOPPER_141` | H200 | 141GB | ## CPU Types (CpuInstanceType) | Enum | vCPU | RAM | Max Disk | Type | |------|------|-----|----------|------| | `CPU3G_1_4` | 1 | 4GB | 10GB | General | | `CPU3G_2_8` | 2 | 8GB | 20GB | General | | `CPU3G_4_16` | 4 | 16GB | 40GB | General | | `CPU3G_8_32` | 8 | 32GB | 80GB | General | | `CPU3C_1_2` | 1 | 2GB | 10GB | Compute | | `CPU3C_2_4` | 2 | 4GB | 20GB | Compute | | `CPU3C_4_8` | 4 | 8GB | 40GB | Compute | | `CPU3C_8_16` | 8 | 16GB | 80GB | Compute | | `CPU5C_1_2` | 1 | 2GB | 15GB | Compute (5th gen) | | `CPU5C_2_4` | 2 | 4GB | 30GB | Compute (5th gen) | | `CPU5C_4_8` | 4 | 8GB | 60GB | Compute (5th gen) | | `CPU5C_8_16` | 8 | 16GB | 120GB | Compute (5th gen) | ```python from runpod_flash import Endpoint, CpuInstanceType @Endpoint(name="cpu-work", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"]) async def process(data): import pandas as pd return pd.DataFrame(data).describe().to_dict() ``` ## Common Patterns ### CPU + GPU Pipeline ```python from runpod_flash import Endpoint, GpuGroup, CpuInstanceType @Endpoint(name="preprocess", cpu=CpuInstanceType.CPU5C_4_8, workers=5, dependencies=["pandas"]) async def preprocess(raw): import pandas as pd return pd.DataFrame(raw).to_dict("records") @Endpoint(name="infer", gpu=GpuGroup.AMPERE_80, workers=5, dependencies=["torch"]) async def infer(clean): import torch t = torch.tensor([[v for v in r.values()] for r in clean], device="cuda") return {"predictions": t.mean(dim=1).tolist()} async def pipeline(data): return await infer(await preprocess(data)) ``` ### Parallel Execution ```python import asyncio results = await asyncio.gather(compute(a), compute(b), compute(c)) ``` ## Gotchas 1. **Imports outside function** -- most common error. Everything inside the decorated function. 2. **Forgetting await** -- all decorated functions and client methods need `await`. 3. **Missing dependencies** -- must list in `dependencies=[]`. 4. **gpu/cpu are exclusive** -- pick one per Endpoint. 5. **idle_timeout is seconds** -- default 60s, not minutes. 6. **10MB payload limit** -- pass URLs, not large objects. 7. **Client vs decorator** -- `image=`/`id=` = client. Otherwise = decorator. 8. **Auto GPU switching requires workers >= 5** -- pass a list of GPU types (e.g. `gpu=[GpuGroup.ADA_24, GpuGroup.AMPERE_80]`) and set `workers=5` or higher. The platform only auto-switches GPU types based on supply when max workers is at least 5. 9. **`runsync` timeout is 60s** -- cold starts can exceed 60s. Use `ep.runsync(data, timeout=120)` for first requests or use `ep.run()` + `job.wait()` instead.