--- title: Customizing this pattern weight: 20 aliases: /rag-quickstart/customizing/ --- :toc: :imagesdir: /images :_content-type: ASSEMBLY include::modules/comm-attributes.adoc[] [id="customizing-rag-quickstart"] == Customizing the RAG AI Quickstart pattern Without any changes, this pattern runs a CPU-backed LLM and does not require a GPU. This can be limiting in terms of usable models as well as speed, so you might want to use a GPU instead. [id="enabling-gpu"] === Enabling GPU support To enable GPU support, set `global.device` to `gpu` in `values-global.yaml` and push your changes to GitHub. This adds NFD and the NVIDIA GPU Operator to the pattern installation and enables the models to run using an NVIDIA accelerator. [NOTE] ==== If you are running this pattern on an OpenShift cluster on AWS, setting `global.device` to `gpu` automatically creates a GPU (`g6.2xlarge`) machine and add it as a worker node to your cluster. ==== [id="changing-models"] === Changing models To update the models, edit `overrides/values-cpu.yaml` (if `global.device` is set to `cpu`) or `overrides/values-gpu.yaml` (if set to `gpu`). The default CPU-based model is defined as follows: [source,yaml] ---- global: models: llama-3-2-3b-instruct-cpu: id: meta-llama/Llama-3.2-3B-Instruct enabled: true resources: limits: cpu: "6" memory: 48Gi requests: cpu: "2" memory: 24Gi args: - --enable-auto-tool-choice - --chat-template - /chat-templates/tool_chat_template_llama3.2_json.jinja - --tool-call-parser - llama3_json - --dtype - auto - --max-model-len - "16384" - --max-num-seqs - "1" ---- You can change this to any vLLM-compatible model that you have accepted the terms and conditions for with your HuggingFace API token. You can also adjust the resource parameters as needed for your environment. The runtime defaults to `vllm/vllm-openai:v0.11.1`. If you need a later version, you can override the image: [source,yaml] ---- llm-service: deviceConfigs: gpu: image: vllm/vllm-openai:nightly ---- [NOTE] ==== The example above sets a GPU-specific container image. To override the CPU-based image instead, use the key `llm-service.deviceConfigs.cpu.image`. ==== [id="multiple-models"] === Defining multiple models You can define multiple LLM models to be served simultaneously. For example: [source,yaml] ---- global: models: deepseek-r1: id: Valdemardi/DeepSeek-R1-Distill-Llama-70B-AWQ enabled: true resources: limits: cpu: "32" memory: 200Gi requests: cpu: "24" memory: 150Gi args: - --reasoning-parser - deepseek_r1 - --tool-call-parser - llama3_json - --enable-auto-tool-choice - --quantization - awq_marlin - --dtype - float16 - --max-model-len - "65536" gpt-oss-120b: id: openai/gpt-oss-120b enabled: true resources: limits: cpu: "32" memory: 200Gi requests: cpu: "24" memory: 150Gi args: - --tool-call-parser - openai - --enable-auto-tool-choice ---- For a complete list of customizable values, see the link:https://github.com/rh-ai-quickstart/ai-architecture-charts[AI Architecture charts] repository.