--- title: RAG LLM Chatbot on CPU date: 2025-10-24 tier: sandbox summary: This patterns deploys a CPU-based LLM, your choice of several RAG DB providers, and a simple chatbot UI which exposes the configuration and results of the RAG queries. rh_products: - Red Hat OpenShift Container Platform - Red Hat OpenShift GitOps - Red Hat OpenShift AI partners: - Microsoft - IBM Fusion industries: - General aliases: /rag-llm-cpu/ links: github: https://github.com/validatedpatterns-sandbox/rag-llm-cpu install: getting-started bugs: https://github.com/validatedpatterns-sandbox/rag-llm-cpu/issues feedback: https://docs.google.com/forms/d/e/1FAIpQLScI76b6tD1WyPu2-d_9CCVDr3Fu5jYERthqLKJDUGwqBg7Vcg/viewform --- # **CPU-based RAG LLM chatbot** ## **Introduction** The CPU-based RAG LLM chatbot Validated Pattern deploys a retrieval-augmented generation (RAG) chatbot on Red Hat OpenShift by using Red Hat OpenShift AI. The pattern runs entirely on CPU nodes without requiring GPU hardware, which provides a cost-effective and accessible solution for environments where GPU resources are limited or unavailable. This pattern provides a secure, flexible, and production-ready starting point for building and deploying on-premise generative AI applications. ## **Target audience** This pattern is intended for the following users: - **Developers & Data Scientists** who want to build and experiment with RAG-based large language model (LLM) applications. - **MLOps & DevOps Engineers** who are responsible for deploying and managing AI/ML workloads on OpenShift. - **Architects** who evaluate cost-effective methods for delivering generative AI capabilities on-premise. ## **Why Use This Pattern?** - **Cost-Effective**: The pattern runs entirely on CPU nodes, which removes the need for expensive and scarce GPU resources. - **Flexible**: The pattern supports multiple vector database backends, such as Elasticsearch, PGVector, and Microsoft SQL Server, to integrate with existing data infrastructure. - **Transparent**: The Gradio frontend exposes the internals of the RAG query and LLM prompts, which provides insight into the generation process. - **Extensible**: The pattern uses open-source standards, such as KServe and OpenAI-compatible APIs, to serve as a foundation for complex applications. ## **Architecture Overview** At a high level, the components work together in the following sequence: 1. A user enters a query into the **Gradio UI**. 2. The backend application, using **LangChain**, queries a configured **vector database** to retrieve relevant documents. 3. These documents are combined with the original query from the user into a prompt. 4. The prompt is sent to the **KServe-deployed LLM**, which runs via **llama.cpp** on a CPU node. 5. The LLM generates a response, which is streamed back to the **Gradio UI**. 6. **Vault** provides the necessary credentials for the vector database and HuggingFace token at runtime. ![Overview](/images/rag-llm-cpu/rag-augmented-query.png) _Figure 1. Overview of RAG Query from User's perspective._ ## **Prerequisites** Before you begin, ensure that you have access to the following resources: - A Red Hat OpenShift cluster version 4.x. (The recommended size is at least two `m5.4xlarge` nodes.) - A HuggingFace API token. - The `Podman` command-line tool. ## **What This Pattern Provides** - A [KServe](https://github.com/kserve/kserve)-based LLM deployed to [Red Hat OpenShift AI](https://www.redhat.com/en/products/ai/openshift-ai) that runs entirely on a CPU-node with a [llama.cpp](https://github.com/ggml-org/llama.cpp) runtime. - A choice of one or more vector database providers to serve as a RAG backend with configurable web-based or Git repository-based sources. Vector embedding and document retrieval are implemented with [LangChain](https://docs.langchain.com/oss/python/langchain/overview). - [Vault](https://developer.hashiCorp.com/vault)-based secret management for a HuggingFace API token and credentials for supported databases, such as ([Elasticsearch](https://www.elastic.co/docs/solutions/search/vector), [PGVector](https://github.com/pgvector/pgvector), [Microsoft SQL Server](https://learn.microsoft.com/en-us/sql/sql-server/ai/vectors?view=sql-server-ver17)). - A [gradio](https://www.gradio.app/)-based frontend for connecting to multiple [OpenAI API-compatible](https://github.com/openai/openai-openapi) LLMs. This frontend exposes the internals of the RAG query and LLM prompts so that users have insight into the running processes.