---
title: "nvidia nemotron 3 agents rag voice safety"
source_url: https://developer.nvidia.com/blog/building-nvidia-nemotron-3-agents-for-reasoning-multimodal-rag-voice-and-safety/
tags: [nvidia, inference]
source: rss
source_feed: NVIDIA Developer Blog
source_published: 
ingested: 2026-05-08
review_value: 8
review_confidence: 7
review_recommendation: strong
review_stars: 5
sha256: 335a1d3b8710749c
type: raw
created: 2026-05-10
updated: 2026-05-10
---
# Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety | NVIDIA Technical Blog
Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety | NVIDIA Technical Blog DEVELOPER Home Blog Forums Docs Downloads Training Join Technical Blog Subscribe Related Resources Agentic AI / Generative AI English Building NVIDIA Nemotron 3 Agents for Reasoning, Multimodal RAG, Voice, and Safety Mar 24, 2026 By Chintan Patel , Maryam Motamedi , Chris Alexiuk , Moon Chung and Isabel Hulseman Like Discuss (0) L T F R E AI-Generated Summary Like Dislike At GTC 2026, NVIDIA introduced the Nemotron 3 familya unified stack of specialized models including Nemotron 3 Super for long-context reasoning, Nemotron 3 Content Safety for multimodal moderation, VoiceChat for real-time speech interaction, and Nano Omni (upcoming) for enterprise-grade multimodal understanding, all designed for scalable agentic AI systems. Nemotron 3 Super employs a hybrid Mamba-Transformer MoE architecture with NVFP4 precision on Blackwell GPUs, achieving high throughput and efficiency for multi-agent tasks, while Nemotron 3 Content Safety delivers low-latency, accurate safety moderation across multiple languages and modalities. NVIDIA NeMo tools, such as the NeMo Evaluator and Agent Toolkit, enable robust benchmarking and end-to-end optimization of agentic AI systems, allowing developers to build, evaluate, and deploy scalable, trustworthy digital assistants with open models and recipes. AI-generated content may summarize information incompletely. Verify important information. Learn more Agentic AI is an ecosystem where specialized models work together to handle planning, reasoning, retrieval, and safety guardrailing. As these systems scale, developers need models that can understand real-world multimodal data, converse naturally with users globally, and operate safely across languages and modalities. At GTC 2026, NVIDIA introduced a new generation of NVIDIA Nemotron models designed to work together as a unified agentic stack: NVIDIA Nemotron 3 Super for long-context reasoning and agentic tasks NVIDIA Nemotron 3 Ultra (coming soon) for highest reasoning accuracy and efficiency among open frontier models NVIDIA Nemotron 3 Content Safety for multimodal, multilingual content moderation NVIDIA Nemotron 3 VoiceChat (in early access) for low latency, natural, full-duplex voice interactions NVIDIA Nemotron 3 Nano Omni (coming soon) for enterprise-grade multimodal understanding NVIDIA Nemotron RAG for generating embeddings for image and text modalities with NVIDIA Llama Nemotron Embed VL and for reordering image-or-text candidates when relevance depends on visual content with NVIDIA Llama Nemotron Rerank VL &nbsp; Together with open data, training recipes, and NVIDIA NeMo tools, the Nemotron family of models provides an end-to-end toolkit to build, evaluate, and optimize production-grade agentic AI systems. This blog explores the latest Nemotron 3 models, their performance, and how developers can use them to build scalable, multimodal, and real-time AI agents. Power multi-agent systems with NVIDIA Nemotron 3 Super Multi-agent systems suffer from &#8220;context explosion&#8221; with massive token histories 15x that of standard chat and a thinking tax&#8221; with chain-of-thought reasoning for every decision. NVIDIA Nemotron 3 Super is an open hybrid mixture-of-experts (MoE) model that activates just 12B parameters per pass, delivering high accuracy and efficiency for a fraction of the compute. A hybrid architecture with Mamba and Transformer layers, multi token prediction, and NVFP4 precision on NVIDIA Blackwell GPUs delivers up to 5x higher throughput than the previous generation while reducing memory footprint and cost. A configurable thinking budget lets developers bound chain of thought to keep latency and spend predictable, even for continuous agent workloads.&nbsp; With a 1M-token context window and reinforcement learning across 10+ environments, Nemotron 3 Super excels at coding, math, instruction following, and function-calling, making it ideal for multi-agent applications with significantly higher throughput on Blackwell when running in NVFP4. Figure 1. Nemotron 3 Super delivers top-tier intelligence while leading in throughput per GPU in the most attractive efficiency quadrant from Artificial Analysis. Nemotron 3 Super uses latent MoE to call four expert specialists for the inference cost of only one, compressing tokens before they reach the experts. External evaluations back this up. On the Artificial Analysis Intelligence Index for open weight models under 250B parameters, Nemotron 3 Super NVFP4 ranks among the top models, matching the highest intelligence scores from leading alternatives. Figure 2. Nemotron 3 Super ranks among the top open-weight models under 250B parameters on the Artificial Analysis Intelligence Index. In the intelligence versus efficiency plot, Nemotron 3 Super lands in the most attractive upper right quadrant combining strong task performance with high output throughput per GPU making it a compelling choice for cost sensitive production agents. Nemotron 3 Super with open weights, open training data, and open development recipes is ideal for software development, deep research, cybersecurity, and the financial services industry. Keep agents safe with Nemotron 3 Content Safety As agents expand from text only to multimodal workflows, safety guardrails must evolve across inputs, retrieval, and outputs. They must also be applicable in use cases like enterprise copilots and user-generated content (think dating apps or social media), and detect prompt injection in agentic systems such as healthcare, where self-harm is a concern.&nbsp; Nemotron 3 Content Safety is a compact 4B parameter multimodal safety model that detects unsafe or sensitive content across text and images. Built on the Gemma 3 4B backbone with an adapter based classification head, it delivers high accuracy safety classification at low latency that s ideal for production agentic pipelines. It fuses visual and language features to produce a simple safe/unsafe decision, with optional granular category labels. A quick keyword toggle lets developers choose between fast binary classification and full taxonomy reporting, supporting both low latency paths and deeper inspection. On a suite of multimodal, multilingual safety benchmarks, Nemotron 3 Content Safety reaches approximately 84% accuracy, outperforming alternative safety models across the same tasks while keeping latency low enough for in line moderation in production pipelines. Figure 3. Model accuracy vs. alternative safety models on multimodal, multilingual harmful content benchmarks. The model uses the same 23 category taxonomy as Aegis 1 3, covering classes such as hate, harassment, violence, sexual content, plagiarism, and unauthorized advice. Trained on high quality Aegis datasets and human annotated real world images rather than primarily synthetic data the model performs strongly across multimodal benchmarks in its 12 supported languages, with solid zero shot generalization beyond them. Natural conversations with Nemotron 3 VoiceChat&nbsp; Traditional voice AI relies on cascaded pipelines, automatic speech recognition (ASR), a large language model (LLM), and text-to-speech (TTS) all of which introduce latency, complexity, and multiple points of failure.&nbsp; Nemotron 3 VoiceChat is a 12B-parameter end-to-end speech model for full-duplex, real-time conversational AI, currently in early access . Unlike cascaded stacks, VoiceChat directly analyzes audio input and generates audio output in a unified and streaming LLM architecture. Using this single model eliminates multi-model orchestration. Built on the Nemotron Nano v2 LLM backbone with Nemotron speech (Parakeet encoder) and TTS decoder, VoiceChat delivers natural, interruptible conversations with low latency.&nbsp; This model, in its early-access stage, has landed in the most attractive upper right quadrant of the Artificial Analysis Speech to Speech leaderboard. The graphic below plots conversational dynamics against speech reasoning performance, where Nemotron 3 VoiceChat lands in the highlighted upper right quadrant, alongside NVIDIA PersonaPlex , a full duplex, 7B-parameter research model. This means developers get both responsive turn taking behavior and strong reasoning over audio; both are critical for assistants that must sound natural and stay on task. Figure 4. Nemotron 3 VoiceChat and NVIDIA PersonaPlex lead open source full duplex models on both conversational dynamics and speech reasoning, landing in the most attractive quadrant of the Artificial Analysis benchmark. With a streamlined end-to-end pipeline, VoiceChat targets sub-300ms end-to-end latency, processing 80ms audio chunks faster than real-time. A single model means fewer points of failure, reduced technical debt, and easier deployment for conversational agents in healthcare, financial services, telecommunications, gaming, and more. Understand the world with NVIDIA Nemotron 3 Omni Agentic systems increasingly need to understand real-world data in different formats: video, audio, documents, UI screens, and reason across modalities. Existing solutions are either closed source or face compliance challenges for global enterprise deployment. NVIDIA Nemotron 3 Nano Omni is the first open, production-ready native omni-understanding foundation model delivering high-context video reasoning enhanced through audio transcription. Nano Omni is powered by NVIDIA Nemotron speech (Parakeet encoder), state-of-the-art optical character recognition (OCR) reasoning with a Nemotron 3 Nano language backbone, and NVIDIA&#8217;s first GUI-trained system for real agentic applications.&nbsp; The architecture uses 3D convolution layers (Conv3D) for efficient handling of temporal-spatial data in video, and efficient video sampling (EVS) enables processing of longer videos at the same computational cost by identifying and pruning temporally static patches. Stay tuned for release updates about this model. Improve multimodal search with Llama Nemotron Embed VL and Rerank VL Agentic RAG pipelines rely on retrieval to ground generation on evidence, not just prompts. But enterprise data lives in PDFs with charts, scanned contracts, tables, and slide decks formats that text-only retrieval misses entirely.&nbsp; Llama Nemotron Embed VL and Llama Nemotron Rerank VL are compact multimodal models that enable accurate visual document retrieval while remaining compatible with standard vector databases. On the ViDoRe V3/MTEB Pareto curve, which plots retrieval accuracy versus tokens processed per second on a single NVIDIA H100 GPU, Llama Nemotron Embed VL occupies the Pareto frontier. It delivers competitive or better accuracy at high throughput relative to both open and commercial alternatives. Figure 5. Pareto curve for model accuracy vs performance for open and commercial embedding models. Benchmarked on one H100 by the MTEB leaderboard on the ViDoRe V3 benchmark Llama Nemotron Embed VL is a 1.7B-parameter dense embedding model that encodes page images and text into a single-dimensional vector, with support for Matryoshka embeddings. Built on NVIDIA Eagle a frontier vision-language model with a Llama 3.2 1B backbone and SigLip2 400M vision encoder it uses contrastive learning for query-document similarity and enables millisecond-latency search with standard vector databases. Llama Nemotron Rerank VL is a 1.7B-parameter cross-encoder reranker that scores query-page relevance. When paired with the Llama Nemotron Embed VL model, it further increases accuracy by reranking retrieved text chunks and images. Evaluate and optimize with NVIDIA NeMo Building production agents requires not only strong models but also robust tools for evaluation and optimization. NVIDIA NeMo provides tools to evaluate, compare, and tune agentic systems:&nbsp; NVIDIA NeMo Evaluator, enables robust, reproducible benchmarking with support for agentic evaluation. By providing standardized evaluation setups, developers can benchmark performance, validate outputs, and compare models under consistent conditions. NVIDIA NeMo Agent Toolkit is an open source framework for profiling and optimizing agentic systems end-to-end. Bring agents from LangChain, AutoGen, AWS Strands, or other frameworks without code changes and get visibility into latency bottlenecks, token costs, and orchestration overhead to ship performant agents at scale. Start building with Nemotron Agentic AI is a shift from systems that respond to systems that act. It is a coordinated stack of models, tools, memory, and guardrails that can plan, execute, critique, and adapt. If it s just a bigger model in the same chat window, it s not agentic. The Nemotron family of models, released under the NVIDIA permissive open model licenses , is built for this multi model reality. Nemotron 3 Super anchors long context reasoning and planning. Nemotron 3 Content Safety watches every step, moderating multimodal inputs, retrieved content, and outputs. Nemotron 3 VoiceChat turns that intelligence into full duplex, real time conversations. Nemotron 3 Nano Omni (coming soon) gives agents eyes and ears across video, audio, documents, charts, and GUIs. Around them, NeMo tools add retrieval, tool calling, evaluation, and judge models so agents can score their own work and improve. Efficiency is the hidden requirement that makes production viable. Real agents make dozens or hundreds of model calls per task, so Nemotron models are right sized and optimized for throughput, latency, and cost. And because they re open and customizable, teams can tune behaviors, align to their own data, and deploy where their security and compliance teams need them. With Nemotron and NVIDIA NeMo, you re getting the building blocks for trustworthy, repeatable, and scalable digital assistants for your production agentic systems. Get started today:&nbsp; Download the Nemotron models and datasets from Hugging Face .&nbsp; Preview and access Nemotron Super here . Access Nemotron 3 Content Safety here . Preview and apply for early access to Nemotron 3 VoiceChat here . Evaluate with NVIDIA NeMo Evaluator &nbsp; Optimize with NeMo Agent Toolkit . Evaluate NVIDIA-hosted API endpoints on build.nvidia.com and OpenRouter . Stay up-to-date on NVIDIA Nemotron by subscribing to NVIDIA news and following NVIDIA AI on LinkedIn , X , Discord , and YouTube . Visit the Nemotron developer page for resources to get started. Explore open Nemotron models and datasets on Hugging Face and Blueprints on build.nvidia.com . Engage with Nemotron livestreams , tutorials , and the developer community on the NVIDIA forum and Discord . Discuss (0) Like Tags Agentic AI / Generative AI | Content Creation / Rendering | Data Science | General | NeMo | Nemotron | Intermediate Technical | Benchmark | News | featured | GTC 2026 | Llama | LLMs | Machine Learning &amp; Artificial Intelligence | NVFP4 | Open Source | Retrieval Augmented Generation (RAG) About the Authors About Chintan Patel Chintan Patel is a senior product manager at NVIDIA focused on bringing GPU-accelerated solutions to the HPC community. He leads the management and offering of the HPC application containers on the NVIDIA GPU Cloud registry. Prior to NVIDIA, he held product management, marketing and engineering positions at Micrel, Inc. He holds an MBA from Santa Clara University and a bachelor's degree in electrical engineering and computer science from UC Berkeley. View all posts by Chintan Patel About Maryam Motamedi Maryam Motamedi is a product marketing lead for AI software at NVIDIA. She brings decades of cross-industry experience in media/AdTech, streaming, retail, and telecom. Maryam specializes in translating cutting-edge technology into real-world solutions, helping developers and enterprises build AI-powered applications that redefine how we connect, work, and interact. View all posts by Maryam Motamedi About Chris Alexiuk Chris Alexiuk is a deep learning developer advocate at NVIDIA, working on creating technical assets that help developers use the incredible suite of AI tools available at NVIDIA. Chris comes from a machine learning and data science background, and he is obsessed with everything and anything about large language models. View all posts by Chris Alexiuk About Moon Chung Moon Chung is a senior product marketing manager at NVIDIA specializing in Enterprise AI. She has previously worked for Meta and Adobe, focusing on product strategy, product development, and go-to-market strategy. Moon holds an MBA degree from Duke University s Fuqua School of Business. View all posts by Moon Chung About Isabel Hulseman Isabel Hulseman is a product marketing manager for enterprise AI software at NVIDIA. With over 9 years of marketing experience (3+ at NVIDIA), and an MBA in marketing, her goal is to provide developers with the tools they need to build custom generative AI applications and enable enterprises to develop and scale their solutions to serve their customers better. View all posts by Isabel Hulseman Comments Related posts NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model NVIDIA Nemotron 3 Nano Omni Powers Multimodal Agent Reasoning in a Single Efficient Open Model Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate Inside NVIDIA Nemotron 3: Techniques, Tools, and Data That Make It Efficient and Accurate Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models Develop Specialized AI Agents with New NVIDIA Nemotron Vision, RAG, and Guardrail Models Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5 Build More Accurate and Efficient AI Agents with the New NVIDIA Llama Nemotron Super v1.5 Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Llama Nemotron Models Accelerate Agentic AI Workflows with Accuracy and Efficiency Related posts How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain How to Build Deep Agents for Enterprise Search with NVIDIA AI-Q and LangChain Build Next-Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics Build Next-Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Building Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities Build AI-Ready Knowledge Systems Using 5 Essential Multimodal RAG Capabilities How to Build a Document Processing Pipeline for RAG with Nemotron How to Build a Document Processing Pipeline for RAG with Nemotron L T F R E