---
title: "nvidia edge first llms av robotics"
source_url: https://developer.nvidia.com/blog/build-next-gen-physical-ai-with-edge%e2%80%91first-llms-for-autonomous-vehicles-and-robotics/
tags: [nvidia, inference]
source: rss
source_feed: NVIDIA Developer Blog
source_published: 
ingested: 2026-05-08
review_value: 8
review_confidence: 7
review_recommendation: strong
review_stars: 5
sha256: efd8451bf4418ed4
type: raw
created: 2026-05-10
updated: 2026-05-10
---
# Build Next&#x2d;Gen Physical AI with Edge‑First LLMs for Autonomous Vehicles and Robotics | NVIDIA Technical Blog
Build Next&#x2d;Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics | NVIDIA Technical Blog DEVELOPER Home Blog Forums Docs Downloads Training Join Technical Blog Subscribe Related Resources Developer Tools &amp; Techniques Build Next-Gen Physical AI with Edge First LLMs for Autonomous Vehicles and Robotics NVIDIA TensorRT Edge LLM introduces support for MoEs, Cosmos Reason 2, and Qwen3-TTS/ASR on NVIDIA Jetson and NVIDIA DRIVE Mar 12, 2026 By Lin Chai , Luxiao Zheng , Fan Shi , Maximilien Breughe and Michael Ferry Like Discuss (0) L T F R E AI-Generated Summary Like Dislike The latest release of NVIDIA TensorRT Edge-LLM introduces advanced support for mixture of experts (MoE), hybrid reasoning architectures, and the NVIDIA Nemotron family on embedded platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor, enabling high-fidelity, low-latency autonomous machine intelligence within strict power constraints. Native multimodal interaction is achieved through optimized Qwen3-TTS and Qwen3-ASR models, allowing end-to-end, low-latency voice dialogue with a Thinker-Talker framework, and Cosmos Reason 2 enables advanced spatio-temporal reasoning, 3D localization, and long-context processing for humanoid robotics and embodied agents at the edge. NVIDIA Alpamayo integration supports end-to-end trajectory planning in autonomous vehicles, employing flow matching trajectory decoding, explainable decision-making with multicamera context, and FP8-accelerated Vision Transformers, marking a shift from modular stacks to production-ready, reasoning-based VLA models. AI-generated content may summarize information incompletely. Verify important information. Learn more Physical AI is rapidly evolving, from next-generation software-defined autonomous vehicles (AVs) to humanoid robots. The challenge is no longer how to run a large language model (LLM), but how to enable high-fidelity reasoning, real-time multimodal interaction, and trajectory planning within strict power and latency envelopes. NVIDIA TensorRT Edge-LLM , a high-performance C++ inference runtime for LLMs and vision language models (VLMs) on embedded platforms, is designed to overcome these challenges.&nbsp; As explained in this post, the latest TensorRT Edge-LLM release delivers a significant expansion in fundamental capabilities for NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor platforms. It introduces advanced edge architectures, including mixture of experts (MoE) , the NVIDIA Cosmos Reason 2 open planning model for physical AI, and Qwen3-TTS and Qwen-ASR models for embedded speech processing. Building on these foundational pillars, the release also offers optimized support for the NVIDIA Nemotron family of open models. This provides developers with the essential runtime to build the next generation of autonomous machines. Efficient reasoning at scale Running massive models on embedded hardware requires a rethink of compute efficiency. The latest release of TensorRT Edge-LLM fully enables MoE support at the edge, specifically optimizing models like Qwen3 MoE. By activating only a subset of expert parameters per token, MoE architectures enable edge devices to access the reasoning capabilities of a massive model while maintaining the inference latency and active compute footprint of a much smaller one.&nbsp; This architectural shift is critical for deploying high-fidelity reasoning on edge platforms like NVIDIA DRIVE AGX Thor and NVIDIA Jetson Thor. As a developer, you can drastically scale up the intelligence of your autonomous systems without exceeding the strict power and latency limits required for real-time, mission-critical operations. Unlock hybrid reasoning at the edge&nbsp; TensorRT Edge-LLM is a specialized runtime to fully support NVIDIA Nemotron 2 Nano . This enables a new class of System 2 reasoning directly on embedded chipsets, including NVIDIA DRIVE Thor and Jetson Thor. For developers building advanced in-cabin AI assistants or robotic dialogue agents, deploying highly capable language models at the edge presents a significant memory and latency challenge. Nemotron 2 Nano addresses this challenge fundamentally by utilizing a novel Hybrid Mamba-2-Transformer architecture. This significantly reduces the memory footprint from KV cache storage with Mamba State Space architectures while maintaining high-fidelity precision from attention layers.&nbsp; TensorRT Edge-LLM bridges the deployment gap by providing optimized kernels that accelerate these specific hybrid layers. This enables developers to use the model&#8217;s massive context window for complex edge retrieval-augmented generation (RAG) pipelines or agentic workflows while maintaining a strict, production-viable device memory footprint. By enabling dynamic thinking at the edge with TensorRT Edge-LLM, developers can leverage a model s ability to shift seamlessly between deep reasoning and immediate conversational action. This is a critical capability for advanced in-cabin assistants and robotic agents that must reason through complex user queries one moment and provide conversational responses the next. Deep reasoning mode ( /think ) : TensorRT Edge-LLM efficiently handles the expanded token generation required for chain of thought (CoT) processing. By using the /think system prompt, the runtime enables the model to think through complex logic, achieving a remarkable 97.8% on MATH500 before outputting a decision. Conversational reflex mode ( /no_think ) : For latency-critical voice interactions where the user expects an immediate reply, developers can issue a /no_think command. TensorRT Edge-LLM optimizes this path to bypass reasoning traces, delivering immediate, intelligent responsiveness required for seamless conversational AI and agile on-device agents. By supporting this hybrid architecture, TensorRT Edge-LLM enables compact, production-ready VLMs and LLMs to serve as both reasoned assistants and low-latency conversational agents, significantly reducing the memory constraints of physical AI. Real-time multimodal interaction at the edge TensorRT Edge-LLM now offers support for Qwen3-TTS and Qwen3-ASR , a native multimodal model with Thinker-Talker architecture capable of voice interaction. Unlike traditional pipelines that cascade ASR, LLM, and TTS models, adding latency at every hop, Qwen3-TTS/ASR&nbsp; handles end-to-end speech processing. By optimizing both the Thinker and Talker components, TensorRT Edge-LLM enables low-latency, natural voice synthesis directly on the chip: Thinker : TensorRT Edge-LLM accelerates the reasoning core, allowing the model to process complex driver queries and environment context to generate intelligent, reasoned responses. Talker : TensorRT Edge-LLM complements the reasoning engine by delivering low latency, natural voice synthesis (TTS) directly on the chip. In the case of AVs, this allows for seamless, interruptible conversations between the driver and the vehicle. Equipping humanoid robotics with physical common sense&nbsp; For humanoid robots and advanced vision agents, understanding the real world requires more than just identifying objects; it requires an intuitive grasp of physics and time. To meet this need, TensorRT Edge-LLM now supports Cosmos Reason 2 , an open, customizable reasoning VLM purpose-built for physical AI and robotics.&nbsp; Cosmos Reason 2 empowers embodied agents to reason like humans by using prior knowledge, physical common sense, and chain-of-thought capabilities to understand world dynamics without human annotations. With TensorRT Edge-LLM optimized, low-latency runtime, robots at the edge can efficiently leverage Cosmos Reason 2 as a primary planning model to reason through their next steps.&nbsp; Key capabilities of Cosmos Reason 2 accelerated by TensorRT Edge-LLM include: Advanced spatio-temporal reasoning : Enhanced physical AI reasoning with improved timestamp precision and a deep understanding of space, time, and fundamental physics. 3D localization and explanation : The ability to not only detect objects but also provide 2D and 3D point localization, bounding-box coordinates, and contextual reasoning explanations for its labels. Massive context processing : Support for an improved long-context window of up to 256K input tokens, allowing edge agents to ingest extensive environmental and historical data. By supporting Cosmos Reason 2, TensorRT Edge-LLM ensures that next-generation robots can continuously evaluate complex, long-tail physical scenarios and safely plan their actions in real time. Advancing autonomous driving with end-to-end trajectory planning Among the most significant shifts in autonomous production is the move from traditional modular stacks to end-to-end VLA models. NVIDIA Alpamayo is a family of open AI models, simulation frameworks, and physical AI datasets designed to accelerate the development of safe, transparent, and reasoning-based AVs.&nbsp; Stay tuned for the forthcoming Alpamayo 1 workflow, a distillation recipe that brings System 2 rational thinking to the edge. Alpamayo 1 represents a leap forward from standard VLMs. It is not just describing a scene; it is planning a precise trajectory through it. The architecture utilizes a Cosmos Reason Backbone (distilled) to generate a chain of causation (reasoning trace) before outputting actions.&nbsp; Key features of the Alpamayo integration in TensorRT Edge-LLM include: Flow matching trajectory decoding : Moving beyond simple regression, flow matching is used to generate diverse, high-fidelity future trajectories. History and context : The model tokenizes two-second historical trajectories and multicamera inputs, processing them through a Qwen3-VL backbone to output explainable driving decisions. For example, &#8220;Nudge to the left to increase clearance. Performance : On DRIVE Thor, Alpamayo 1 achieves production-viable latencies, using FP8 acceleration for the Vision Transformer (ViT) components. Figure 1. The most significant shift in autonomous vehicle production is the transition from traditional modular stacks to end-to-end VLA models Get started with TensorRT Edge-LLM for physical AI TensorRT Edge-LLM serves as the go-to-open-source, pure C++ inference runtime designed specifically for the mission-critical needs of automotive and robotics. It eliminates Python dependencies for deployment, ensuring predictable memory footprints. From deploying the efficient expert routing of Qwen3 MoE today, to preparing for the future distilled reasoning of Alpamayo 1, NVIDIA provides the essential runtime to build the next generation of autonomous machines. To get started, explore the new features, including the Alpamayo and MoE examples, in the updated TensorRT Edge-LLM GitHub repo or through the latest NVIDIA DriveOS releases. Discuss (0) Like Tags Developer Tools &amp; Techniques | Edge Computing | Robotics | Automotive / Transportation | Cosmos | DRIVE | Jetson | Nemotron | TensorRT | TensorRT-LLM | Intermediate Technical | Deep dive | AI Inference | autonomous vehicles | GTC 2026 | IoT | LLMs | Mixture of Experts (MoE) | Physical AI | Retrieval Augmented Generation (RAG) | Thor | VLMs About the Authors About Lin Chai Lin Chai is a senior product manager at NVIDIA, leading TensorRT and TensorRT Edge-LLM, NVIDIA s AI inference platforms for deep learning across datacenter and embedded platforms. Drawing on her background in autonomous driving and automotive OEMs, she is inspired to build production-grade inference systems that deliver best-in-class performance for deep learning workloads across data center, edge, and physical AI applications enabling systems that perceive, reason, and act in the real world. View all posts by Lin Chai About Luxiao Zheng Luxiao Zheng is a senior systems software engineer at NVIDIA. He works on the TensorRT general performance team with a specialization in Large Language Model inference workflow. He works on end-to-end LLM software development, performance measurements, analysis and improvements for x86_64 and aarch64 platforms. Luxiao holds a M.S. in Computer Science, a B.S. in Computer Science and a B.S. in Chemical Engineering from Washington University in St. Louis. View all posts by Luxiao Zheng About Fan Shi Fan Shi is a senior system software engineer on the NVIDIA TensorRT team, specializing in the efficient deployment of advanced AI models on edge platforms. His work focuses on optimizing performance and usability in deep learning inference. Fan holds an M.S. in computational data science from Carnegie Mellon University and a B.S. in statistics and computer science from the University of Illinois. View all posts by Fan Shi About Maximilien Breughe Maximilien Breughe is an engineering leader and software engineer at NVIDIA, where he works on AI inference systems and edge AI technologies. He has a background in deep learning libraries and performance engineering, and holds a PhD in Computer Architecture focused on performance simulation techniques. Maximilien is especially interested in building practical, high-performance AI systems that bridge research and real-world deployment. View all posts by Maximilien Breughe About Michael Ferry Michael Ferry is a software engineering manager on the NVIDIA TensorRT team, where he leads the TensorRT Edge-LLM, Automotive Safety, and New Platforms teams. His work centers on optimized, reliable AI inference for safety-critical robotics and automotive edge systems. Before joining NVIDIA in 2018, Michael created and led several floating-point-focused verification tools at Intel. He holds a PhD in Mathematics, specializing in numerical optimization, from the University of California, San Diego. View all posts by Michael Ferry Comments Related posts Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM Accelerating LLM and VLM Inference for Automotive and Robotics with NVIDIA TensorRT Edge-LLM NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72 NVIDIA Accelerates OpenAI gpt-oss Models Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72 Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Optimizing Inference on Large Language Models with NVIDIA TensorRT-LLM, Now Publicly Available Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI Setting New Records in MLPerf Inference v3.0 with Full-Stack Optimizations for AI Getting the Best Performance on MLPerf Inference 2.0 Getting the Best Performance on MLPerf Inference 2.0 Related posts Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE Federated Learning Without the Refactoring Overhead Using NVIDIA FLARE Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments Mitigating Indirect AGENTS.md Injection Attacks in Agentic Environments Build a More Secure, Always-On Local AI Agent with OpenClaw and NVIDIA NemoClaw Build a More Secure, Always-On Local AI Agent with OpenClaw and NVIDIA NemoClaw Bringing AI Closer to the Edge and On-Device with Gemma 4 Bringing AI Closer to the Edge and On-Device with Gemma 4 Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere Building the AI Grid with NVIDIA: Orchestrating Intelligence Everywhere L T F R E