registryVersion: 1.12.0 models: - name: Llama 3.3 70B Instruct displayName: Llama 3.3 70B Instruct modelHubID: llama-3.3-instruct category: Text Generation type: NGC description: The Llama 3.3 70B-Instruct NIM simplifies the deployment of the Llama 3.3 70B instruction tuned model which is optimized for language understanding, reasoning, and text generation use cases, and outperforms many of the available open source chat models on common industry benchmarks. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.llama.com/llama3_3/use-policy/ - label: License Agreement url: https://www.llama.com/llama3_3/license/ modelVariants: - variantId: Llama 3.3 70B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-01-08T04:54:23.525Z",
    "description": "# **Llama-3.3-70B-Instruct Overview**\n\n## **Description:**\n\n**Llama-3.3-70B-Instruct** is an auto-regressive language model that uses an optimized transformer architecture. It is designed for text-based tasks such as multilingual chat, coding assistance, and synthetic data generation, and is particularly optimized for dialogue-based use cases. With 70 billion parameters, it provides strong performance that is comparable to larger models but with lower hardware requirements, and it does not process images or audio.\n\nThis model is ready for commercial/non-commercial use.\n\nThis version introduces support for GB200 NVL72, GH200 NVL2, B200 and NVFP4. CUDA updated to version 12.9. For detailed information, refer to Release [Notes for NVIDIA NIM for LLMs LLM 1.12](https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html). \n\n## **Third-Party Community Consideration**\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA\\[meta-llama/Llama-3.3-70B-Instruct\\]  \n([https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)). \n\n## **License/Terms of Use:**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/). \n\n\n**ADDITIONAL INFORMATION**: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.\n\n## **Get Help**\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.\n\n## **Deployment Geography:**\n\nGlobal \n\n## **Use Case:**\n\nThis model is intended for developers, researchers, and enterprises. They would integrate it into applications and workflows for a variety of advanced text-based tasks.\n\n* For Conversational AI: Building sophisticated and natural-sounding chatbots for customer service, multilingual virtual assistants, and interactive dialogue systems.  \n* For Software Development: Engineers might use the model as a powerful coding assistant for generating code, debugging, explaining complex algorithms, and writing documentation.  \n* For Content Creation and Analysis: Businesses and content creators might use the model  to draft emails, generate marketing copy, summarize long documents, and create synthetic text data to train other machine learning models.\n\n## **Release Date:**\n\nBuild.Nvidia.com 12/17/2024 via  \n[llama-3.3-70b-instruct Model by Meta | NVIDIA NIM](https://build.nvidia.com/meta/llama-3_3-70b-instruct)\n\nGithub 12/13/2024 via   \n[https://github.blog/changelog/2024-12-13-llama-3-3-70b-instruct-is-now-available-on-github-models-ga/](https://github.blog/changelog/2024-12-13-llama-3-3-70b-instruct-is-now-available-on-github-models-ga/)\n\nHuggingface 12/06/2024 via   \n[https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) \n\n**Reference(s):** \n\n[https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)\n\n## **Model Architecture:** \n\nArchitecture Type: Transformer  \nNetwork Architecture: Llama-3.3-70B\n\nThis model was developed based on Meta-Llama-3.3-70B  \n[https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct).\n\nNumber of model parameters: 7.06*10^10\n\n## **Input:**\n\nInput Type(s): Text \n\nInput Format(s): String \n\nInput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Input: The model processes input as tokens. The maximum context length is 8,192 tokens. Input text strings must be pre-processed by the model's specific Tiktoken tokenizer before being fed into the model.\n\n## **Output:**\n\nOutput Type(s): Text \n\nOutput Format(s): String\n\nOutput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Output: The model generates text as a sequence of tokens. The length of the generated output can be controlled by inference parameters. The raw token output requires post-processing (de-tokenization) to be converted into a human-readable string.\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Software Integration:**\n\nRuntime Engine: vLLM, TensorRT\n\nSupported Hardware Microarchitecture Compatibility:\n\nNVIDIA Ampere  \nNVIDIA Blackwell  \nNVIDIA Hopper  \nNVIDIA Lovelace \n\nPreferred Operating System(s):\n\nLinux   \nWindows\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version(s):**\n\nLlama-3.3-70B-Instruct\n\n## **Usage**\n\n### **Use with transformers**\n\nStarting with transformers \\>= 4.45.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.\n\nMake sure to update your transformers installation via pip install \\--upgrade transformers.\n\nSee the snippet below for usage with Transformers:\n\n```\nimport transformers\nimport torch\n\nmodel_id = \"meta-llama/Llama-3.3-70B-Instruct\"\n\npipeline = transformers.pipeline(\n    \"text-generation\",\n    model=model_id,\n    model_kwargs={\"torch_dtype\": torch.bfloat16},\n    device_map=\"auto\",\n)\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n    {\"role\": \"user\", \"content\": \"Who are you?\"},\n]\n\noutputs = pipeline(\n    messages,\n    max_new_tokens=256,\n)\nprint(outputs[0][\"generated_text\"][-1])\n```\n\n### **Tool use with transformers**\n\nLLaMA-3.3 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).\n\nTool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers. Here is a quick example showing a single simple tool:\n\n```\n# First, define a tool\ndef get_current_temperature(location: str) -> float:\n    \"\"\"\n    Get the current temperature at a location.\n    \n    Args:\n        location: The location to get the temperature for, in the format \"City, Country\"\n    Returns:\n        The current temperature at the specified location in the specified units, as a float.\n    \"\"\"\n    return 22.  # A real function should probably actually get the temperature!\n\n# Next, create a chat and apply the chat template\nmessages = [\n  {\"role\": \"system\", \"content\": \"You are a bot that responds to weather queries.\"},\n  {\"role\": \"user\", \"content\": \"Hey, what's the temperature in Paris right now?\"}\n]\n\ninputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)\n```\n\nYou can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:\n\n```\ntool_call = {\"name\": \"get_current_temperature\", \"arguments\": {\"location\": \"Paris, France\"}}\nmessages.append({\"role\": \"assistant\", \"tool_calls\": [{\"type\": \"function\", \"function\": tool_call}]})\n```\n\nand then call the tool and append the result, with the tool role, like so:\n\n```\nmessages.append({\"role\": \"tool\", \"name\": \"get_current_temperature\", \"content\": \"22.0\"})\n```\n\nAfter that, you can generate() again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling \\- for more information, see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).\n\n### **Use with bitsandbytes**\n\nThe model checkpoints can be used in 8-bit and 4-bit for further memory optimisations using bitsandbytes and transformers\n\nSee the snippet below for usage:\n\n```\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_id = \"meta-llama/Llama-3.3-70B-Instruct\"\nquantization_config = BitsAndBytesConfig(load_in_8bit=True)\n\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    model_id, device_map=\"auto\", torch_dtype=torch.bfloat16, quantization_config=quantization_config)\n\ntokenizer = AutoTokenizer.from_pretrained(model_id)\ninput_text = \"What are we having for dinner?\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutput = quantized_model.generate(**input_ids, max_new_tokens=10)\n\nprint(tokenizer.decode(output[0], skip_special_tokens=True))\n```\n\nTo load in 4-bit simply pass load\\_in\\_4bit=True\n\n### **Use with llama**\n\nPlease, follow the instructions in the [repository](https://github.com/meta-llama/llama).\n\nTo download Original checkpoints, see the example command below leveraging huggingface-cli:\n\n```\nhuggingface-cli download meta-llama/Llama-3.3-70B-Instruct --include \"original/*\" --local-dir Llama-3.3-70B-Instruct\n```\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset**\n\n**Data Modality:** Text \n\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method:** Hybrid: Human, Synthetic\n\n**Properties:** \n\nThe pre-training dataset contains over 15 trillion (15T) tokens from a diverse mix of publicly available online sources. The fine-tuning dataset consists of prompts and preference-ranked responses designed to improve helpfulness and safety.\n\n### **Testing Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method:** Hybrid: Human, Automated\n\n**Properties:** \n\nThe public datasets cover a wide range of tasks including massive multitask language understanding (MMLU), problem-solving (GSM8K), and code generation (HumanEval). Meta's internal evaluation set contains over 2,000 prompts designed to test for safety and helpfulness across various potentially risky categories. \n\n### **Evaluation Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Automated, Human, Synthetic\n\n**Labeling Method:** Hybrid: Human, Automated\n\n**Properties:** \n\nThe public datasets are industry-standard benchmarks designed to evaluate diverse capabilities like general knowledge, reasoning, coding, and math. For example, MMLU tests multitask knowledge, HumanEval tests code generation, and GSM8K tests grade-school math word problems. Meta's private evaluation set contains over 2,000 prompts for assessing safety and helpfulness.\n\n**Detailed Performance:**\n\n| Category | Benchmark | \\# Shots | Metric | Llama 3.1 8B Instruct | Llama 3.1 70B Instruct | Llama-3.3 70B Instruct | Llama 3.1 405B Instruct |\n| ----- | ----- | ----- | ----- | ----- | ----- | ----- | ----- |\n|  | MMLU (CoT) | 0 | macro\\_avg/acc | 73.0 | 86.0 | 86.0 | 88.6 |\n|  | MMLU Pro (CoT) | 5 | macro\\_avg/acc | 48.3 | 66.4 | 68.9 | 73.3 |\n| Steerability | IFEval |  |  | 80.4 | 87.5 | 92.1 | 88.6 |\n| Reasoning | GPQA Diamond (CoT) | 0 | acc | 31.8 | 48.0 | 50.5 | 49.0 |\n| Code | HumanEval | 0 | pass@1 | 72.6 | 80.5 | 88.4 | 89.0 |\n|  | MBPP EvalPlus (base) | 0 | pass@1 | 72.8 | 86.0 | 87.6 | 88.6 |\n| Math | MATH (CoT) | 0 | sympy\\_intersection\\_score | 51.9 | 68.0 | 77.0 | 73.8 |\n| Tool Use | BFCL v2 | 0 | overall\\_ast\\_summary/macro\\_avg/valid | 65.4 | 77.5 | 77.3 | 81.1 |\n| Multilingual | MGSM | 0 | em | 68.9 | 86.9 | 91.1 | 91.6 |\n\n## **Technical Limitations** \n\nTesting conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, the model's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying this model in any applications, developers should perform safety testing and tuning tailored to their specific applications. Please refer to available resources including the [Responsible Use Guide](https://llama.meta.com/responsible-use-guide), [Trust and Safety](https://llama.meta.com/trust-and-safety/) solutions, and other [resources](https://llama.meta.com/docs/get-started/) to learn more about responsible development. \n\n## **Inference:**\n\n**Acceleration Engine:** vLLM, TensorRT \n\n**Test Hardware:** \n   \n* B200 SXM   \n* H200 SXM  \n* H100 SXM  \n* A100 SXM 80GB  \n* A100 SXM 40GB  \n* L40S PCIe  \n* A10G  \n* H100 NVL  \n* H200 NVL  \n* GH200 96GB    \n* GB200 NVL72\n* GH200 NVL2\n* RTX 5090  \n* RTX 4090  \n* RTX 6000 Ada\n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.",
    "displayName": "Llama-3.3-70B-Instruct",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-L26U-IFIN",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx4-throughput-bf16-jdijd32qrq",
    "latestVersionSizeInBytes": 148279696110,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_3-70b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "llama-3.3-70b-instruct",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Meta",
    "safetyAndSecurity": "",
    "shortDescription": "The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out).",
    "teamName": "meta",
    "updatedDate": "2025-10-28T20:37:46.241Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/meta/containers/llama-3.3-70b-instruct optimizationProfiles: - profileId: nim/meta/llama-3.3-70b-instruct:h200x1-throughput-fp8-r-6bjqwx5a framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200x1 FP8 Throughput ngcMetadata: 02f132ac03fb2ab51b82d88abce83b64feb565c93ad1d54f3b2ab04b7c86b21f: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 70c427b55c83a3c54340d828ce94b546ad566be2ec930f0bd760a00927b4b180 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x1-throughput-nvfp4-1bf1rojpxw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x1 NVFP4 Throughput ngcMetadata: 09fcf7a392fe17c95e87d390742222a4a904b540f79f7b3b3d414bf3a092660b: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 33af5b8bea236de1a255b3108bcbb55e0dad3135b676d809d8ce339956cf67d4 number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:a100x2-throughput-bf16-5hyfmddv4a framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A100x2 BF16 Throughput ngcMetadata: 12c295e09aa3a3bac95522db7c0af51e27d6a4283b0402298c98691fc121a8ae: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 517f622a203fd1bdf58c0ba179d9be37fa1917c1f49e0a5aa85c7f5d3b8731b3 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 135GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x2-throughput-bf16-neynbhcsra framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x2 BF16 Throughput ngcMetadata: 12f9ae91afef2d29f5ef4c312f0922ac8ed5aa877c8c49416b0dfaf9dcb902e0: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: fda9c04c123bfcdfae4f8f81847d3aee5eb51698a39dd5905d6581d780b90209 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x4-latency-bf16-h4d-jgziqw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x4 BF16 Latency ngcMetadata: 135406168c0a2540196ed6f8003e35f8326cda374c6beb6f92b8b6f4883fbf0d: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 76e7be2bffcb7a930d207aa6f616ba863f1884672c64680ec0fca11bfc88304b number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 4 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:a10gx8-latency-bf16-of3qbtqvsg framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A10Gx8 BF16 Latency ngcMetadata: 168f348ad80045c0a730210c796a66ccf83768df25543f8b0567c1e186be9ad6: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 00a7b1bb5360bc13540061f29a07344a7aa2feefc46f7f7ff355131ba9d4690d number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 150GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200_NVLx2 BF16 Throughput ngcMetadata: 195071914f36a70a2b4306853667c37e6dd145c4ed787d099a0be9e75d84c58d: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 651dca39d0943930cc8b7bc0b5cd116294a25601c9f43deab2e362e3c96fde11 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A100_SXM4_40GBx8 BF16 Throughput ngcMetadata: 230323019f91e55e7e5ef0f472984bfe38672edc42d5d8f301887842e303e866: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 165e61398618addb727e82b8809cea1215b044020a8568597a31d7bee23b05e8 number_of_gpus: '8' pp: '1' precision: bf16 profile: throughput tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100_NVLx2 BF16 Throughput ngcMetadata: 252cb13923588a782037650b182dcc87562a58a6a1dc48a31519f9964dee57bd: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: adda34b085d63494164d063e4a82677e59bcde4543da432d4a550a84185434e0 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x2-latency-fp8-n6ww5ulixq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x2 FP8 Latency ngcMetadata: 2d46c8f638e9000b9892b517219356e3b980aabd33f027e7c858386688febd52: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 00103c1174a5863d30a1429e5aba6b251aa676ec57460280f28c6cb61f117d98 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_144GBx2 BF16 Throughput ngcMetadata: 30809103f16d80f0f834cfec8d3a48617ac311a1e13150534e01d1c34b0a5db7: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 70078c4e36e97245d7fe026a7cac6258820c5d8df77bf2d10432c6a35007e7e2 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h200x4-latency-bf16-cp-xxbkpta framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200x4 BF16 Latency ngcMetadata: 41bc6ff1de6d3dcfe33b8070b32a89946b55b5770c92c82ffb8bb87b8e3fc9d7: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7c9d81be68e9ba750798e8c48585ebbce4d271d36981e30f09019e011d8e389a number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 4 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 139GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:a100x8-latency-bf16-qfohcfr1iq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A100x8 BF16 Latency ngcMetadata: 443b4edfa5128abcbc85f57ca43e02053730a3fc22929e4b7864422cf5b12d16: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d8669419688b2ce0f64218aeaa11f4840e272e2d1f5d11fc5e0d1b3f53476e2d number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 8 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 147GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_144GBx2 FP8 Latency ngcMetadata: 44edc112b59ec6736bc9fc172d7219b9999f4398e5b61a7ca692a2053e1f4fc0: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d6f8ec1ae3a5910ae26ff689e3416a0947cf1c1c1bcc7dfc8d3186e490bcb36c number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_144GBx2 FP8 Throughput ngcMetadata: 4a4dc27109678a256cf4ae5209280f044a6562ade8c2e5bca3025a096a41c551: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e075a3b9652dae46b800108afda2a6f7c0f6301a35a1db3671dc8af30f1fd5a2 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x1-throughput-nvfp4-ujocyfzf6a framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x1 NVFP4 Throughput ngcMetadata: 4bf0e1bc784ba2c8b1ae399bb1042d1546bac30df98d02663d4e1db60744aabc: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: cf417dfa6ac83cc20cfb5404ec0b2eae321d174bc4808d6007df8562ffee63d8 number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:l40sx4-latency-fp8-vl02sw2m-g framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct L40Sx4 FP8 Latency ngcMetadata: 52050fe50397b0b158fafe24a0c1e74efad0d04351274757337c86fc99968dd9: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a9d77447899d9eb9de5254bf262250c7321a6522f30d485abd4072fc1de36dcc number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_144GBx1 FP8 Throughput ngcMetadata: 5b6330f563a4c3f73c9b02dc126295dd85d954c0623581981d1c6179155d9f7b: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0f31befe5670c8fd4ae2429ceaa76edcfcbcdfb96db0375e2671df999e4038c7 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_480GBx1 FP8 Latency ngcMetadata: 5e578516ea42fae60c4f314736e8d3e506c497894e059a6af96bd4c2c84edf23: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 68e492b05ff7304cf14489a9a313eb7049ab552fb27b106d1dc61af71a5b7c29 number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:a10gx8-throughput-bf16-rl2yes9ktw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A10Gx8 BF16 Throughput ngcMetadata: 613255b124f05cbf875c142c5ea7c2e3ebb7754a8a5473ad828d2bb07e2eaa88: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c0f0a6abdd6734299ec6f65611fa66490fd46303035100f9c865dd5d3c1dfb19 number_of_gpus: '8' pp: '1' precision: bf16 profile: throughput tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 150GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h200x2-throughput-bf16-lciwvjwxkw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200x2 BF16 Throughput ngcMetadata: 64878d614ca9a859228cd55d140af0865823c2f3524e43c7be53c01c039481b6: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 824477061f9c69bd79fd248a136a273e8d861d092fb853ede5e06e12510d8188 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx8 BF16 Latency ngcMetadata: 706687e8d19dccfb16a39808c18e54b9e55f7a5d6c2384df2c805453445ee4bb: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 095fa13f893ed4235a19615963e6b18bdb3e599ad5631c493007ae59dfe73f46 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 8 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:a100x4-throughput-bf16-lyvveim8va framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A100x4 BF16 Throughput ngcMetadata: 76e28450af746bb7626af7e5e2db4b57b56f11f5b6632a120eefabba925c2b15: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f14e1bad1a0e78da150aeedfee7919ab3ef21def09825caffef460b93fdde9b7 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 4 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 140GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:l40sx4-throughput-bf16-rhzeshgk8w framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct L40Sx4 BF16 Throughput ngcMetadata: 7cb838de5dad2c42066f0616756d0ad2708939c450b95416d41098e9931470c1: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c419c6ba54c118a6deb6ed9918e9c72e7f151698116c1d3c2bc32042a94d6bbb number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 140GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h100x2-throughput-bf16-m9pz-s1ymq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100x2 BF16 Throughput ngcMetadata: 7ed84ed093e8c5e8d237966262d640c6c2f160a8606df22e869e6f7a5a83cc96: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 5006533ca6b5151e94f18d8e518c68965918f248d0680b23e9fc0e4553e0d9ef number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h100x2-throughput-fp8-wbna-gqhxw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100x2 FP8 Throughput ngcMetadata: 7f4107d806d19c2c2beb2e870bf01217de37a247f27ee168985fc42a9576c641: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0013e870ea929584ec13dad6948450024cdc6c2f03a865f1b050fb08b9f64312 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct A100_SXM4_40GBx8 BF16 Latency ngcMetadata: 814d03ce098b7de458602c7bce320c3d06fe898759577c849a34193653a70bbb: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7a207406eaa12a8bb549ea578116338e4e204d3b38ed0ffb6a9d9d789f2cd994 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200_NVLx1 FP8 Throughput ngcMetadata: 829d3e1c28ffd52afed2d35e9374cfc7b605eda5a630bc9b33fbea5500da8fb3: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 34a1cf18c4d7501df008280668fe6df7de1f91ff29daee8d5c80291dd6e51b0e number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x2-latency-nvfp4-prgjwnsudw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x2 NVFP4 Latency ngcMetadata: 8abcb1c5fc3e57d712a311f08f9b33b59b383196b95f7d7f66c758de85d56567: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 807294ccda05820ac7bbb9cf0471df7494e947226acff080c0782bda0c7d4394 number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h200x2-latency-fp8-ozazyo6fjw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200x2 FP8 Latency ngcMetadata: 92e9707e66c742310e9a7a6d38e162b2578375c8fe0844939c499a00116a994e: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 138ef4644a3d6477c3deaf2cd22f548d3396925db62f4752fb73b52b7b8a4a29 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x1-latency-nvfp4-gbqmrrkwrw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x1 NVFP4 Latency ngcMetadata: a1366af9ab8c32f147d10d0fcc2a43d55b20f2c79178b4a291caa5dec55f966c: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 19ca51edfcfaecd4c68b0950ff57be89e59def4ad003dbcfae4352b43d152223 number_of_gpus: '1' pp: '1' precision: nvfp4 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h100x4-latency-fp8-mg52y2fpwq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100x4 FP8 Latency ngcMetadata: a2003c7b2b19b79aefb52cd9daa58fb20f0520dd9759037ff34e67110f384218: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 8a5f27c50cf45f7d1a1e504bcd33820eefa80539b94a68bbf015c3f4f4cb2c3f number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx4 BF16 Throughput ngcMetadata: a403f6513a44565063a70541681355465810849c0f537c825cd6575c960c2c14: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4896c9f159be7403ca983e4da47959b87841d5fe0034304ab473baf61f3132a1 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x1-latency-fp8-uepcd7pd4a framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x1 FP8 Latency ngcMetadata: a425a0f4eef147092d6d41acbd7c9c3408614205b8135699274b02f2363b707c: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0cedb0518e3995aa41d37920a83b151ad05bdf2a43beedbff21b709cf696e350 number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_144GBx2 BF16 Latency ngcMetadata: a6f328cf048298b737a05799b74a3f81b4a215f125d71088054c6c32f3446801: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7e9757ebb03d4334fd350490505620d2af6b5329aa8a28df931e0a22e46d55cd number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx2 FP8 Throughput ngcMetadata: a9f34dd0f8e4fd295b0d04067aa0ecce24aa3707b26305e9ab084d430546975c: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 77ab630b949b0a58ad580a22ea055bc392a30fbf57357d6398814e00775aab8c number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x1-throughput-fp8-ybdaheki0g framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x1 FP8 Throughput ngcMetadata: af09a13bcaa3650952df251a0dfd03dabaf7700a6d00b6f2264b2c9ef757fbb6: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b6dc07bb5bf5be874355bbe6288ca066c605a43c23d6c537ac9d4929c22d2cdd number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GH200_480GBx1 FP8 Throughput ngcMetadata: c7fc979432a42458118ab456c33302cbde984c5d8a0035e9d2c1d07b5f3dc0d9: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 1dea2d2f10ec64c74ca127f73b52bf5253dfdc91c5cd5da07cb742e166e8a795 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:l40sx4-throughput-fp8-sx6as-ue-a framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct L40Sx4 FP8 Throughput ngcMetadata: cf120c3ecf2025e6a170cb224802ca6a02cbeec3ad74944a69263b3193a64fa2: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b118ae4fb04a6bbcf439004b94edd4815d2c965a0c692c2b98a790580c9c3f7b number_of_gpus: '4' pp: '1' precision: fp8 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H200_NVLx2 FP8 Latency ngcMetadata: cf5787bfa25e0f21603c8aa6458d2ae062691d0fa81e684dc219082ba39fb1d9: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 552ad035a0898be37be03c9d539efbda5a7d2f214b2c5950e14bb694ad8329a9 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx2 NVFP4 Throughput ngcMetadata: d650534ce98fea4bfc9924d77c91fbd8dca227321c35557e924297ab6b9008cb: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 19aeb73125023f25e273ed14ccc69b935b2ce5131d4d91d1b78f3e8bdc0366b7 number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:gb200x4-latency-bf16-vuvdg5jkzq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct GB200x4 BF16 Latency ngcMetadata: d7ff5f88620f7fbe0538931af334663b43d15cb2c969e7fc96375ac60108906f: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: be07050242f7ce67689c0d81de40bb1de6967dd251a881bcb784193fb92d8183 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 4 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x2-throughput-bf16-omzr8lu67g framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x2 BF16 Throughput ngcMetadata: e0ac049ec460cc8dfe59feaec6d12ae55807dac2b0bd62396c36679f2674e330: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6d1452af26f860b53df112c90f6b92f22a41156c09dafa2582c2c1194e56a673 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100_NVLx2 FP8 Throughput ngcMetadata: e518c22e6d4135300fc5c10bd0c4d195c51ac596e8950172e303bcce84794732: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3035d73242fb579040fb3f341adc36a7073f780419e73dd97edb7ce35cb0f550 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:l40sx4-latency-bf16-rasfmhw4uw framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct L40Sx4 BF16 Latency ngcMetadata: e6d1855d3f24e439b904cf1fd47d3e136bec4af9134c039558c61f9ae34593af: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 8747b7e093d3b26e808e8bbebdb50c3ac0a0f82402c58b3430a8760ff96e406e number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx4 FP8 Latency ngcMetadata: ee0b992fafa65ffe00e8df84f80f9e417a400ec40b60b6769db81498482610d7: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3140e28251686b824ea3fd4d45a86cef01b156d1737ada0b6783b612ac3b6e92 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100_NVLx4 BF16 Latency ngcMetadata: efcb2762954af78c9b84774917daf706fd8d663df3d54c298a1fb9d2fb86a119: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3d2f50e0423aa98250617f6a0dad719bed6892994a47c60e092ce494d93e9bce number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:b200x1-throughput-fp8-xk4doibibg framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct B200x1 FP8 Throughput ngcMetadata: f215c1f1608a7818f6c465646f8f8cb412a58b39c99e4a15857466fb9a970aef: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6979353282e6f8421f9ffd76c33eb1e675f796fc7ed036c6038b99a21d649f18 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct RTX6000_BLACKWELL_SVx4 NVFP4 Latency ngcMetadata: f9c5befd972751383a8dfa7b38fb77fd4c69af4e015136f0a194b7db0176ce59: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4f697999cecdc5afc7ff8f588b71a5b7683117aa866f34ab76886db2dbe86dcc number_of_gpus: '4' pp: '1' precision: nvfp4 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:h100x4-throughput-bf16-bpwvcpvnsq framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100x4 BF16 Throughput ngcMetadata: fbec99d055ebc70d1261d9520f1f6f854fb0a84771bdadde30668dca1f081c7d: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2eb1d578e4e069c384bf617e5354889d043a1c72b77f432c07e06ffb1b8be36b number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 139GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.3-70b-instruct:6f6073b423013f6a7d4d9f39144961bfbfbc386b framework: TensorRT-LLM displayName: Llama 3.3 70B Instruct H100_NVLx4 FP8 Latency ngcMetadata: ff1a26a9837e3e3122a70f91d46181b15f22ba8276c47b0d852cabde8a6a5460: model: meta/llama-3.3-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9b6105c7bf6521bd8eb6fa1badcd239636f35c06317166bf78d58a8cc239411f number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM labels: - Llama - Meta - Chat - Text Generation - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Llama 3.3 Nemotron Super 49B displayName: Llama 3.3 Nemotron Super 49B modelHubID: llama-3.3-nemotron-super-49b category: Chatbots type: NGC description: Llama-3.3-Nemotron-Super-49B v1 and v1.5 are language models that can follow instructions, complete requests, and generate creative text formats. The Llama-3.3-Nemotron-Super-49B v1 series of Large Language Models (LLMs) are instruction-tuned versions of the Llama-Nemotron. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Llama 3.3 Nemotron Super 49B V1 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "|Field:|Response:|\n|:---:|:---:|\n|Participation considerations from adversely impacted groups (protected classes) in model design and testing:|None|\n|Measures taken to mitigate against unwanted bias:|None|",
    "canGuestDownload": false,
    "createdDate": "2025-03-16T22:25:08.619Z",
    "description": "# Llama-3.3-Nemotron-Super-49B-v1\n\n\n## Model Overview \n\nLlama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens.\n\nLlama-3.3-Nemotron-Super-49B-v1 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model\u2019s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff.\n\nThe model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using REINFORCE (RLOO) and Online Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and Online RPO checkpoints. For more details on how the model was trained, please see [this blog](https://developer.nvidia.com/blog/build-enterprise-ai-agents-with-advanced-open-nvidia-llama-nemotron-reasoning-models/).\n\n![Architecture Diagram](https://assets.ngc.nvidia.com/products/api-catalog/llama-3_3-nemotron-super-49b-v1/diagram.jpg)\n\nThis model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: \n[Llama-3_1-Nemotron-Nano-8B-v1](https://build.nvidia.com/nvidia/llama-3_1-nemotron-nano-8b-v1)\n\nThis model is ready for commercial use. \n\n## License/Terms of Use\n\nGOVERNING TERMS: Your use of this model is governed by the [NVIDIA Open Model License.](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n**Model Developer:** NVIDIA\n\n**Model Dates:** Trained between November 2024 and February 2025\n\n**Data Freshness:**  The pretraining data has a cutoff of 2023 per Meta Llama 3.3 70B\n\n### Use Case: <br>\nDevelopers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. <br>\n\n### Release Date:  <br>\n3/18/2025 <br>\n\n## References\n* [2502.00203] [Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203)\n\n## Model Architecture\n**Architecture Type:** Dense decoder-only Transformer model  \n**Network Architecture:** Llama 3.3 70B Instruct, customized through Neural Architecture Search (NAS)\n\nThe model is a derivative of Meta\u2019s Llama-3.3-70B-Instruct, using Neural Architecture Search (NAS). The NAS algorithm results in non-standard and non-repetitive blocks. This includes the following: \nSkip attention: In some blocks, the attention is skipped entirely, or replaced with a single linear layer.\nVariable FFN: The expansion/compression ratio in the FFN layer is different between blocks. \n\nWe utilize a block-wise distillation of the reference model, where for each block we create multiple variants providing different tradeoffs of quality vs. computational complexity, discussed in more depth below. We then search over the blocks to create a model which meets the required throughput and memory (optimized for a single H100-80GB GPU) while minimizing the quality degradation. The model then undergoes knowledge distillation (KD), with a focus on English single and multi-turn chat use-cases. The KD step included 40 billion tokens consisting of a mixture of 3 datasets - FineWeb, Buzz-V1.2 and Dolma.\n\n## Intended use\n\nLlama-3.3-Nemotron-Super-49B-v1 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. \n\n## Input\n- **Input Type:** Text\n- **Input Format:** String\n- **Input Parameters:** One-Dimensional (1D)\n- **Other Properties Related to Input:** Context length up to 131,072 tokens\n\n## Output\n- **Output Type:** Text\n- **Output Format:** String\n- **Output Parameters:** One-Dimensional (1D)\n- **Other Properties Related to Output:** Context length up to 131,072 tokens\n\n## Model Version\n1.0 (3/18/2025)\n\n## Software Integration\n- **Runtime Engine:** Transformers\n- **Recommended Hardware Microarchitecture Compatibility:** \n   - NVIDIA Hopper\n   - NVIDIA Ampere\n\n## Quick Start and Usage Recommendations:\n\n1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt\n2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode\n3. We recommend using greedy decoding for Reasoning OFF mode\n4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required\n\nYou can try this model out through the preview API, using this link: [Llama-3_3-Nemotron-Super-49B-v1](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1).\n\n## Inference:\n**Engine:**\nTransformers  \n**Test Hardware:**\n- FP8: 1x NVIDIA H100-80GB GPU (Coming Soon)\n- BF16: \n   - 2x NVIDIA H100-80GB GPUs\n   - 2x NVIDIA A100-80GB GPUs\n      \n**[Preferred/Supported] Operating System(s):** Linux <br>\n\n## Training Datasets\n\nA large variety of training data was used for the knowledge distillation phase before post-training pipeline, 3 of which included: FineWeb, Buzz-V1.2, and Dolma.\n\nThe data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model. \n\nIn conjunction with this model release, NVIDIA has released 30M samples of post-training data, as public and permissive. [Llama-Nemotron-Post-Training-Dataset-v1](https://huggingface.co/datasets/nvidia/Llama-Nemotron-Post-Training-Dataset-v1)\n\nDistribution of the domains is as follows:\n\n| Category | Value     |\n|----------|-----------|\n| math     | 19,840,970|\n| code     | 9,612,677 |\n| science     | 708,920    |\n| instruction following       | 56,339    |\n| chat     | 39,792    |\n| safety   | 31,426    |\n\nPrompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes. \n\nModels that were used in the creation of this dataset:\n- Llama-3.3-70B-Instruct\n- Llama-3.1-Nemotron-70B-Instruct\n- Llama-3.3-Nemotron-70B-Feedback/Edit/Select\n- Mixtral-8x22B-Instruct-v0.1\n- DeepSeek-R1\n- Qwen-2.5-Math-7B-Instruct\n- Qwen-2.5-Coder-32B-Instruct\n- Qwen-2.5-72B-Instruct\n- Qwen-2.5-32B-Instruct\n\n**Data Collection for Training Datasets:**\nHybrid: Automated, Human, Synthetic\n\n**Data Labeling for Training Datasets:**\nHybrid: Automated, Human, Synthetic\n\n## Evaluation Datasets \n\nWe used the datasets listed below to evaluate Llama-3.3-Nemotron-Super-49B-v1. \n\n**Data Collection for Evaluation Datasets:**\nHybrid: Human/Synthetic\n\n**Data Labeling for Evaluation Datasets:**\nHybrid: Human/Synthetic/Automatic\n\n## Evaluation Results\nThese results contain both Reasoning On, and Reasoning Off. We recommend using temperature=`0.6`, top_p=`0.95` for Reasoning On mode, and greedy decoding for Reasoning Off mode. All evaluations are done with 32k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate.\n\n> NOTE: Where applicable, a Prompt Template will be provided. While completing benchmarks, please ensure that you are parsing for the correct output format as per the provided prompt in order to reproduce the benchmarks seen below. \n\n### Arena-Hard\n\n| Reasoning Mode | Score |\n|--------------|------------|\n| Reasoning Off | 88.3 | \n\n\n### MATH500\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 74.0 | \n| Reasoning On | 96.6  |\n\nUser Prompt Template: \n```\n\"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \\boxed{}.\\nQuestion: {question}\"\n```\n\n### AIME25\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 13.33 | \n| Reasoning On | 58.4 |\n\nUser Prompt Template: \n```\n\"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \\boxed{}.\\nQuestion: {question}\"\n```\n\n### GPQA\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 50 | \n| Reasoning On | 66.67 |\n\nUser Prompt Template: \n```\n\"What is the correct answer to this question: {question}\\nChoices:\\nA. {option_A}\\nB. {option_B}\\nC. {option_C}\\nD. {option_D}\\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \\boxed{}\"\n```\n\n### IFEval\n\n| Reasoning Mode | Strict:Instruction |\n|--------------|------------|\n| Reasoning Off | 89.21 | \n\n### BFCL V2 Live\n\n| Reasoning Mode | Score |\n|--------------|------------|\n| Reasoning Off | 73.7 | \n\nUser Prompt Template:\n```\nYou are an expert in composing functions. You are given a question and a set of possible functions. \nBased on the question, you will need to make one or more function/tool calls to achieve the purpose. \nIf none of the function can be used, point it out. If the given question lacks the parameters required by the function,\nalso point it out. You should only return the function call in tools call sections.\n\nIf you decide to invoke any of the function(s), you MUST put it in the format of <TOOLCALL>[func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]</TOOLCALL>\n\nYou SHOULD NOT include any other text in the response.\nHere is a list of functions in JSON format that you can invoke.\n\n<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>\n\n{user_prompt}\n```\n\n### MBPP 0-shot\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 84.9| \n| Reasoning On | 91.3 |\n\nUser Prompt Template:\n````\nYou are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.\n\n@@ Instruction\nHere is the given problem and test examples:\n{prompt}\nPlease use the python programming language to solve this problem.\nPlease make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.\nPlease return all completed codes in one code block.\nThis code block should be in the following format:\n```python\n# Your codes here\n```\n````\n\n### MT-Bench\n\n| Reasoning Mode | Score |\n|--------------|------------|\n| Reasoning Off | 9.17 |\n\n## Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. \n\nFor more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards, which you can find by clicking the [ModelCard++ tab](https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/models/llama-3.3-nemotron-super-49b-v1/bias) above, next to Overview.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).",
    "displayName": "Llama-3.3-Nemotron-Super-49B-v1",
    "explainability": "|Field:|Response:|\n|:---:|:---:|\n|Intended Application(s) & Domain(s):| Text generation, summarization, question answering. Focused on users and customers who want to get good accuracy-efficiency (price) tradeoff.\n|Model Type:|Text-to-text transformer|\n|Intended Users:|This model is intended for developers, researchers, and customers building/utilizing LLMs, while balancing accuracy and efficiency|\n|Output:|Text String(s)|\n|Describe how the model works:|Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers|\n|Technical Limitations:The Model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text. |\n|Verified to have met prescribed quality standards?|Yes|\n|Performance Metrics:|Accuracy, Throughput, and user-side throughput|\n|Potential Known Risks:|The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.|\n|End User License Agreement:| Your use of this model is governed by the [NVIDIA Open Model License.](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-B2P7-A6NC",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "hf-1a2cb80-nim-0613-tool-use-v2",
    "latestVersionSizeInBytes": 99752363705,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_3-nemotron-49b-instruct.jpg",
    "modelFormat": "SavedModel",
    "name": "llama-3.3-nemotron-super-49b-v1",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "|Field:|Response:|\n|:---:|:---:|\n|Generatable or Reverse engineerable personally-identifiable information?|None|\n|Was consent obtained for any personal data used?|None Known|\n|Personal data used to create this model?|None Known|\n|How often is dataset reviewed?|Before Release|\n|Is there provenance for all datasets used in training?|Yes|\n|Does data labeling (annotation, metadata) comply with privacy laws?|Not Applicable|\n|Is data compliant with data subject requests for data correction or removal, if such a request was made?|Not Applicable|\n|Applicable NVIDIA Privacy Policy|https://www.nvidia.com/en-us/about-nvidia/privacy-policy/|",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "|Field:|Response:|\n|:---:|:---:|\n|Model Application(s):|Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning|\n|Describe life critical application (if present):|None Known|\n|Use Case Restrictions:| Abide by the [NVIDIA Open Model License.](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.|\n|Model and Dataset Restrictions:|The Principle of least privilege (PoLP) is applied limiting access for dataset generation.  Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face and NGC, and may become available on cloud providers' model catalog.|",
    "shortDescription": "Llama-3.3-Nemotron-Super-49B-v1 is a large language model (LLM) which is a derivative of Meta\u2019s Llama-3.3-70B-Instruct (AKA the reference model).",
    "teamName": "nvidia",
    "updatedDate": "2025-07-16T16:58:34.843Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-3.3-nemotron-super-49b-v1 optimizationProfiles: - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:a100x2-throughput-bf16-ozhgcnodhw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 A100x2 BF16 Throughput ngcMetadata: 0db3b5e8468c9debf30bcf41cbfea084adc59000885efd6fdcb3bbb902651bd6: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h100x2-throughput-bf16-aie7yrqp4q framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100x2 BF16 Throughput ngcMetadata: 1617d074ce252f66e96d5f0e331fa5c6cc0a0330519e56b5c66c60eb7d7bf4f9: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:b200x2-throughput-fp8-hq3hflct5a framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 B200x2 FP8 Throughput ngcMetadata: 26bd84b107a99415b474267bec4cbcf932fbb28e45d7fb4e4db2971506825888: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100_NVLx4 BF16 Latency ngcMetadata: 28552abdb2c491d46065d52ca1dc1265b99ba95a5bf8daaee4c5de12511a3b4f: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h200x1-throughput-bf16-2mmw837ykw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H200x1 BF16 Throughput ngcMetadata: 434e8d336fa23cbe151748d32b71e196d69f20d319ee8b59852a1ca31a48d311: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 94GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:b200x2-latency-fp8-1m3h4ytjug framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 B200x2 FP8 Latency ngcMetadata: 4950d30811e1e426e97cda69e6c03a8a4819db8aa4abf34722ced4542a1f6b52: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100_NVLx1 FP8 Throughput ngcMetadata: 5811750e70b7e9f340f4d670c72fcbd5282e254aeb31f62fd4f937cfb9361007: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h200x2-latency-bf16-19zfbhbq3g framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H200x2 BF16 Latency ngcMetadata: 6832a9395f54086162fd7b1c6cfaae17c7d1e535a60e2b7675504c9fc7b57689: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 A100_SXM4_40GBx4 BF16 Throughput ngcMetadata: 6c29727e6e3d48a900c348c1fab181dc40bc926be07b06ca5b8eae42a6bc9901: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 4 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h100x2-latency-fp8-3wbe0ygpmg framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100x2 FP8 Latency ngcMetadata: 6c3f01dd2b2a56e3e83f70522e4195d3f2add70b28680082204bbb9d6150eb04: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h100x4-latency-bf16-9sreahcbuq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100x4 BF16 Latency ngcMetadata: 73f41fabbb60beb5b05ab21c8dcce5c277d99bcabec31abf46a0194d0dd18d04: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h100x1-throughput-fp8-mhpv-tjmtq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100x1 FP8 Throughput ngcMetadata: 7b508014e846234db3cabe5c9f38568b4ee96694b60600a0b71c621dc70cacf3: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 A100_SXM4_40GBx8 BF16 Latency ngcMetadata: 8a446393aaeb0065ee584748c7c03522389921a11ff2bd8cb5800e06a8644eb0: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100_NVLx2 FP8 Latency ngcMetadata: a00ce1e782317cd19ed192dcb0ce26ab8b0c1da8928c33de8893897888ff7580: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:b200x1-throughput-bf16-e8quw21o2g framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 B200x1 BF16 Throughput ngcMetadata: a4c63a91bccf635b570ddb6d14eeb6e7d0acb2389712892b08d21fad2ceaee38: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H100_NVLx2 BF16 Throughput ngcMetadata: acd73fcee9d91ada305118080138fb3ca4d255adee3312acda38c4487daae476: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:l40sx4-latency-fp8-dm0yeik1qq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 L40Sx4 FP8 Latency ngcMetadata: bdd0d3cd53fad1130259beea81ab5711fb98f2f1a020b5b26c3c82fd7d43c5af: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 50GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h200x2-throughput-fp8-pn0bsx2fww framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H200x2 FP8 Throughput ngcMetadata: c91a755246cb08dd9aa6905bc40b7db552071d141a850be5a791b06eb4fb2ef8: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:a100x4-latency-bf16-htlclkizog framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 A100x4 BF16 Latency ngcMetadata: d73b7cf2f719d720329fc65fc255ae901bc3beebdc59be9815ede1a07948c1f7: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 4 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:h200x2-latency-fp8-v0ho-fvz0g framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 H200x2 FP8 Latency ngcMetadata: e4f217a5fb016b570e34b8a8eb06051ccfef9534ba43da973bb7f678242eaa5f: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:b200x2-latency-bf16-moifcs7ehq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 B200x2 BF16 Latency ngcMetadata: f44768c625db71a327cf17e750d5e1a8e60171a8d8ef6b4c1c4b57fe74c9bf46: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 Generic NVIDIA GPUx8 BF16 ngcMetadata: 1d7b604f835f74791e6bfd843047fc00a5aef0f72954ca48ce963811fb6f3f09: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '8' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 8 - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 Generic NVIDIA GPUx2 BF16 ngcMetadata: 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '2' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 2 - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1:hf-1a2cb80-nim-0613-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1 Generic NVIDIA GPUx4 BF16 ngcMetadata: 54946b08b79ecf9e7f2d5c000234bf2cce19c8fee21b243c1a084b03897e8c95: model: nvidia/llama-3.3-nemotron-super-49b-v1 release: 1.10.1 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '4' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 4 - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - variantId: Llama 3.3 Nemotron Super 49B V1.5 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "### **Bias**\n\n|Field:|Response:|\n|:---:|:---:|\n|Participation considerations from adversely impacted groups (protected classes) in model design and testing:|None|\n|Measures taken to mitigate against unwanted bias:|None|",
    "canGuestDownload": false,
    "createdDate": "2025-09-16T20:19:24.618Z",
    "description": "## **Llama-3.3-Nemotron-Super-49B-v1.5 Overview**\n\n## **Description:**\n\n**Llama-3.3-Nemotron-Super-49B-v1.5** is a significantly upgraded version of Llama-3.3-Nemotron-Super-49B-v1 and is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and agentic tasks, such as Retrieval-Augmented Generation (RAG) and tool calling. The model supports a context length of 128K tokens.\n\nLlama-3.3-Nemotron-Super-49B-v1.5 is a model which offers a great tradeoff between model accuracy and efficiency. Efficiency (throughput) directly translates to savings. Using a novel Neural Architecture Search (NAS) approach, we greatly reduce the model\u2019s memory footprint, enabling larger workloads, as well as fitting the model on a single GPU at high workloads (H200). This NAS approach enables the selection of a desired point in the accuracy-efficiency tradeoff. For more information on the NAS approach, please refer to [this paper](https://arxiv.org/abs/2411.19146).\n\nThe model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Science, and Tool Calling. Additionally, the model went through multiple stages of Reinforcement Learning (RL) including Reward-aware Preference Optimization (RPO) for chat, Reinforcement Learning with Verifiable Rewards (RLVR) for reasoning, and iterative Direct Preference Optimization (DPO) for Tool Calling capability enhancements. The final checkpoint was achieved after merging several RL and DPO checkpoints.\n\nThis model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here:\n\n* [Llama-3.1-Nemotron-Nano-4B-v1.1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1)  \n* [Llama-3.1-Nemotron-Ultra-253B-v1](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1)\n\nThis model is ready for commercial use.\n\n## **License/Terms of Use**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/).\n\n**Additional Information:** [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.\n\n##**Get Help**\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.**\n\n## **Deployment Geography:**\n\nGlobal\n\n## **Use Case:**\n\nLlama-3.3-Nemotron-Super-49B-v1.5 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. Developers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks.\n\n## **Release Date:**\n\n* Hugging Face 7/25/2025 via [Llama-3\\_3-Nemotron-Super-49B-v1\\_5](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5)  \n* Build.NVIDIA.com 7/25/2025 [Llama-3\\_3-Nemotron-Super-49B-v1\\_5](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5)\n\n## **References**\n\n* [\\[2505.00949\\] Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)  \n* [\\[2502.00203\\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203)  \n* [\\[2411.19146\\]Puzzle: Distillation-Based NAS for Inference-Optimized LLMs](https://arxiv.org/abs/2411.19146)\n\n## **Model Architecture**\n\n**Architecture Type:** Dense decoder-only Transformer model\n\n**Network Architecture:** Llama 3.3 70B Instruct, customized through Neural Architecture Search (NAS)\n\nThe model is a derivative of Meta\u2019s Llama-3.3-70B-Instruct, using Neural Architecture Search (NAS). The NAS algorithm results in non-standard and non-repetitive blocks. This includes the following:\n\nSkip attention: In some blocks, the attention is skipped entirely, or replaced with a single linear layer. Variable FFN: The expansion/compression ratio in the FFN layer is different between blocks.\n\nWe utilize a block-wise distillation of the reference model, where for each block we create multiple variants providing different tradeoffs of quality vs. computational complexity, discussed in more depth below. We then search over the blocks to create a model which meets the required throughput and memory (optimized for a single H100-80GB GPU) while minimizing the quality degradation. The model then undergoes knowledge distillation (KD), with a focus on English single and multi-turn chat use cases. The KD step included 40 billion tokens consisting of a mixture of 3 datasets \\- FineWeb, Buzz-V1.2, and Dolma.\n\n**Intended Use**  \nLlama-3.3-Nemotron-Super-49B-v1.5 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported.\n\n## **Input**\n\n* **Input Type(s):** Text  \n* **Input Format:** String  \n* **Input Parameters:** One-Dimensional (1D)  \n* **Other Properties Related to Input:** Context length up to 131,072 tokens\n\n## **Output**\n\n* **Output Type(s):** Text  \n* **Output Format:** String  \n* **Output Parameters:** One-Dimensional (1D)  \n* **Other Properties Related to Output:** Context length up to 131,072 tokens\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA\u2019s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Software Integration:**\n\n* **Runtime Engine:** vLLM, TensorRT  \n* **Supported Hardware Microarchitecture Compatibility:**  \n  * NVIDIA Ampere  \n  * NVIDIA Hopper  \n* **Preferred Operating System(s):** Linux\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version:**\n\nLlama-3.3-Nemotron-Super-49B-v1.5-1.12.0 <br>\nLlama-3.3-Nemotron-Super-49B-v1.5-1.13.1 <br>\nLlama-3.3-Nemotron-Super-49B-v1.5-1.14.0 <br>\n\n## **Quick Start and Usage Recommendations:**\n\n* By default (empty system prompt), the model will respond in reasoning ON mode. Setting /no\\_think in the system prompt will enable reasoning OFF mode.  \n* We recommend setting temperature to 0.6 and Top P to 0.95 for Reasoning ON mode.  \n* We recommend using greedy decoding for Reasoning OFF mode.\n\nYou can try this model out through the preview API, using this link: [Llama-3\\_3-Nemotron-Super-49B-v1\\_5](https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1_5).\n\n**Use It with vLLM**\n\npip install vllm==0.9.2\n\nAn example on how to serve with vLLM:\n\n```\n$ python3 -m vllm.entrypoints.openai.api_server \\\n  --model \"nvidia/Llama-3_3-Nemotron-Super-49B-v1_5\" \\\n  --trust-remote-code \\\n  --seed=1 \\\n  --host=\"0.0.0.0\" \\\n  --port=5000 \\\n  --served-model-name \"Llama-3_3-Nemotron-Super-49B-v1_5\" \\\n  --tensor-parallel-size=8 \\\n  --max-model-len=65536 \\\n  --gpu-memory-utilization 0.95 \\\n  --enforce-eager\n```\n\n**Running a vLLM Server with Tool-call Support**\n\nTo enable tool calling usage with this model, we provide a tool parser in the repository. Here is an example on how to use it:\n\n```\n$ git clone https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1_5\n\n$ conda create -n vllm python=3.12 -y\n$ conda activate vllm\n$ pip install vllm==0.9.2\n\n$ python3 -m vllm.entrypoints.openai.api_server \\\n  --model Llama-3_3-Nemotron-Super-49B-v1_5 \\\n  --trust-remote-code \\\n  --seed=1 \\\n  --host=\"0.0.0.0\" \\\n  --port=5000 \\\n  --served-model-name \"Llama-3_3-Nemotron-Super-49B-v1_5\" \\\n  --tensor-parallel-size=8 \\\n  --max-model-len=65536 \\\n  --gpu-memory-utilization 0.95 \\\n  --enforce-eager \\\n  --enable-auto-tool-choice \\\n  --tool-parser-plugin \"Llama-3_3-Nemotron-Super-49B-v1_5/llama_nemotron_toolcall_parser_no_streaming.py\" \\\n  --tool-call-parser \"llama_nemotron_json\"\n```\n\nAfter launching a vLLM server, you can call the server with tool-call support using a Python script like below.\n\n```py\nfrom openai import OpenAI\nclient = OpenAI(\n    base_url=\"http://0.0.0.0:5000/v1\",\n    api_key=\"dummy\",\n)\ncompletion = client.chat.completions.create(\n    model=\"Llama-3_3-Nemotron-Super-49B-v1_5\",\n    messages=[\n        {\"role\": \"system\", \"content\": \"\"},\n        {\"role\": \"user\", \"content\": \"My bill is $100. What will be the amount for 18% tip?\"}\n    ],\n    tools=[\n        {\n            \"type\": \"function\",\n            \"function\": {\n                \"name\": \"calculate_tip\",\n                \"parameters\": {\n                    \"type\": \"object\",\n                    \"properties\": {\n                        \"bill_total\": {\n                            \"type\": \"integer\",\n                            \"description\": \"The total amount of the bill\"\n                        },\n                        \"tip_percentage\": {\n                            \"type\": \"integer\",\n                            \"description\": \"The percentage of tip to be applied\"\n                        }\n                    },\n                    \"required\": [\"bill_total\", \"tip_percentage\"]\n                }\n            }\n        },\n        {\n            \"type\": \"function\",\n            \"function\": {\n                \"name\": \"convert_currency\",\n                \"parameters\": {\n                    \"type\": \"object\",\n                    \"properties\": {\n                        \"amount\": {\n                            \"type\": \"integer\",\n                            \"description\": \"The amount to be converted\"\n                        },\n                        \"from_currency\": {\n                            \"type\": \"string\",\n                            \"description\": \"The currency code to convert from\"\n                        },\n                        \"to_currency\": {\n                            \"type\": \"string\",\n                            \"description\": \"The currency code to convert to\"\n                        }\n                    },\n                    \"required\": [\"from_currency\", \"amount\", \"to_currency\"]\n                }\n            }\n        }\n    ],\n    temperature=0.6,\n    top_p=0.95,\n    max_tokens=32768,\n    stream=False\n)\nprint(completion.choices[0].message.content)\n'''\n<think>\nOkay, let's see. The user has a bill of $100 and wants to know the amount for an 18% tip. Hmm, I need to calculate the tip based on the bill total and the percentage. The tools provided include calculate_tip, which takes bill_total and tip_percentage as parameters. So the bill_total here is 100, and the tip_percentage is 18. I should call the calculate_tip function with these values. Wait, do I need to check if the parameters are integers? The bill is $100, which is an integer, and 18% is also an integer. So that fits the function's requirements. I don't need to convert any currency here because the user is asking about a tip in the same currency. So the correct tool to use is calculate_tip with those parameters.\n</think>\n'''\nprint(completion.choices[0].message.tool_calls)\n'''\n[ChatCompletionMessageToolCall(id='chatcmpl-tool-e341c6954d2c48c2a0e9071c7bdefd8b', function=Function(arguments='{\"bill_total\": 100, \"tip_percentage\": 18}', name='calculate_tip'), type='function')]\n'''\n```\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset:**\n\nA large variety of training data was used for the knowledge distillation phase before the post-training pipeline, 3 of which included: FineWeb, Buzz-V1.2, and Dolma.\n\nThe data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model.\n\nPrompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both reasoning on and off modes, to train the model to distinguish between two modes.\n\nNVIDIA will be releasing the post-training dataset in the coming weeks.\n\n**Data Modality:** Text \n\n**Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic\n\n**Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic\n\n**Properties (Quantity, Dataset Descriptions, Sensor(s)):** \n\nQuantity: The knowledge distillation phase used 40 billion tokens. The underlying Dolma dataset contains approximately 3 trillion tokens, and FineWeb contains over 15 trillion tokens.\n\nDataset Descriptions: The training data is a composite of large-scale web text, synthetically generated conversational data for coding and instruction-following, and a diverse corpus of text from academic, literary, and encyclopedic sources. The model supports a context length of 128,000 tokens.\n\n### \n\n### **Testing Dataset:**\n\n**Data Collection Method by dataset:** Hybrid: Human, Synthetic\n\n**Labeling Method by dataset:** Hybrid: Human, Automated\n\n**Properties (Quantity, Dataset Descriptions, Sensor(s)):**\n\nQuantity: The benchmarks contain thousands of test items, including:\n\n* MMLU: \\~15,900 multiple-choice questions.  \n* GPQA: 448 difficult multiple-choice questions.  \n* HumanEval: 164 programming problems.  \n* GSM-8K: \\~8,500 grade-school math problems.  \n* MATH: 12,500 competition math problems.  \n* IF-Eval: Over 400 instruction-following prompts.\n\nDataset Descriptions: The evaluation suite is diverse, covering 57 academic and professional subjects (MMLU), expert-level reasoning (GPQA), Python code generation (HumanEval), mathematical problem-solving (GSM-8K, MATH), and the ability to follow precise instructions (IF-Eval).\n\n### **Evaluation Dataset:**\n\nWe used the datasets listed below to evaluate Llama-3.3-Nemotron-Super-49B-v1.5.\n\n**Data Collection Method by dataset:** Hybrid: Human. Synthetic\n\n**Labeling Method by dataset:** Hybrid: Human, Synthetic, Automatic\n\n**Evaluation Results**  \nWe evaluate the model using temperature=0.6, top\\_p=0.95, and 64k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate.\n\n### MATH500\n\n| Reasoning Mode | pass@1 (avg. over 4 runs) |\n| ----- | ----- |\n| Reasoning On | 97.4 |\n\n### AIME 2024\n\n| Reasoning Mode | pass@1 (avg. over 16 runs) |\n| ----- | ----- |\n| Reasoning On | 87.5 |\n\n### AIME 2025\n\n| Reasoning Mode | pass@1 (avg. over 16 runs) |\n| ----- | ----- |\n| Reasoning On | 82.71 |\n\n### GPQA\n\n| Reasoning Mode | pass@1 (avg. over 4 runs) |\n| ----- | ----- |\n| Reasoning On | 71.97 |\n\n### LiveCodeBench 24.10-25.02\n\n| Reasoning Mode | pass@1 (avg. over 4 runs) |\n| ----- | ----- |\n| Reasoning On | 73.58 |\n\n### BFCL v3\n\n| Reasoning Mode | pass@1 (avg. over 2 runs) |\n| ----- | ----- |\n| Reasoning On | 71.75 |\n\n### IFEval\n\n| Reasoning Mode | Strict:Instruction |\n| ----- | ----- |\n| Reasoning On | 88.61 |\n\n### ArenaHard\n\n| Reasoning Mode | pass@1 (avg. over 1 runs) |\n| ----- | ----- |\n| Reasoning On | 92.0 |\n\n### Humanity's Last Exam (Text-Only Subset)\n\n| Reasoning Mode | pass@1 (avg. over 1 runs) |\n| ----- | ----- |\n| Reasoning On | 7.64 |\n\n### MMLU Pro (CoT)\n\n| Reasoning Mode | pass@1 (avg. over 1 runs) |\n| ----- | ----- |\n| Reasoning On | 79.53 |\n\nAll evaluations were done using the [NeMo-Skills](https://github.com/NVIDIA/NeMo-Skills) repository.\n\n## **Inference:**\n\n**Acceleration Engine:**\n\n* vLLM, TensorRT\n\n**Test Hardware:**\n\n* 2x NVIDIA H100-80GB  \n* 2x NVIDIA A100-80GB GPUs\n* B200 SXM    \n* H200 SXM   \n* H100 SXM   \n* A100 SXM 80GB   \n* A100 SXM 40GB   \n* L40S PCIe   \n* A10G   \n* H100 NVL   \n* H200 NVL   \n* GH200 96GB   \n* GB200 \n* RTX 5090   \n* RTX 4090   \n* RTX 6000 Ada \n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\nFor more detailed information on ethical considerations for this model, please see the Model Card++ [Bias](http://./bias.md), [Explainability](http://./explainability.md), [Safety & Security](http://./safety.md), and [Privacy](http://./privacy.md) Subcards.\n\nPlease report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.\n\n## **Citation**\n\n```py\n@misc{bercovich2025llamanemotronefficientreasoningmodels,\n      title={Llama-Nemotron: Efficient Reasoning Models}, \n      author={Akhiad Bercovich and Itay Levy and Izik Golan and Mohammad Dabbah and Ran El-Yaniv and Omri Puny and Ido Galil and Zach Moshe and Tomer Ronen and Najeeb Nabwani and Ido Shahaf and Oren Tropp and Ehud Karpas and Ran Zilberstein and Jiaqi Zeng and Soumye Singhal and Alexander Bukharin and Yian Zhang and Tugrul Konuk and Gerald Shen and Ameya Sunil Mahabaleshwarkar and Bilal Kartal and Yoshi Suhara and Olivier Delalleau and Zijia Chen and Zhilin Wang and David Mosallanezhad and Adi Renduchintala and Haifeng Qian and Dima Rekesh and Fei Jia and Somshubra Majumdar and Vahid Noroozi and Wasi Uddin Ahmad and Sean Narenthiran and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Siddhartha Jain and Igor Gitman and Ivan Moshkov and Wei Du and Shubham Toshniwal and George Armstrong and Branislav Kisacanin and Matvei Novikov and Daria Gitman and Evelina Bakhturina and Jane Polak Scowcroft and John Kamalu and Dan Su and Kezhi Kong and Markus Kliegl and Rabeeh Karimi and Ying Lin and Sanjeev Satheesh and Jupinder Parmar and Pritam Gundecha and Brandon Norick and Joseph Jennings and Shrimai Prabhumoye and Syeda Nahida Akter and Mostofa Patwary and Abhinav Khattar and Deepak Narayanan and Roger Waleffe and Jimmy Zhang and Bor-Yiing Su and Guyue Huang and Terry Kong and Parth Chadha and Sahil Jain and Christine Harvey and Elad Segal and Jining Huang and Sergey Kashirsky and Robert McQueen and Izzy Putterman and George Lam and Arun Venkatesan and Sherry Wu and Vinh Nguyen and Manoj Kilaru and Andrew Wang and Anna Warno and Abhilash Somasamudramath and Sandip Bhaskar and Maka Dong and Nave Assaf and Shahar Mor and Omer Ullman Argov and Scot Junkin and Oleksandr Romanenko and Pedro Larroy and Monika Katariya and Marco Rovinelli and Viji Balas and Nicholas Edelman and Anahita Bhiwandiwalla and Muthu Subramaniam and Smita Ithape and Karthik Ramamoorthy and Yuting Wu and Suguna Varshini Velury and Omri Almog and Joyjit Daw and Denys Fridman and Erick Galinkin and Michael Evans and Katherine Luna and Leon Derczynski and Nikki Pope and Eileen Long and Seth Schneider and Guillermo Siman and Tomasz Grzegorzek and Pablo Ribalta and Monika Katariya and Joey Conway and Trisha Saar and Ann Guan and Krzysztof Pawelec and Shyamala Prayaga and Oleksii Kuchaiev and Boris Ginsburg and Oluwatobi Olabiyi and Kari Briski and Jonathan Cohen and Bryan Catanzaro and Jonah Alben and Yonatan Geifman and Eric Chung and Chris Alexiuk},\n      year={2025},\n      eprint={2505.00949},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https://arxiv.org/abs/2505.00949}, \n}\n```",
    "displayName": "Llama-3.3-nemotron-super-49b-v1.5",
    "explainability": "### **Explainability**\n\n| Field: | Response: |\n| :---- | :---- |\n| Intended Application(s) & Domain(s): | Text generation, reasoning, summarization, and question answering.  |\n| Model Type: | Text-to-text transformer |\n| Intended Users: | This model is intended for developers, researchers, and customers building/utilizing LLMs, while balancing accuracy and efficiency. |\n| Output: | Text String(s) |\n| Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |\n| Technical Limitations: | The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs. The Model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text. |\n| Verified to have met prescribed quality standards? | Yes |\n| Performance Metrics: | Accuracy, Throughput, and user-side throughput |\n| Potential Known Risks: | The model was optimized explicitly for instruction following and as such is susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. The model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources -- either directly or indirectly by retrieval (e.g. via visiting a website) -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place.The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. Model output should be appropriately escaped before viewing or other processing.|\n| End User License Agreement: | Your use of this model is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama. |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-8J48-AV9D",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise",
        "technology:model:soln_nvidia_ai"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx4-latency-bf16-ymougp5ifg",
    "latestVersionSizeInBytes": 107695648037,
    "logo": "https://catalog.ngc.nvidia.com/_next/image?url=https%3A%2F%2Fassets.ngc.nvidia.com%2Fproducts%2Fapi-catalog%2Fimages%2Fllama-3_3-nemotron-49b-instruct.jpg&w=640&q=90",
    "modelFormat": "SavedModel",
    "name": "llama-3.3-nemotron-super-49b-v1.5",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "### Privacy\n\n|Field:|Response:|\n|:---:|:---:|\n|Generatable or Reverse engineerable personally-identifiable information?|None|\n|Was consent obtained for any personal data used?|None Known|\n|Personal data used to create this model?|None Known|\n|How often is dataset reviewed?|Before Release|\n|Is there provenance for all datasets used in training?|Yes|\n|Does data labeling (annotation, metadata) comply with privacy laws?|Yes|\n|Applicable NVIDIA Privacy Policy|https://www.nvidia.com/en-us/about-nvidia/privacy-policy/|",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "### Safety & Security\n\n|Field:|Response:|\n|:---:|:---:|\n|Model Application(s):|Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning|\n|Describe life critical application (if present):|None Known (please see referenced Known Risks in the Explainability subcard).|\n|Use Case Restrictions:|Your use of this model is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama. |",
    "shortDescription": "Llama-3.3-Nemotron-Super-49B-v1.5 is a significantly upgraded version of Llama-3.3-Nemotron-Super-49B-v1 and is a large language model (LLM) which is a derivative of Meta Llama-3.3-70B-Instruct",
    "teamName": "nvidia",
    "updatedDate": "2025-10-17T18:04:03.572Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-3.3-nemotron-super-49b-v1.5 optimizationProfiles: - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:a100x2-throughput-bf16-wcsztflslq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A100x2 BF16 Throughput ngcMetadata: 03fdf4e63960724f08647e43122aab89748cf69f8e180c64fab6370abee11c41: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ae4d6417367534d6b999876248c3591165a546df619a27fe6460b92aa44e7f88 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx4 BF16 Latency ngcMetadata: 0634edcf356b10f286d7a9ff5b5a0798a2616208e5c3b891aed4394fc504b0a1: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4e1be52ab36b863d4abb3e4e549f1f8150d8fe59bf3021012b9eddbb124bf1a8 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x1-throughput-nvfp4-yqo6gpzgtw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x1 NVFP4 Throughput ngcMetadata: 097e7abb70716b35f220ddfa9f1beafc1872b83d2faae76087c5981875c172d7: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e1bde03fd878742322841e5871f1182069b936b6c4517c9b2d07c94d8c7e8ebf number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:l40sx4-throughput-bf16-lb51ks7uxa framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 L40Sx4 BF16 Throughput ngcMetadata: 0d0c380f456551cf0c7d94cba5df94a6679bf13bfcec35518dd4700277c45d6d: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 303af23d9df7615161cd22feb968e97571f32f341e3567ec57a5405fc513e452 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:b200x2-latency-bf16-srzmz11lsq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 B200x2 BF16 Latency ngcMetadata: 0de4288607eed4d3b8fc4437cc7b7660d927d0ba9265f95f4c49191a69701446: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f96591a267ca466ff8d50fe13273091238ce4066f7da0533206572ac09da1eff number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:l40sx4-latency-fp8-csbsvltszw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 L40Sx4 FP8 Latency ngcMetadata: 1080a2945bccdf2773330d1ff5041b953088cb90e76c0a29ccddde3edb10fa48: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 76c65c985bc5acb124613d1c2854d8ca1908efc80cb6bdda6ebecf814f6f9932 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 50GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h100x2-latency-fp8-sowqqe--5a framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100x2 FP8 Latency ngcMetadata: 1f01cd4066c857f8982fcd8f7e7d7e4920c1e77ef50c8e0e9451815ec3d6590c: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6418fea1651154d4141be9df22ee889d55ee1e07eb23327386cfd21ed7e48917 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:a10gx8-throughput-bf16-4fcffqprja framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A10Gx8 BF16 Throughput ngcMetadata: 282c3be83f0772b985e007af291125cc8ecd4befc2833a96feefecfe49a6a116: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d50bef030c2c86675a058f6a7b4132438d8558afead641fa45c70cb431631be3 number_of_gpus: '8' pp: '2' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x2-latency-nvfp4-3sifj870lq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x2 NVFP4 Latency ngcMetadata: 2a721971fe1905d88e8281b2804ffd900bbd20704482e07eb1dee03ca7ee1f26: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 21113f03af084c245046a894ef0cce875ebb362781b1e2d70774b919dfc08b7b number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A100_SXM4_40GBx4 BF16 Throughput ngcMetadata: 2da0154c6a5ddf2d67aae37fd8a276f7fb54d69ebf9c2fe631c9cb9721912c10: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a5b791bd084d0d196d8eab3a5ee30584c4b5154b68e27d4ae27240c572aaa0c3 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 4 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx1 FP8 Throughput ngcMetadata: 317edefe0e4f3253972892af7f1f8bb0787c39eaac22e54947bbd21c64c105de: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 93ae1647a06301ebae5535fc2a127f5149c5ffe3f63f99443eac45c342b36bf9 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200_NVLx1 FP8 Throughput ngcMetadata: 37aa8cad01613034db7185edd866ca513104bf5b87447a9ea373ddc475141a38: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7640e168db96daafc2278c529e4ea7e93a9751774a361437b064f6542fa8400a number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x1-throughput-bf16-b7mc0n5vsq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x1 BF16 Throughput ngcMetadata: 3a1966db19d49667baa129a4838553168a4c66202dae42b3da82d34e0254dda9: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 56e98fd149cfe53aa5e62c155e6903b2254d7084850bfb5ea65bcd05e9fb416c number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 94GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx2 NVFP4 Throughput ngcMetadata: 3e02aabd0df7fb43fd55db667ddc61b9c1c6b2962aa3f2bfd0aa8d2206aa5ccd: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b22fdd19cd522ea34c068f776b07b38474bd419dd4bed6cb6ba3cb56376437fa number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h100x2-throughput-bf16-sfp5psfsoa framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100x2 BF16 Throughput ngcMetadata: 3f000887cbabfb954b87cbdafed85aeec51c82e5c941801c67a0fbb6bdbfbed5: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 037bea81c2da8987458de8a0e326c12c574c09048e1d80a027a73b6f6b553e06 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100_NVLx1 FP8 Throughput ngcMetadata: 4138603595d590ef014e6b18a034c8d6b6f7addc09e83ce2f97fe3d6b5502658: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3557dd611c5250fc7498c009ad75ec1ffdd75dd591e69de8efdfd0ad379871b5 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GH200_144GBx1 BF16 Throughput ngcMetadata: 439a0279d35d96d6b3c8be1f22f92a94d7e874e2476a58dc868b600861c84428: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ab0c91b93211e813ba1ef7fc61abd40d74c1babafe9d323e3dcddc74008f4cf3 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx1 NVFP4 Throughput ngcMetadata: 496a3bcf32f7c7e81e59b1c17395d49b6c412dcb9e94d1bd4675c7ab61ed4b8c: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c93c0eb2422047add4d8c0141d90bab8840448b965f75976ac669b97d7934cca number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100_NVLx2 BF16 Throughput ngcMetadata: 4eb1789fe7a9ba85b6915c1f6ab6423be03ad2b7660fd17ccadfca11a9cea20e: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 61175440372bb9eb41cd7d5f3de3cb8aa05ab8ab84483ab3bb2580cdf9edb50f number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:a100x4-latency-bf16-qfav5fnhta framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A100x4 BF16 Latency ngcMetadata: 4ecfbc0680c47e40c54811d7d056bd8c2cb17410671da1d5a9d94f37e0e9ddd9: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4b36e4728a079ec71833e5b851ba55816893bae8c6b3e028c7092783d51380b6 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 4 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x2-latency-fp8-hh-qiitbsq framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x2 FP8 Latency ngcMetadata: 5104fad3c90f0e82d48218e2f295ecc76413a75f7828902f6201cdce8e11f119: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: eb379d36e7d269b0f9dedded8c2295fe06009c4792c2faebeee87520382b4a79 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h200x4-latency-bf16-gwrdidufkg framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200x4 BF16 Latency ngcMetadata: 5375ff8c01b5f03cc5226403b75091b280f9ab3b4901e4ddf08effd37f3be185: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 66667c721d5d8a4380827673282e501ca94da891407d35e1c4212606fc217cd4 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 4 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx2 NVFP4 Latency ngcMetadata: 556dcaf16db7138e6cadd8e2a194caed98ad4d5be6c5d2f2638b4517f6d8a2f2: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7f399c329c773521d095cda5f5da78429a9a12497dc5e0a107b30250e7af3c9e number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100_NVLx2 FP8 Latency ngcMetadata: 557fec5cda76abb3bda2a196e908b91a4f97b18c0bab1fbc1e32927e131722f0: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0eb8ad98ede42e79a11f7439000dbc224363a640c2c03a7c0415d9d84852109c number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:b200x1-throughput-bf16-ypw69-37kw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 B200x1 BF16 Throughput ngcMetadata: 5c181a5c2c72785a8c062e4f8b197d404caa754117731b76dfc612c4751392a8: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f9bff6c55a835edfed0cf54e1d92d121be400ddbe8bca9a14cae0406b129a700 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 94GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx2 FP8 Latency ngcMetadata: 610f006b15f3adbdb072da0b4155d8a772332cf1768fb7389ef92a83c31c26dc: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c791be936f92f5cb480fbae429dff3fac2d0e7f1a3d3396e78196029b9a0d395 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GH200_480GBx1 FP8 Throughput ngcMetadata: 67fefe5a60111523b327e93282aabb0bee010780482d12aed22032fae947e6db: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f677672ff3da6e403ae52655aa3a37c53289547b58bc22c64c39364d755ac363 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x2-latency-bf16-ql0dncdzug framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x2 BF16 Latency ngcMetadata: 6988d6b50d4c8c0d12579128b9ceb6dfd239d91ecc1b0dcaf6b2d5235a785f41: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3ba2484ae0c038d20cd5f4add3a088742233193c5f9c7d16c55384ddd5ad1f78 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 96GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:gb200x1-throughput-fp8-eqspsgvc4g framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GB200x1 FP8 Throughput ngcMetadata: 721782adb8e04decd419a5d5fd5138ea578840ce23ae878cd66b5ade58b64860: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6dd43f5bf34d55e342b12dceabe739dc65b70b1c652b32e33158dda0176938b6 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:b200x1-throughput-nvfp4-lzk6scakha framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 B200x1 NVFP4 Throughput ngcMetadata: 878da3cd983e1c204b447eaf6c2b1fbe15df8e3f8606dba0276dc5db6f1b2ea3: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ed7366f0b3c56148342e9281e92f74ebd0117e45028b50f4b5080474ca08579f number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:l40sx4-latency-bf16-fcnx7qsagw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 L40Sx4 BF16 Latency ngcMetadata: 8b3a0a14508070667a00aa2bec26a373d078db79903603585127a2a33a437dcd: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3eebbad6ddd7f86f084cafbaa2774c9b68814be840c5117b1e8f42bd4609a154 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200_NVLx2 BF16 Latency ngcMetadata: 901ae99dec61c02334df6c00217c665e620b03efaccd7ecbfedee9dd5b919e2c: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0f967f12c738cdeba2e1f6f507f6519b0e70fe691a3d85cc6f275bfa6cebaab0 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h100x4-latency-bf16-c7ags8vtqa framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100x4 BF16 Latency ngcMetadata: a14acaea6216232b3dd9ff678dd04b239a48f8ef7eec367c8ba121aa93bc3699: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 79c4e912efae2b6e2670be80c6b16cf7d5a8d41658e3bdf6f0d7b72dc5d58634 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h200x1-throughput-fp8-tnrs6lwhqg framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200x1 FP8 Throughput ngcMetadata: a55bb618b06a37fd61e99000a8ba38375801c0879c67a7a1b5a66cd497e09817: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c8b0bc8703bb921bcde98b731a73ba8a5223b4cc331b1f6ec52c97c0fb7eb334 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200_NVLx2 FP8 Latency ngcMetadata: a5b40bd2025de323418db8d8577d91ad1c4c1b2143219fd9661c679e317af0fe: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 77041070f5f0dbe334cd51ed68930e5768eb93f59aff9f374f380e732fb3b078 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:b200x1-throughput-fp8-ueeogrvolw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 B200x1 FP8 Throughput ngcMetadata: b29ee8752b78d7d6a588e68487d9dd9f8ceaa2a01964f09c49ae9d7512a0e425: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0c5dc1a2f374f41aba6887d42b8f2497e43e32931426f33220897c481b121300 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200_NVLx1 BF16 Throughput ngcMetadata: b3c7e84a0d005d532b307e36b9956be0b169a791283fd5362fa7326c1d442516: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 74e8a2780520f4deb4b75fe91dfb53dc33ab212294b755cb654dfcfbd720bea0 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:b200x2-latency-fp8-kwktcc65kw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 B200x2 FP8 Latency ngcMetadata: b9cd24c06efe599256f1cbc69e32686bf837e634d6d72754124a3d2db6a69415: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 91ac13ed5c5bcacd46af55350326af80781385cbf9ee70b426583319f1972bcb number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GH200_144GBx2 FP8 Latency ngcMetadata: bee87c5b924821f18e4f18f9b63509e00d105053e9c0ed00440235219cc4c355: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4c34e6ff5408a3c795f72d84ad93d221975854695c77a2c49a03c94e288ebd7e number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100_NVLx4 BF16 Latency ngcMetadata: cdc6d143f3c8ae40bef086616fe918badef8b5d8c2f7ed7bf35c46efc664f1d2: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7acb071a5e1043fb94515e5b7a4209955aa7b1a7f9311e43a96d525f36124582 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A100_SXM4_40GBx8 BF16 Latency ngcMetadata: d1a6703d5e49f81f492115bcd2fbb3d8f654fb74c7871374a0aba53e268f7eb7: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a491d0c85fd9304f23dc12098011ac5a7f06323c7f1bf5a03930512eab8bc661 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 RTX6000_BLACKWELL_SVx2 BF16 Throughput ngcMetadata: d74a0a1011908274b71ca777cadb98fb50eb4d1b03f293c4961dfffd685fa77c: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a43faa67d15febf569c0cc2520243017faa2f85e955b0da68f0819b562b0f746 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:l40sx4-throughput-fp8-grh3fk4vxa framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 L40Sx4 FP8 Throughput ngcMetadata: dd6e06ce56d8c23034792ecfafa9cad84e89381646f1a6b3f61e64d5c7151cca: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c218e34971f85acf3ff928121f744f142c144926bface213142ca6abe6d08527 number_of_gpus: '4' pp: '1' precision: fp8 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 50GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h200x2-latency-fp8-qionglmjjw framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200x2 FP8 Latency ngcMetadata: e76d9a6e681f5047d58bb835cd1144df8a4c07cbd5e11340d9e841a34639c6ac: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7fd4e15ecac33d4ebf1f8b32433b45177d6701b159d35a262c363fd67aab00ca number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GH200_144GBx2 BF16 Latency ngcMetadata: ecad5bd2fe50b96e275be4aede45e63ccacb4943719a70a76183ac78cb7b2602: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0863fbe7a8909618c897d216a1a1df5e66eb44030a0421496f854e0bc52bb041 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h200x1-throughput-bf16-jvvivxsong framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H200x1 BF16 Throughput ngcMetadata: ee058f1abbfe0cc174b16c966b91cfe886c7bb247fc691d8c7c8fee3dc9c8f41: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 493e3f3ab5031da3fd826eb8ef23ea20e87af83cae500214e295fbeca9003e55 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 94GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:a10gx8-latency-bf16-ejd7ve2qag framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 A10Gx8 BF16 Latency ngcMetadata: ef9c1ac2f14b38895123c608a25f0104c42557f617c91e8ef6e151bc601822de: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7c853ac2150b03fff0d81df21fed77d789c5254f9f440e32844c93d073f5e43f number_of_gpus: '8' pp: '2' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 100GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:h100x1-throughput-fp8-zmf7sc5wtg framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 H100x1 FP8 Throughput ngcMetadata: f5e04275ea0d3bd001a2262e85e47de206406d9593d74525074a25475dc47a22: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e96f0bba12e9e5e054f193562c51161b61804d52ba2bff7a49bf8aa267a1c2b2 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 49GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama-3.3-nemotron-super-49b-v1.5:hf-f091ea1-fix-chat-template-jet framework: TensorRT-LLM displayName: Llama 3.3 Nemotron Super 49B V1.5 GH200_144GBx1 FP8 Throughput ngcMetadata: f97be1c404cf80299a3d85359c32a28710a3251c6ed6eec2dd3f3bdce2ca2903: model: nvidia/llama-3.3-nemotron-super-49b-v1.5 release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 911352dda31a6b8811db6e5dc7c573094dadfc2354bee8cd0f41e784e79fc6f6 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 93GB - key: LLM ENGINE value: TENSORRT_LLM labels: - Llama - Chatbots - Virtual Assistants - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Llama 3.1 Nemotron Nano displayName: Llama 3.1 Nemotron Nano modelHubID: llama-3.1-nemotron-nano category: Chatbots type: NGC description: Llama 3.1 Nemotron Nano 8B or 4B is a language model that can follow instructions, complete requests, and generate creative text formats. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: Llama 3.1 Nemotron Nano 4b V1.1 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "|Field:|Response:|\n|:---:|:---:|\n|Participation considerations from adversely impacted groups (protected classes) in model design and testing:|None|\n|Measures taken to mitigate against unwanted bias:|None|",
    "canGuestDownload": false,
    "createdDate": "2025-06-05T17:26:28.211Z",
    "description": "## Model Overview \n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. It  is a derivative of [nvidia/Llama-3.1-Minitron-4B-Width-Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base), which is created from Llama 3.1 8B using [our LLM compression technique](https://arxiv.org/abs/2408.11796) and offers improvements in model accuracy and efficiency. \n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a model which offers a great tradeoff between model accuracy and efficiency. The model fits on a single RTX GPU and can be used locally. The model supports a context length of 128K.\n\nThis model underwent a multi-phase post-training process to enhance both its reasoning and non-reasoning capabilities. This includes a supervised fine-tuning stage for Math, Code, Reasoning, and Tool Calling as well as multiple reinforcement learning (RL) stages using Reward-aware Preference Optimization (RPO) algorithms for both chat and instruction-following. The final model checkpoint is obtained after merging the final SFT and RPO checkpoints\n\nThis model is part of the Llama Nemotron Collection. You can find the other model(s) in this family here: \n- [Llama-3.3-Nemotron-Ultra-253B-v1](https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1)\n- [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3.3-Nemotron-Super-49B-v1)\n- [Llama-3.1-Nemotron-Nano-8B-v1](https://huggingface.co/nvidia/Llama-3.1-Nemotron-Nano-8B-v1)\n\nThis model is ready for commercial use.\n\n## License/Terms of Use\n\nThe NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products). Your use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license). Built with Llama.\n\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n**Model Developer:** NVIDIA\n\n**Model Dates:** Trained between August 2024 and April 2025\n\n**Data Freshness:** The pretraining data has a cutoff of 2023 per Meta Llama 3.1 8B\n\n\n## Use Case: \n\nDevelopers designing AI Agent systems, chatbots, RAG systems, and other AI-powered applications. Also suitable for typical instruction-following tasks. Balance of model accuracy and compute efficiency (the model fits on a single RTX GPU and can be used locally).\n\n## Release Date: <br>\nNGC: May 2025 <br>\n\n## References\n\n- [\\[2502.00203\\] Reward-aware Preference Optimization: A Unified Mathematical Framework for Model Alignment](https://arxiv.org/abs/2502.00203)\n\n\n## Model Architecture\n\n**Architecture Type:** Dense decoder-only Transformer model\n\n**Network Architecture:** [Llama 3.1 Minitron Width 4B Base](https://huggingface.co/nvidia/Llama-3.1-Minitron-4B-Width-Base)\n\n## Intended use\n\nLlama-3.1-Nemotron-Nano-4B-v1.1 is a general purpose reasoning and chat model intended to be used in English and coding languages. Other non-English languages (German, French, Italian, Portuguese, Hindi, Spanish, and Thai) are also supported. \n\n# Input:\n- **Input Type:** Text\n- **Input Format:** String\n- **Input Parameters:** One-Dimensional (1D)\n- **Other Properties Related to Input:** Context length up to 131,072 tokens\n\n## Output:\n- **Output Type:** Text\n- **Output Format:** String\n- **Output Parameters:** One-Dimensional (1D)\n- **Other Properties Related to Output:** Context length up to 131,072 tokens\n\n## Model Version:\n1.0 (April 2025)\n\n## Software Integration\n- **Runtime Engine:** NeMo 24.12 <br>\n  \n- **Recommended Hardware Microarchitecture Compatibility:**\n    - NVIDIA Ampere\n    - NVIDIA Hopper\n- **Preferred/Supported Operating System:** Linux <br>\n\n## Quick Start and Usage Recommendations:\n\n1. Reasoning mode (ON/OFF) is controlled via the system prompt, which must be set as shown in the example below. All instructions should be contained within the user prompt\n2. We recommend setting temperature to `0.6`, and Top P to `0.95` for Reasoning ON mode\n3. We recommend using greedy decoding for Reasoning OFF mode\n4. We have provided a list of prompts to use for evaluation for each benchmark where a specific template is required\n\nSee the snippet below for usage with Hugging Face Transformers library. Reasoning mode (ON/OFF) is controlled via system prompt. Please see the example below.\nOur code requires the transformers package version to be `4.44.2` or higher.\n\n\n### Example of \u201cReasoning On:\u201d\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   temperature=0.6,\n   top_p=0.95,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"on\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\n\n### Example of \u201cReasoning Off:\u201d\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}]))\n```\n\nFor some prompts, even though thinking is disabled, the model emergently prefers to think before responding. But if desired, the users can prevent it by pre-filling the assistant response.\n\n```python\nimport torch\nimport transformers\n\nmodel_id = \"nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1\"\nmodel_kwargs = {\"torch_dtype\": torch.bfloat16, \"device_map\": \"auto\"}\ntokenizer = transformers.AutoTokenizer.from_pretrained(model_id)\ntokenizer.pad_token_id = tokenizer.eos_token_id\n\n# Thinking can be \"on\" or \"off\"\nthinking = \"off\"\n\npipeline = transformers.pipeline(\n   \"text-generation\",\n   model=model_id,\n   tokenizer=tokenizer,\n   max_new_tokens=32768,\n   do_sample=False,\n   **model_kwargs\n)\n\nprint(pipeline([{\"role\": \"system\", \"content\": f\"detailed thinking {thinking}\"}, {\"role\": \"user\", \"content\": \"Solve x*(sin(x)+2)=0\"}, {\"role\":\"assistant\", \"content\":\"<think>\\n</think>\"}]))\n```\n\n## Inference:\n**Test Hardware:**\n\n- BF16:\n    - 1x RTX 50 Series GPUs\n    - 1x RTX 40 Series GPUs\n    - 1x RTX 30 Series GPUs\n    - 1x H100-80GB GPU\n    - 1x A100-80GB GPU\n\n# Training and Evaluation Datasets \n\n## Training Datasets\n\nA large variety of training data was used for the post-training pipeline, including manually annotated data and synthetic data.\n\nThe data for the multi-stage post-training phases for improvements in Code, Math, and Reasoning is a compilation of SFT and RL data that supports improvements of math, code, general reasoning, and instruction following capabilities of the original Llama instruct model. \n\nPrompts have been sourced from either public and open corpus or synthetically generated. Responses were synthetically generated by a variety of models, with some prompts containing responses for both Reasoning On and Off modes, to train the model to distinguish between two modes. \n\n**Data Collection for Training Datasets:** <br>\n* Hybrid: Automated, Human, Synthetic <br>\n\n**Data Labeling for Training Datasets:** <br>\n* Not Applicable (N/A) <br>\n\n## Evaluation Datasets\n\nWe used the datasets listed below to evaluate Llama-3.1-Nemotron-Nano-4B-v1.1. \n\n**Data Collection for Evaluation Datasets:** Hybrid: Human/Synthetic\n\n**Data Labeling for Evaluation Datasets:** Hybrid: Human/Synthetic/Automatic\n\n## Evaluation Results\n\nThese results contain both \u201cReasoning On\u201d, and \u201cReasoning Off\u201d. We recommend using temperature=`0.6`, top_p=`0.95` for \u201cReasoning On\u201d mode, and greedy decoding for \u201cReasoning Off\u201d mode. All evaluations are done with 32k sequence length. We run the benchmarks up to 16 times and average the scores to be more accurate.\n\n> NOTE: Where applicable, a Prompt Template will be provided. While completing benchmarks, please ensure that you are parsing for the correct output format as per the provided prompt in order to reproduce the benchmarks seen below. \n\n### MT-Bench\n\n| Reasoning Mode | Score |\n|--------------|------------|\n| Reasoning Off | 7.6 |\n| Reasoning On | 8.1 |\n\n\n### MATH500\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 72.0% | \n| Reasoning On | 95.1%  |\n\nUser Prompt Template: \n\n```\n\"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \\boxed{}.\\nQuestion: {question}\"\n```\n\n\n### AIME25\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 13.3% | \n| Reasoning On | 46.7% |\n\nUser Prompt Template: \n\n```\n\"Below is a math question. I want you to reason through the steps and then give a final answer. Your final answer should be in \\boxed{}.\\nQuestion: {question}\"\n```\n\n\n### GPQA-D\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 31.8% | \n| Reasoning On | 55.8% |\n\nUser Prompt Template: \n\n\n```\n\"What is the correct answer to this question: {question}\\nChoices:\\nA. {option_A}\\nB. {option_B}\\nC. {option_C}\\nD. {option_D}\\nLet's think step by step, and put the final answer (should be a single letter A, B, C, or D) into a \\boxed{}\"\n```\n\n\n### IFEval\n\n| Reasoning Mode | Strict:Prompt | Strict:Instruction |\n|--------------|------------|------------|\n| Reasoning Off | 73.6% | 80.8% |\n| Reasoning On | 75.4% | 82.6% |\n\n### BFCL v2 Live\n\n| Reasoning Mode | Score |\n|--------------|------------|\n| Reasoning Off | 57.1% | \n| Reasoning On | 64.2% | \n\nUser Prompt Template:\n\n\n```\n<AVAILABLE_TOOLS>{functions}</AVAILABLE_TOOLS>\n\n{user_prompt}\n```\n\n\n### MBPP 0-shot\n\n| Reasoning Mode | pass@1 |\n|--------------|------------|\n| Reasoning Off | 66.4% | \n| Reasoning On | 86.0% |\n\nUser Prompt Template:\n\n\n````\nYou are an exceptionally intelligent coding assistant that consistently delivers accurate and reliable responses to user instructions.\n\n@@ Instruction\nHere is the given problem and test examples:\n{prompt}\nPlease use the python programming language to solve this problem.\nPlease make sure that your code includes the functions from the test samples and that the input and output formats of these functions match the test samples.\nPlease return all completed codes in one code block.\nThis code block should be in the following format:\n```python\n# Your codes here\n```\n````\n\n\n## Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. \n\nFor more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety.md), and [Privacy](privacy.md) Subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).",
    "displayName": "Llama-3.1-Nemotron-Nano-4B-v1.1",
    "explainability": "|Field:|Response:|\n|:---|:---|\n|Intended Application(s) & Domain(s):|Text generation, reasoning, summarization, and question answering.|\n|Model Type: |Text-to-text transformer |\n|Intended Users:|This model is intended for developers, researchers, and customers building/utilizing LLMs, while balancing accuracy and efficiency.|\n|Output:|Text String(s)|\n|Describe how the model works:|Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers.|\n|Technical Limitations:|The model was trained on data that contains toxic language, unsafe content, and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.<br><br>The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs.<br>|\n|Verified to have met prescribed quality standards?|Yes|\n|Performance Metrics:|Accuracy, Throughput, and user-side throughput|\n|Potential Known Risks:|The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources  -- either directly or indirectly by retrieval (e.g. via visiting a website)  -- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place.<br><br>The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.|\n|End User License Agreement:|The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products). Your use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license). Built with Llama.",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-WSPX-LGNV",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "l40sx1-throughput-lora-fp8-jt4xfye5rg",
    "latestVersionSizeInBytes": 5496337766,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/nemotron-mini-4b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "llama3.1-nemotron-nano-4b-v1.1",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "|Field:|Response:|\n|:---|:---|\n|Generatable or Reverse engineerable personal data?|None|\n|Was consent obtained for any personal data used?|None Known|\n|Personal data used to create this model?|None Known|\n|How often is dataset reviewed?|Before Release|\n|Is there provenance for all datasets used in training?|Yes|\n|Does data labeling (annotation, metadata) comply with privacy laws?|Yes|\n|Applicable NVIDIA Privacy Policy|https://www.nvidia.com/en-us/about-nvidia/privacy-policy/|",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "|Field:|Response:|\n|:---|:---|\n|Model Application(s):|Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning|\n|Describe life critical application (if present):|None Known (please see referenced Known Risks in the Explainability subcard).|\n|Use Case Restrictions:|The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products). Your use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/). Additional Information: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license). Built with Llama.\n|Model and Dataset Restrictions:|The Principle of least privilege (PoLP) is applied limiting access for dataset generation.  Restrictions enforce dataset access during training, and dataset license constraints adhered to. Model checkpoints are made available on Hugging Face and NGC, and may become available on cloud providers' model catalog.|",
    "shortDescription": "Llama-3.1-Nemotron-Nano-4B-v1.1 is a large language model (LLM) reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling.",
    "teamName": "nvidia",
    "updatedDate": "2025-06-18T18:05:57.835Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama3.1-nemotron-nano-4b-v1.1 optimizationProfiles: - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:a100x1-throughput-bf16-a-zgkhv-7a framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 A100x1 BF16 Throughput ngcMetadata: 222d1729a785201e8a021b226d74d227d01418c41b556283ee1bdbf0a818bd94: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:hf-9f834a8-fix-checksum framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 H100_NVLx1 BF16 Throughput ngcMetadata: 25b5e251d366671a4011eaada9872ad1d02b48acc33aa0637853a3e3c3caa516: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:h200x1-throughput-bf16-6ej0hxqqug framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 H200x1 BF16 Throughput ngcMetadata: 434e8d336fa23cbe151748d32b71e196d69f20d319ee8b59852a1ca31a48d311: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:a10gx1-throughput-bf16-kf8s30cw4q framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 A10Gx1 BF16 Throughput ngcMetadata: 74bfd8b2df5eafe452a9887637eef4820779fb4e1edb72a4a7a2a1a2d1e6480b: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 1 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:l40sx1-throughput-bf16-ji5fmrct-w framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 L40Sx1 BF16 Throughput ngcMetadata: ac5071bbd91efcc71dc486fcd5210779570868b3b8328b4abf7a408a58b5e57c: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:l40sx1-throughput-fp8-y0vtnnyy0q framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 L40Sx1 FP8 Throughput ngcMetadata: ad17776f4619854fccd50354f31132a558a1ca619930698fd184d6ccf5fe3c99: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 6GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:hf-9f834a8-fix-checksum framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 A100_SXM4_40GBx1 BF16 Throughput ngcMetadata: c6821c013c559912c37e61d7b954c5ca8fe07dda76d8bea0f4a52320e0a54427: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 1 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:h100x1-throughput-bf16-n6thxsck2g framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 H100x1 BF16 Throughput ngcMetadata: e7dbd9a8ce6270d2ec649a0fecbcae9b5336566113525f20aee3809ba5e63856: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:hf-9f834a8-fix-checksum framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 GH200_480GBx1 BF16 Throughput ngcMetadata: f7f74ecd523cd63065a50016a8786a893b9b1efe0d313bc5bcc54682f56e55fe: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:hf-9f834a8-fix-checksum framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 Generic NVIDIA GPUx2 BF16 ngcMetadata: 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '2' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 2 - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/nvidia/llama3.1-nemotron-nano-4b-v1.1:hf-9f834a8-fix-checksum framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 4B V1.1 Generic NVIDIA GPUx1 BF16 ngcMetadata: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2: model: nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1 release: 1.8.5 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '1' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.5 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - variantId: Llama 3.1 Nemotron Nano 8b V1 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "canGuestDownload": false,
    "createdDate": "2025-03-18T04:59:31.861Z",
    "description": "# Model Overview\n\n## Description:\n\n[//]: # ([Provide additional details about the algorithm/model; include supporting image/video and/or reference blog/article, if available.] [This model is ready for commercial/non-commercial use.] OR [This model is for research and development only.] OR [This model is for demonstration purposes and not for production usage.] <br>)\n\nLlama-3.1-Nemotron-Nano-8B-v1 is a model for generating responses for roleplaying, retrieval augmented generation, and function calling.  It is a small language model (SLM) optimized through distillation, pruning and quantization for speed and on-device deployment. VRAM usage has been minimized to approximately 2 GB, providing significantly faster Time-to-First-Token compared to LLMs.\n\nThis model is ready for commercial use.\n\n### License/Terms of Use: \n[NVIDIA AI Foundation Models Community License Agreement](https://developer.nvidia.com/downloads/nv-ai-foundation-models-license)\n\n\n## References\n\nPlease refer to the [User Guide]() to use the model and use a suggested guideline for prompts.\n\n## Model Architecture:\n**Architecture Type:** Transformer <br>\n**Network Architecture:** Llama-3.1 <br>\n\n## Limitations\nThe model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. This issue could be exacerbated without the use of the recommended prompt template. \n\n## Input: \n**Input Type(s):**  Text (Prompt) <br>\n**Input Format(s):** String <br>\n**Input Parameters:** One Dimensional (1D) <br>\n**Other Properties Related to Input:** The model has a maximum of 4096 input tokens. <br>\n \n## Output: \n**Output Type(s):** Text (Response) <br>\n**Output Format:** String <br>\n**Output Parameters:** 1D <br>\n**Other Properties Related to Output:**  The model has a maximum of 4096 output tokens. Maximum output for both versions can be set apart from input.<br>\n\n\n## Prompt Format:\n\nWe recommend using the following prompt template, which was used to fine-tune the model. The model may not perform optimally without it.\n\n**Single Turn**\n\n```\n<extra_id_0>System\n{system prompt}\n\n<extra_id_1>User\n{prompt}\n<extra_id_1>Assistant\\n\n```\n\n**Tool use**\n\n```\n<extra_id_0>System\n{system prompt}\n\n<tool> ... </tool>\n<context> ... </context>\n\n<extra_id_1>User\n{prompt}\n<extra_id_1>Assistant\n<toolcall> ... </toolcall>\n<extra_id_1>Tool\n{tool response}\n<extra_id_1>Assistant\\n\n```\n\n\n## Software Integration: (On-Device)\n**Runtime(s):** AI Inference Manager (NVAIM) Version 1.0.0 <br>\n**Toolkit:**  NVAIM <br>\nSee [this document]() for details on how to integrate the model into NVAIM.\n\n**Supported Hardware Platform(s):** GPU supporting DirectX 11/12 and Vulkan 1.2 or higher <br>\n\n**[Preferred/Supported] Operating System(s):** <br>\n* Windows <br>\n\n## Software Integration: (Cloud)\n**Toolkit:** NVIDIA NIM <br>\nSee [this document]() for details on how to integrate the model into NVAIM.\n\n**[Preferred/Supported] Operating System(s):** <br>\n* Linux <br>\n\n### Model Version(s)\nLlama-3.1-Nemotron-Nano-8B-v1\n\n# Training & Evaluation: \n\n## Training Dataset:\n\n** Data Collection Method by dataset <br>\n* Hybrid: Automated, Human <br>\n\n** Labeling Method by dataset <br>\n* Hybrid: Automated, Human <br>\n\n**Properties:** <br>\n\nTrained on approximately 10000 Game/Non-Playable Character (NPC) dialog turns from domain chat data.\n\n## Evaluation Dataset:\n\n** Data Collection Method by dataset <br>\n* Hybrid: Automated, Human <br>\n\n** Labeling Method by dataset <br>\n* Human <br>\n\n**Properties:** <br>\n\nEvaluated on approximately Game/NPC 10000 dialog turns from domain chat data.  <br>\n\n## Inference:\n**Engine:** TRT-LLM <br>\n**Test Hardware:** <br>\n* A100 <br>\n* A10g <br>\n* H100  <br>\n* L40s  <br>\n\n**Supported Hardware Platform(s):** L40s, A10g, A100, H100<br>\n\n## Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.  For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "Llama-3.1-Nemotron-Nano-8B-v1",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-KZF6-2OL5",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "hf-25.03.17-0508-tool-use-v2",
    "latestVersionSizeInBytes": 16077860262,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/nemotron-mini-4b-instruct.jpg",
    "modelFormat": "SavedModel",
    "name": "llama-3.1-nemotron-nano-8b-v1",
    "orgName": "nim",
    "precision": "OTHER",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "shortDescription": "Llama-3.1-Nemotron-Nano-8B-v1 is a model for generating responses for roleplaying, retrieval augmented generation, and function calling. It is a small language model (SLM) optimized through distillation, pruning and quantization.",
    "teamName": "nvidia",
    "updatedDate": "2025-05-29T19:46:00.234Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-3.1-nemotron-nano-8b-v1 optimizationProfiles: - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:a100x2-latency-bf16-zxsnn7zu2g framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 A100x2 BF16 Latency ngcMetadata: 2146fcf18ea0412d564c6ed21d2f727281b95361fd78ccfa3d0570ec1716e8db: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:a100x1-throughput-bf16-jfn07bk9ua framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 A100x1 BF16 Throughput ngcMetadata: 222d1729a785201e8a021b226d74d227d01418c41b556283ee1bdbf0a818bd94: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100_NVLx1 BF16 Throughput ngcMetadata: 25b5e251d366671a4011eaada9872ad1d02b48acc33aa0637853a3e3c3caa516: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h200x1-throughput-bf16-hqyhv2wimw framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H200x1 BF16 Throughput ngcMetadata: 434e8d336fa23cbe151748d32b71e196d69f20d319ee8b59852a1ca31a48d311: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100_NVLx1 FP8 Throughput ngcMetadata: 5811750e70b7e9f340f4d670c72fcbd5282e254aeb31f62fd4f937cfb9361007: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h200x2-latency-bf16-q6opgs6yja framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H200x2 BF16 Latency ngcMetadata: 6832a9395f54086162fd7b1c6cfaae17c7d1e535a60e2b7675504c9fc7b57689: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h100x2-latency-fp8-zsiywmloya framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100x2 FP8 Latency ngcMetadata: 6c3f01dd2b2a56e3e83f70522e4195d3f2add70b28680082204bbb9d6150eb04: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h100x1-throughput-fp8-5tn9pkgdbq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100x1 FP8 Throughput ngcMetadata: 7b508014e846234db3cabe5c9f38568b4ee96694b60600a0b71c621dc70cacf3: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx4-latency-bf16-k3y094rsxq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx4 BF16 Latency ngcMetadata: 844ebe2b42df8de8ce66cbb6ecf43f90858ea7efc14ddf020cf1ae7450ae0c33: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 19GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:a10gx2-throughput-bf16-htgj9vhmiw framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 A10Gx2 BF16 Throughput ngcMetadata: 8a62b002be0b7f82c407e5ed45c50dabe654deca052b521a920682f918323d0d: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 2 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx2-throughput-bf16-qivaletdla framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx2 BF16 Throughput ngcMetadata: 973a6bfbfc5d13fc5eb18f5011fab777a5bd257d5807e97f842a3364e82160dc: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100_NVLx2 FP8 Latency ngcMetadata: a00ce1e782317cd19ed192dcb0ce26ab8b0c1da8928c33de8893897888ff7580: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx1-throughput-bf16-anodjae0ya framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx1 BF16 Throughput ngcMetadata: ac5071bbd91efcc71dc486fcd5210779570868b3b8328b4abf7a408a58b5e57c: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx1-throughput-fp8-dbamkqep8q framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx1 FP8 Throughput ngcMetadata: ad17776f4619854fccd50354f31132a558a1ca619930698fd184d6ccf5fe3c99: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h200x1-throughput-fp8-mafkx9-zmq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H200x1 FP8 Throughput ngcMetadata: af876a179190d1832143f8b4f4a71f640f3df07b0503259cedee3e3a8363aa96: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h100x2-latency-bf16-iq2eo5lxgw framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100x2 BF16 Latency ngcMetadata: b3d535c0a7eaaea089b087ae645417c0b32fd01e7e9d638217cc032e51e74fd0: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100_NVLx2 BF16 Latency ngcMetadata: b7fad3b35b07d623fac6549078305b71d0e6e1d228a86fa0f7cfe4dbeca9151a: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx2-latency-fp8-hkd8uidneq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx2 FP8 Latency ngcMetadata: c4ff823a8202af4b523274fb8c6cdd73fa8ee5af16391a6d36b17f714a3c71a0: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h200x2-latency-fp8-a3-t7tca3g framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H200x2 FP8 Latency ngcMetadata: e4f217a5fb016b570e34b8a8eb06051ccfef9534ba43da973bb7f678242eaa5f: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:h100x1-throughput-bf16-iugafozvdq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 H100x1 BF16 Throughput ngcMetadata: e7dbd9a8ce6270d2ec649a0fecbcae9b5336566113525f20aee3809ba5e63856: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:l40sx2-latency-bf16-z1ujefobmq framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 L40Sx2 BF16 Latency ngcMetadata: fa36c3502e92c50f78a1906242f929864955e702b7dbfbdb19758fb7ee9aa811: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 Generic NVIDIA GPUx2 BF16 ngcMetadata: 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '2' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 2 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 Generic NVIDIA GPUx4 BF16 ngcMetadata: 54946b08b79ecf9e7f2d5c000234bf2cce19c8fee21b243c1a084b03897e8c95: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '4' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 4 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB - profileId: nim/nvidia/llama-3.1-nemotron-nano-8b-v1:hf-25.03.17-0508-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.1 Nemotron Nano 8B V1 Generic NVIDIA GPUx1 BF16 ngcMetadata: ac34857f8dcbd174ad524974248f2faf271bd2a0355643b2cf1490d0fe7787c2: model: nvidia/llama-3.1-nemotron-nano-8b-v1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '1' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 15GB labels: - Llama - Meta - Text Generation - Large Language Model - NVIDIA Validated - Nemo config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Llama 3.1 Instruct displayName: Llama 3.1 Instruct modelHubID: llama-3.1-instruct category: Text Generation type: NGC description: The Llama 3.1 70B-Instruct, 8B instruct and 8B base NIM simplifies the deployment of the Llama 3.1 70B-Instruct, 8B instruct and 8B base tuned models which is optimized for language understanding, reasoning, and text generation use cases, and outperforms many of the available open source chat models on common industry benchmarks. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: Llama 3.1 8B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-02-03T23:30:51.257Z",
    "description": "# **Llama-3.1-8B-Instruct Overview**\n\n## **Description:**\n\n**Llama-3.1-8B-Instruct** is an 8 billion parameter, instruction-tuned large language model created by Meta. This model is part of the Llama 3.1 family of open-access models and is specifically optimized for dialogue and conversational use cases, making it highly capable of following user instructions to perform a wide variety of natural language processing tasks.\n\nThis model is ready for commercial/non-commercial use.\n\nThis version introduces support for GB200 NVL72, GH200 NVL2, B200 and NVFP4. CUDA updated to version 12.9. For detailed information, refer to Release [Notes for NVIDIA NIM for LLMs LLM 1.12](https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html). \n\n## **Third-Party Community Consideration**\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA \\[meta-llama/Llama-3.1-8B-Instruct\\]  \n([https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)). \n\n## **License/Terms of Use:**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). \n\n**ADDITIONAL INFORMATION**: [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_1/license/). Built with Llama.\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.\n\n## **Deployment Geography:**\n\nGlobal \n\n## **Use Case:**\n\nThis model is primarily used by developers, researchers, and businesses to build and experiment with a wide range of generative AI applications. Its combination of strong performance, efficiency, and an open-access license makes it highly versatile.\n\n* Developers and Businesses would use this model to create production-ready applications such as:  \n  * AI Chatbots and customer service agents.  \n  * Content creation tools for writing emails, marketing copy, and articles.  \n  * Summarization and question-answering systems for internal documents.  \n  * Code generation assistants to help programmers write and debug code.  \n* Researchers would use it to study large language model behavior, explore AI safety and alignment, and benchmark new training or fine-tuning techniques on a capable, open model.  \n* AI Hobbyists would use the model for personal projects, running it on consumer-grade hardware to experiment with creating their own AI assistants or exploring the frontiers of generative AI.\n\n## **Release Date:**\n\nBuild.Nvida.com 07/23/2024 via  \n(https://build.nvidia.com/meta/llama-3_1-8b-instruct)\n\nHuggingface 07/23/2024 via   \n(https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)\n\n## **Reference(s):** \n\n[https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \n\n## **Model Architecture:** \n\nArchitecture Type: Transformer  \nNetwork Architecture: Llama-3.1-8B\n\nThis model was developed based on Meta Llama-3.1-8B  \n[https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \n\nNumber of model parameters: 8.03*10^9\n## **Input:**\n\nInput Type(s): Text \n\nInput Format(s): String \n\nInput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Input: The primary input limit is the model's maximum context length, which is 128,000 tokens. All input text must be pre-processed using the model's specific tokenizer to convert the string into a sequence of token IDs. For conversational use, inputs should be formatted using the model's designated chat template.\n\n## **Output:**\n\nOutput Type(s): Text \n\nOutput Format(s): String\n\nOutput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Output: The output consists of a variable-length sequence of tokens from the model's vocabulary. The total length of the input and output cannot exceed the 128,000-token context window. Post-processing is required to detokenize the raw token ID sequence into a human-readable string.\n\n## **Software Integration:**\n\nRuntime Engine: vLLM, TensorRT\n\nSupported Hardware Microarchitecture Compatibility:\n\nNVIDIA Ampere  \nNVIDIA Blackwell  \nNVIDIA Hopper  \nNVIDIA Lovelace \n\nPreferred Operating System(s):\n\nLinux   \nWindows  \nmacOS\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version(s):**\n\n*Llama-3.1-8B-Instruct-1.10.1 \n*Llama-3.1-8B-Instruct-1.12.0\n*Llama-3.1-8B-Instruct-1.13.1\n\n## **Usage**\n\n### **Use with transformers**\n\nStarting with transformers \\>= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.\n\nMake sure to update your transformers installation via pip install \\--upgrade transformers.\n\n```\nimport transformers\nimport torch\n\nmodel_id = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\n\npipeline = transformers.pipeline(\n    \"text-generation\",\n    model=model_id,\n    model_kwargs={\"torch_dtype\": torch.bfloat16},\n    device_map=\"auto\",\n)\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n    {\"role\": \"user\", \"content\": \"Who are you?\"},\n]\n\noutputs = pipeline(\n    messages,\n    max_new_tokens=256,\n)\nprint(outputs[0][\"generated_text\"][-1])\n```\n\nNote: You can also find detailed recipes on how to use the model locally, with torch.compile(), assisted generations, quantised and more at [huggingface-llama-recipes](https://github.com/huggingface/huggingface-llama-recipes)\n\n### **Tool use with transformers**\n\nLLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).\n\nTool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers. Here is a quick example showing a single simple tool:\n\n```\n# First, define a tool\ndef get_current_temperature(location: str) -> float:\n    \"\"\"\n    Get the current temperature at a location.\n    \n    Args:\n        location: The location to get the temperature for, in the format \"City, Country\"\n    Returns:\n        The current temperature at the specified location in the specified units, as a float.\n    \"\"\"\n    return 22.  # A real function should probably actually get the temperature!\n\n# Next, create a chat and apply the chat template\nmessages = [\n  {\"role\": \"system\", \"content\": \"You are a bot that responds to weather queries.\"},\n  {\"role\": \"user\", \"content\": \"Hey, what's the temperature in Paris right now?\"}\n]\n\ninputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)\n```\n\nYou can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:\n\n```\ntool_call = {\"name\": \"get_current_temperature\", \"arguments\": {\"location\": \"Paris, France\"}}\nmessages.append({\"role\": \"assistant\", \"tool_calls\": [{\"type\": \"function\", \"function\": tool_call}]})\n```\n\nand then call the tool and append the result, with the tool role, like so:\n\n```\nmessages.append({\"role\": \"tool\", \"name\": \"get_current_temperature\", \"content\": \"22.0\"})\n```\n\nAfter that, you can generate() again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling \\- for more information, see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset**\n\n**Data Modality:** Text \n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method:** Hybrid: Human, Automated\n\n**Properties:** \n\nThe model was pre-trained on a dataset of over 15 trillion tokens. This dataset is a high-quality mix of publicly available data, heavily filtered for safety and quality. The instruction fine-tuning dataset is smaller and consists of high-quality prompts, responses, and preference rankings curated by humans.\n\n### **Testing Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Automated\n\n**Labeling Method:** Human\n\n**Properties:** \n\nThe testing datasets comprise thousands of individual problems designed to measure model capabilities in specific areas:\n\n* **General Knowledge & Reasoning:** MMLU, DROP, AGIEval, BIG-Bench Hard  \n* **Mathematics:** GSM8K, MATH  \n* **Coding:** HumanEval\n\n### **Evaluation Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Automated\n\n**Labeling Method:** Human\n\n**Properties:** \n\nThe evaluation datasets consist of thousands of questions and problems designed to test the model's capabilities in areas like general knowledge, reasoning, mathematics, and coding.\n\n**Base pretrained models**\n\n| Category | Benchmark | \\# Shots | Metric | Llama 3 8B | Llama 3.1 8B | Llama 3 70B | Llama 3.1 70B | Llama 3.1 405B |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU | 5 | macro\\_avg/acc\\_char | 66.7 | 66.7 | 79.5 | 79.3 | 85.2 |\n|  | MMLU-Pro (CoT) | 5 | macro\\_avg/acc\\_char | 36.2 | 37.1 | 55.0 | 53.8 | 61.6 |\n|  | AGIEval English | 3-5 | average/acc\\_char | 47.1 | 47.8 | 63.0 | 64.6 | 71.6 |\n|  | CommonSenseQA | 7 | acc\\_char | 72.6 | 75.0 | 83.8 | 84.1 | 85.8 |\n|  | Winogrande | 5 | acc\\_char | \\- | 60.5 | \\- | 83.3 | 86.7 |\n|  | BIG-Bench Hard (CoT) | 3 | average/em | 61.1 | 64.2 | 81.3 | 81.6 | 85.9 |\n|  | ARC-Challenge | 25 | acc\\_char | 79.4 | 79.7 | 93.1 | 92.9 | 96.1 |\n| Knowledge reasoning | TriviaQA-Wiki | 5 | em | 78.5 | 77.6 | 89.7 | 89.8 | 91.8 |\n| Reading comprehension | SQuAD | 1 | em | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 |\n|  | QuAC (F1) | 1 | f1 | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 |\n|  | BoolQ | 0 | acc\\_char | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 |\n|  | DROP (F1) | 3 | f1 | 58.4 | 59.5 | 79.7 | 79.6 | 84.8 |\n\n**Instruction tuned models**\n\n| Category | Benchmark | \\# Shots | Metric | Llama 3 8B Instruct | Llama 3.1 8B Instruct | Llama 3 70B Instruct | Llama 3.1 70B Instruct | Llama 3.1 405B Instruct |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU | 5 | macro\\_avg/acc | 68.5 | 69.4 | 82.0 | 83.6 | 87.3 |\n|  | MMLU (CoT) | 0 | macro\\_avg/acc | 65.3 | 73.0 | 80.9 | 86.0 | 88.6 |\n|  | MMLU-Pro (CoT) | 5 | micro\\_avg/acc\\_char | 45.5 | 48.3 | 63.4 | 66.4 | 73.3 |\n|  | IFEval |  |  | 76.8 | 80.4 | 82.9 | 87.5 | 88.6 |\n| Reasoning | ARC-C | 0 | acc | 82.4 | 83.4 | 94.4 | 94.8 | 96.9 |\n|  | GPQA | 0 | em | 34.6 | 30.4 | 39.5 | 46.7 | 50.7 |\n| Code | HumanEval | 0 | pass@1 | 60.4 | 72.6 | 81.7 | 80.5 | 89.0 |\n|  | MBPP \\++ base version | 0 | pass@1 | 70.6 | 72.8 | 82.5 | 86.0 | 88.6 |\n|  | Multipl-E HumanEval | 0 | pass@1 | \\- | 50.8 | \\- | 65.5 | 75.2 |\n|  | Multipl-E MBPP | 0 | pass@1 | \\- | 52.4 | \\- | 62.0 | 65.7 |\n| Math | GSM-8K (CoT) | 8 | em\\_maj1@1 | 80.6 | 84.5 | 93.0 | 95.1 | 96.8 |\n|  | MATH (CoT) | 0 | final\\_em | 29.1 | 51.9 | 51.0 | 68.0 | 73.8 |\n| Tool Use | API-Bank | 0 | acc | 48.3 | 82.6 | 85.1 | 90.0 | 92.0 |\n|  | BFCL | 0 | acc | 60.3 | 76.1 | 83.0 | 84.8 | 88.5 |\n|  | Gorilla Benchmark API Bench | 0 | acc | 1.7 | 8.2 | 14.7 | 29.7 | 35.3 |\n|  | Nexus (0-shot) | 0 | macro\\_avg/acc | 18.1 | 38.5 | 47.8 | 56.7 | 58.7 |\n| Multilingual | Multilingual MGSM (CoT) | 0 | em | \\- | 68.9 | \\- | 86.9 | 91.6 |\n\n**Multilingual benchmarks**\n\n| Category | Benchmark | Language | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B |\n| :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU (5-shot, macro\\_avg/acc) | Portuguese | 62.12 | 80.13 | 84.95 |\n|  |  | Spanish | 62.45 | 80.05 | 85.08 |\n|  |  | Italian | 61.63 | 80.4 | 85.04 |\n|  |  | German | 60.59 | 79.27 | 84.36 |\n|  |  | French | 62.34 | 79.82 | 84.66 |\n|  |  | Hindi | 50.88 | 74.52 | 80.31 |\n|  |  | Thai | 50.32 | 72.95 | 78.21 |\n\n## **Technical Limitations** \n\n Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, the model's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying this model in any applications, developers should perform safety testing and tuning tailored to their specific applications. Please refer to available resources including the [Responsible Use Guide](https://llama.meta.com/responsible-use-guide), [Trust and Safety](https://llama.meta.com/trust-and-safety/) solutions, and other [resources](https://llama.meta.com/docs/get-started/) to learn more about responsible development. \n\n## **Inference:**\n\n**Acceleration Engine:** vLLM, TensorRT \n\n**Test Hardware:** \n\n* B200 SXM  \n* H200 SXM  \n* H100 SXM  \n* A100 SXM 80GB  \n* A100 SXM 40GB  \n* L40S PCIe  \n* A10G  \n* H100 NVL  \n* H200 NVL  \n* GH200 96GB\n* GB200 NVL72\n* GH200 NVL2\n* RTX 5090  \n* RTX 4090  \n* RTX 6000 Ada\n\n## **Get Help**\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.**",
    "displayName": "Llama-3.1-8b-instruct",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "Llama3.1",
        "NIM",
        "NSPECT-0DQP-LNLV",
        "llama-3.1-8b-instruct",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx2-latency-bf16-qyait9sohq",
    "latestVersionSizeInBytes": 17635976938,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_1-8b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "llama-3.1-8b-instruct",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Meta",
    "safetyAndSecurity": "",
    "shortDescription": "The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out).",
    "teamName": "meta",
    "updatedDate": "2025-10-21T17:47:59.613Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_1-8b-instruct-nemo optimizationProfiles: - profileId: nim/meta/llama-3.1-8b-instruct:a10gx4-latency-bf16-r3bmpcovtw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A10Gx4 BF16 Latency ngcMetadata: 09fec372bdcfaee0662140bc5ed522900bb0b0da7cc37ceba6209731dc55a689: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c0232176c2e5374758e3d88ea13e70aa0edca0862c923428f54b85da208960a9 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 4 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 19GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100_NVLx2 FP8 Latency ngcMetadata: 0c87e2871cd7a6ea205a137109c3afde0134ba22c6fe8e978a752287cf561643: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: cb6ac7eedef673edc08e85f4f3e7525c31f499e5c5f376cbffc05cb8eefe197a number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x1-throughput-fp8-8imkyjutxw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x1 FP8 Throughput ngcMetadata: 0cf8ac8bfbf183d8a891e9023d6aa7a1d93f6720e5bd78e578711e3d5b822c52: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2feaef51b8c016e5c678f39202dfe542c11eb5fc2443749e6c2330f3474aaffe number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h100x2-latency-fp8-cvpqroehhq framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100x2 FP8 Latency ngcMetadata: 0e0a9fb28e4df4f8a2dcaafbcb03ce1e0b9d27a4e00ec273f27bcc47e7572225: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4217e8fa6ba7ac9609ee76470bec904253dadbe7fc33a52f715e08791073c501 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x2-latency-fp8-i4razlnzqw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x2 FP8 Latency ngcMetadata: 192d34f8204aa5c44b08406f8d98c86c606363ff8a2ca5f608b87a2516313b55: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0222120a0b05a944b22ed6b0d7376bbe89abef1c05fa6ecd7967199500398864 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200_NVLx1 BF16 Throughput ngcMetadata: 1d7b8b2d964254990181ba7a6e93687275c3372b689d66b6494ad5f788a108a6: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2660946198ebbb837e487b333ef86b2ac4cbc37b907151de45f291596625f919 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_ADAx1 INT4_AWQ Throughput ngcMetadata: 245a4f27515a6291ac239b37f209847384dbadaa5ad155c45d17bcc524594371: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4d714aa3567eb6e2d72aa08be91bb5fc632e7bbaa645c265104ea1d65eb28efa number_of_gpus: '1' pp: '1' precision: int4_awq profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: INT4_AWQ - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x1-throughput-bf16-zsf8rdhqtw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x1 BF16 Throughput ngcMetadata: 2465a2b2fc773ea207e312352258ca9a54650fc9ec9740ae96646528556a0916: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 61348d392451059a37d2218d940f4aaf266562d0d6fa156e211f266022d5d26e number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_480GBx1 BF16 Throughput ngcMetadata: 28e1523b3569391509a8e976f17c0b04e21faee7095225076a99636cbb1da858: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 828408fdd397e49bb4256a997d3f85d90c3d9a3e756531b8895cd78a83574aa6 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A100_SXM4_40GBx1 BF16 Throughput ngcMetadata: 40d4f2dcb13710bf7fcf1d9d41dfeb1b0ff22ba2d266bc2997a81a000fa5d031: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 72732266ad2f3d3b824f413f11716f81b87ccb602c3cdda972c7341c0d1e60b5 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 1 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h100x1-throughput-fp8-rmqqnk9ima framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100x1 FP8 Throughput ngcMetadata: 4411bf23579e41275d6a994cd768d9dc2ebbd523253e2844115f24644a5e86b1: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3950ee02bc0277147b77079c0cc5bc954b9189f7866fbbebc37ece4ec31283f6 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x2-latency-bf16-px49bz6jka framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x2 BF16 Latency ngcMetadata: 4a7b681f1dc1dcbc0b98f4c4eaa6bdac6557af058dd878039624c68683e2dee3: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 467e29e38751b085aa13fdc92f6eaf1a08a8c360ef19718a92f64bd507221fd8 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x2-latency-bf16--fupfm1fjg framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x2 BF16 Latency ngcMetadata: 4b344f09436a75385ad7c78aa224f685d1f92980ccf7ea52f29a52c1ca646b70: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0a3a8da158191386b506caa79c0bd9787f45009a7e52113b82fcdde0511001d7 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:l40sx2-latency-bf16-aauqggrlkw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct L40Sx2 BF16 Latency ngcMetadata: 55df9113a4cd134e4ddaeeae43cd33089be30b74380a9bc29d677ed9784a3492: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9b1feaf6e923581317ff4291ca09856eb403efd7acdeee1c8e787d988ced56ce number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h200x2-latency-bf16-zwyr4clzla framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200x2 BF16 Latency ngcMetadata: 588fa4150abaa001f1357112de2ca65c85c1c86322b3f7d0ca9f1451f40baee5: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7b377dc3e02bbcef7ec1c0ccee4afc1d99d2409dc7ab6576f1f386ebbedeabc6 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x2-latency-nvfp4-urjebtmqkg framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x2 NVFP4 Latency ngcMetadata: 5eaaf502f6dab9ce29e7d034182bb56eeeb3e349633f4561018f27b3069189b3: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2693061ea8698de078f95517349178ce8894a51a97db468183032cba22ab04ae number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 6GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100_NVLx1 FP8 Throughput ngcMetadata: 785d7d60df3f153a36413f29a16ac14bc5cfba73004bc7feee2bca9d78b10e6f: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a0de60706fabf3ee071fef41f0c14225a3d88799bb9728af810e57a7499f038f number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x1-throughput-nvfp4-6zvdbhtdna framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x1 NVFP4 Throughput ngcMetadata: 78fdabce8c3eae38cea72ca3f28aaca02e3cc475c17913d6e8d4e554cba2aaa9: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d07936f83cf22e053e9fe3339050f2e05459ebcde766c94fdb7a6ac90aeb1fda number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 6GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_480GBx1 FP8 Latency ngcMetadata: 7c71f0d6db2e0d52a3fbc34dabd0584ed7a27ef63a49e21aaa394d8746eeb189: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 1fa738c9de9d4b25a298b3cd021b05beab57c2c9ab5a930a3d1efcf7204fc463 number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h200x1-throughput-fp8-mqkoo4u9fa framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200x1 FP8 Throughput ngcMetadata: 83fa1ce989c823d1fba445823ac58beb734bb31383a33af261a8b0808495678a: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d7e3e88abee403b365d238441ca9c1172e71745fe43cd7dce511e7d95309d237 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 NVFP4 Throughput ngcMetadata: 882e2041a947f6e0793a600a4470fbbd41e7a3f3363bb4956a2c63aaa7cf51ec: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0852d2d4b6d54d6bc12acf922890bfa19e801e74276ddddcff98034ba0dc4c0f number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 FP8 Latency ngcMetadata: 8855de19ef9d0f55c0213a8786591091cc5965a2c862562cc7b492c712ef09e3: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a11707b8479a7230d31a451c07c5650f0e8ff58948507a983a2e89b846929ecf number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200_NVLx2 BF16 Latency ngcMetadata: 885fb853c59fc5ea3a61554797670d6f61e4b2db23f1acbc69f7e8e98846ce21: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 362023b28455913264302e9d87593459bc9930c544859af88689346e92085fea number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A100_SXM4_40GBx2 BF16 Latency ngcMetadata: 88b3c4d52c48162915703053126fe2d2ec64632b4508fb05dd0984904cc4b313: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 344bc1f6c75518604e27015bf9131a6dc8c5257396806f35575516bb14234706 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 2 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x1-throughput-bf16-qsrhtlj33g framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x1 BF16 Throughput ngcMetadata: 8a33858f5392a45aa85acaab0a81601e9831cfd99507249536c63be228f09918: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3cb77524c717d774efbda1f850840b59abc39d9bd46fc2983ebc3dc1f4931ff6 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200_NVLx2 FP8 Latency ngcMetadata: 8b0cd9578c1bf872d35c8da2dc72ed6f2161623840923884a8f50725ec11a4ec: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3a251730a9c214eea8101d8192f6c4c35b1d321aad615edc1e0a942521b828b0 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h100x2-latency-bf16-xhazfvu8og framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100x2 BF16 Latency ngcMetadata: 8ecf55cfb8e611fb1e1579b57089060c76270bcafb322a872c751cb59ba840bc: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 782d96856a10dac93438804c42286ba3e7d0d7445d7fbd8f8497dd3c80238564 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h100x1-throughput-bf16-tgzhmf3syg framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100x1 BF16 Throughput ngcMetadata: 90061152a480ade6c471a982258bf4e42dc51cf29ad9f6642120547c33bdf51f: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 04c9939245eef94e92510642382d6ad26f65a25cf1687b76ccc4e66aba70da39 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_144GBx1 BF16 Throughput ngcMetadata: 9020f539c475f53d364474485cd83728454b7a340c0f1ee2d3cf505ccdcc1189: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ca7f6e7d29a1f514a9d7f4f3d731b0a0d286a9358a96287b12a1045ac9ca590b number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x2-latency-fp8-hrzafixo7g framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x2 FP8 Latency ngcMetadata: 94420d0c4e672e70e91c15d5a6e23c447fa3b43f1632936eebf9cdd0c845d036: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b1e67d29794a75bf923a7224dd297dbc4aacd4d97273a7fd66dda7e8371a6da8 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 NVFP4 Latency ngcMetadata: 95f587f27ab8c1467d93d12ffb7db8f3920888b4211c2ab82ef4f8de2fca61f5: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9cd258950837575782b24640b90c1cd969334d691131036208cd3c8b0735f927 number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200_NVLx1 FP8 Throughput ngcMetadata: 9b0e99f6e9afa6fa529d47662d85b1e6d16b3abadcc2a5e72c10486eb7c87201: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 93297aebf65f337e982d4ebd8e79f380bd9ad05346cf2e18908c6365de2b2307 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x2-latency-nvfp4-sgvjjrbeuw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x2 NVFP4 Latency ngcMetadata: 9f558e6681791166fdc01cacf06f2d869b67c26f1d573738f92e5e227f820270: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: efc59c490dab28006a89a5d53a36d4ef5c5d3b0927c7d118f02014d4eb0c29e8 number_of_gpus: '4' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 6GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:l40sx1-throughput-fp8-xad-wr2scw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct L40Sx1 FP8 Throughput ngcMetadata: a3e90cba8e03efc80877da3902607362c851c36e8c45cd92aada9e7cac900765: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7ebba124e3f58d0563b96377f7c85432ef3e2f393efb9e158fd308a8738abcfb number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h200x2-latency-fp8-cftgwz2fda framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200x2 FP8 Latency ngcMetadata: a826d9d8199abbe4e4084a2f64d3658ef6749b1697ecf21fd0615d1e138e368d: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6bc839be18669cae90a69af4e02503965be03b8c68b1b7ac2cf6b612033abeb8 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:a100x2-latency-bf16-oxfjg8md-a framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A100x2 BF16 Latency ngcMetadata: ad582d87e490e749edcaf041d763e6c3f492962ccdbbe83e9204b48d6cfe7641: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0b1583a74d6516dd30c0bfdce8972835384f10fb4f617df4e0260fdf5092b059 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 17GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:gb200x1-throughput-nvfp4-ihgvv-o6wg framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GB200x1 NVFP4 Throughput ngcMetadata: adbc8a19059852df0c2ac75173b80f123b0901926e524a2c050ce60aa3ae5ca1: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a381f8cc9d090bc49cb320a47cdefe01b0555dc9409312516194cb19437436d0 number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 6GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100_NVLx2 BF16 Latency ngcMetadata: afc6d2a8f5c1affe8524a39c78d6f083cd56ac678f9cc9f89df33b0e0e530ec5: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d0040f5636dd8d4baa9c36337ccb4157b58e47ba7118b2d54b36e2ad96061ed0 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_144GBx1 FP8 Throughput ngcMetadata: b0c4bcf92286ad2f689805bf411e44a617df5a5455c703ddd8053f354d40b5cb: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0ec33e185e2d8e2e4bd97118f54276dbece4b6744371f895bd4c86e8e4dceedb number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:h200x1-throughput-bf16-6ylo-i-bbw framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H200x1 BF16 Throughput ngcMetadata: b795f66a018d1278aaded769cb88a79b5565d2fe6497739b03d8f1bad88e75d1: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 673a0a6cdd37cabec7d8dcc8f05f787884c72f2b56fcaad416429dda38238c0b number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_480GBx1 BF16 Latency ngcMetadata: c11d003373b87576201557974186967205684e4045905b5140a3d92f274cbf5f: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b41e403000a7d5221cdd4f00ab4d8a2ef58aa3470a65db51abd523b245c63ea6 number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:l40sx2-latency-fp8-szl6-yje2g framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct L40Sx2 FP8 Latency ngcMetadata: c512ff489822b14e13879c4b1cbb849e5a45d453beeb2d9abfe52f029c0639d2: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 570d33681085a88ae5ea6bd28342996817acb2d9b0a5e8486e197bba77b832a1 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_ADAx1 INT4_AWQ Latency ngcMetadata: c95bbf72a36cc53dd0750074c0307cbc16ef98a8634cd89f94046e226c892ac9: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 737909c201e9ccea9bfb138401ed768b71985f2aad636ed91b7ca0712e02cb43 number_of_gpus: '1' pp: '1' precision: int4_awq profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: INT4_AWQ - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 BF16 Latency ngcMetadata: ccfeded811dbe0f17d70c25f83c247d1317114349b5df99ba1044c1fcb79b8ef: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 689cc0a0026ff1fff0e5818d26ffe369c59e7b16f96d9239248504fbab23c28c number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 BF16 Throughput ngcMetadata: d3ab627cccb5910fbce6396c9d205c84792abee634eb9f334c47086cf5d01b12: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: fa369969c2e20bb29b6b004cde3f63ba17a65056818bd8ad63528141ecb41527 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:a100x1-throughput-bf16-lwcrbwztpq framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A100x1 BF16 Throughput ngcMetadata: d67b7f59a9a2851e98bce877ee3702e82a3166322418dbf900a6a15e46643472: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2c735fe09841c686ba2f7ace400337d0f11be549d41cdeb9ba1c82688d0688fa number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_144GBx1 BF16 Latency ngcMetadata: db8a6f9d6f65eaec69ec78ea131cb34ec66bc63df975d23f8d2ccb031806dcc8: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c9d911dd01c8c520784ca0fec350f83855a8ca1ea1a3fed5f707fa642945a3e3 number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:b200x1-throughput-fp8-i5rbiys4jq framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct B200x1 FP8 Throughput ngcMetadata: e0b3ee6ce141beca50c67daccbebb1ce7417c14acd08c81346986898042733b6: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9d7c4bc4201757dcd3f1147712dfb1c83a8f8535405a59e1fa547f17b4a5869b number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 9GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct H100_NVLx1 BF16 Throughput ngcMetadata: e6c81e90a8ff3f2cf1b1bffbf760b05c7cf12d18c6486a4690b8ab81b6de436d: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b75eb29f401f555ea3d19a6df30861ef874d10760590ee8856ff0542d7ea1e7f number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_480GBx1 FP8 Throughput ngcMetadata: ed144c17645499c4cd983b4a2e4bdc23f0f03cc55e19073c357e8eb0ff982dc6: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: aab7bbbbcf3e6f7fb544ee77017dd2ee59aa0b81dc314dec4bb46317def34714 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct RTX6000_BLACKWELL_SVx1 FP8 Throughput ngcMetadata: ee928087f01a5df571cf5e62c96f66fedccaf180524ce1e43cb4b5a23295deb8: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 60b75ec923fe8a417deb0273d9a21b4fcd4e3e0f8f9ed6e9527a66433cf6030c number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:l40sx1-throughput-bf16-lh60z9g-aq framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct L40Sx1 BF16 Throughput ngcMetadata: eece8dae913d9055ed8060b6ae1764cefecd6d158dd314851e1ecd15b5d9126d: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ded6e827825103665f5fea2381f04196a88c95525aea31c073556f0204ad9c8e number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 16GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:8c22764a7e3675c50d4c7c9a4edb474456022b16 framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct GH200_144GBx1 FP8 Latency ngcMetadata: ef8d429a394978d394a8d15ddbdd6666bde4dc68e40f8cb399b188f5b7e59db5: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 02d2b0459a15ab0a39ec4335caf74e3105e66f51ae577b7bf8a1b64cfcb5c472 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 30GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-8b-instruct:a10gx4-throughput-bf16-g04kznyzwa framework: TensorRT-LLM displayName: Llama 3.1 8B Instruct A10Gx4 BF16 Throughput ngcMetadata: f02876a90b3197bcf046fee9ab2beb6f7482b8b35e3ff9ff545d03ba9ba7bb23: model: meta/llama-3.1-8b-instruct release: 1.13.1 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: fda7bc4c0bf1f5eaeecbf29fbdd078ecbf587ac57c62610180d3d5fb90ffcfda number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 4 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.13.1 - key: DOWNLOAD SIZE value: 19GB - key: LLM ENGINE value: TENSORRT_LLM - variantId: Llama 3.1 70B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-05-20T18:17:09.107Z",
    "description": "# **Llama-3.1-70B-Instruct Overview**\n\n## **Description:**\n\n**Llama-3.1-70B-Instruct** is a multilingual large language model from the Meta Llama 3.1 collection of pretrained and instruction-tuned generative models. This model is optimized for multilingual dialogue use cases and outperforms many available open-source and closed-source chat models on common industry benchmarks. Llama 3.1 is an auto-regressive language model that uses an optimized transformer architecture; the tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.\n\nThis model is ready for commercial/non-commercial use.\n\nThis version introduces support for GB200 NVL72, GH200 NVL2, B200 and NVFP4. CUDA updated to version 12.9. For detailed information, refer to Release [Notes for NVIDIA NIM for LLMs LLM 1.12](https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html). \n\n## **Third-Party Community Consideration**\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA\\[meta-llama/Llama-3.1-70B-Instruct\\]  \n([https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)). \n\n## **License/Terms of Use:**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of the model is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/.).\n\n**ADDITIONAL INFORMATION:** [Llama 3.1 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.\n\n## Get Help\n\n### Enterprise Support\n\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.**\n\n## **Deployment Geography:**\n\nGlobal \n\n## **Use Case:**\n\nDevelopers, AI researchers, and businesses would be expected to use this system to build and power a wide range of applications that require advanced reasoning, instruction-following, and multilingual dialogue capabilities. Specific applications include creating sophisticated chatbots and virtual assistants, developing powerful content creation and summarization tools, building complex question-answering systems, and powering multilingual customer support platforms.\n\n## **Release Date:**\n\nBuild.Nvidia.com 07/23/2024 via  \n[llama-3.1-70b-instruct Model by Meta | NVIDIA NIM](https://build.nvidia.com/meta/llama-3_1-70b-instruct)\n\nGithub 07/23/2024 via   \n[https://github.com/meta-llama/llama-models/blob/main/models/llama3\\_1/LICENSE](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)\n\nHuggingface 07/23/2024 via   \n[https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)\n\n**Reference(s):** \n\n[https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)\n\n## **Model Architecture:** \n\nArchitecture Type: Transformer  \nNetwork Architecture: Llama-3.1-70B\n\nThis model was developed based on Meta-Llama-3.1-70B  \n[https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct)\n\nNumber of model parameters: 7.06*10^10\n\n## **Input:**\n\nInput Type(s): Text \n\nInput Format(s): String \n\nInput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Input: The model accepts a string of text which is converted into tokens using the model's specific tokenizer. The total length of the input prompt and the generated output cannot exceed the model's context window of 128,000 tokens.\n\n## **Output:**\n\nOutput Type(s): Text \n\nOutput Format(s): String\n\nOutput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Output: The model generates a string of text, produced token by token. The maximum length of the output is limited by the model's 128,000-token context window, less the length of the input prompt. Post-processing is required to decode the model's token-based output into a readable string.\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Software Integration:**\n\nRuntime Engine: vLLM, TensorRT\n\nSupported Hardware Microarchitecture Compatibility:\n\nNVIDIA Ampere  \nNVIDIA Blackwell  \nNVIDIA Hopper  \nNVIDIA Lovelace \n\nPreferred Operating System(s):\n\nLinux   \nWindows\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version(s):**\n\nLlama-3.1-70B-Instruct\n\n## **Usage**\n\n**Use with transformers**\n\nStarting with transformers \\>= 4.45.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.\n\nMake sure to update your transformers installation via pip install \\--upgrade transformers.\n\nSee the snippet below for usage with Transformers:\n\n```\nimport transformers\nimport torch\n\nmodel_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\n\npipeline = transformers.pipeline(\n    \"text-generation\",\n    model=model_id,\n    model_kwargs={\"torch_dtype\": torch.bfloat16},\n    device_map=\"auto\",\n)\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n    {\"role\": \"user\", \"content\": \"Who are you?\"},\n]\n\noutputs = pipeline(\n    messages,\n    max_new_tokens=256,\n)\nprint(outputs[0][\"generated_text\"][-1])\n```\n\n**Tool use with transformers**\n\nLLaMA-3.3 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).\n\nTool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers. Here is a quick example showing a single simple tool:\n\n```\n# First, define a tool\ndef get_current_temperature(location: str) -> float:\n    \"\"\"\n    Get the current temperature at a location.\n    \n    Args:\n        location: The location to get the temperature for, in the format \"City, Country\"\n    Returns:\n        The current temperature at the specified location in the specified units, as a float.\n    \"\"\"\n    return 22.  # A real function should probably actually get the temperature!\n\n# Next, create a chat and apply the chat template\nmessages = [\n  {\"role\": \"system\", \"content\": \"You are a bot that responds to weather queries.\"},\n  {\"role\": \"user\", \"content\": \"Hey, what's the temperature in Paris right now?\"}\n]\n\ninputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)\n```\n\nYou can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:\n\n```\ntool_call = {\"name\": \"get_current_temperature\", \"arguments\": {\"location\": \"Paris, France\"}}\nmessages.append({\"role\": \"assistant\", \"tool_calls\": [{\"type\": \"function\", \"function\": tool_call}]})\n```\n\nand then call the tool and append the result, with the tool role, like so:\n\n```\nmessages.append({\"role\": \"tool\", \"name\": \"get_current_temperature\", \"content\": \"22.0\"})\n```\n\nAfter that, you can generate() again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling \\- for more information, see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).\n\n**Use with bitsandbytes**\n\nThe model checkpoints can be used in 8-bit and 4-bit for further memory optimisations using bitsandbytes and transformers\n\nSee the snippet below for usage:\n\n```\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_id = \"meta-llama/Meta-Llama-3.1-70B-Instruct\"\nquantization_config = BitsAndBytesConfig(load_in_8bit=True)\n\nquantized_model = AutoModelForCausalLM.from_pretrained(\n    model_id, device_map=\"auto\", torch_dtype=torch.bfloat16, quantization_config=quantization_config)\n\ntokenizer = AutoTokenizer.from_pretrained(model_id)\ninput_text = \"What are we having for dinner?\"\ninput_ids = tokenizer(input_text, return_tensors=\"pt\").to(\"cuda\")\n\noutput = quantized_model.generate(**input_ids, max_new_tokens=10)\n\nprint(tokenizer.decode(output[0], skip_special_tokens=True))\n```\n\nTo load in 4-bit simply pass load\\_in\\_4bit=True\n\n**Use with llama**\n\nPlease, follow the instructions in the [repository](https://github.com/meta-llama/llama).\n\nTo download Original checkpoints, see the example command below leveraging huggingface-cli:\n\n```\nhuggingface-cli download meta-llama/Meta-Llama-3.1-70B-Instruct --include \"original/*\" --local-dir Meta-Llama-3.1-70B-Instruct\n```\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset**\n\n**Data Modality:** Text \n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method:** Hybrid: Automated, Human\n\n**Properties:** \n\nThe pretraining dataset contains over 15 trillion tokens. It is a multilingual dataset covering over 30 languages and was filtered heavily for quality using various techniques, including heuristic filters, NSFW filters, and text classifiers. The model's knowledge was trained on data with a cutoff of December 2023\\. \n\n### **Testing Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method:** Hybrid: Automated, Human\n\n**Properties:** \n\n**Description:**\n\nThe model was tested on a diverse set of evaluation data.\n\n* Public Benchmarks: These test a wide range of capabilities, from general knowledge and reasoning (MMLU, HellaSwag) to expert-level problem-solving (GPQA) and programming (HumanEval, which contains 164 programming problems).  \n* Internal Evaluation Set: Meta created a new high-quality test set of 2,000 prompts covering 12 key use cases (e.g., coding, reasoning, creative writing, instruction following). This set is used for human evaluation to assess performance on real-world, nuanced tasks. \n\n### **Evaluation Dataset**\n\n**Link:** Undisclosed\n\n**Data Collection Method:** Hybrid: Automated, Human, Synthetic\n\n**Labeling Method:** Hybrid: Human, Automated\n\n**Properties:** \n\nThe evaluation datasets are diverse and test a wide spectrum of capabilities. MMLU measures broad multitask knowledge. GPQA assesses advanced reasoning with difficult, expert-level questions. HumanEval and MATH specifically test code generation and mathematical reasoning abilities, respectively. Meta also utilizes a large, private, human-annotated evaluation set designed to assess model performance in real-world, nuanced scenarios. \n\n**Base pretrained models**\n\n| Category | Benchmark | \\# Shots | Metric | Llama 3 8B | Llama 3.1 8B | Llama 3 70B | Llama 3.1 70B | Llama 3.1 405B |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU | 5 | macro\\_avg/acc\\_char | 66.7 | 66.7 | 79.5 | 79.3 | 85.2 |\n|  | MMLU-Pro (CoT) | 5 | macro\\_avg/acc\\_char | 36.2 | 37.1 | 55.0 | 53.8 | 61.6 |\n|  | AGIEval English | 3-5 | average/acc\\_char | 47.1 | 47.8 | 63.0 | 64.6 | 71.6 |\n|  | CommonSenseQA | 7 | acc\\_char | 72.6 | 75.0 | 83.8 | 84.1 | 85.8 |\n|  | Winogrande | 5 | acc\\_char | \\- | 60.5 | \\- | 83.3 | 86.7 |\n|  | BIG-Bench Hard (CoT) | 3 | average/em | 61.1 | 64.2 | 81.3 | 81.6 | 85.9 |\n|  | ARC-Challenge | 25 | acc\\_char | 79.4 | 79.7 | 93.1 | 92.9 | 96.1 |\n| Knowledge reasoning | TriviaQA-Wiki | 5 | em | 78.5 | 77.6 | 89.7 | 89.8 | 91.8 |\n| Reading comprehension | SQuAD | 1 | em | 76.4 | 77.0 | 85.6 | 81.8 | 89.3 |\n|  | QuAC (F1) | 1 | f1 | 44.4 | 44.9 | 51.1 | 51.1 | 53.6 |\n|  | BoolQ | 0 | acc\\_char | 75.7 | 75.0 | 79.0 | 79.4 | 80.0 |\n|  | DROP (F1) | 3 | f1 | 58.4 | 59.5 | 79.7 | 79.6 | 84.8 |\n\n**Instruction tuned models**\n\n| Category | Benchmark | \\# Shots | Metric | Llama 3 8B Instruct | Llama 3.1 8B Instruct | Llama 3 70B Instruct | Llama 3.1 70B Instruct | Llama 3.1 405B Instruct |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU | 5 | macro\\_avg/acc | 68.5 | 69.4 | 82.0 | 83.6 | 87.3 |\n|  | MMLU (CoT) | 0 | macro\\_avg/acc | 65.3 | 73.0 | 80.9 | 86.0 | 88.6 |\n|  | MMLU-Pro (CoT) | 5 | micro\\_avg/acc\\_char | 45.5 | 48.3 | 63.4 | 66.4 | 73.3 |\n|  | IFEval |  |  | 76.8 | 80.4 | 82.9 | 87.5 | 88.6 |\n| Reasoning | ARC-C | 0 | acc | 82.4 | 83.4 | 94.4 | 94.8 | 96.9 |\n|  | GPQA | 0 | em | 34.6 | 30.4 | 39.5 | 46.7 | 50.7 |\n| Code | HumanEval | 0 | pass@1 | 60.4 | 72.6 | 81.7 | 80.5 | 89.0 |\n|  | MBPP \\++ base version | 0 | pass@1 | 70.6 | 72.8 | 82.5 | 86.0 | 88.6 |\n|  | Multipl-E HumanEval | 0 | pass@1 | \\- | 50.8 | \\- | 65.5 | 75.2 |\n|  | Multipl-E MBPP | 0 | pass@1 | \\- | 52.4 | \\- | 62.0 | 65.7 |\n| Math | GSM-8K (CoT) | 8 | em\\_maj1@1 | 80.6 | 84.5 | 93.0 | 95.1 | 96.8 |\n|  | MATH (CoT) | 0 | final\\_em | 29.1 | 51.9 | 51.0 | 68.0 | 73.8 |\n| Tool Use | API-Bank | 0 | acc | 48.3 | 82.6 | 85.1 | 90.0 | 92.0 |\n|  | BFCL | 0 | acc | 60.3 | 76.1 | 83.0 | 84.8 | 88.5 |\n|  | Gorilla Benchmark API Bench | 0 | acc | 1.7 | 8.2 | 14.7 | 29.7 | 35.3 |\n|  | Nexus (0-shot) | 0 | macro\\_avg/acc | 18.1 | 38.5 | 47.8 | 56.7 | 58.7 |\n| Multilingual | Multilingual MGSM (CoT) | 0 | em | \\- | 68.9 | \\- | 86.9 | 91.6 |\n\n**Multilingual benchmarks**\n\n| Category | Benchmark | Language | Llama 3.1 8B | Llama 3.1 70B | Llama 3.1 405B |\n| :---- | :---- | :---- | :---- | :---- | :---- |\n| General | MMLU (5-shot, macro\\_avg/acc) | Portuguese | 62.12 | 80.13 | 84.95 |\n|  |  | Spanish | 62.45 | 80.05 | 85.08 |\n|  |  | Italian | 61.63 | 80.4 | 85.04 |\n|  |  | German | 60.59 | 79.27 | 84.36 |\n|  |  | French | 62.34 | 79.82 | 84.66 |\n|  |  | Hindi | 50.88 | 74.52 | 80.31 |\n|  |  | Thai | 50.32 | 72.95 | 78.21 |\n\n## **Technical Limitations** \n\nTesting conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, the model's potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying this model in any applications, developers should perform safety testing and tuning tailored to their specific applications. Please refer to available resources including the [Responsible Use Guide](https://llama.meta.com/responsible-use-guide), [Trust and Safety](https://llama.meta.com/trust-and-safety/) solutions, and other [resources](https://llama.meta.com/docs/get-started/) to learn more about responsible development. \n\n## **Inference:**\n\n**Acceleration Engine:** vLLM, TensorRT \n\n**Test Hardware:** \n\n  B200 SXM   \n  H200 SXM  \n  H100 SXM  \n  A100 SXM 80GB  \n  A100 SXM 40GB  \n  L40S PCIe  \n  A10G  \n  H100 NVL  \n  H200 NVL  \n  GH200 96GB  \n  GB200 NVL72   \n  GH200 NVL2     \n  RTX 5090  \n  RTX 4090  \n  RTX 6000 Ada\n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.",
    "displayName": "Llama-3.1-70b-instruct",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "Llama3.1",
        "Llama3.1-70b-instruct",
        "NIM",
        "NSPECT-7S3F-QFG8",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx8-latency-bf16-bxiagh4jgg",
    "latestVersionSizeInBytes": 157357267927,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_1-70b-instruct.jpg",
    "modelFormat": "SavedModel",
    "name": "llama-3.1-70b-instruct",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Meta",
    "safetyAndSecurity": "",
    "shortDescription": "The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out).",
    "teamName": "meta",
    "updatedDate": "2025-10-15T17:49:15.605Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_1-70b-instruct-nemo optimizationProfiles: - profileId: nim/meta/llama-3.1-70b-instruct:b200x1-throughput-nvfp4-lissxvpltg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x1 NVFP4 Throughput ngcMetadata: 1b7ebc7f2cd12aa502b3f2bc17fa55a91f304abd992b287c535a59b6536d3e05: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b476c975e5339b67e01a1a9aee137aa1dd80c1d520b62ba160b64e426c8e2e6e number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx8 BF16 Latency ngcMetadata: 266a5944d595ad57b186c01686b30ba7d1fc10f22a5b4fa17ef8d5cd54faf0f8: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 62c6e2eeff50dd4b71f6a31817eed7685778f8d1415340f402e269add0ca102b number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 8 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h200x4-latency-bf16-csp1xgtxoq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200x4 BF16 Latency ngcMetadata: 2a56d7a6042e02c5b469f5128c76379973e255caf5b1adc1cde6e03230159077: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 91f8367eac71f0e5731988bac7b8b9ae66747619ed7cea336ff1ad2609b07945 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 4 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 139GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A100_SXM4_40GBx8 BF16 Throughput ngcMetadata: 2fdeceaf1b64acf3ab1c2a22b8e23f6c25d639d6a5d7006c51c80b613fb2699b: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ce61710173c15471c4031430bc8de32b94fb1859a9d4d4cced5c09664b9658c3 number_of_gpus: '8' pp: '1' precision: bf16 profile: throughput tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:l40sx4-throughput-fp8-ulen5raong framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct L40Sx4 FP8 Throughput ngcMetadata: 3013dcf9b905cbd2f5e23f804fd5d66d183ddc71a8735631d3cad277f7c23897: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ce87554d33c9d66d40c52c15b3a90b5c802ef4b7d05781dc74fd18485a20e15d number_of_gpus: '4' pp: '1' precision: fp8 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x1-latency-nvfp4-aiiz15cu0w framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x1 NVFP4 Latency ngcMetadata: 344979e57f70e669d35378bc48ef7d14a13dc6aa0467ce9cb29166b8a8371bcb: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b668500698d48c5aee9f5b591c4383cb62053acb59539cf0b511b8b2d2ae864f number_of_gpus: '1' pp: '1' precision: nvfp4 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx2 FP8 Throughput ngcMetadata: 3526ceaf332ec21d4317c0939a99a3862b19593527fa942ffd5a1df2dade47ce: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2a535c9c9ddfb8e328abc28f3b4d9564ecdf9886fa177096f7b38dee7af754ab number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100_NVLx2 BF16 Throughput ngcMetadata: 3684471ad5d007fa1f72bbc672a794107de7b0e8df88214dc1563a24aa99c8b7: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 22af385c5fd7064b011e826d0d78c210b7ac1fe7a9e29eef15e6a5e433b9db9d number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:b200x2-latency-nvfp4-hrt0sgzswa framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x2 NVFP4 Latency ngcMetadata: 377c705c5682293482c5094b946b8e74ccba5302c324b5ce41f952e9cac29890: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 289f1c679cc71fbadaa8139366458b0c3fc39d49ba067efdb7db9fbf3801ac1c number_of_gpus: '2' pp: '1' precision: nvfp4 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:l40sx4-latency-fp8-ctp-cvrc0w framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct L40Sx4 FP8 Latency ngcMetadata: 43160a1132063bf60ef6d7fe17a9b271f03dedbdb3bd1584a2e53707c8faa9ce: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f1a3de57c511586b258f58e7457103c919f8fa4db289d37961cad2468596ee6c number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100_NVLx2 FP8 Throughput ngcMetadata: 44d44ef91639f0c76a1ef4be0022651ed8d42b485c26de00ba99aee570d1768d: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 1d049541d40b0b0407983f0438189a5d21af6652866d6640437e0323c7878361 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x2-throughput-bf16-lo9t8i-qua framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x2 BF16 Throughput ngcMetadata: 45c52f130d8d467fa6e91f4ffee683fff5601e16df41388d4047e63e294e1165: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 706ae15947d58ed243812620f46199e223e7288c6624ccd33d9e9393a7bfb96a number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx2 NVFP4 Throughput ngcMetadata: 4b9618100e94fc85d674a89eae960e18d8192163abe5db2a0d2be891d32ea06a: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 54ad694d948a6bd8d413341c0d9476b3756a5553aa8ce8ba5479d3b3cf289e9d number_of_gpus: '2' pp: '1' precision: nvfp4 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 2 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h100x4-throughput-bf16-wf01-bcefa framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100x4 BF16 Throughput ngcMetadata: 4d9f79288ba78fd61b3cc445c6f9da30362a132ea371798a8ec3dff7bddc3a20: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 22f873ff61f22bd360dc173f0f4a068d4d950c02ea6045570eb7f50ec8f83e93 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 139GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h100x2-throughput-fp8-vjxy5bkroq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100x2 FP8 Throughput ngcMetadata: 4fbe63c3f6f9b928dac05fe81a278ac1ad45ccf329850f66bd6cbc0c2f2c044c: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: bb98ff9885aa439391b057063cce3555833a27f88d982b2c210fd4b752390475 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_144GBx2 BF16 Latency ngcMetadata: 50fe6d2879cabe91e1e0b96314d40695e7ffc9e83a02d63629b9cabfae496dbe: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 21c7fedc20e7b94738606f7f4f8ebb346dc3f087f082dd32b713e4b8e6ed0a06 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200_NVLx2 FP8 Latency ngcMetadata: 592714cb05c8f25c0445fb7467d096956db9bfbee0958eb713c02a5410867bff: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 90dae25f9f80d311b10f61f5772c37bac723422cce689c138396be49db0b82f4 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_144GBx1 FP8 Throughput ngcMetadata: 5fbbcbeb676751bfdc9b65cca39334f82fbe543070ea66b4756f71de6cfe2b59: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b2d2411595259e3d02add53ce15aaae59cf5bb02731910aecbd8b5b7a3f75adc number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:a100x4-throughput-bf16-ftwaepe7oq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A100x4 BF16 Throughput ngcMetadata: 68aff19a2e5198624143bf25060662c863ecf21039b6f2d4ef3fe7965a8bab96: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 8f849c4baf82033a3bbfba75bd2a6fc379c92079e81f6fca99f978c9d1c04ad1 number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 4 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 140GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h200x2-latency-fp8-msyzoyixrw framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200x2 FP8 Latency ngcMetadata: 6c27932dc47820a7130505d6bceca05a3ec27628a8416b4603b9b9c8367f161d: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 070c645e1731e5dd9875c800a22cadcd32f9008bf79b768884a331afc9c96e25 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:l40sx4-throughput-bf16-gfrr6smxia framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct L40Sx4 BF16 Throughput ngcMetadata: 6c9c0490830921741f09a61b59d32ff645681d80194b3af37214824d65f05e7e: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e4710ee55a988086fb6dd511b81e989b78d523d951f0da1719cf0328d750a71e number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 140GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:b200x1-throughput-fp8-ktlniezpyw framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x1 FP8 Throughput ngcMetadata: 741556aa43f38761800674e07ff79f5d61136c8301687b3f914f61c78f72ce46: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d8ffe1683a73563951753b8b23c0854020887590f3a2112230e0ad947fe1ae99 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_480GBx1 FP8 Throughput ngcMetadata: 75e60d670c274c13e9647548bc1c21549d28871432524a0c86becc2b9c73392e: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6a43b054452f258a2315e308da5d8813a4d5c7672764a4f28218373855853197 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx4 BF16 Throughput ngcMetadata: 76296a7f2a589f543337824f321c38801835885f3a85d9efc3c5b820d7db5228: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4f131424adabbfc81a877f09da0ca3bb31989fc0bed618b8d0c5969faa01f7fe number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:a100x8-latency-bf16-zb8ixw2ong framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A100x8 BF16 Latency ngcMetadata: 7cfa94d868fb7d979659d8418cbf37496cefd480d3b3b3ea06877b08e2868827: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 698871a2af3710aa48027caa8536573c057658a99a89b8d9652e15f19f0c2e12 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 8 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 147GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_480GBx1 FP8 Latency ngcMetadata: 88a14e8523e8747165e8574a84cee8c4a580af03ced367e74017bf4046835dd2: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e85188dc62e0c517a93bc24a32bee7b7f27b66fa0c6e3184813a4873369e413f number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:b200x4-latency-bf16-gqr-l-hprg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x4 BF16 Latency ngcMetadata: 8cc3eeb4f2ae763b36bf76a67ed42daea7b533852a65090b522440956de4f327: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3a0889dd10acf4050ccd4fbb878eb5c982c420e3a63df2a2fafaa9fc6c8cf861 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 4 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h100x4-latency-fp8-oxqturnvsg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100x4 FP8 Latency ngcMetadata: 9078b6b41878fcfd7e5e9dca2ea0b5c5560d85d31e2cbb9e0e9801d2bb192bfe: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: cc434557f087f390d162cf972e61958ba9e9f09b6112e174e9824b7bcd92e6f4 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200_NVLx1 FP8 Throughput ngcMetadata: 92adc0b1a36388246d3f037e68df053c83b4bfe4d23e1fae59f711e6e451b944: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0e974238d656d94cb79d39fcc0064f619e6606c678fcba61651358275c693e75 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x1-throughput-nvfp4-w75uvvawyq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x1 NVFP4 Throughput ngcMetadata: a92446d9168e5b10aabe4d31889c68b90503f2ee9bbefafa7b406ef1f2f2b92b: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c593287a9ac875b3a649fb7725a9f1a1e6816129291594a3783a556296bd8808 number_of_gpus: '1' pp: '1' precision: nvfp4 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: NVFP4 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 41GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:b200x2-latency-fp8-zkeshhnnug framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x2 FP8 Latency ngcMetadata: b483bd59b245ec47d9b700691316ed76163f1500d17dcd1fd1fc13ef4fa34dbd: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: daec3db3d904aa6cba1cecd2404867f81d44cf10bb16cc9c9f0ee9a19085bb68 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 69GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100_NVLx4 FP8 Latency ngcMetadata: b8a18b250c3bd00464dd5194016ecc81756f0121ebc070081bdb2de6dd715a91: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e741999618e5a4d94b595ff13d17028e5f11db1d5ed50644fd584d34d553198b number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_144GBx2 BF16 Throughput ngcMetadata: bab01e4b4d692d4d879a405cac30bc3830fb4bfed76deaff130bc989bbf70008: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 943e58ca6ffb929366337f69cfc2a49f55a062cb721159a094ae6ade370d2302 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:a10gx8-throughput-bf16-c6h2bujzqq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A10Gx8 BF16 Throughput ngcMetadata: bb0acd8d341492a58388d49010ebfd53ccf30e9ba61961e68853b7812bdd57d5: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 91af47c6a8cd2c59b5187b47bcb6c3feaef274f685067c0ba391f035ccf265eb number_of_gpus: '8' pp: '1' precision: bf16 profile: throughput tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 150GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_144GBx2 FP8 Latency ngcMetadata: c02d69cc0542152ece147e75cb33487d9058a83ac94866680455b56c075cede4: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 50fd91ee630703f0af954360b638c997fa7e69b60ed949c129e3ba042ed47b66 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h200x1-throughput-fp8-j-xwy-p6zg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200x1 FP8 Throughput ngcMetadata: cb42798192666f9b621fb9a5aeecb342ae389bb6c8992183804aeed016fc1862: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0a504db21e8269006446b7b777218b5fc904fb8308dd5fbba24de96b577d289f number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GH200_144GBx2 FP8 Throughput ngcMetadata: d40298dbc0f90c12808e7e5becb22e47c284f05012c73dabb9818f03f461cd10: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ceebc0caeb9bdff847a148e0915219590cc463095bfb9545eb265978e4b8eb81 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 2 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A100_SXM4_40GBx8 BF16 Latency ngcMetadata: d847fc6b060db9381d38e6cb59ff183f29c6bb457c402d24c27971e59bad9bf7: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 35500bcc312a974be001478a0b1e2466fea9dac13d7bd087146f70d4c9e854c6 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x1-latency-fp8-f3rjvlafrw framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x1 FP8 Latency ngcMetadata: dbcdb5f1412398520d7330cb890aa57f1792596f7dc885cc65a1dc20d390cc9d: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a5a7e9799d92d3e8e2cb39c45acf73dee122f9f65c1bbeaf4eaf7d669745a89c number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:l40sx4-latency-bf16-jhyf9rlszq framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct L40Sx4 BF16 Latency ngcMetadata: e2b3ba60e795d306cf487ea71c8a5d128769f452eeffb54fada6697f031b556c: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3d65171e66e53a0c6b9d1253a09c393d917cb611ea6de85f7c7abadf7ea934b6 number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H100_NVLx4 BF16 Latency ngcMetadata: e6fcaba4b0c11392cd4ce8e0eddc261fac45e994c7735c3d79734245aac1a68d: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: dbc58664a141b59a060de53a9e4f2d25b4ec41f75fcb791a0f046981b0634fad number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x1-throughput-fp8-trr9koy1vg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x1 FP8 Throughput ngcMetadata: e8e4c9317e3e32c8e50d3f4c54019b40df5608a319359ad9f8257d23f2348c2b: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: cb02e92359903bcfa274d5176c8a7a840e58a3df1587ca0f0c734049d4f1d5c8 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 68GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:b200x2-throughput-bf16-gbt9zmjfla framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct B200x2 BF16 Throughput ngcMetadata: f078cbf33438ea9b68c5d4eba7bec671246d44730cbb0af6d94dfa1517bf3036: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4db4480cbfa2cd40a70ed784e3bb40e3e7e4d6693b7d275fb0b19b328ea1da0a number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:a10gx8-latency-bf16-mqqdeavnfg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct A10Gx8 BF16 Latency ngcMetadata: f49e065d985faa3a766163f386395cc53c64429754c58cf9edf553ac0ec96244: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9c2fce4ab72d829d5e3b008b9f6b64a608194a97ee6fc18af7863cf922226107 number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 150GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx4 NVFP4 Latency ngcMetadata: f66f34808a8fe25ee8a3666427569f3e7119b1af54cd64e31d082282d5d47210: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d7ec938fb438f8d87c58707eb5a60af4e1fb6a9b7ac42eef6738dd9b0d2ff671 number_of_gpus: '4' pp: '1' precision: nvfp4 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: NVFP4 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:h200x2-throughput-bf16-9iwul7vevg framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200x2 BF16 Throughput ngcMetadata: f6d98b286dd43d8a6e677a9a0f218e76928154490a908a2d9f76cbfd2cd043bf: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 583b7cb36984c4bff9e31b3f10937d34d488b2e4e32ac4b0ac5b44e02ef4779b number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 134GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct RTX6000_BLACKWELL_SVx4 FP8 Latency ngcMetadata: f7397c4ee54cefd8fc3cc3b947406ba51947215d77ff58f35aeaa298605db13a: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ead7f13a4122bc78f9624682198bcccbc485345c9a58742601b0bf0e8ed59760 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 4 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:1d54af340dc8906a2d21146191a9c184c35e47bd framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct H200_NVLx2 BF16 Throughput ngcMetadata: fa4fbf5af52b66775f63d40cbc3db263304d7844095d1a677d799b8e90bf141b: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d2ae506c5cddf13a3d2f0139dcb6edafd042794999162e7d9623bd4ceabb1b70 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 263GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.1-70b-instruct:gb200x4-latency-bf16-3uozpudciw framework: TensorRT-LLM displayName: Llama 3.1 70B Instruct GB200x4 BF16 Latency ngcMetadata: fe20ab9158c65c3e7765e50c5c72ece46ee34a9e184dcdb13eda9bbef78ab300: model: meta/llama-3.1-70b-instruct release: 1.14.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e6a18cf6935b817ca2968d06e1abd91444d73634b80ff54f3680d152cadc209e number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 4 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.14.0 - key: DOWNLOAD SIZE value: 138GB - key: LLM ENGINE value: TENSORRT_LLM labels: - Llama - Meta - Text Generation - Large Language Model - TensorRT-LLM - Language Generation - NeMo - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: StarCoder2-7B displayName: StarCoder2-7B modelHubID: starcoder-2 category: Language Model type: NGC description: StarCoder2-7B is a language model that can follow instructions, complete requests, and generate creative text formats. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: StarCoder2-7B modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-03-08T17:48:51.559Z",
    "description": "# **Starcoder2-7B Overview**\n\n## **Description:**\n\n**StarCoder2-7B** generates source code from natural language instructions and code prompts across a wide range of programming languages. This 7-billion parameter model is part of the next generation of open-source large language models for code, developed by the BigCode collaboration, and was trained on The Stack v2, a massive, permissively licensed dataset covering over 600 programming languages. It is specifically designed to assist with tasks like code completion, code synthesis, and infilling (filling in missing code within a file).\n\nThis model is ready for commercial/non-commercial use.\n\nThis version introduces support for GB200 NVL72, GH200 NVL2, B200 and NVFP4. CUDA updated to version 12.9. For detailed information, refer to Release [Notes for NVIDIA NIM for LLMs LLM 1.12](https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html). \n\n## **Third-Party Community Consideration**\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA \\[bigcode/starcoder2-7b\\]  \n([bigcode/starcoder2-7b \u00b7 Hugging Face](https://huggingface.co/bigcode/starcoder2-7b##license)). \n\n## **License/Terms of Use:**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); except for the model which is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/.).\n\n**ADDITIONAL INFORMATION:** [BigCode Open RAIL-M v1 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).\n\n## **Get Help**\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.\n\n## **Deployment Geography:**\n\nGlobal \n\n## **Use Case:**\n\nThe expected users of StarCoder2-7B are software developers, data scientists, programmers, and students. They would use this model as a sophisticated coding assistant integrated into their development environment (like VS Code) or as a standalone tool for the following purposes:\n\n* Accelerating Development: To significantly speed up the coding process using intelligent, context-aware code completion and infilling.\n\n* Code Synthesis: To generate entire functions, classes, or scripts from a natural language description (e.g., \"Write a JavaScript function that validates an email address using regex\").\n\n* Rapid Prototyping: To quickly create boilerplate code or functional prototypes for new applications or features.\n\n* Learning & Debugging: To understand a new programming language or framework by generating example code, or to get suggestions on how to refactor or fix a piece of buggy code.\n\n* Automation: To automate the creation of repetitive code, such as unit tests, data processing scripts, or configuration files.\n\n## **Release Date:**\n\nBuild.NVIDIA.com 03/18/2024 via  \n[starcoder2-7b Model by BigCode | NVIDIA NIM](https://build.nvidia.com/bigcode/starcoder2-7b)\n\nGithub 02/28/2024 via   \n[GitHub \\- bigcode-project/starcoder2: Home of StarCoder2\\!](https://github.com/bigcode-project/starcoder2)\n\nHuggingface 02/28/2024 via   \n[bigcode/starcoder2-7b \u00b7 Hugging Face](https://huggingface.co/bigcode/starcoder2-7b##license)\n\n## **Reference(s):** \n\n[bigcode/starcoder2-7b \u00b7 Hugging Face](https://huggingface.co/bigcode/starcoder2-7b##license)\n\n## **Model Architecture:** \n\nArchitecture Type: Transformer  \nNetwork Architecture: StarCoder2\n\nThis model was developed based on The Stack V2  \n[https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids) \n\nNumber of model parameters: 0.717*10^10\n\n## **Input:**\n\nInput Type(s): Text \n\nInput Format: String \n\nInput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Input:\n\n* Tokens: The model accepts a sequence of tokens with a maximum context window of 16,384 tokens. This limit encompasses both the input prompt and the tokens generated by the model.  \n* Characters: Input strings can contain a wide range of Unicode characters, consistent with the diverse set of programming languages and natural languages found in the training data.  \n* Pre-Processing Needed: Raw input strings must be converted into a sequence of integer token IDs using the specific tokenizer associated with the StarCoder2-7B  model before being fed into the network.\n\n \n\n## **Output:**\n\nOutput Type(s): Text \n\nOutput Format: String\n\nOutput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Output: \n\n* Tokens: The model generates a sequence of tokens. The length of the output is variable and is ultimately constrained by the model's maximum context window of 16,384 tokens (the sum of input and output tokens cannot exceed this limit).  \n* Characters (Including Restrictions): The output consists of a wide range of Unicode characters from the model's vocabulary, designed to produce syntactically correct source code and coherent natural language.  \n* Post-Processing Needed: The raw token IDs generated by the model must be decoded using the model's specific tokenizer to be converted into a human-readable string. Special control tokens (e.g., end-of-sequence tokens) may also need to be filtered from the final output.\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Software Integration:**\n\nRuntime Engine: vLLM, TensorRT\n\nSupported Hardware Microarchitecture Compatibility: NVIDIA Hopper  \n\nPreferred Operating System(s):\n\nLinux   \nWindows\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version(s):**\n\nStarCoder2-7B-1.10.1 <br>\nStarCoder2-7B-1.12.0 <br>\nStarCoder2-7B-1.13.1 <br>\nStarCoder2-7B-1.14.0 <br>\nStarCoder2-7B-1.15.0 <br>\n\n## **Usage**\n\n**Running the model on CPU/GPU/multi GPU**\n\n* Using full precision:\n\n```\n# pip install git+https://github.com/huggingface/transformers.git # TODO: merge PR to main\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\ncheckpoint = \"bigcode/starcoder2-7b\"\ndevice = \"cuda\" # for GPU usage or \"cpu\" for CPU usage\n\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n# for multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=\"auto\")`\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(device)\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n\n* Using torch.bfloat16:\n\n```\n# pip install accelerate\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\ncheckpoint = \"bigcode/starcoder2-7b\"\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\n\n# for fp16 use `torch_dtype=torch.float16` instead\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, device_map=\"auto\", torch_dtype=torch.bfloat16)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(\"cuda\")\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n\n```\nprint(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\n```\n\n**Quantized Versions through bitsandbytes**\n\n* Using 8-bit precision (int8):\n\n```\n# pip install bitsandbytes accelerate\nfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n\n# to use 4bit use `load_in_4bit=True` instead\nquantization_config = BitsAndBytesConfig(load_in_8bit=True)\n\ncheckpoint = \"bigcode/starcoder2-7b\"\ntokenizer = AutoTokenizer.from_pretrained(checkpoint)\nmodel = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=quantization_config)\n\ninputs = tokenizer.encode(\"def print_hello_world():\", return_tensors=\"pt\").to(\"cuda\")\noutputs = model.generate(inputs)\nprint(tokenizer.decode(outputs[0]))\n```\n\n```\n\n>>> print(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\n# load_in_8bit\nMemory footprint: 7670.52 MB\n# load_in_4bit\n>>> print(f\"Memory footprint: {model.get_memory_footprint() / 1e6:.2f} MB\")\nMemory footprint: 4197.64 MB\n```\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset:**\n\n**Data Modality:** Text \n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method by dataset:** Hybrid: Human, Synthetic, Automated\n\n**Properties:** \n\n* Quantity: StarCoder2-7B was trained on a processed subset of the dataset totaling approximately 3.3 trillion tokens. The complete, pre-processed Stack v2 dataset contains over 67TB of data.  \n* Dataset Descriptions: The Stack v2 is a very large dataset of source code from over 600 programming languages, supplemented with text from GitHub Issues, Pull Requests, and Kaggle Notebooks. A key feature of the dataset is that it has been filtered to only include content from repositories with permissive licenses. It also underwent an extensive PII (Personally Identifiable Information) redaction process to remove sensitive data.\n\n### **Testing Dataset:**\n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated\n\n**Labeling Method by dataset:** Hybrid: Human, Automated\n\n**Properties:** \n\nQuantity: The benchmarks consist of a varied number of problems (e.g., HumanEval: 164 problems; MBPP: \\~400 test problems; DS-1000: 1,000 problems).\n\nDataset Descriptions: This is a collection of evaluation suites, not a single dataset. They are designed to measure the model's performance on specific downstream tasks:\n\n* HumanEval/MBPP: Test Python code generation from natural language docstrings.  \n* DS-1000: Tests code generation for data science tasks (e.g., using Pandas, NumPy).  \n* CruxEval: Tests the model's ability to predict the output of a given code snippet. The primary goal is to assess the functional correctness and logical reasoning capabilities of the model.\n\n### **Evaluation Dataset:**\n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Automated, Human, Synthetic\n\n**Labeling Method by dataset:** Hybrid: Human, Automated\n\n**Properties:** \n\nQuantity: The benchmarks consist of a varied number of problems (e.g., HumanEval: 164 problems; MBPP: \\~400 test problems; DS-1000: 1,000 problems).\n\nDataset Descriptions: This is a collection of evaluation suites designed to measure the model's performance on specific downstream tasks:\n\n* HumanEval/MBPP: Test Python code generation from natural language docstrings.  \n* DS-1000: Tests code generation for data science tasks (e.g., using Pandas, NumPy).  \n* CruxEval: Tests the model's ability to predict the output of a given code snippet. The primary goal is to assess the functional correctness and logical reasoning capabilities of the model. \n\n## **Technical Limitations** \n\nThe model has been trained on source code from 17 programming languages. The predominant language in source is English although other languages are also present. As such the model is capable of generating code snippets provided some context but the generated code is not guaranteed to work as intended. It can be inefficient and contain bugs or exploits. See [the paper](https://huggingface.co/papers/2402.19173) for an in-depth discussion of the model limitations. \n\n## **Inference:**\n\n**Acceleration Engine:** vLLM, TensorRT \n\n**Test Hardware:** \n  \n  H100 SXM <br> \n  H200 SXM (BF16 TP2) <br> \n  \n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.",
    "displayName": "StarCoder2 7B",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-GLGZ-U7GA",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "h200x2-throughput-bf16-2yu77crgzg",
    "latestVersionSizeInBytes": 15477496973,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama2-70b.jpg",
    "modelFormat": "N/A",
    "name": "starcoder2-7b",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "BigCode",
    "safetyAndSecurity": "",
    "shortDescription": "StarCoder2-7B model is a 7B parameter model trained on 17 programming languages from The Stack v2, with opt-out requests excluded. The model uses Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens...",
    "teamName": "bigcode",
    "updatedDate": "2025-11-17T18:19:21.047Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/bigcode/containers/starcoder2-7b optimizationProfiles: - profileId: nim/bigcode/starcoder2-7b:h100x1-throughput-bf16-wqtdmrjeda framework: TensorRT-LLM displayName: Starcoder2 7B H100x1 BF16 Throughput ngcMetadata: 0f40318708a05837c5517a80f06974ff2c353c11bc6e04eb10baabe4436a7522: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7ffe6beb932b6f649191382b440a2ae6a18a3a7c1a883a5c47cb0de89b812266 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:h100x2-throughput-bf16-pag8ayfq7a framework: TensorRT-LLM displayName: Starcoder2 7B H100x2 BF16 Throughput ngcMetadata: 4753e2649bd3f25d4742969ccea5bb7e6ac2e469ebe811d194565decbb7c91d7: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6de32383a825dc6ff1128099c218896c3fa72b897e3ea0cd559ef2133b20d6c2 number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 15GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:h100x2-latency-bf16-bcq-c0ggmw framework: TensorRT-LLM displayName: Starcoder2 7B H100x2 BF16 Latency ngcMetadata: 57280e7f84736bfd89a5fc38bc51f5ef6c0d92ed77ad66c60d897ccd7165ac98: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4f122733d53833b661fddc7ab39bb1b0c188779f2dd945241a7868b1e968dd1d number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 15GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:h200x1-throughput-bf16-yir1bzdhja framework: TensorRT-LLM displayName: Starcoder2 7B H200x1 BF16 Throughput ngcMetadata: 6f7097713b9a9c9e8553347ee7cf28f0c4c7c2bd913166dcb36c666ecc48dad1: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d1b221b1fd69a8f1b0dcd11b964dbf589518034edacf353953316bf9548f5de3 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:h200x2-latency-bf16-a8shrirgma framework: TensorRT-LLM displayName: Starcoder2 7B H200x2 BF16 Latency ngcMetadata: 70d88d7152538c95bc0dc059470e9f00656d8431c56ed11743d267c1dfccd433: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2758237c2bfa7e1f02183efb7432a82b677b8a29331fe65b97d8e09c2075b51e number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 15GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:h200x2-throughput-bf16-lspznbyceg framework: TensorRT-LLM displayName: Starcoder2 7B H200x2 BF16 Throughput ngcMetadata: e496963dfd535acf3104a4040e5d0b4a73ab564f0f1c583d3ad153a28200f266: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3e06b5f5b48d0b152e1a73bea3a346274e602f8d4ed0d3151ea215c3f9cbf8fc number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 15GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/bigcode/starcoder2-7b:hf-bb9afde framework: TensorRT-LLM displayName: Starcoder2 7B Generic NVIDIA GPUx1 ngcMetadata: 1d76bac39d2ca5f44c35735c615e4758bee6f6964c6db099577cb9000ecb6447: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' llm_engine: tensorrt_llm nim_workspace_hash_v1: 43ab4fbb30e1beedda3de8df4244c9fc44fb8fbd8ba0ac23b028abf822bbf637 pp: '1' tp: '1' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: COUNT value: 1 - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/bigcode/starcoder2-7b:hf-bb9afde framework: TensorRT-LLM displayName: Starcoder2 7B Generic NVIDIA GPUx2 ngcMetadata: ef596a550ec0d61b427e3f7ff26fd21d49b5b1caff72b5bc8f3a5affc2a1d7b9: model: bigcode/starcoder2-7b release: 1.14.1 tags: feat_lora: 'false' llm_engine: tensorrt_llm nim_workspace_hash_v1: 43ab4fbb30e1beedda3de8df4244c9fc44fb8fbd8ba0ac23b028abf822bbf637 pp: '1' tp: '2' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: COUNT value: 2 - key: NIM VERSION value: 1.14.1 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' labels: - bigCode - StarCoder - "Code Generation" - "Text Generation" - "Multilingual support" - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Mistral Instruct displayName: Mistral Instruct modelHubID: mistral-instruct category: Text Generation type: NGC description: Mistral Instruct is a language model that can follow instructions, complete requests, and generate creative text formats. The Mistral Instract Large Language Model (LLM) is an instruct fine-tuned version of the Mistral. modelVariants: - variantId: Mistral 7B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2024-06-18T23:25:12.525Z",
    "description": "# **Mistral-7B-Instruct-v0.3 Overview**\n\n## **Description:**\n\n**Mistral-7B-Instruct-v0.3** is a large language model (LLM) that has been fine-tuned for instruction-based tasks. It is an improved version of the Mistral-7B-v0.3 model and is designed to be easily fine-tuned to achieve compelling performance. \n\nThis model is ready for commercial/non-commercial use.\n\nThis version introduces support for GB200 NVL72, GH200 NVL2, B200 and NVFP4. CUDA updated to version 12.9. For detailed information, refer to Release [Notes for NVIDIA NIM for LLMs LLM 1.12](https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html). \n\n## **Third-Party Community Consideration**\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party's requirements for this application and use case; see link to Non-NVIDIA \\[mistralai/Mistral-7B-Instruct-v0.3\\]  \n([https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)). \n\n## **License/Terms of Use:**\n\n**GOVERNING TERMS:** The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of the model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).\n\n**ADDITIONAL INFORMATION:** Apache 2.0 License.\n\nYou are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws.\n\n## **Deployment Geography:**\n\nGlobal \n\n## **Use Case:**\n\nThis model is primarily intended for AI developers, researchers, and businesses seeking a powerful yet efficient foundational language model.\n\n### Expected uses include:\n\n* Fine-tuning for Custom Applications: Developers can use Mistral-7B-Instruct-v0.3 as a base to train specialized models for tasks like creating customer service chatbots, content summarization tools, code generation assistants, and sentiment analysis systems.\n\n* Research and Experimentation: Researchers can leverage this open-source model to study language model behavior, explore new training techniques, or establish performance benchmarks on various natural language processing tasks.\n\n* Prototyping AI Solutions: This model's excellent balance of high performance and relatively low computational cost makes it ideal for startups and individual developers to rapidly build and test proof-of-concept AI features before deploying larger-scale solutions.\n\n## **Release Date:**\n\nHuggingface 05/22/2024 via   \n[https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) \n\n## **Reference(s):** \n\n[https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) \n\n## **Model Architecture:** \n\nArchitecture Type: Transformer  \nNetwork Architecture: Mistral-7B-v0.3\n\nThis model was developed based on Mistral-7B-v0.3  \n[https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3)  \n\nNumber of model parameters: 7.25*10^9\n\n## **Input:**\n\nInput Type(s): Text \n\nInput Format: String \n\nInput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Input:\n\n* Tokens: The model processes text as a sequence of tokens. The input string is converted into a sequence of integer token IDs by a tokenizer. The vocabulary size is 32,768 tokens.  \n    \n* Size and Length Limits: The maximum input sequence length supported by the model is 32,768 tokens.  \n    \n* Pre-Processing Needed: Yes. Raw input text must be tokenized. For chat or instruction-following tasks, the input must be formatted according to the model's specific chat template, which typically involves wrapping user prompts in \\[INST\\] and \\[/INST\\] tags.\n\n \n\n## **Output:**\n\nOutput Type(s): Text \n\nOutput Format: String\n\nOutput Parameters: One-Dimensional (1D)\n\nOther Properties Related to Output: \n\n* Tokens: The model generates a sequence of token IDs from its vocabulary of 32,768 tokens.  \n    \n* Post-Processing Needed: Yes. The generated sequence of token IDs must be decoded by the tokenizer to be converted into a human-readable string.  \n* Length: The output length is variable and is controlled by generation parameters. Generation stops when an end-of-sequence (EOS) token is produced or the maximum length is reached.\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Software Integration:**\n\nRuntime Engine: vLLM, TensorRT\n\nSupported Hardware Microarchitecture Compatibility:\n\nNVIDIA Ampere  \nNVIDIA Blackwell  \nNVIDIA Hopper  \nNVIDIA Lovelace \n\nPreferred Operating System(s):\n\nLinux   \nWindows\n\nThe integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n## **Model Version(s):**\n\nMistral-7B-Instruct-v0.3\n\n## **Usage**\n\n**Instruct following**\n\n```\nfrom mistral_inference.transformer import Transformer\nfrom mistral_inference.generate import generate\n\nfrom mistral_common.tokens.tokenizers.mistral import MistralTokenizer\nfrom mistral_common.protocol.instruct.messages import UserMessage\nfrom mistral_common.protocol.instruct.request import ChatCompletionRequest\n\n\ntokenizer = MistralTokenizer.from_file(f\"{mistral_models_path}/tokenizer.model.v3\")\nmodel = Transformer.from_folder(mistral_models_path)\n\ncompletion_request = ChatCompletionRequest(messages=[UserMessage(content=\"Explain Machine Learning to me in a nutshell.\")])\n\ntokens = tokenizer.encode_chat_completion(completion_request).tokens\n\nout_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)\nresult = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])\n\nprint(result)\n```\n\n**Function calling**\n\n```\nfrom mistral_common.protocol.instruct.tool_calls import Function, Tool\nfrom mistral_inference.transformer import Transformer\nfrom mistral_inference.generate import generate\n\nfrom mistral_common.tokens.tokenizers.mistral import MistralTokenizer\nfrom mistral_common.protocol.instruct.messages import UserMessage\nfrom mistral_common.protocol.instruct.request import ChatCompletionRequest\n\n\ntokenizer = MistralTokenizer.from_file(f\"{mistral_models_path}/tokenizer.model.v3\")\nmodel = Transformer.from_folder(mistral_models_path)\n\ncompletion_request = ChatCompletionRequest(\n    tools=[\n        Tool(\n            function=Function(\n                name=\"get_current_weather\",\n                description=\"Get the current weather\",\n                parameters={\n                    \"type\": \"object\",\n                    \"properties\": {\n                        \"location\": {\n                            \"type\": \"string\",\n                            \"description\": \"The city and state, e.g. San Francisco, CA\",\n                        },\n                        \"format\": {\n                            \"type\": \"string\",\n                            \"enum\": [\"celsius\", \"fahrenheit\"],\n                            \"description\": \"The temperature unit to use. Infer this from the users location.\",\n                        },\n                    },\n                    \"required\": [\"location\", \"format\"],\n                },\n            )\n        )\n    ],\n    messages=[\n        UserMessage(content=\"What's the weather like today in Paris?\"),\n        ],\n)\n\ntokens = tokenizer.encode_chat_completion(completion_request).tokens\n\nout_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)\nresult = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])\n\nprint(result)\n```\n\n**Generate with transformers**\n\n```\nfrom transformers import pipeline\n\nmessages = [\n    {\"role\": \"system\", \"content\": \"You are a pirate chatbot who always responds in pirate speak!\"},\n    {\"role\": \"user\", \"content\": \"Who are you?\"},\n]\nchatbot = pipeline(\"text-generation\", model=\"mistralai/Mistral-7B-Instruct-v0.3\")\nchatbot(messages)\n```\n\n**Function calling with transformers**\n\nTo use this example, you'll need transformers version 4.42.0 or higher. Please see the [function calling guide](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in the transformers docs for more information.\n\n```\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\n\nmodel_id = \"mistralai/Mistral-7B-Instruct-v0.3\"\ntokenizer = AutoTokenizer.from_pretrained(model_id)\n\ndef get_current_weather(location: str, format: str):\n    \"\"\"\n    Get the current weather\n\n    Args:\n        location: The city and state, e.g. San Francisco, CA\n        format: The temperature unit to use. Infer this from the users location. (choices: [\"celsius\", \"fahrenheit\"])\n    \"\"\"\n    pass\n\nconversation = [{\"role\": \"user\", \"content\": \"What's the weather like in Paris?\"}]\ntools = [get_current_weather]\n\n\n# format and tokenize the tool use prompt \ninputs = tokenizer.apply_chat_template(\n            conversation,\n            tools=tools,\n            add_generation_prompt=True,\n            return_dict=True,\n            return_tensors=\"pt\",\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map=\"auto\")\n\ninputs.to(model.device)\noutputs = model.generate(**inputs, max_new_tokens=1000)\nprint(tokenizer.decode(outputs[0], skip_special_tokens=True))\n```\n\n## **Training, Testing, and Evaluation Datasets:**\n\n### **Training Dataset:**\n\n**Data Modality:** Text \n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method by dataset:** Hybrid: Automated, Synthetic, Human\n\n**Properties:** The pre-training data is a diverse mix of text and code from the public web. The fine-tuning data consists of high-quality instruction-response pairs.\n\n### **Testing Dataset:**\n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method by dataset:** Hybrid: Human, Automated\n\n**Properties:** \n\n* Quantity: The number of data items varies significantly per benchmark (e.g., MMLU has \\~15.9k questions, HellaSwag has \\~10k sentences).  \n* Dataset Descriptions: The model is tested against benchmarks designed to evaluate a wide range of capabilities, including: general knowledge and reasoning (MMLU, ARC), commonsense inference (HellaSwag, Winogrande), truthfulness (TruthfulQA), conversational ability (MT-Bench), and code generation (HumanEval, MBPP). \n\n### **Evaluation Dataset:**\n\n**Link:** Undisclosed\n\n**Data Collection Method by dataset:** Hybrid: Human, Synthetic, Automated\n\n**Labeling Method by dataset:** Hybrid: Human, Automated\n\n**Properties:** The model is evaluated against benchmarks designed to measure a wide range of capabilities, including: general knowledge and reasoning (MMLU), truthfulness (TruthfulQA), conversational ability (MT-Bench), and code generation (HumanEval). The quantity of data varies significantly per benchmark.\n\n## **Technical Limitations** \n\nThe Mistral-7B-Instruct model is a quick demonstration that the base model can be easily fine-tuned to achieve compelling performance. It does not have any moderation mechanisms. We're looking forward to engaging with the community on ways to leverage guardrails, allowing for deployment in environments requiring moderated outputs. \n\n## **Inference:**\n\n**Acceleration Engine:** vLLM, TensorRT \n\n**Test Hardware:** \n \n* B200 SXM   \n* H200 SXM  \n* H100 SXM  \n* A100 SXM 80GB  \n* A100 SXM 40GB  \n* L40S PCIe  \n* A10G  \n* H100 NVL  \n* H200 NVL  \n* GH200 96GB\n* GB200 NVL72\n* GH200 NVL2   \n* RTX 5090  \n* RTX 4090  \n* RTX 6000 Ada\n\n## **Deployment Details:**\n\nVisit the [NIM Container LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) page for release documentation, deployment guides, and more.\n\n## Get Help\n\n## **Enterprise Support**\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA\u2019s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "Mistral-7B-Instruct-v0.3",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "Bulk Build",
        "Mistral",
        "NIM",
        "NSPECT-YDAW-FMDD",
        "mistral-7b-instruct-v0-3",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "l40sx2-latency-bf16--dooucx8xw",
    "latestVersionSizeInBytes": 14861937693,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/mistral-7b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "mistral-7b-instruct-v0-3",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Mistral AI",
    "safetyAndSecurity": "",
    "shortDescription": "Mistral-7B-Instruct-v0.3 is a language model that can follow instructions, complete requests, and generate creative text formats",
    "teamName": "mistralai",
    "updatedDate": "2025-08-29T13:34:39.863Z"
} displayName: Mistral 7B Instruct source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/mistralai/containers/mistral-7b-instruct-v0.3 optimizationProfiles: - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200_NVLx1 BF16 Throughput ngcMetadata: 090aed9ae0f4312f525a15003626f36dd30aded5cabb5bfd580cfb88510f7175: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 713cc9e4be3d70c5a99ac45e46c5cd2cb271b5c16228561f1daf992f8feac8ff number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:gb200x2-latency-fp8-6ewdxacbyg framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GB200x2 FP8 Latency ngcMetadata: 0c430e8114b7d75876e040eb9e57f94f8780339c06d2afd7d52b8b520fe7d002: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 783069b3e78233c4649dfbd4031e5e32109f7eda1e54e623f39afa6771f69812 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:l40sx2-latency-fp8-werjmjtilg framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 L40Sx2 FP8 Latency ngcMetadata: 0dd2f4179304094d417e6326812f71bf5583853f655c71d2992893485208b6c4: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f8c9fdd309012b470e249367853869bb7c38c8adf82866e2e1f02b6ffabc6429 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200_NVLx2 BF16 Latency ngcMetadata: 1c8d8380b88e5e0b9dfa9bd7d9808e7b44533516e015e085121a17bec9f2803a: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9a8267f26d4614a846205cc91d2282acd830762361507abc23e58dfe78cd412c number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h200x2-latency-fp8-nnktn87ayw framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200x2 FP8 Latency ngcMetadata: 1d71ecae305c2a01b823eac3cac374e0cf882795349b7c9aa045363c82f331a3: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6edfcc1e5567c57e834b36011decd5d2efb559be351273b2ad7c8768bae66e39 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 RTX6000_BLACKWELL_SVx1 BF16 Latency ngcMetadata: 21d116fd5db9b38ae613c9ec2117e796e0aeec6d8f24e92928ca1171b5f0db8a: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c78a84fee0362ed61ba62a668240cb6619c459e8c105ba274e13c186512f846d number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h200x1-throughput-bf16-fii9d12dng framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200x1 BF16 Throughput ngcMetadata: 240f8bb29f20bd7b6f3a76367e82c2182d6f626ac3ac800daac124d6cdce1be6: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9190e11b1ecbdeb46a92c026ca480103519885a8f07638f6d4a21d27a9118141 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A100_SXM4_40GBx2 BF16 Latency ngcMetadata: 31f3565f323b25fb739b4319a054db75e52ad21e1e3adb93de0b7f932de6e954: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b84730aefe89d65c679b93ea096271ba583373eca30f081c680bed7dcff8f7c1 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 2 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h100x1-throughput-bf16-p8aaj1ui3a framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100x1 BF16 Throughput ngcMetadata: 37fe5e59120c01002604cd395d38f91f3c71808c9a76060d772ee8625db8a9aa: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 43b0efb1c318ab00e8740ac496ec18fd55db1d3f50a9c9470408fd5b647ccb0e number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_144GBx1 BF16 Throughput ngcMetadata: 3822e73f5b87aab36b8fa7f67f06b027ee79f259d4a6f6149d6e6e8834e15694: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 99355664e85deab8cef6bcda9b64b62eae3e6cb4fb72654992701318971d9cab number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_144GBx1 BF16 Latency ngcMetadata: 3929f189d9fe09ced84378d555047f329e9b5d22ea05a84dfc27bf4e423ab2ab: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a01faf2c27412adcdcc908764085b24d58398d230a42ecc86b4d0241886d55b6 number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_144GBx1 FP8 Throughput ngcMetadata: 3d2e4a566abec5fdab7016cf1ee4b9f0a28c0bbaf1ab5ba1fbcb739b9fbbfab6: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: dcbbb88340365f3a3ce5006101441492476ef83235d14fb9d2eaf245868f6e51 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:l40sx1-throughput-bf16-wgortubrfa framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 L40Sx1 BF16 Throughput ngcMetadata: 3d97e245329239baeda4299616d3379ddbea98f0f30a9fb6a0f97ff3bb593593: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 59b129434dc4805de4e69fc1e786327abcad095b67e099df18a0741700fc3562 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:b200x2-latency-bf16-k0zyvwltfq framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 B200x2 BF16 Latency ngcMetadata: 4369fb715f1d8a01cec62d750cbea038af0b5f5f032a372236fe6cb7ecaad891: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b41d026216b5cc343dc386982bc593595899581dd03e2e608bc13a7d7fe1fb71 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100_NVLx1 FP8 Throughput ngcMetadata: 4c9a845c4a8037390a5d87a8e2db4a9ef8c7cc5c5e613960b771adce3e548deb: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 00ae490b209e488c85835da5990d77b81c03bd111838c6ab188f5b3c1084f5f0 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:gb200x2-latency-bf16-wkgtx84w0q framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GB200x2 BF16 Latency ngcMetadata: 514f55258d4cffc08411144a8709add1d6dfda7563894feee671bad11a2be79c: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 054261eb6e99db9b73dd8ff0146b8748e4676c27344512bb070ffd471c54002d number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100_NVLx2 FP8 Latency ngcMetadata: 5165a3e3a29e9d6717869d3960133151f93437c8c34a297c2b6080763bfdcf32: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 20bd7371c735a556fea27dd89285a4bc2bb8cc5630c912da7ba9bdfd4fe0f149 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:b200x1-throughput-bf16-bzgxd7omcg framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 B200x1 BF16 Throughput ngcMetadata: 5a9eef29a1519f40178baec3444a0133990c5ed49ebd6e0ee5de5391f403c25f: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 99677dd60c3849da7d5daab13e7094c8f15c1ed0b324691dcfeb07c340748155 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:b200x1-throughput-fp8-1kujysl0bw framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 B200x1 FP8 Throughput ngcMetadata: 61a13f62e79b27cd7c69c32477b58857196ccc914a1b9f526fb9715238ceaddd: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f9ef3850321d85a46a3267f358c73bc62479d5fd3bd077bfbaa54968926284ab number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_144GBx1 FP8 Latency ngcMetadata: 6bdad363d842f0dc7b89c0bcdbfce49ab7ff3c77a7f1aaf741e10b5a8cfc7b65: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 05684b1b6aeb72f02bd8422d6f4bd2c193e2409465575f3b01fd7fe0ca606056 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_480GBx1 BF16 Latency ngcMetadata: 6e083975f86a7bd245f26d412ca4c0d99bed4a05e85eb516c7bf23d4a8dbc635: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 01c793b12ef404fe8d3d97f17a5851e07af0ce3e1ed7028940203e38020cff5b number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 RTX6000_BLACKWELL_SVx1 FP8 Throughput ngcMetadata: 6f9107642dc7198eb0084ba462da560fe696b28a74dc9bd5747f58b2229c0833: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 60a40aa9dddaddacb1f4f7ebbb536b33494d4a252b563fb321152125c6e5be1d number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:a10gx1-throughput-bf16-dbarntjrxg framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A10Gx1 BF16 Throughput ngcMetadata: 8aff3c4c1c985c1ab4bd362202c194479652465f22c634e79fead9f935d9f308: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e936f5f6f910166d5a160a10cd1a565e15bac32bd77cdf527f23a97e9a792622 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 1 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:a100x1-throughput-bf16-ug2ytdn9rq framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A100x1 BF16 Throughput ngcMetadata: 8d844ff4c978b716fb1b3044d4663305081e95fe9e3281ccbc60ebd243137939: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 52c07121ddb1886588c745606458ef1289c479b94c2e1d77b8260c89cb5e40ce number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200_NVLx1 FP8 Throughput ngcMetadata: 8e60a02116ccc9ed3587946a5b2ad0c431826d89e36be714dfa92044e84579de: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e395395c3d17874ecbc16978f25f851e3a8e4b08e6f107f8d1e00dc1f485aad2 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100_NVLx1 BF16 Throughput ngcMetadata: 8f1c5ed6338e2517b1db9987ca4a9cab78ad17acca14801eb301cd52fb60a4e2: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7b2c3bc5c67da0408b365dcde48b64e5ac4c4b316e7061e4da026d8ebfa23c60 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_480GBx1 BF16 Throughput ngcMetadata: a049a483413cc9fcf09502fb199582de0948a105c7d58ed41023f74fcf46a84a: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6fd41902191734a0e6c6105beef1c9eaec18a21786868ec1f5cf9876f924709f number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h200x2-latency-bf16-utfzkvbx7q framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200x2 BF16 Latency ngcMetadata: a2bd430ddfc5a8063daef926241911c1db8503f6b24034483213620ea0d6534c: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: fb34a1eb2c580f06657b13b17ed3fdcebd849fce79f29c090d58c8798bf18df0 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h100x1-throughput-fp8-3zk3rahgzq framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100x1 FP8 Throughput ngcMetadata: a91be3d64f314c006fa8f85baacd7a54eda8be911ac8ab66cca6b02044de16a8: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6f2d35f77b6e720ee0f8c950a138b31d739efd3c69094c0d830c7a6e1657575b number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:l40sx1-throughput-fp8-kegtu7-f8w framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 L40Sx1 FP8 Throughput ngcMetadata: a9776f67cf10b8456cb9c6a3cc657310cf6b402b482f75294f89a832813e585d: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a002dbee023af5c15de8bd132e0ad771081f285acdb4196fc77df8e2d5025bf7 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:a10gx2-latency-bf16-ultpn-z0fa framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A10Gx2 BF16 Latency ngcMetadata: b75e85db64643ec2a6fe296828296b6fb2971998445181b3e92bb4edb7028f9c: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2d647a36531389075e6d579d0e030fcaeb428ec3056fdf12e56ad673c99c3e9d number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 2 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A100_SXM4_40GBx1 BF16 Throughput ngcMetadata: c30c03b6c527e5bbc13c9024db7888bd3ef1f81f5437dd5eeb639963e7f956c0: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 1b85347f49cd97c7a43491a4065364128a860973559d1fbb418436cf1d22071f number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 1 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_480GBx1 FP8 Latency ngcMetadata: c5440e61e502d2bde9d8182735ec4b3cb5dc07386768b22e3fd11afb8f9123e1: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c1481c7a0c16fafd6097a18868e47f96bdb3e9bfef14223ed63d4c58b0b41046 number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 RTX6000_BLACKWELL_SVx1 FP8 Latency ngcMetadata: c6298bc319a27d0b8ce68b6ff2ee478f4503a532f85281db4e58e09bdc1e828d: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: bb8fc0519d6e093698164f6283010687e2acf7b28c240faeebcb9f6d04f3bdfd number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:a100x2-latency-bf16-eao98qqyaq framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 A100x2 BF16 Latency ngcMetadata: cffacd84594ea7bf5f03f8eaaa107d0e6078dadf84c2b8919c53ecc44ec0a418: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 31efbfeecc0071b15073a663036ed7854fe34edad643248b0b33f1648d396528 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:gb200x1-throughput-bf16-b7lzpmozrg framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GB200x1 BF16 Throughput ngcMetadata: d3e771cdfdccfc6685afb1749f581cbd723b3f4a7180453c366e49fe02201035: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 36332f865d9e636409ef1d780e1e129a7958c56612e251e991090198a71ecca2 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h100x2-latency-fp8-ubwp9icmag framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100x2 FP8 Latency ngcMetadata: d4d4737e0a2b76a1409c12d66a70924643f1bfca7448ca7bb9deb7a25f449470: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 5a148e8d8559629ab65b636a3344bd0a6c9d5ea1e7974bcdc5eb9273bc0c3aec number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:b200x2-latency-fp8-dmol-yoeqa framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 B200x2 FP8 Latency ngcMetadata: d772d81834faf0d0f3021ff92ea6893131a28c7b1eacedff9586b5304c768e16: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d21056b60d5e01083565a8671683b54e86879aa36a36475318838deec4613666 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GH200_480GBx1 FP8 Throughput ngcMetadata: e2392b25430a3f1fcec000a51661f633d36450438db8b09765dffaced7fac7e7: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 5ae57ff46795ac07b5bb351fafc5f767a4c62e864a2c283d982e92c8d82273e8 number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 RTX6000_BLACKWELL_SVx1 BF16 Throughput ngcMetadata: e2698f06fe02b17f4ac157fb39bd9ae75282d96e0a618c96ca9472a04bd3679b: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e8b647653354c51429e61400d7e38ea59531989d0808b570fa5755b7d6bfe130 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200_NVLx2 FP8 Latency ngcMetadata: e7014f30a669dca5d114ee37df6385a5453b02a23614832ef7bdb0e7f1622d11: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 43c16ff7a8e64039006342b0ba5d0067d7b3e49f7a913f06e67128585266aa67 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:l40sx2-latency-bf16--dooucx8xw framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 L40Sx2 BF16 Latency ngcMetadata: e93d09fc66b8e716a8370d431143b6d5efa1ab47a3aed691907d3c3d8d85bd4d: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ee18ec1b20cc3bbaff5d95b5e157c2d8ab9dc856df258de26b73ff6ec8e9f98d number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h100x2-latency-bf16-y5oxwbaufw framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100x2 BF16 Latency ngcMetadata: f01bf801094b032f87027566ae9036ac0489547a3a650a93bf7af4e12d7975c8: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 87eccccb3d5001d66e004f0e0bd2e26e245774007035b6d3c9a01f120c5d5f02 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 14GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:gb200x1-throughput-fp8-ndgj2enqyq framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 GB200x1 FP8 Throughput ngcMetadata: f76535fba1856c95ce15f11a3228fcb0469242f4921644d94fa682220c518c3f: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c0ef6863d68dbd3c3fdb7f3a0e8f2da9daf77c94757f228736a335de4a1ae628 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:hf-0d4b76e-tool_calling-bf16 framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H100_NVLx2 BF16 Latency ngcMetadata: f97a97adc7e9ca1ae225535ac8949af0f306846b429f079946fcc43cf0346f20: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 620affc99411e713b97069819f83de5bb626bf16dcb1d064492ad50dfc9641b7 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 28GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mistral-7b-instruct-v0-3:h200x1-throughput-fp8-ez1figc57w framework: TensorRT-LLM displayName: Mistral 7B Instruct V0.3 H200x1 FP8 Throughput ngcMetadata: fd130b01c59445a3967c804d57077fd77a3b2f76048603516402320f809f88ad: model: mistralai/mistral-7b-instruct-v0.3 release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f830f542f5a36acc42ddf3d7f7823662ed361c4e8e0ee0007cfe70da3df2fb9f number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM labels: - Mistral - Instruct - Large Language Model - TensorRT-LLM - Language Generation - NeMo - NVIDIA Validated config: architectures: - Other modelType: Mistral license: NVIDIA AI Foundation Models Community License - name: Mixtral Instruct displayName: Mixtral Instruct modelHubID: mixtral-instruct category: Text Generation type: NGC description: The Mixtral Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts model. Mixtral Instruct is a language model that can follow instructions, complete requests, and generate creative text formats. The Mixtral Instruct Large Language Model (LLM) is an instruct fine-tuned version of the Mixtral. modelVariants: - variantId: Mixtral 8x7B Instruct modelCard: ewogICAgImFjY2Vzc1R5cGUiOiAiTk9UX0xJU1RFRCIsCiAgICAiYXBwbGljYXRpb24iOiAiT3RoZXIiLAogICAgImJpYXMiOiAiIiwKICAgICJjYW5HdWVzdERvd25sb2FkIjogZmFsc2UsCiAgICAiY3JlYXRlZERhdGUiOiAiMjAyNC0wNy0xM1QxODo1ODo0MC4wOTRaIiwKICAgICJkZXNjcmlwdGlvbiI6ICIjIyBNb2RlbCBPdmVydmlld1xuXG4jIyMgRGVzY3JpcHRpb25cblxuTWl4dHJhbCA4eDdCIEluc3RydWN0IGlzIGEgbGFuZ3VhZ2UgbW9kZWwgdGhhdCBjYW4gZm9sbG93IGluc3RydWN0aW9ucywgY29tcGxldGUgcmVxdWVzdHMsIGFuZCBnZW5lcmF0ZSBjcmVhdGl2ZSB0ZXh0IGZvcm1hdHMuIE1peHRyYWwgOHg3QiBhIGhpZ2gtcXVhbGl0eSBzcGFyc2UgbWl4dHVyZSBvZiBleHBlcnRzIG1vZGVsIChTTW9FKSB3aXRoIG9wZW4gd2VpZ2h0cy48YnI+XG5UaGlzIG1vZGVsIGhhcyBiZWVuIG9wdGltaXplZCB0aHJvdWdoIHN1cGVydmlzZWQgZmluZS10dW5pbmcgYW5kIGRpcmVjdCBwcmVmZXJlbmNlIG9wdGltaXphdGlvbiAoRFBPKSBmb3IgY2FyZWZ1bCBpbnN0cnVjdGlvbiBmb2xsb3dpbmcuIE9uIE1ULUJlbmNoLCBpdCByZWFjaGVzIGEgc2NvcmUgb2YgOC4zMCwgIHdpdGggYSBwZXJmb3JtYW5jZSBjb21wYXJhYmxlIHRvIEdQVDMuNS48YnI+XG5cbkxpY2Vuc2VkIHVuZGVyIEFwYWNoZSAyLjAuIE1peHRyYWwgb3V0cGVyZm9ybXMgTGxhbWEgMiA3MEIgb24gbW9zdCBiZW5jaG1hcmtzIHdpdGggNnggZmFzdGVyIGluZmVyZW5jZS4gSW4gcGFydGljdWxhciwgaXQgbWF0Y2hlcyBvciBvdXRwZXJmb3JtcyBHUFQzLjUgb24gbW9zdCBzdGFuZGFyZCBiZW5jaG1hcmtzLjxicj5cbk1peHRyYWwgaGFzIHRoZSBmb2xsb3dpbmcgY2FwYWJpbGl0aWVzLlxuKiBJdCBncmFjZWZ1bGx5IGhhbmRsZXMgYSBjb250ZXh0IG9mIDMyayB0b2tlbnMuXG4qIEl0IGhhbmRsZXMgRW5nbGlzaCwgRnJlbmNoLCBJdGFsaWFuLCBHZXJtYW4gYW5kIFNwYW5pc2guXG4qIEl0IHNob3dzIHN0cm9uZyBwZXJmb3JtYW5jZSBpbiBjb2RlIGdlbmVyYXRpb24uXG4qIEl0IGNhbiBiZSBmaW5ldHVuZWQgaW50byBhbiBpbnN0cnVjdGlvbi1mb2xsb3dpbmcgbW9kZWwgdGhhdCBhY2hpZXZlcyBhIHNjb3JlIG9mIDguMyBvbiBNVC1CZW5jaC5cblxuVGhpcyBtb2RlbCBpcyByZWFkeSBjb21tZXJpY2lhbCB1c2UuIFxuXG4jIyMgVGhpcmQtUGFydHkgQ29tbXVuaXR5IENvbnNpZGVyYXRpb24gXG5UaGlzIG1vZGVsIGlzIG5vdCBvd25lZCBvciBkZXZlbG9wZWQgYnkgTlZJRElBLiBUaGlzIG1vZGVsIGhhcyBiZWVuIGRldmVsb3BlZCBhbmQgYnVpbHQgdG8gYSB0aGlyZC1wYXJ0eVx1MjAxOXMgcmVxdWlyZW1lbnRzIGZvciB0aGlzIGFwcGxpY2F0aW9uIGFuZCB1c2UgY2FzZTsgc2VlIGxpbmsgdG8gdGhlIFtNaXh0cmFsIDh4N0ItSW5zdHJ1Y3QtdjAuMSBNb2RlbCBDYXJkXShodHRwczovL2h1Z2dpbmdmYWNlLmNvL21pc3RyYWxhaS9NaXh0cmFsLTh4N0ItSW5zdHJ1Y3QtdjAuMSkuXG5cbiMjIyBHb3Zlcm5pbmcgVGVybXNcbl9Ob3RlOiBUaGUgTklNIGNvbnRhaW5lciBpcyBnb3Zlcm5lZCBieSB0aGUgW05WSURJQSBTb2Z0d2FyZSBMaWNlbnNlIEFncmVlbWVudF0oaHR0cHM6Ly93d3cubnZpZGlhLmNvbS9lbi11cy9hZ3JlZW1lbnRzL2VudGVycHJpc2Utc29mdHdhcmUvbnZpZGlhLXNvZnR3YXJlLWxpY2Vuc2UtYWdyZWVtZW50LykgYW5kIHRoZSBbUHJvZHVjdC1TcGVjaWZpYyBUZXJtcyBmb3IgTlZJRElBIEFJIFByb2R1Y3RzXShodHRwczovL3d3dy5udmlkaWEuY29tL2VuLXVzL2FncmVlbWVudHMvZW50ZXJwcmlzZS1zb2Z0d2FyZS9wcm9kdWN0LXNwZWNpZmljLXRlcm1zLWZvci1haS1wcm9kdWN0cy8pOyBhbmQgdGhlIHVzZSBvZiB0aGlzIG1vZGVsIGlzIGdvdmVybmVkIGJ5IHRoZSBbTlZJRElBIENvbW11bml0eSBNb2RlbCBMaWNlbnNlIEFncmVlbWVudF0oaHR0cHM6Ly93d3cubnZpZGlhLmNvbS9lbi11cy9hZ3JlZW1lbnRzL2VudGVycHJpc2Utc29mdHdhcmUvbnZpZGlhLWNvbW11bml0eS1tb2RlbHMtbGljZW5zZS8pLiBBRERJVElPTkFMIElORk9STUFUSU9OOiBbQXBhY2hlIExpY2Vuc2UgMi4wXShodHRwczovL2Nob29zZWFsaWNlbnNlLmNvbS9saWNlbnNlcy9hcGFjaGUtMi4wLykuXG5cbioqWW91IGFyZSByZXNwb25zaWJsZSBmb3IgZW5zdXJpbmcgdGhhdCB5b3VyIHVzZSBvZiBOVklESUEgQUkgRm91bmRhdGlvbiBNb2RlbHMgY29tcGxpZXMgd2l0aCBhbGwgYXBwbGljYWJsZSBsYXdzLioqXG5cbiMjIyBHZXR0aW5nIFN0YXJ0ZWRcbltRdWljayBTdGFydCBHdWlkZV0oaHR0cHM6Ly9kb2NzLm52aWRpYS5jb20vbmltL2xhcmdlLWxhbmd1YWdlLW1vZGVscy9sYXRlc3QvZ2V0dGluZy1zdGFydGVkLmh0bWwgKVxuXG4jIyMgUmVmZXJlbmNlcyhzKTpcblxuKiBbTWl4dHJhbCBvZiBleHBlcnRzIHwgTWlzdHJhbCBBSSB8IE9wZW4gc291cmNlIG1vZGVsc10oaHR0cHM6Ly9taXN0cmFsLmFpL25ld3MvbWl4dHJhbC1vZi1leHBlcnRzLykgPGJyPlxuXG4jIyBNb2RlbCBBcmNoaXRlY3R1cmVcblxuKipBcmNoaXRlY3R1cmUgVHlwZToqKiBUcmFuc2Zvcm1lciA8YnI+XG4qKk5ldHdvcmsgQXJjaGl0ZWN0dXJlOioqIFNwYXJzZSBNaXh0dXJlIG9mIEdQVC1iYXNlZCBleHBlcnRzIDxicj5cbioqTW9kZWwgVmVyc2lvbjoqKiAwLjEgPGJyPlxuXG4jIyMgSW5wdXRcbi0gKipJbnB1dCBUeXBlOioqIFRleHRcbi0gKipJbnB1dCBGb3JtYXQ6KiogU3RyaW5nXG4tICoqSW5wdXQgUGFyYW1ldGVyczoqKiBJbnB1dCBTZXF1ZW5jZSBMZW5ndGggKFRva2VucykgPSA4MTkyIFxuXG4jIyMgT3V0cHV0XG4tICoqT3V0cHV0IFR5cGU6KiogVGV4dFxuLSAqKk91dHB1dCBGb3JtYXQ6KiogU3RyaW5nXG4tICoqT3V0cHV0IFBhcmFtZXRlcnM6KiogT3V0cHV0IFNlcXVlbmNlIExlbmd0aCAoVG9rZW5zKSA9IDgxOTIgXG5cbiMjIyBTb2Z0d2FyZSBJbnRlZ3JhdGlvbjpcblxuKipTdXBwb3J0ZWQgSGFyZHdhcmUgUGxhdGZvcm0ocyk6KiogTlZJRElBIEFtcGVyZSwgTlZJRElBIEhvcHBlciwgTlZJRElBIExvdmVsYWNlLCBOVklESUEgVHVyaW5nIDxicj5cbioqU3VwcG9ydGVkIE9wZXJhdGluZyBTeXN0ZW0ocyk6KiogTGludXggPGJyPlxuXG4jIyBUcmFpbmluZyBEYXRhXG5cblRyYWluaW5nIGFuZCB0dW5pbmcgZGF0YSBoYXMgbm90IGJlZW4gZGlzY2xvc2VkIGZvciBNaXh0cmFsLTh4N0ItSW5zdHJ1Y3QtdjAuMSBtb2RlbC5cblxuIyMgRXZhbHVhdGlvbiBEYXRhXG5PZmZpY2lhbCBldmFsdWF0aW9uIGRhdGEgaGFzIG5vdCBiZWVuIHB1Ymxpc2hlZCBmb3IgTWl4dHJhbC04eDdCLUluc3RydWN0LXYwLjEgbW9kZWwuXG5cbiMjIEluZmVyZW5jZVxuXG4qKkVuZ2luZToqKiBbVHJpdG9uXShodHRwczovL2RldmVsb3Blci5udmlkaWEuY29tL3RyaXRvbi1pbmZlcmVuY2Utc2VydmVyKSA8YnI+XG4qKlRlc3QgSGFyZHdhcmU6KiogSDEwMCwgQTEwMCA4MEdCIDxicj5cblxuIyMgR2V0IEhlbHA6IEVudGVycHJpc2UgU3VwcG9ydFxuR2V0IGFjY2VzcyB0byBrbm93bGVkZ2UgYmFzZSBhcnRpY2xlcyBhbmQgc3VwcG9ydCBjYXNlcyBvciBbc3VibWl0IGEgdGlja2V0XShodHRwczovL3d3dy5udmlkaWEuY29tL2VuLXVzL2RhdGEtY2VudGVyL3Byb2R1Y3RzL2FpLWVudGVycHJpc2Utc3VpdGUvc3VwcG9ydC8pLlxuXG4jIyBOVklESUEgQUkgRW50ZXJwcmlzZSBEb2N1bWVudGF0aW9uXG5WaXNpdCB0aGUgW05WSURJQSBBSSBFbnRlcnByaXNlIERvY3VtZW50YXRpb24gSHViXShodHRwczovL2RvY3MubnZpZGlhLmNvbS9haS1lbnRlcnByaXNlLykgZm9yIHJlbGVhc2UgZG9jdW1lbnRhdGlvbiwgZGVwbG95bWVudCBndWlkZXMgYW5kIG1vcmUuIiwKICAgICJkaXNwbGF5TmFtZSI6ICJNaXh0cmFsLTh4N0ItSW5zdHJ1Y3QtdjAuMSIsCiAgICAiZXhwbGFpbmFiaWxpdHkiOiAiIiwKICAgICJmcmFtZXdvcmsiOiAiT3RoZXIiLAogICAgImhhc1BsYXlncm91bmQiOiBmYWxzZSwKICAgICJoYXNTaWduZWRWZXJzaW9uIjogdHJ1ZSwKICAgICJpc1BsYXlncm91bmRFbmFibGVkIjogZmFsc2UsCiAgICAiaXNQdWJsaWMiOiBmYWxzZSwKICAgICJpc1JlYWRPbmx5IjogdHJ1ZSwKICAgICJsYWJlbHMiOiBbCiAgICAgICAgIkJ1bGsgQnVpbGQiLAogICAgICAgICJOU1BFQ1QtNFM4MC1OMUE3IiwKICAgICAgICAiTlNQRUNULUJFMDgtMTRNViIsCiAgICAgICAgIm1peHRyYWwtOHg3Yi1pbnN0cnVjdC12MC0xIiwKICAgICAgICAibnZhaWU6bW9kZWw6bnZhaWVfc3VwcG9ydGVkIiwKICAgICAgICAibnZpZGlhX25pbTptb2RlbDpuaW1tY3JvX252aWRpYV9uaW0iLAogICAgICAgICJwcm9kdWN0TmFtZXM6bmltLWRldiIsCiAgICAgICAgInByb2R1Y3ROYW1lczpudi1haS1lbnRlcnByaXNlIgogICAgXSwKICAgICJsYXRlc3RWZXJzaW9uSWRTdHIiOiAiaGYtYTYwODMyYy0wNTA4LXRvb2wtdXNlLXYyIiwKICAgICJsYXRlc3RWZXJzaW9uU2l6ZUluQnl0ZXMiOiA5MzQwODExNDI2OCwKICAgICJsb2dvIjogImh0dHBzOi8vYXNzZXRzLm5nYy5udmlkaWEuY29tL3Byb2R1Y3RzL2FwaS1jYXRhbG9nL2ltYWdlcy9taXh0cmFsLTh4N2ItaW5zdHJ1Y3QuanBnIiwKICAgICJtb2RlbEZvcm1hdCI6ICJOL0EiLAogICAgIm5hbWUiOiAibWl4dHJhbC04eDdiLWluc3RydWN0LXYwMSIsCiAgICAib3JnTmFtZSI6ICJuaW0iLAogICAgInByZWNpc2lvbiI6ICJOL0EiLAogICAgInByaXZhY3kiOiAiIiwKICAgICJwcm9kdWN0TmFtZXMiOiBbCiAgICAgICAgIm5pbS1kZXYiLAogICAgICAgICJudi1haS1lbnRlcnByaXNlIgogICAgXSwKICAgICJwdWJsaWNEYXRhc2V0VXNlZCI6IHt9LAogICAgInB1Ymxpc2hlciI6ICJNaXN0cmFsIEFJIiwKICAgICJzYWZldHlBbmRTZWN1cml0eSI6ICIiLAogICAgInNob3J0RGVzY3JpcHRpb24iOiAiTWl4dHJhbCA4eDdCIEluc3RydWN0IGlzIGEgbGFuZ3VhZ2UgbW9kZWwgdGhhdCBjYW4gZm9sbG93IGluc3RydWN0aW9ucywgY29tcGxldGUgcmVxdWVzdHMsIGFuZCBnZW5lcmF0ZSBjcmVhdGl2ZSB0ZXh0IGZvcm1hdHMuIiwKICAgICJ0ZWFtTmFtZSI6ICJtaXN0cmFsYWkiLAogICAgInVwZGF0ZWREYXRlIjogIjIwMjUtMDYtMDNUMTc6MzM6MzMuMDMyWiIKfQ== displayName: Mixtral 8x7B Instruct source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/mistralai/containers/mixtral-8x7b-instruct-v01 optimizationProfiles: - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:a100x2-throughput-bf16-s69xvudfza framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 A100x2 BF16 Throughput ngcMetadata: 0db3b5e8468c9debf30bcf41cbfea084adc59000885efd6fdcb3bbb902651bd6: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h100x2-throughput-bf16-zwhl2fsi5a framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100x2 BF16 Throughput ngcMetadata: 1617d074ce252f66e96d5f0e331fa5c6cc0a0330519e56b5c66c60eb7d7bf4f9: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100_NVLx4 BF16 Latency ngcMetadata: 28552abdb2c491d46065d52ca1dc1265b99ba95a5bf8daaee4c5de12511a3b4f: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 4 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:l40sx4-throughput-fp8-hbavqk65yw framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 L40Sx4 FP8 Throughput ngcMetadata: 3d0e5989f2fbc23e7d4504cd69269c9636deb61d0efc12225d3d59d54afea297: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: fp8 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 45GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h200x1-throughput-bf16-00qqbltmrg framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H200x1 BF16 Throughput ngcMetadata: 434e8d336fa23cbe151748d32b71e196d69f20d319ee8b59852a1ca31a48d311: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:b200x2-latency-fp8-pwnesuqgxg framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 B200x2 FP8 Latency ngcMetadata: 4950d30811e1e426e97cda69e6c03a8a4819db8aa4abf34722ced4542a1f6b52: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100_NVLx1 FP8 Throughput ngcMetadata: 5811750e70b7e9f340f4d670c72fcbd5282e254aeb31f62fd4f937cfb9361007: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h200x2-latency-bf16-uh6awyzpta framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H200x2 BF16 Latency ngcMetadata: 6832a9395f54086162fd7b1c6cfaae17c7d1e535a60e2b7675504c9fc7b57689: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 A100_SXM4_40GBx4 BF16 Throughput ngcMetadata: 6c29727e6e3d48a900c348c1fab181dc40bc926be07b06ca5b8eae42a6bc9901: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 4 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h100x2-latency-fp8-4-l0a-rlkq framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100x2 FP8 Latency ngcMetadata: 6c3f01dd2b2a56e3e83f70522e4195d3f2add70b28680082204bbb9d6150eb04: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h100x4-latency-bf16-axe5ogfgvq framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100x4 BF16 Latency ngcMetadata: 73f41fabbb60beb5b05ab21c8dcce5c277d99bcabec31abf46a0194d0dd18d04: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h100x1-throughput-fp8-j1x74k--ng framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100x1 FP8 Throughput ngcMetadata: 7b508014e846234db3cabe5c9f38568b4ee96694b60600a0b71c621dc70cacf3: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:l40sx4-latency-bf16-qavtgypi5w framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 L40Sx4 BF16 Latency ngcMetadata: 844ebe2b42df8de8ce66cbb6ecf43f90858ea7efc14ddf020cf1ae7450ae0c33: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 A100_SXM4_40GBx8 BF16 Latency ngcMetadata: 8a446393aaeb0065ee584748c7c03522389921a11ff2bd8cb5800e06a8644eb0: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 8 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:b200x1-throughput-fp8-ult1akfaqa framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 B200x1 FP8 Throughput ngcMetadata: 8b87146e39b0305ae1d73bc053564d1b4b4c565f81aa5abe3e84385544ca9b60: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:a10gx8-throughput-bf16-jxekvgjfha framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 A10Gx8 BF16 Throughput ngcMetadata: 935ec3ac922bf54106311dfc6b3214a1651a26033b4f5007b6351fffb4058b7a: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '8' pp: '1' precision: bf16 profile: throughput tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 8 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 90GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100_NVLx2 FP8 Latency ngcMetadata: a00ce1e782317cd19ed192dcb0ce26ab8b0c1da8928c33de8893897888ff7580: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:b200x1-throughput-bf16-ftwmzofxbq framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 B200x1 BF16 Throughput ngcMetadata: a4c63a91bccf635b570ddb6d14eeb6e7d0acb2389712892b08d21fad2ceaee38: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:l40sx4-throughput-bf16-d9jierrahq framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 L40Sx4 BF16 Throughput ngcMetadata: ab8f2faec3bcafc32efaf05acada4df4d8a171a759b4fb5c44d2d9d43a348764: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H100_NVLx2 BF16 Throughput ngcMetadata: acd73fcee9d91ada305118080138fb3ca4d255adee3312acda38c4487daae476: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h200x1-throughput-fp8-skjppy5-iw framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H200x1 FP8 Throughput ngcMetadata: af876a179190d1832143f8b4f4a71f640f3df07b0503259cedee3e3a8363aa96: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:l40sx4-latency-fp8-vmkcxgu3fw framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 L40Sx4 FP8 Latency ngcMetadata: bdd0d3cd53fad1130259beea81ab5711fb98f2f1a020b5b26c3c82fd7d43c5af: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 45GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:a100x4-latency-bf16-yphkz2bivw framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 A100x4 BF16 Latency ngcMetadata: d73b7cf2f719d720329fc65fc255ae901bc3beebdc59be9815ede1a07948c1f7: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 4 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:h200x2-latency-fp8-skwo6uxqkq framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 H200x2 FP8 Latency ngcMetadata: e4f217a5fb016b570e34b8a8eb06051ccfef9534ba43da973bb7f678242eaa5f: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 44GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:b200x2-latency-bf16-qkpte3pb7w framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 B200x2 BF16 Latency ngcMetadata: f44768c625db71a327cf17e750d5e1a8e60171a8d8ef6b4c1c4b57fe74c9bf46: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 88GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 Generic NVIDIA GPUx8 BF16 ngcMetadata: 1d7b604f835f74791e6bfd843047fc00a5aef0f72954ca48ce963811fb6f3f09: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '8' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 8 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 Generic NVIDIA GPUx2 BF16 ngcMetadata: 375dc0ff86133c2a423fbe9ef46d8fdf12d6403b3caa3b8e70d7851a89fc90dd: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '2' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 2 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' - profileId: nim/mistralai/mixtral-8x7b-instruct-v01:hf-a60832c-0508-tool-use-v2 framework: TensorRT-LLM displayName: Mixtral 8x7b Instruct V0.1 Generic NVIDIA GPUx4 BF16 ngcMetadata: 54946b08b79ecf9e7f2d5c000234bf2cce19c8fee21b243c1a084b03897e8c95: model: mistralai/mixtral-8x7b-instruct-v0.1 release: 1.8.4 tags: feat_lora: 'false' llm_engine: tensorrt_llm pp: '1' precision: bf16 tp: '4' trtllm_buildable: 'true' modelFormat: trt-llm spec: - key: PRECISION value: BF16 - key: COUNT value: 4 - key: NIM VERSION value: 1.8.4 - key: DOWNLOAD SIZE value: 87GB - key: LLM ENGINE value: TENSORRT_LLM - key: TRTLLM BUILDABLE value: 'TRUE' labels: - Mistral - Instruct - Large Language Model - TensorRT-LLM - Language Generation - NeMo - NVIDIA Validated config: architectures: - Other modelType: mistral license: NVIDIA AI Foundation Models Community License - name: Deepseek R1 Distill Llama displayName: Deepseek R1 Distill Llama modelHubID: deepseek-r1-distill-llama category: Chat Assistant type: NGC description: The DeepSeek-R1-Distill-Llama-70B NIM simplifies the deployment of a distilled version of the DeepSeek-R1 series, built upon the Llama3.3-70B-Instruct architecture. This model is designed to deliver efficient performance for reasoning, math, and code tasks while maintaining high accuracy. By distilling knowledge from the larger DeepSeek-R1 model, it provides state-of-the-art performance with reduced computational requirements. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: Deepseek R1 Distill Llama 70b modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-02-07T00:32:58.270Z",
    "description": "**Model Overview**\n\n## Description:\n\nDeepSeek-R1-Distill-Llama-70B is a distilled version of the DeepSeek-R1 series, built upon the Llama3.3-70B-Instruct architecture. This model is designed to deliver efficient performance for reasoning, math, and code tasks while maintaining high accuracy. By distilling knowledge from the larger DeepSeek-R1 model, it provides state-of-the-art performance with reduced computational requirements.\n\nThis model is ready for both research and commercial use.\nFor more details, visit the [DeepSeek website](https://www.deepseek.com/).\n\n## Third-Party Community Consideration\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [DeepSeek-R1 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B).\n\n### License/Terms of Use\n\nGOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Additional Information: [MIT License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md); [Meta Llama 3.3 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE). Built with Llama.\n\n## References:\n\n- [DeepSeek GitHub Repository](https://github.com/deepseek-ai/DeepSeek-V3)\n- [DeepSeek-R1 Paper](https://arxiv.org/abs/2501.12948)\n- [Hugging Face Model Card for DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)\n\n## Model Architecture:\n\n**Architecture Type:** Distilled version of Mixture of Experts (MoE) <br>\n**Base Model:** Llama3.3-70B-Instruct\n\n## Input:\n\n**Input Type(s):** Text <br>\n**Input Format(s):** String <br>\n**Input Parameters:** (1D) <br>\n**Other Properties Related to Input:** <br>\nDeepSeek recommends adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:\n\n1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.\n2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.**\n3. For mathematical problems, it is advisable to include a directive in your prompt such as: \"Please reason step by step, and put your final answer within \\boxed{}.\"\n4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.\n\n## Output:\n\n**Output Type(s):** Text <br>\n**Output Format:** String <br>\n**Output Parameters:** (1D) <br>\n\n## Software Integration:\n\n**Runtime Engine(s):** TensorRT-LLM <br>\n**Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, and NVIDIA Volta architectures <br>\n**[Preferred/Supported] Operating System(s):** Linux\n\n## Model Version(s):\n\nDeepSeek-R1-Distill-Llama-70B\n\n# Training, Testing, and Evaluation Datasets:\n\n## Training Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Testing Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Evaluation Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Inference:\n\n**Engine:** TensorRT-LLM <br>\n**Test Hardware:** NVIDIA Hopper\n\n## Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Model Limitations:\nThe base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "DeepSeek-R1-Distill-Llama-70B",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-RAHW-5L0X",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "l40sx4-throughput-fp8-46u3lvp6ja",
    "latestVersionSizeInBytes": 73623864095,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/deepseek-r1-distill-llama-70b.jpg",
    "modelFormat": "N/A",
    "name": "deepseek-r1-distill-llama-70b",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "DeepSeek-AI",
    "safetyAndSecurity": "",
    "shortDescription": "DeepSeek-R1-Distill-Llama-70B is a distilled version of the DeepSeek-R1 series, built upon the Llama3.3-70B-Instruct architecture. This model is designed to deliver efficient performance for reasoning, math, and code tasks while maintaining high accuracy",
    "teamName": "deepseek-ai",
    "updatedDate": "2025-02-28T02:15:12.044Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/deepseek-ai/containers/deepseek-r1-distill-llama-70b optimizationProfiles: - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:l40sx4-throughput-fp8-46u3lvp6ja framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B L40Sx4 FP8 Throughput ngcMetadata: 23c28e4a1ad4d963c1504f1a33b45afb65bf61b64b20be1a8ea2c8816ea0fc36: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: fp8 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 4 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 69GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:h100x4-latency-fp8-k5tlofelyw framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B H100x4 FP8 Latency ngcMetadata: 4696d5c5b44b13bb5e864affcdcfa30ad229390285476315d9921fd0828bda5b: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 69GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:h100x8-latency-fp8-xz3eymtuzq framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B H100x8 FP8 Latency ngcMetadata: 91f2b7c9e719c0c380ba6c1d6c3e5cad61aaf807730de88fa3b6233a39edeeaa: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '8' pp: '1' precision: fp8 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 8 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 70GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:h100x2-throughput-fp8-8cx2penaia framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B H100x2 FP8 Throughput ngcMetadata: da94a5c34cf665e85813fa49f321f1e87ca12317722b5e65628cf3ed0371897b: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 69GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:h100x4-throughput-bf16-g31fj2uvrw framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B H100x4 BF16 Throughput ngcMetadata: e6b8fb8c4c76343b05b9051974593e5bd9110a868770d52e8eb0fe5a3b46dd67: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: throughput tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 4 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 138GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-70b:h100x8-latency-bf16-v8q6jmcd9g framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 70B H100x8 BF16 Latency ngcMetadata: f87605b6d8cfc0ca39fad21b4ec580219f3a3be42884d2c7caad9b8ae4b3c1c7: model: deepseek-r1-distill-llama-70b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '8' pp: '1' precision: bf16 profile: latency tp: '8' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 8 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 147GB - variantId: Deepseek R1 Distill Llama 8b modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-02-07T00:36:21.255Z",
    "description": "**Model Overview**\n\n## Description:\n\nDeepSeek-R1-Distill-Llama-8B is a distilled version of the DeepSeek-R1 series, built upon the Llama3.1-8B-Instruct architecture. This model is designed to deliver efficient performance for reasoning, math, and code tasks while maintaining high accuracy. By distilling knowledge from the larger DeepSeek-R1 model, it provides state-of-the-art performance with reduced computational requirements.\n\nThis model is ready for both research and commercial use.\nFor more details, visit the [DeepSeek website](https://www.deepseek.com/).\n\n## Third-Party Community Consideration\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [DeepSeek-R1 Model Card](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B).\n\n### License/Terms of Use\n\nGOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Additional Information: [MIT License](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/mit.md); [Meta Llama 3.3 Community License Agreement](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct/blob/main/LICENSE). Built with Llama.\n\n## References:\n\n- [DeepSeek GitHub Repository](https://github.com/deepseek-ai/DeepSeek-V3)\n- [DeepSeek-R1 Paper](https://arxiv.org/abs/2501.12948)\n- [Hugging Face Model Card for DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)\n\n## Model Architecture:\n\n**Architecture Type:** Distilled version of Mixture of Experts (MoE) <br>\n**Base Model:** Llama3.1-8B-Instruct\n\n## Input:\n\n**Input Type(s):** Text <br>\n**Input Format(s):** String <br>\n**Input Parameters:** (1D) <br>\n**Other Properties Related to Input:** <br>\nDeepSeek recommends adhering to the following configurations when utilizing the DeepSeek-R1 series models, including benchmarking, to achieve the expected performance:\n\n1. Set the temperature within the range of 0.5-0.7 (0.6 is recommended) to prevent endless repetitions or incoherent outputs.\n2. **Avoid adding a system prompt; all instructions should be contained within the user prompt.**\n3. For mathematical problems, it is advisable to include a directive in your prompt such as: \"Please reason step by step, and put your final answer within \\boxed{}.\"\n4. When evaluating model performance, it is recommended to conduct multiple tests and average the results.\n\n## Output:\n\n**Output Type(s):** Text <br>\n**Output Format:** String <br>\n**Output Parameters:** (1D) <br>\n\n## Software Integration:\n\n**Runtime Engine(s):** TensorRT-LLM <br>\n**Supported Hardware Microarchitecture Compatibility:** NVIDIA Ampere, NVIDIA Blackwell, NVIDIA Jetson, NVIDIA Hopper, NVIDIA Lovelace, NVIDIA Pascal, NVIDIA Turing, and NVIDIA Volta architectures <br>\n**[Preferred/Supported] Operating System(s):** Linux\n\n## Model Version(s):\n\nDeepSeek-R1-Distill-Llama-8B\n\n# Training, Testing, and Evaluation Datasets:\n\n## Training Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Testing Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Evaluation Dataset:\n\n**Data Collection Method by dataset:** Hybrid: Human, Automated <br>\n**Labeling Method by dataset:** Hybrid: Human, Automated <br>\n\n## Inference:\n\n**Engine:** TensorRT-LLM <br>\n**Test Hardware:** NVIDIA Hopper\n\n## Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Model Limitations:\nThe base model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "DeepSeek-R1-Distill-Llama-8B",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-H31C-POBP",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "runtime_params_rtx_MSL16384",
    "latestVersionSizeInBytes": 176,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/deepseek-r1-distill-llama-70b.jpg",
    "modelFormat": "N/A",
    "name": "deepseek-r1-distill-llama-8b",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "DeepSeek-AI",
    "safetyAndSecurity": "",
    "shortDescription": "DeepSeek-R1-Distill-Llama-8B is a distilled version of the DeepSeek-R1 series, built upon the Llama3.1-8B-Instruct architecture. This model is designed to deliver efficient performance for reasoning, math, and code tasks while maintaining high accuracy",
    "teamName": "deepseek-ai",
    "updatedDate": "2025-02-26T20:50:06.949Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/deepseek-ai/containers/deepseek-r1-distill-llama-8b optimizationProfiles: - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:l40sx1-throughput-fp8-vbqc0btoqg framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B L40Sx1 FP8 Throughput ngcMetadata: d968c663c710e56275088096bc0dcf823560aaf7dca910bfcb41f5056063ab02: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:h100x1-throughput-fp8-d9grrq-lka framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B H100x1 FP8 Throughput ngcMetadata: 0bdec027404c16d6ca96e159079082f9630a24a277ff519d0c8fea71007222ec: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:h100x2-latency-bf16-7ztok5r0dg framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B H100x2 BF16 Latency ngcMetadata: 0ce355335e6c3aec54e49ab53822e628fa1227091d0326da962bcc4f95b5f602: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:a10gx4-latency-bf16-aiejrysrlw framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B A10Gx4 BF16 Latency ngcMetadata: 1dfac8e12042573dc93536a393902478e1a6a46d1cd742cf0a4251c11f77e253: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '4' pp: '1' precision: bf16 profile: latency tp: '4' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 4 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 19GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:l40sx2-latency-fp8-fmuoxfbb0q framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B L40Sx2 FP8 Latency ngcMetadata: c2d4efce2d553c3aa78109b6d5dff0fd34b86bbb3b765aa8afdf12e9d13e8e83: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:h100x1-throughput-bf16-4jcstzx27q framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B H100x1 BF16 Throughput ngcMetadata: 4f6dba657c08280bdb419cbc1c60d265e82731b807ee2ae3c111cb9a91571aa1: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:h100x2-latency-fp8-q8xwzp22aa framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B H100x2 FP8 Latency ngcMetadata: 518edac01f731b63676743a1860fe21861d1399b19cb2e584de3d9a6a3ea6d8e: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 9GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:l40sx1-throughput-bf16-yvbnwvfzew framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B L40Sx1 BF16 Throughput ngcMetadata: 9bc8e8aa12847674fa2840b9c03cbdb0246d7f144a5257510fd53eacc2a9d62f: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:a100x1-throughput-bf16-iq9maz9nkw framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B A100x1 BF16 Throughput ngcMetadata: c959aa89b69ad9295ccc99a34546819d16bb0e2566a6cfed0985eecf37bcc14b: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 16GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:l40sx2-latency-bf16-tlmx3sgrdw framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B L40Sx2 BF16 Latency ngcMetadata: 20d6bb61a1ee5160c0baed3721f8b580525a0aaaaa3b1333e9a882d4c61b1ed7: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 17GB - profileId: nim/deepseek-ai/deepseek-r1-distill-llama-8b:a10gx2-throughput-bf16-uv8ptkf8-g framework: TensorRT-LLM displayName: Deepseek R1 Distill Llama 8B A10Gx2 BF16 Throughput ngcMetadata: edbb37d3ef94a5cc38919ab86694b835307c0668ca6d41ea746796b34ced78f1: model: deepseek-ai/deepseek-r1-distill-llama-8b release: 1.5.2 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: throughput tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 2 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.5.2 - key: DOWNLOAD SIZE value: 17GB labels: - Deepseek - Distill - Llama - Meta - Chat - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Llama 3.2 Instruct displayName: Llama 3.2 Instruct modelHubID: llama-3.2-instruct category: Commercial and Research type: NGC description: The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3_2/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3_2/license/ modelVariants: - variantId: Llama 3.2 1B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-02-25T17:55:56.040Z",
    "description": "## Model Information\n\nThe Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. \n\nLlama 3.2 models are ready for commercial use.\n\nModels are accelerated by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a library for optimizing Large Language Model (LLM) inference on NVIDIA GPUs.\n\n**Models in this Collection:**  \n- Llama-3.2-1B\n- Llama-3.2-1B-Instruct\n- Llama-3.2-3B\n- Llama-3.2-3B-Instruct\n\n**Model Developer:** Meta\n\n**Model Version:** 3.2\n\n**Model Release Date:** September 25, 2024\n\n**Third-Party Community Consideration:**\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [Llama 3.2 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).\n\n**License:** Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).\n\n**Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety. \n\n|  | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| Llama 3.2 (text only)  | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code  | 128k | Yes | Yes | Up to 9T tokens | December 2023 |\n|  |  | 3B (3.21B) | Multilingual Text | Multilingual Text and code  |  |  |  |  |  |\n\n**Supported Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\n**Llama 3.2 Model Family:** Token counts refer to pre-training data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\n**Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. \n\n**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https://github.com/meta-llama/llama-recipes). \n\n## Intended Use\n\n**Intended Use Cases:** Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pre-trained models can be adapted for a variety of additional natural language generation tasks. \n\n**Out of Scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Software Integration\n\n**Supported Hardware Microarchitecture Compatibility:**\n- NVIDIA Ampere\n- NVIDIA Hopper\n- NVIDIA Lovelace\n- NVIDIA Jetson\n\n**Supported Operating System(s):**\n- Linux \n- Windows\n\n## Hardware and Software\n\n**Training Factors:** We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pre-training. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.\n\n**Training Energy Use:** Training utilized a cumulative of **916k** GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. \n\n## \n\n**Training Greenhouse Gas Emissions:** Estimated total location-based greenhouse gas emissions were **240** tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq.\n\n|  | Training Time (GPU hours) | Logit Generation Time (GPU Hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) |\n| :---- | :---: | ----- | :---: | :---: | :---: |\n| Llama-3.2-1B | 370k | \\- | 700 | 107 | 0 |\n| Llama-3.2-3B | 460k | \\- | 700 | 133 | 0 |\n| Total | 830k |         86k |  | 240 | 0 |\n\nThe methodology used to determine training energy use and greenhouse gas emissions can be found [here](https://arxiv.org/pdf/2204.05149). Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.\n\n## Training Data\n\n**Data Collection Method:** Unknown  \n**Labeling Method:** Unknown\n\n**Overview:** Llama 3.2 was pre-trained on up to 9 trillion tokens of data from publicly available sources. For the 1B and 3B Llama 3.2 models, we incorporated logits from the Llama 3.1 8B and 70B models into the pre-training stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance. In post-training we used a similar recipe as Llama 3.1 and produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO).\n\n**Data Freshness:** The pre-training data has a cutoff of December 2023.\n\n## Benchmarks \\- English Text\n\nIn this section, we report the results for Llama 3.2 models on standard automatic benchmarks. For all these evaluations, we used our internal evaluations library. \n\n### Base Pre-trained Models \n\n| Category | Benchmark | \\# Shots | Metric | Llama-3.2-1B | Llama-3.2-3B | Llama-3.1-8B |\n| ----- | ----- | :---: | :---: | :---: | :---: | :---: |\n| General | MMLU | 5 | macro\\_avg/acc\\_char | 32.2 | 58 | 66.7 |\n|  | AGIEval English | 3-5 | average/acc\\_char | 23.3 | 39.2 | 47.8 |\n|  | ARC-Challenge | 25 | acc\\_char | 32.8 | 69.1 | 79.7 |\n| Reading comprehension | SQuAD | 1 | em | 49.2 | 67.7 | 77 |\n|  | QuAC (F1) | 1 | f1 | 37.9 | 42.9 | 44.9 |\n|  | DROP (F1) | 3 | f1 | 28.0 | 45.2 | 59.5 |\n| Long Context | Needle in Haystack | 0 | em | 96.8 | 1 | 1 |\n\n### Instruction-Tuned Models\n\n| Capability |  | Benchmark | \\# Shots | Metric | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Llama-3.1-8B-Instruct |\n| :---: | ----- | :---: | :---: | :---: | :---: | :---: | :---: |\n| General |  | MMLU | 5 | macro\\_avg/acc | 49.3 | 63.4 | 69.4 |\n| Re-writing |  | Open-rewrite eval | 0 | micro\\_avg/rougeL | 41.6 | 40.1 | 40.9 |\n| Summarization |  | TLDR9+ (test) | 1 | rougeL | 16.8 | 19.0 | 17.2 |\n| Instruction following |  | IFEval | 0 | Avg(Prompt/Instruction acc Loose/Strict) | 59.5 | 77.4 | 80.4 |\n| Math |  | GSM8K (CoT) | 8 | em\\_maj1@1 | 44.4 | 77.7 | 84.5 |\n|  |  | MATH (CoT) | 0 | final\\_em | 30.6 | 48.0 | 51.9 |\n| Reasoning |  | ARC-C | 0 | acc | 59.4 | 78.6 | 83.4 |\n|  |  | GPQA | 0 | acc | 27.2 | 32.8 | 32.8 |\n|  |  | Hellaswag | 0 | acc | 41.2 | 69.8 | 78.7 |\n| Tool Use |  | BFCL V2 | 0 | acc | 25.7 | 67.0 | 67.1 |\n|  |  | Nexus | 0 | macro\\_avg/acc | 13.5 | 34.3 | 38.5 |\n| Long Context |  | InfiniteBench/En.QA | 0 | longbook\\_qa/f1 | 20.3 | 19.8 | 27.3 |\n|  |  | InfiniteBench/En.MC | 0 | longbook\\_choice/acc | 38.0 | 63.3 | 72.2 |\n|  |  | NIH/Multi-needle | 0 | recall | 75.0 | 84.7 | 98.8 |\n| Multilingual |  | MGSM (CoT) | 0 | em | 24.5 | 58.2 | 68.9 |\n\n### Multilingual Benchmarks\n\n| Category | Benchmark | Language | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Llama-3.1-8B-Instruct |\n| :---: | :---: | :---: | :---: | :---: | :---: |\n| General | MMLU (5-shot, macro\\_avg/acc) | Portuguese | 39.82 | 54.48 | 62.12 |\n|  |  | Spanish | 41.52 | 55.09 | 62.45 |\n|  |  | Italian | 39.79 | 53.77 | 61.63 |\n|  |  | German | 39.20 | 53.29 | 60.59 |\n|  |  | French | 40.47 | 54.59 | 62.34 |\n|  |  | Hindi | 33.51 | 43.31 | 50.88 |\n|  |  | Thai | 34.67 | 44.54 | 50.32 |\n\n\n## Responsibility & Safety\n\nAs part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:\n\n1. Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama   \n2. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm  \n3. Provide protections for the community to help prevent the misuse of our models\n\n### Responsible Deployment \n\n**Approach:** Llama is a foundational technology designed to be used in a variety of use cases. Examples on how Meta\u2019s Llama models have been responsibly deployed can be found in our [Community Stories webpage](https://llama.meta.com/community-stories/). Our approach is to build the most helpful models, enabling the world to benefit from the technology power, by aligning our model safety for generic use cases and addressing a standard set of harms. Developers are then in the driver\u2019s seat to tailor safety for their use cases, defining their own policies and deploying the models with the necessary safeguards in their Llama systems. Llama 3.2 was developed following the best practices outlined in our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/). \n\n#### Llama 3.2 Instruct \n\n**Objective:** Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. We implemented the same set of safety mitigations as in Llama 3, and you can learn more about these in the Llama 3 [paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/). \n\n**Fine-Tuning Data:** We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We\u2019ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. \n\n**Refusals and Tone:** Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. \n\n#### Llama 3.2 Systems\n\n**Safety as a System:** Large language models, including Llama 3.2, **are not designed to be deployed in isolation** but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with [safeguards](https://llama.meta.com/trust-and-safety/) that developers should deploy with Llama models or other LLMs, including Llama Guard, Prompt Guard and Code Shield. All our [reference implementations](https://github.com/meta-llama/llama-agentic-system) demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. \n\n### New Capabilities and Use Cases\n\n**Technological Advancement:** Llama releases usually introduce new capabilities that require specific considerations in addition to the best practices that generally apply across all Generative AI use cases. For prior release capabilities also supported by Llama 3.2, see [Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md), as the same considerations apply here as well.\n\n**Constrained Environments:** Llama 3.2 1B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version. \n\n### Evaluations\n\n**Scaled Evaluations:** We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Purple Llama safeguards to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case.\n\n**Red Teaming:** We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets.\n\n### Critical Risks \n\nIn addition to our safety work above, we took extra care on measuring and/or mitigating the following critical risk areas:\n\n**1\\. CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive Weapons):** Llama 3.2 1B and 3B models are smaller and less capable derivatives of Llama 3.1. For Llama 3.1 70B and 405B, to assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons and have determined that such testing also applies to the smaller 1B and 3B models. \n\n**2\\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model\u2019s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. \n\n**3\\. Cyber Attacks:** Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.\n\n### Community \n\n**Industry Partnerships:** Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our [Github repository](https://github.com/meta-llama/PurpleLlama).\n\n**Grants:** We also set up the [Llama Impact Grants](https://llama.meta.com/llama-impact-grants/) program to identify and support the most compelling applications of Meta\u2019s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found [here](https://llama.meta.com/llama-impact-grants/#finalists). \n\n**Reporting:** Finally, we put in place a set of resources including an [output reporting mechanism](https://developers.facebook.com/llama_output_feedback) and [bug bounty program](https://www.facebook.com/whitehat) to continuously improve the Llama technology with the help of the community.\n\n## Ethical Considerations and Limitations\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**Values:** The core values of Llama 3.2 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.2 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. \n\n**Testing:** Llama 3.2 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.2\u2019s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.2 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide), [Trust and Safety](https://llama.meta.com/trust-and-safety/) solutions, and other [resources](https://llama.meta.com/docs/get-started/) to learn more about responsible development.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "Llama-3.2-1B-Instruct",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-GLV0-62BM",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx1-throughput-lora-fp8-u8hhoocl0a",
    "latestVersionSizeInBytes": 2119603737,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_2-1b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "llama-3.2-1b-instruct",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Meta",
    "safetyAndSecurity": "",
    "shortDescription": "The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 1B and 3B sizes (text in/text out).",
    "teamName": "meta",
    "updatedDate": "2025-08-18T16:49:35.284Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_2-1b-instruct optimizationProfiles: - profileId: nim/meta/llama-3.2-1b-instruct:b200x1-throughput-bf16-olbx5u2wza framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct B200x1 BF16 Throughput ngcMetadata: 00974a79b608dd9dc2e302879e71708692c9c6304f5905eb4da7d661dadd6ec2: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3190e2309f20a42f1888add452d98f204147634d83e7e5a7bbb401f9e898de2e number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h100x1-throughput-bf16-coy0mruniw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100x1 BF16 Throughput ngcMetadata: 023958aa70e985eb0a0d25c60d7a03732ad5ee7d4f9ac2ebcce17397b172b58c: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 4ba0b194b524b5d78bfa90c76ad9789b54069996b45beca9ce05762a295d871a number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x1-throughput-fp8-q4ene2avnw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x1 FP8 Throughput ngcMetadata: 0b900e8d26b11d548f74a903739434bf00fc990439a9245042e344d253481719: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: b771a0a1bf21ee92364a0f1c9db64628d74919517edf09f47b079aab90af963e number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:a10gx1-throughput-bf16-wmuh1shq9q framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A10Gx1 BF16 Throughput ngcMetadata: 0f7eb9e9a9b4470a7b5b6e93b806ad27ff49b1a94c30aa2986ffaf281f6e8d1f: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e28f4aa923af93efab6e6c14dceae117980f3f805e47f871464af69ea1457946 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 1 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct RTX6000_BLACKWELL_SVx1 BF16 Latency ngcMetadata: 0f87e0f30087419b3a4a74d7902753a6daee998e59c0676d412fefe141f62ffe: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 6582d9bebeaaa8b1cc21b6a10cdde0daf92a198e7f9950b21908a77a90d47c3e number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100_NVLx1 BF16 Throughput ngcMetadata: 0f94ccdaf02fa00a986ba3b2b8ff0351ffa73fe262176e89830445ad81b6bfbc: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 92f394e8f167dca76b9c8eb40b8a09edd896b6fd6ec126ba5609a9c90cc21f59 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_144GBx1 FP8 Latency ngcMetadata: 129db5959331b4c24cae55957a8bef7cce73fcc7571001fe18556c9b691db5d8: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 11d5467289919d18be03dbdd3236e1d2b1fdf81681b52167fadc2af453e8f6ea number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:l40sx1-throughput-bf16-mr-zfjdk9w framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct L40Sx1 BF16 Throughput ngcMetadata: 13d24e5a873aea5df261998c94710c6d00b59074f8389143d94a370762569bf8: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 72a4b35de823e2ba47bc9bac68b3704d0a9eae3db2037458d70d813809c6af78 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct RTX6000_BLACKWELL_SVx1 FP8 Latency ngcMetadata: 14389e34e76649cff246559bc0374718143cb5ac1286f7a53f6e0314c70b004b: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 2493505e4182b1596bf60600e22bae9fd94056b3988e591bceace38117523d26 number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct RTX6000_BLACKWELL_SVx1 BF16 Throughput ngcMetadata: 1d76561dbe108226813651f3fd70416295040612f0cf3c36fd330fc388d9ef60: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 8d4e43ad080af609d17f7c559a838f1e46da4a990bfbee068d540601847951e5 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200_NVLx1 FP8 Throughput ngcMetadata: 280bfbbbf4ea6e6744b706d25032054ad18289814406f251d9f862b044c51c67: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 50ad558f72f96fb5171036046bfaab28fc9eb1157e31488c5da2c3ab0134c020 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200_NVLx1 BF16 Throughput ngcMetadata: 2d1a186f55c204c95b4abd9df2056e3095b148700cd8fdda115ffe7bea3bed60: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 940e6e9a959cdd8829c9cba449c1c8bc83ef2522ce1f263cbc7e1920399fe465 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100_NVLx1 FP8 Throughput ngcMetadata: 2e2ecec7b2d03c998a8bae64e150a5f88bfde56917d372dc91ffc08f94c9d07f: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9fb78395ea4e775bdbf7c8df874ee89b4e084d35fd3ee6f8b105ec6061e8d887 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200_NVLx2 FP8 Latency ngcMetadata: 2e92e2be673e48b2312076393db8caff10c7dae24bf90cd1637b197fe2dda0f2: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: ba433d54af67e3a0d72db4a896acb2e92ba9caa41d731086c34d9e77df019c7d number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:a100x2-latency-bf16-ezdh3qtgsw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A100x2 BF16 Latency ngcMetadata: 345837de17bc4e103174352bc07a86112cef00318470e5477afb24908d09abb6: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f517db32f12098356f9ef902992f57d5362a4e58a8d185c993cc93657f18a3cb number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:a10gx2-latency-bf16-sdcegxqefa framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A10Gx2 BF16 Latency ngcMetadata: 41eb6cd432c8d498926942101511e4da1e913d0d22adbc96ed547a8042d2b7ce: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 071daef17e33c140ef8b82b89004ed3d5412e3eca9b4bedb8b57824dc05e975c number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 2 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A100_SXM4_40GBx2 BF16 Latency ngcMetadata: 43a51be16bada864c0ab6acc3e267e333fe69150a05287d5386e7cb39c7c61bb: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 04ee232eb94574469d111ff8001c42836ee881f67fbf9040fbdc582e6b1b1c42 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 2 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_144GBx1 BF16 Latency ngcMetadata: 4843f0f1c0b0b410cbc37dbb748396f2793b0eb5ed8ad9f215e06da1e82b98e8: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 736c2ad1750f0504768fb31d97587a07cc473517c232e8454f098e63c0f5de5c number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A100_SXM4_40GBx1 BF16 Throughput ngcMetadata: 4acbbc32a700f17dd483e6a53914ec62688d029120fb0c216420e8481983d0e7: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 10849901781866ddf59b10df9f42464a7c089c5f0a61f41f6b25862f19195a7b number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 1 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:l40sx2-latency-fp8-uvobdo54ig framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct L40Sx2 FP8 Latency ngcMetadata: 55be74aa57225eb45db56bda45a2e1ad7a02f8f30d5f8eef9877df8adacc0550: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 0669a60d949ab85b17b5f2a73d7e9f6b131797740da6ed43e29e5f41066d571c number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x2-latency-fp8--yymwnqgka framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x2 FP8 Latency ngcMetadata: 58db8acdee23b42a43f731e5e6e7d123ff889d70318dd876d8325bfdd9d52023: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d9bafa9974769a2f7539affd5e55acef006ce62c15bc088dc3a55afd818ee124 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:gb200x1-throughput-bf16-zyj-crhkzq framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GB200x1 BF16 Throughput ngcMetadata: 59619a192c8ef4c65e8363642f722508401f1392f64fd007337abb01ecbe7d19: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 35decffe402ab43b965bedff55c8cef9addc05e917187701562af2f4fe213de9 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x1-throughput-bf16-tsa8sfptpw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x1 BF16 Throughput ngcMetadata: 61555d7be4e6de25c9219d7a0bb106d40ce887b31f29ea8df4fee6110e2b853b: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 9b5fee25e36a210bec2865d4ebd5c974a8cf4e4002efacd9ec516642370cbd9c number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:b200x1-throughput-fp8-ys-xbyv-sg framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct B200x1 FP8 Throughput ngcMetadata: 6d0ee85cf622a72848fc5daa170614ce7fcec7167fe53d718878f08eb24cb965: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 5c56717955f737f260b918572f35036dee10c1b54530f4096fa66af19b17ffd5 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_480GBx1 FP8 Throughput ngcMetadata: 700827bec7ac9724fe295b4bdde657eff97c34de54f3ad504fabfe32e12e3e18: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 8e95c5073de6fd99ffce7014cc733fe55fa894162cfec8c938756de81fa8ecac number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200_NVLx2 BF16 Latency ngcMetadata: 70089c0e01ba82698bed7ab932bafebc141455e40bd15567f7e37496ac7bcf1e: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200_NVL gpu_device: 233b:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 3575e95e2064a522faa33fa6dbf9a6c3eddbee6bcf286d0c44315142c402b089 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 233B:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x2-latency-bf16-itm2i3hlig framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x2 BF16 Latency ngcMetadata: 7c29442049d0390525e51aaf5d3d3ac7c676bf7222707b7ed29442e2a95227c5: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: f987c70ae752902a0fb500d8f378afd65fc5b47b5eeb88627004b0edb210bcd8 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct RTX6000_BLACKWELL_SVx1 FP8 Throughput ngcMetadata: 85eeb431dec2e7ce1aff645c1e1e08d0a42a644f64874c68a65fd4e07189b902: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: RTX6000_BLACKWELL_SV gpu_device: 2bb5:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: d76814b2d2442f8ad709563f653ed4b80e39f5ad1acbd0b84822235fe4e3d1a4 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_BLACKWELL_SV - key: COUNT value: 1 - key: GPU DEVICE value: 2BB5:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:l40sx1-throughput-fp8-wocwu5pweq framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct L40Sx1 FP8 Throughput ngcMetadata: 894221e5032dfb82de8567266ea22114b8597aabc85b93e92ad290508ecd33bf: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 13f1a9e953c8a5320111dc8a580cd3855291326abe1d1e5b5c7dfced9cb6f6ea number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:a100x1-throughput-bf16-dohmk4psfa framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct A100x1 BF16 Throughput ngcMetadata: 8b22a466a5ef2f848151ea4679201cf4f7fe7ebd7094671cfa3df7a25836b4ff: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: c83a64076a4cecd2c6d8d55db86ba5d0b31395c28ab806962828aa291c192b33 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:gb200x2-latency-bf16-2w1oa3-9bw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GB200x2 BF16 Latency ngcMetadata: 930b33fdac9c955b3149d675d262d286f7e4db61503e9c9de17aa18dfe092238: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: e8d3089f177a1641f3b610799a19a3a4c752e1795eca7be92a129e9bae5cbc39 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_480GBx1 BF16 Throughput ngcMetadata: a416c249ee8f78d2790919c1d5e6f3afa3be9d85f3e77cd635e65335caae4ddb: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: cafc299ec54dd11d527965baac566347feccd9dad56d5a89cfb4e710e56b8b2c number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x1-throughput-fp8-b62hkfmx2a framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x1 FP8 Throughput ngcMetadata: a6780e332aeb2c67ff491b2b1d13f04c58c238bc07bad39af5b0c552d6e3dfae: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7ccea195195ee0642c6785c8d5aa7ab8976737fc4f80f2966f5dcf7c8333391d number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:gb200x1-throughput-fp8-nupr5gs2dw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GB200x1 FP8 Throughput ngcMetadata: a8d5512071d8c48e62ac709edc231cbf158aacc5faae00040379c8c3bc4f2bf8: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: bc56bf15cfcf13e2c7834d0ce1767a27719c1de7291cee654f6232c35375bf45 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h100x2-latency-fp8-5cyndvc2za framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100x2 FP8 Latency ngcMetadata: af99cd31d06f9fb19ffe3dfce1e5c053ffbd43f3c7e671d9c4550eccb8dee31e: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 44b31027ddef82c6ddcd49ecfc68f20067975ad0d9b62755ca3667e055f48ab5 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:l40sx2-latency-bf16-wimz1alj0q framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct L40Sx2 BF16 Latency ngcMetadata: b0833059fa3270d15e7a2ccbd1228fe5a3681b5801395d5c2c306fac3a386534: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a4315018cc1c0f75d76920b73d557daea0e2f4dfbd8b626515adb36e90ebba12 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_144GBx1 FP8 Throughput ngcMetadata: b4375abbe106a961f61ebfc40ecb73490ca64fadba7b06c156173aeec81ed2fa: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 94f9fdc707b068711db2447004698e4a90be58c6d3329463ba57c85714bff488 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:b200x2-latency-bf16-jvvom3lafg framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct B200x2 BF16 Latency ngcMetadata: bb8774e429cd06145c3af972a557193da4859bbb406bdd6ab4eba1111b757ee4: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 417053eecdaf868c0db37d64920e90eb29881638168c9781c33309dabd9852c8 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100_NVLx2 FP8 Latency ngcMetadata: bbc09967e87df528eafcdee9c95946cbc528a004bade8c30e4d655902b4a1eda: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 7ad4d81f5c1ff5bd7171ee75a51e45af792644f3cc45282999e3405adfbec78e number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h100x1-throughput-fp8-tfwwzbhdca framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100x1 FP8 Throughput ngcMetadata: cad5ff155623a7ed9e6e400347be5d2b1772324a4389585f372d99ab2b18310b: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: fa0733fe89758e6911ee12b330d3bc39d1e933a9cffcd13f80f54f00e44d5808 number_of_gpus: '2' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 2GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GH200_144GBx1 BF16 Throughput ngcMetadata: cc2a610402f4d5fc1b580b09af8e0f35c4a6214bc5a3da2de66aa1c9eaf00703: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GH200_144GB gpu_device: 2348:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 64b5a893869e9c17ddd2a31f28ebe867079c41899f1e00c503cce640fa487128 number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_144GB - key: COUNT value: 1 - key: GPU DEVICE value: 2348:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:gb200x2-latency-fp8-mj9h4xjlpw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct GB200x2 FP8 Latency ngcMetadata: d30b476874b44f6d697a81c37a6a5df7747b95d77ec196644b1b288cfa4ebb99: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: GB200 gpu_device: 2941:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a79685d119872d802f1fd849c1636665049ff58a7354fb536df1451d42952a75 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: GB200 - key: COUNT value: 2 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h100x2-latency-bf16-kdm6hypmza framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100x2 BF16 Latency ngcMetadata: d35b0a4879ffc8a4ee620979a9b0306c1d35265c0d0d4917b1079da1bd44c830: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: adb5fe983b732e233035db031b01461e698505aa8db330f00da89ed240b244b3 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:h200x2-latency-fp8-qdlgs44zrw framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H200x2 FP8 Latency ngcMetadata: ece478a8ed72c4ff1b85bf758b105a30b724f5da2320278a50d7328ae661eaed: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: a14e294573c64a9a49641e82012cb4fa3776f6fe25d9be84092b4ec2476006e1 number_of_gpus: '4' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 3GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-1b-instruct:hf-9213176-tool_calling framework: TensorRT-LLM displayName: Llama 3.2 1B Instruct H100_NVLx2 BF16 Latency ngcMetadata: ee50ced41e24f36d4ad7c0bb3504688562be53a94b7f76fa99a867ff8b5d06ca: model: meta/llama-3.2-1b-instruct release: 1.12.0 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm nim_workspace_hash_v1: 77f4dfe035e443e59bbaf204c6e17b9c52cb16cbb638e426a5eedb0a8b6b2177 number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.12.0 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - variantId: Llama 3.2 3B Instruct modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-02-19T20:43:20.538Z",
    "description": "## Model Information\n\nThe Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 1B and 3B sizes (text in/text out). The Llama 3.2 instruction-tuned text only models are optimized for multilingual dialogue use cases, including agentic retrieval and summarization tasks. They outperform many of the available open source and closed chat models on common industry benchmarks. \n\nLlama 3.2 models are ready for commercial use.\n\nModels are accelerated by [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a library for optimizing Large Language Model (LLM) inference on NVIDIA GPUs.\n\n**Models in this Collection:**  \n- Llama-3.2-1B\n- Llama-3.2-1B-Instruct\n- Llama-3.2-3B\n- Llama-3.2-3B-Instruct\n\n**Model Developer:** Meta\n\n**Model Version:** 3.2\n\n**Model Release Date:** September 25, 2024\n\n**Third-Party Community Consideration:**\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [Llama 3.2 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/MODEL_CARD.md).\n\n**License:** Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).\n\n**Model Architecture:** Llama 3.2 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.\n\n|  | Training Data | Params | Input modalities | Output modalities | Context Length | GQA | Shared Embeddings | Token count | Knowledge cutoff |\n| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |\n| Llama 3.2 (text only)  | A new mix of publicly available online data. | 1B (1.23B) | Multilingual Text | Multilingual Text and code  | 128k | Yes | Yes | Up to 9T tokens | December 2023 |\n|  |  | 3B (3.21B) | Multilingual Text | Multilingual Text and code  |  |  |  |  |  |\n\n**Supported Languages:** English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai are officially supported. Llama 3.2 has been trained on a broader collection of languages than these 8 supported languages. Developers may fine-tune Llama 3.2 models for languages beyond these supported languages, provided they comply with the Llama 3.2 Community License and the Acceptable Use Policy. Developers are always expected to ensure that their deployments, including those that involve additional languages, are completed safely and responsibly.\n\n**Llama 3.2 Model Family:** Token counts refer to pre-training data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability.\n\n**Status:** This is a static model trained on an offline dataset. Future versions may be released that improve model capabilities and safety. \n\n**Feedback:** Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama-models/tree/main/models/llama3_2). For more technical information about generation parameters and recipes for how to use Llama 3.2 in applications, please go [here](https://github.com/meta-llama/llama-recipes). \n\n## Intended Use\n\n**Intended Use Cases:** Llama 3.2 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat and agentic applications like knowledge retrieval and summarization, mobile AI powered writing assistants and query and prompt rewriting. Pre-trained models can be adapted for a variety of additional natural language generation tasks. \n\n**Out of Scope:** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.2 Community License. Use in languages beyond those explicitly referenced as supported in this model card.\n\n## Software Integration\n\n**Supported Hardware Microarchitecture Compatibility:**\n- NVIDIA Ampere\n- NVIDIA Hopper\n- NVIDIA Lovelace\n- NVIDIA Jetson\n\n**Supported Operating System(s):**\n- Linux \n- Windows\n\n## Hardware and Software\n\n**Training Factors:** We used custom training libraries, Meta's custom built GPU cluster, and production infrastructure for pre-training. Fine-tuning, annotation, and evaluation were also performed on production infrastructure.\n\n**Training Energy Use:** Training utilized a cumulative of **916k** GPU hours of computation on H100-80GB (TDP of 700W) type hardware, per the table below. Training time is the total GPU time required for training each model and power consumption is the peak power capacity per GPU device used, adjusted for power usage efficiency. \n\n## \n\n**Training Greenhouse Gas Emissions:** Estimated total location-based greenhouse gas emissions were **240** tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy; therefore, the total market-based greenhouse gas emissions for training were 0 tons CO2eq.\n\n|  | Training Time (GPU hours) | Logit Generation Time (GPU Hours) | Training Power Consumption (W) | Training Location-Based Greenhouse Gas Emissions (tons CO2eq) | Training Market-Based Greenhouse Gas Emissions (tons CO2eq) |\n| :---- | :---: | ----- | :---: | :---: | :---: |\n| Llama-3.2-1B | 370k | \\- | 700 | 107 | 0 |\n| Llama-3.2-3B | 460k | \\- | 700 | 133 | 0 |\n| Total | 830k |         86k |  | 240 | 0 |\n\nThe methodology used to determine training energy use and greenhouse gas emissions can be found [here](https://arxiv.org/pdf/2204.05149). Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.\n\n## Training Data\n\n**Data Collection Method:** Unknown  \n**Labeling Method:** Unknown\n\n**Overview:** Llama 3.2 was pre-trained on up to 9 trillion tokens of data from publicly available sources. For the 1B and 3B Llama 3.2 models, we incorporated logits from the Llama 3.1 8B and 70B models into the pre-training stage of the model development, where outputs (logits) from these larger models were used as token-level targets. Knowledge distillation was used after pruning to recover performance. In post-training we used a similar recipe as Llama 3.1 and produced final chat models by doing several rounds of alignment on top of the pre-trained model. Each round involved Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO).\n\n**Data Freshness:** The pre-training data has a cutoff of December 2023.\n\n## Benchmarks \\- English Text\n\nIn this section, we report the results for Llama 3.2 models on standard automatic benchmarks. For all these evaluations, we used our internal evaluations library. \n\n### Base Pre-trained Models \n\n| Category | Benchmark | \\# Shots | Metric | Llama-3.2-1B | Llama-3.2-3B | Llama-3.1-8B |\n| ----- | ----- | :---: | :---: | :---: | :---: | :---: |\n| General | MMLU | 5 | macro\\_avg/acc\\_char | 32.2 | 58 | 66.7 |\n|  | AGIEval English | 3-5 | average/acc\\_char | 23.3 | 39.2 | 47.8 |\n|  | ARC-Challenge | 25 | acc\\_char | 32.8 | 69.1 | 79.7 |\n| Reading comprehension | SQuAD | 1 | em | 49.2 | 67.7 | 77 |\n|  | QuAC (F1) | 1 | f1 | 37.9 | 42.9 | 44.9 |\n|  | DROP (F1) | 3 | f1 | 28.0 | 45.2 | 59.5 |\n| Long Context | Needle in Haystack | 0 | em | 96.8 | 1 | 1 |\n\n### Instruction-Tuned Models\n\n| Capability |  | Benchmark | \\# Shots | Metric | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Llama-3.1-8B-Instruct |\n| :---: | ----- | :---: | :---: | :---: | :---: | :---: | :---: |\n| General |  | MMLU | 5 | macro\\_avg/acc | 49.3 | 63.4 | 69.4 |\n| Re-writing |  | Open-rewrite eval | 0 | micro\\_avg/rougeL | 41.6 | 40.1 | 40.9 |\n| Summarization |  | TLDR9+ (test) | 1 | rougeL | 16.8 | 19.0 | 17.2 |\n| Instruction following |  | IFEval | 0 | Avg(Prompt/Instruction acc Loose/Strict) | 59.5 | 77.4 | 80.4 |\n| Math |  | GSM8K (CoT) | 8 | em\\_maj1@1 | 44.4 | 77.7 | 84.5 |\n|  |  | MATH (CoT) | 0 | final\\_em | 30.6 | 48.0 | 51.9 |\n| Reasoning |  | ARC-C | 0 | acc | 59.4 | 78.6 | 83.4 |\n|  |  | GPQA | 0 | acc | 27.2 | 32.8 | 32.8 |\n|  |  | Hellaswag | 0 | acc | 41.2 | 69.8 | 78.7 |\n| Tool Use |  | BFCL V2 | 0 | acc | 25.7 | 67.0 | 67.1 |\n|  |  | Nexus | 0 | macro\\_avg/acc | 13.5 | 34.3 | 38.5 |\n| Long Context |  | InfiniteBench/En.QA | 0 | longbook\\_qa/f1 | 20.3 | 19.8 | 27.3 |\n|  |  | InfiniteBench/En.MC | 0 | longbook\\_choice/acc | 38.0 | 63.3 | 72.2 |\n|  |  | NIH/Multi-needle | 0 | recall | 75.0 | 84.7 | 98.8 |\n| Multilingual |  | MGSM (CoT) | 0 | em | 24.5 | 58.2 | 68.9 |\n\n### Multilingual Benchmarks\n\n| Category | Benchmark | Language | Llama-3.2-1B-Instruct | Llama-3.2-3B-Instruct | Llama-3.1-8B-Instruct |\n| :---: | :---: | :---: | :---: | :---: | :---: |\n| General | MMLU (5-shot, macro\\_avg/acc) | Portuguese | 39.82 | 54.48 | 62.12 |\n|  |  | Spanish | 41.52 | 55.09 | 62.45 |\n|  |  | Italian | 39.79 | 53.77 | 61.63 |\n|  |  | German | 39.20 | 53.29 | 60.59 |\n|  |  | French | 40.47 | 54.59 | 62.34 |\n|  |  | Hindi | 33.51 | 43.31 | 50.88 |\n|  |  | Thai | 34.67 | 44.54 | 50.32 |\n\n## Responsibility & Safety\n\nAs part of our Responsible release approach, we followed a three-pronged strategy to managing trust & safety risks:\n\n1. Enable developers to deploy helpful, safe and flexible experiences for their target audience and for the use cases supported by Llama   \n2. Protect developers against adversarial users aiming to exploit Llama capabilities to potentially cause harm  \n3. Provide protections for the community to help prevent the misuse of our models\n\n### Responsible Deployment \n\n**Approach:** Llama is a foundational technology designed to be used in a variety of use cases. Examples on how Meta\u2019s Llama models have been responsibly deployed can be found in our [Community Stories webpage](https://llama.meta.com/community-stories/). Our approach is to build the most helpful models, enabling the world to benefit from the technology power, by aligning our model safety for generic use cases and addressing a standard set of harms. Developers are then in the driver\u2019s seat to tailor safety for their use cases, defining their own policies and deploying the models with the necessary safeguards in their Llama systems. Llama 3.2 was developed following the best practices outlined in our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/). \n\n#### Llama 3.2 Instruct \n\n**Objective:** Our main objectives for conducting safety fine-tuning are to provide the research community with a valuable resource for studying the robustness of safety fine-tuning, as well as to offer developers a readily available, safe, and powerful model for various applications to reduce the developer workload to deploy safe AI systems. We implemented the same set of safety mitigations as in Llama 3, and you can learn more about these in the Llama 3 [paper](https://ai.meta.com/research/publications/the-llama-3-herd-of-models/). \n\n**Fine-Tuning Data:** We employ a multi-faceted approach to data collection, combining human-generated data from our vendors with synthetic data to mitigate potential safety risks. We\u2019ve developed many large language model (LLM)-based classifiers that enable us to thoughtfully select high-quality prompts and responses, enhancing data quality control. \n\n**Refusals and Tone:** Building on the work we started with Llama 3, we put a great emphasis on model refusals to benign prompts as well as refusal tone. We included both borderline and adversarial prompts in our safety data strategy, and modified our safety data responses to follow tone guidelines. \n\n#### Llama 3.2 Systems\n\n**Safety as a System:** Large language models, including Llama 3.2, **are not designed to be deployed in isolation** but instead should be deployed as part of an overall AI system with additional safety guardrails as required. Developers are expected to deploy system safeguards when building agentic systems. Safeguards are key to achieve the right helpfulness-safety alignment as well as mitigating safety and security risks inherent to the system and any integration of the model or system with external tools. As part of our responsible release approach, we provide the community with [safeguards](https://llama.meta.com/trust-and-safety/) that developers should deploy with Llama models or other LLMs, including Llama Guard, Prompt Guard and Code Shield. All our [reference implementations](https://github.com/meta-llama/llama-agentic-system) demos contain these safeguards by default so developers can benefit from system-level safety out-of-the-box. \n\n### New Capabilities and Use Cases\n\n**Technological Advancement:** Llama releases usually introduce new capabilities that require specific considerations in addition to the best practices that generally apply across all Generative AI use cases. For prior release capabilities also supported by Llama 3.2, see [Llama 3.1 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md), as the same considerations apply here as well.\n\n**Constrained Environments:** Llama 3.2 1B and 3B models are expected to be deployed in highly constrained environments, such as mobile devices. LLM Systems using smaller models will have a different alignment profile and safety/helpfulness tradeoff than more complex, larger systems. Developers should ensure the safety of their system meets the requirements of their use case. We recommend using lighter system safeguards for such use cases, like Llama Guard 3-1B or its mobile-optimized version. \n\n### Evaluations\n\n**Scaled Evaluations:** We built dedicated, adversarial evaluation datasets and evaluated systems composed of Llama models and Purple Llama safeguards to filter input prompt and output response. It is important to evaluate applications in context, and we recommend building dedicated evaluation dataset for your use case.\n\n**Red Teaming:** We conducted recurring red teaming exercises with the goal of discovering risks via adversarial prompting and we used the learnings to improve our benchmarks and safety tuning datasets. We partnered early with subject-matter experts in critical risk areas to understand the nature of these real-world harms and how such models may lead to unintended harm for society. Based on these conversations, we derived a set of adversarial goals for the red team to attempt to achieve, such as extracting harmful information or reprogramming the model to act in a potentially harmful capacity. The red team consisted of experts in cybersecurity, adversarial machine learning, responsible AI, and integrity in addition to multilingual content specialists with background in integrity issues in specific geographic markets.\n\n### Critical Risks \n\nIn addition to our safety work above, we took extra care on measuring and/or mitigating the following critical risk areas:\n\n**1\\. CBRNE (Chemical, Biological, Radiological, Nuclear, and Explosive Weapons):** Llama 3.2 1B and 3B models are smaller and less capable derivatives of Llama 3.1. For Llama 3.1 70B and 405B, to assess risks related to proliferation of chemical and biological weapons, we performed uplift testing designed to assess whether use of Llama 3.1 models could meaningfully increase the capabilities of malicious actors to plan or carry out attacks using these types of weapons and have determined that such testing also applies to the smaller 1B and 3B models. \n\n**2\\. Child Safety:** Child Safety risk assessments were conducted using a team of experts, to assess the model\u2019s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors including the additional languages Llama 3 is trained on. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences. \n\n**3\\. Cyber Attacks:** Our cyber attack uplift study investigated whether LLMs can enhance human capabilities in hacking tasks, both in terms of skill level and speed. Our attack automation study focused on evaluating the capabilities of LLMs when used as autonomous agents in cyber offensive operations, specifically in the context of ransomware attacks. This evaluation was distinct from previous studies that considered LLMs as interactive assistants. The primary objective was to assess whether these models could effectively function as independent agents in executing complex cyber-attacks without human intervention.\n\n### Community \n\n**Industry Partnerships:** Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership on AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our [Github repository](https://github.com/meta-llama/PurpleLlama).\n\n**Grants:** We also set up the [Llama Impact Grants](https://llama.meta.com/llama-impact-grants/) program to identify and support the most compelling applications of Meta\u2019s Llama model for societal benefit across three categories: education, climate and open innovation. The 20 finalists from the hundreds of applications can be found [here](https://llama.meta.com/llama-impact-grants/#finalists). \n\n**Reporting:** Finally, we put in place a set of resources including an [output reporting mechanism](https://developers.facebook.com/llama_output_feedback) and [bug bounty program](https://www.facebook.com/whitehat) to continuously improve the Llama technology with the help of the community.\n\n## Ethical Considerations and Limitations\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**Values:** The core values of Llama 3.2 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3.2 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress. \n\n**Testing:** Llama 3.2 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3.2\u2019s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3.2 models, developers should perform safety testing and tuning tailored to their specific applications of the model. Please refer to available resources including our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide), [Trust and Safety](https://llama.meta.com/trust-and-safety/) solutions, and other [resources](https://llama.meta.com/docs/get-started/) to learn more about responsible development.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**",
    "displayName": "Llama-3.2-3B-Instruct",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "Bulk Build",
        "NSPECT-11Z8-118P",
        "llama-3.2-3b-instruct",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx4090x1-throughput-lora-fp8-0jughsmrpw",
    "latestVersionSizeInBytes": 3776725124,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/llama-3_2-3b-instruct.jpg",
    "modelFormat": "N/A",
    "name": "llama-3.2-3b-instruct",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "Meta",
    "safetyAndSecurity": "",
    "shortDescription": "The Meta Llama 3.2 collection of multilingual large language models (LLMs) is a collection of pre-trained and instruction-tuned generative models in 1B and 3B sizes (text in/text out).",
    "teamName": "meta",
    "updatedDate": "2025-07-24T22:31:03.452Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/llama-3_2-3b-instruct optimizationProfiles: - profileId: nim/meta/llama-3.2-3b-instruct:rtx4090x1-throughput-fp8-jtgu5wt2yg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX4090x1 FP8 Throughput ngcMetadata: 08cb5b3735b6331f07212bd488639ad1a049dbcf3e96375acbbb83ca861f9ec9: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX4090 gpu_device: 2684:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX4090 - key: COUNT value: 1 - key: GPU DEVICE value: 2684:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 4GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h20x1-throughput-bf16-hnixelsq-q framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H20x1 BF16 Throughput ngcMetadata: 1cdb4d3f28059cb1aacb005776112ce4f7060a20d25d072932ce60bbe993fabc: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H20 gpu_device: 2329:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H20 - key: COUNT value: 1 - key: GPU DEVICE value: 2329:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:a100x2-latency-bf16-dbue0mkzcw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A100x2 BF16 Latency ngcMetadata: 2146fcf18ea0412d564c6ed21d2f727281b95361fd78ccfa3d0570ec1716e8db: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 2 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:a100x1-throughput-bf16-lblsxfeipq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A100x1 BF16 Throughput ngcMetadata: 222d1729a785201e8a021b226d74d227d01418c41b556283ee1bdbf0a818bd94: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100_NVLx1 BF16 Throughput ngcMetadata: 25b5e251d366671a4011eaada9872ad1d02b48acc33aa0637853a3e3c3caa516: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A100_SXM4_40GBx2 BF16 Latency ngcMetadata: 30316e5488489e3c0c2b0e7eee9e4bf5e82655b2a31b66d2e2c5dfa2b4e99bb2: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 2 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:rtx4090x1-latency-fp8-cq5x62ffbg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX4090x1 FP8 Latency ngcMetadata: 33ca5a99fa9b89117df4b610b3f37fdf3462bc2e84a5b96bcf7685e5d839f7f5: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX4090 gpu_device: 2684:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX4090 - key: COUNT value: 1 - key: GPU DEVICE value: 2684:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h20x2-latency-bf16-u-xo-smuuq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H20x2 BF16 Latency ngcMetadata: 362bd1de84adb8cc5be888391810dd9cc02ce3f25ad0b70fd500be54f93b9d4c: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H20 gpu_device: 2329:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H20 - key: COUNT value: 2 - key: GPU DEVICE value: 2329:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h200x1-throughput-bf16-frc0n1b7nw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H200x1 BF16 Throughput ngcMetadata: 434e8d336fa23cbe151748d32b71e196d69f20d319ee8b59852a1ca31a48d311: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:b200x2-latency-fp8-04qswl5yla framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct B200x2 FP8 Latency ngcMetadata: 4950d30811e1e426e97cda69e6c03a8a4819db8aa4abf34722ced4542a1f6b52: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX6000_ADAx1 BF16 Latency ngcMetadata: 566962048d4b01afd12f466ae697cf071eed5a46be33d66f3733e978ce99d1e7: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100_NVLx1 FP8 Throughput ngcMetadata: 5811750e70b7e9f340f4d670c72fcbd5282e254aeb31f62fd4f937cfb9361007: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 1 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h200x2-latency-bf16--b69z90dgg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H200x2 BF16 Latency ngcMetadata: 6832a9395f54086162fd7b1c6cfaae17c7d1e535a60e2b7675504c9fc7b57689: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h100x2-latency-fp8-r2-4vhtqrq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100x2 FP8 Latency ngcMetadata: 6c3f01dd2b2a56e3e83f70522e4195d3f2add70b28680082204bbb9d6150eb04: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:a10gx1-throughput-bf16-r9bno-v4fw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A10Gx1 BF16 Throughput ngcMetadata: 74bfd8b2df5eafe452a9887637eef4820779fb4e1edb72a4a7a2a1a2d1e6480b: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 1 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h100x1-throughput-fp8-kc5b4ag-cg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100x1 FP8 Throughput ngcMetadata: 7b508014e846234db3cabe5c9f38568b4ee96694b60600a0b71c621dc70cacf3: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h20x2-latency-fp8-icgplntjww framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H20x2 FP8 Latency ngcMetadata: 7fba1c034f3ace0a31d7cc345ec44482735555168c62c332bb38121c26345bbd: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H20 gpu_device: 2329:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H20 - key: COUNT value: 2 - key: GPU DEVICE value: 2329:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX6000_ADAx1 FP8 Throughput ngcMetadata: 8620431f3069e2f17f1cf712639ba06d67290d74c4c2e9a0d6e606952de91a88: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h20x1-throughput-fp8-bmppgnfoeq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H20x1 FP8 Throughput ngcMetadata: 86be215b815363c818c00883dd403bd1f4ce5c610037637529a2a7e039973de6: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H20 gpu_device: 2329:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H20 - key: COUNT value: 1 - key: GPU DEVICE value: 2329:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:b200x1-throughput-fp8-pysymm95jq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct B200x1 FP8 Throughput ngcMetadata: 8b87146e39b0305ae1d73bc053564d1b4b4c565f81aa5abe3e84385544ca9b60: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l20x1-throughput-bf16-rpqq5ggd-q framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L20x1 BF16 Throughput ngcMetadata: 91c52b108cd75967df6ed98f3d1d73a34cb0899d625f6f86499089f545ebe458: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L20 gpu_device: 26ba:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L20 - key: COUNT value: 1 - key: GPU DEVICE value: 26BA:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100_NVLx2 FP8 Latency ngcMetadata: a00ce1e782317cd19ed192dcb0ce26ab8b0c1da8928c33de8893897888ff7580: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX6000_ADAx1 FP8 Latency ngcMetadata: a1ebfd69da7c3b97aa566387a4f086e563ff848cc5bab442147badc55f63364a: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:b200x1-throughput-bf16-iwdccsjltw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct B200x1 BF16 Throughput ngcMetadata: a4c63a91bccf635b570ddb6d14eeb6e7d0acb2389712892b08d21fad2ceaee38: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX6000_ADAx1 BF16 Throughput ngcMetadata: a7b900f860f8770ecf1a982e79395659729010beeec832b522d96e8243b2439a: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX6000_ADA gpu_device: 26b1:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l40sx1-throughput-bf16-i09pxvzjbg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L40Sx1 BF16 Throughput ngcMetadata: ac5071bbd91efcc71dc486fcd5210779570868b3b8328b4abf7a408a58b5e57c: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l40sx1-throughput-fp8-jnzgjqaxuw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L40Sx1 FP8 Throughput ngcMetadata: ad17776f4619854fccd50354f31132a558a1ca619930698fd184d6ccf5fe3c99: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h200x1-throughput-fp8-r0-6osqtng framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H200x1 FP8 Throughput ngcMetadata: af876a179190d1832143f8b4f4a71f640f3df07b0503259cedee3e3a8363aa96: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 1 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h100x2-latency-bf16-0i4agi9azq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100x2 BF16 Latency ngcMetadata: b3d535c0a7eaaea089b087ae645417c0b32fd01e7e9d638217cc032e51e74fd0: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 2 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100_NVLx2 BF16 Latency ngcMetadata: b7fad3b35b07d623fac6549078305b71d0e6e1d228a86fa0f7cfe4dbeca9151a: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100_NVL gpu_device: 2321:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: H100_NVL - key: COUNT value: 2 - key: GPU DEVICE value: 2321:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l20x2-latency-bf16-acj72sjf5a framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L20x2 BF16 Latency ngcMetadata: c1c471464263781f56805d7768a50f70c830dbc68d795d641bd5bef18455b6f4: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L20 gpu_device: 26ba:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L20 - key: COUNT value: 2 - key: GPU DEVICE value: 26BA:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l40sx2-latency-fp8-44i4vvrorq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L40Sx2 FP8 Latency ngcMetadata: c4ff823a8202af4b523274fb8c6cdd73fa8ee5af16391a6d36b17f714a3c71a0: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l20x2-latency-fp8-f2nzfrgyia framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L20x2 FP8 Latency ngcMetadata: c610b690036f0e8ac96ea3ed1e584ef5c4f8a4cf1253664b8dc08df5b404de48: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L20 gpu_device: 26ba:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: L20 - key: COUNT value: 2 - key: GPU DEVICE value: 26BA:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A100_SXM4_40GBx1 BF16 Throughput ngcMetadata: c6821c013c559912c37e61d7b954c5ca8fe07dda76d8bea0f4a52320e0a54427: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A100_SXM4_40GB gpu_device: 20b0:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: A100_SXM4_40GB - key: COUNT value: 1 - key: GPU DEVICE value: 20B0:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:rtx4090x1-throughput-bf16-8m--uis3tg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX4090x1 BF16 Throughput ngcMetadata: c78670b98ba7d5bc4105cbf723eb1cb514e3cb159dacd3d8b997b20c9ceeb1ea: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX4090 gpu_device: 2684:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: RTX4090 - key: COUNT value: 1 - key: GPU DEVICE value: 2684:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l20x1-throughput-fp8-fqsk6q2inq framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L20x1 FP8 Throughput ngcMetadata: d2f14fb35f10d3ffef37a9f198d3c39f37a1452f65a1b523ec0135868fb23ba7: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L20 gpu_device: 26ba:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: L20 - key: COUNT value: 1 - key: GPU DEVICE value: 26BA:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h200x2-latency-fp8-zzxu8dlxcw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H200x2 FP8 Latency ngcMetadata: e4f217a5fb016b570e34b8a8eb06051ccfef9534ba43da973bb7f678242eaa5f: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H200 gpu_device: 2335:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: fp8 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: FP8 - key: GPU value: H200 - key: COUNT value: 2 - key: GPU DEVICE value: 2335:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 5GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:h100x1-throughput-bf16--lfg89p-ew framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct H100x1 BF16 Throughput ngcMetadata: e7dbd9a8ce6270d2ec649a0fecbcae9b5336566113525f20aee3809ba5e63856: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:a10gx2-latency-bf16-0ksvrbt0ww framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct A10Gx2 BF16 Latency ngcMetadata: ee94491ed7167340de93fe9d1c87f10ba424da6f497eeabf83b4edcbeb69364c: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: A10G gpu_device: 2237:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: A10G - key: COUNT value: 2 - key: GPU DEVICE value: 2237:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:b200x2-latency-bf16-f-dquqynva framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct B200x2 BF16 Latency ngcMetadata: f44768c625db71a327cf17e750d5e1a8e60171a8d8ef6b4c1c4b57fe74c9bf46: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: B200 gpu_device: 2901:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: B200 - key: COUNT value: 2 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct GH200_480GBx1 FP8 Throughput ngcMetadata: f49b49f3d90159a594def51efd8595f1d618e288bca2721fe08e786a1ac67d04: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: fp8 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: FP8 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:rtx4090x1-latency-bf16-b25uxqlekg framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct RTX4090x1 BF16 Latency ngcMetadata: f5e266ce2a4692b37b80e0cb6ab2dea59a54d26b80396f1a521921384bd79ffe: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: RTX4090 gpu_device: 2684:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: latency tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: RTX4090 - key: COUNT value: 1 - key: GPU DEVICE value: 2684:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 7GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:hf-392a143-tool-use-v2 framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct GH200_480GBx1 BF16 Throughput ngcMetadata: f7f74ecd523cd63065a50016a8786a893b9b1efe0d313bc5bcc54682f56e55fe: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: GH200_480GB gpu_device: 2342:10de llm_engine: tensorrt_llm number_of_gpus: '1' pp: '1' precision: bf16 profile: throughput tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: THROUGHPUT - key: PRECISION value: BF16 - key: GPU value: GH200_480GB - key: COUNT value: 1 - key: GPU DEVICE value: 2342:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 12GB - key: LLM ENGINE value: TENSORRT_LLM - profileId: nim/meta/llama-3.2-3b-instruct:l40sx2-latency-bf16-pb6dhqvrgw framework: TensorRT-LLM displayName: Llama 3.2 3B Instruct L40Sx2 BF16 Latency ngcMetadata: fa36c3502e92c50f78a1906242f929864955e702b7dbfbdb19758fb7ee9aa811: model: meta/llama-3.2-3b-instruct release: 1.10.1 tags: feat_lora: 'false' gpu: L40S gpu_device: 26b9:10de llm_engine: tensorrt_llm number_of_gpus: '2' pp: '1' precision: bf16 profile: latency tp: '2' modelFormat: trt-llm spec: - key: PROFILE value: LATENCY - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 2 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.10.1 - key: DOWNLOAD SIZE value: 8GB - key: LLM ENGINE value: TENSORRT_LLM labels: - Llama - Meta - Multilingual Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: NeMo Retriever-Parse displayName: NeMo Retriever-Parse modelHubID: nemoretriever-parse category: Text Extraction type: NGC description: Nemoretriever-parse is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemoretriever-parse is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: nemoretriever-parse:1.2.0 modelCard: ewogICAgImFjY2Vzc1R5cGUiOiAiTk9UX0xJU1RFRCIsCiAgICAiYXBwbGljYXRpb24iOiAiT3RoZXIiLAogICAgImJpYXMiOiAiIiwKICAgICJjYW5HdWVzdERvd25sb2FkIjogZmFsc2UsCiAgICAiY3JlYXRlZERhdGUiOiAiMjAyNS0wMy0xM1QwMDoyODozNC4zNTRaIiwKICAgICJkZXNjcmlwdGlvbiI6ICIjIG5lbW9yZXRyaWV2ZXItcGFyc2UgXG5cbm5lbW9yZXRyaWV2ZXItcGFyc2UgaXMgYSBnZW5lcmFsIHB1cnBvc2UgdGV4dC1leHRyYWN0aW9uIG1vZGVsLCBzcGVjaWZpY2FsbHkgZGVzaWduZWQgdG8gaGFuZGxlIGRvY3VtZW50cy4gR2l2ZW4gYW4gaW1hZ2UsIG5lbW9yZXRyaWV2ZXItcGFyc2UgaXMgYWJsZSB0byBleHRyYWN0IGZvcm1hdHRlZC10ZXh0LCB3aXRoIGJvdW5kaW5nLWJveGVzIGFuZCB0aGUgY29ycmVzcG9uZGluZyBzZW1hbnRpYyBjbGFzcy4gVGhpcyBoYXMgZG93bnN0cmVhbSBiZW5lZml0cyBmb3Igc2V2ZXJhbCB0YXNrcyBzdWNoIGFzIGluY3JlYXNpbmcgdGhlIGF2YWlsYWJpbGl0eSBvZiB0cmFpbmluZy1kYXRhIGZvciBMYXJnZSBMYW5ndWFnZSBNb2RlbHMgKExMTXMpLCBpbXByb3ZpbmcgdGhlIGFjY3VyYWN5IG9mIHJldHJpZXZlciBzeXN0ZW1zLCBhbmQgZW5oYW5jaW5nIGRvY3VtZW50IHVuZGVyc3RhbmRpbmcgcGlwZWxpbmVzLlxuXG4jIyBMaWNlbnNlIFxuR09WRVJOSU5HIFRFUk1TOiBUaGUgTklNIGNvbnRhaW5lciBpcyBnb3Zlcm5lZCBieSB0aGUgW05WSURJQSBTb2Z0d2FyZSBMaWNlbnNlIEFncmVlbWVudF0oaHR0cHM6Ly93d3cubnZpZGlhLmNvbS9lbi11cy9hZ3JlZW1lbnRzL2VudGVycHJpc2Utc29mdHdhcmUvbnZpZGlhLXNvZnR3YXJlLWxpY2Vuc2UtYWdyZWVtZW50LylhbmQgW1Byb2R1Y3QtU3BlY2lmaWMgVGVybXMgZm9yIE5WSURJQSBBSSBQcm9kdWN0c10oaHR0cHM6Ly93d3cubnZpZGlhLmNvbS9lbi11cy9hZ3JlZW1lbnRzL2VudGVycHJpc2Utc29mdHdhcmUvcHJvZHVjdC1zcGVjaWZpYy10ZXJtcy1mb3ItYWktcHJvZHVjdHMvKS4gVXNlIG9mIHRoaXMgbW9kZWwgaXMgZ292ZXJuZWQgYnkgdGhlIFtOVklESUEgQ29tbXVuaXR5IE1vZGVsIExpY2Vuc2VdKGh0dHBzOi8vd3d3Lm52aWRpYS5jb20vZW4tdXMvYWdyZWVtZW50cy9lbnRlcnByaXNlLXNvZnR3YXJlL252aWRpYS1jb21tdW5pdHktbW9kZWxzLWxpY2Vuc2UvKS5cblxuIyMgUmVmZXJlbmNlcyBcblxuWzFdIGh0dHBzOi8vaHVnZ2luZ2ZhY2UuY28vZG9jcy90cmFuc2Zvcm1lcnMvZW4vbW9kZWxfZG9jL21iYXJ0XG5cbiMjIE1vZGVsIEFyY2hpdGVjdHVyZSBcblxuIyMjIEFyY2hpdGVjdHVyZSBUeXBlIDogXG5UcmFuc2Zvcm1lci1iYXNlZCB2aXNpb24tZW5jb2Rlci1kZWNvZGVyIG1vZGVsXG5cbiMjIyBOZXR3b3JrIEFyY2hpdGVjdHVyZSBcblxuVmlzaW9uIEVuY29kZXI6IFZpVC1IIG1vZGVsIChodHRwczovL2h1Z2dpbmdmYWNlLmNvL252aWRpYS9DLVJBRElPKVxuQWRhcHRlciBMYXllcjogMUQgY29udm9sdXRpb25zICYgbm9ybXMgdG8gY29tcHJlc3MgZGltZW5zaW9uYWxpdHkgYW5kIHNlcXVlbmNlIGxlbmd0aCBvZiB0aGUgbGF0ZW50IHNwYWNlICgxMjgwIHRva2VucyB0byAzMjAgdG9rZW5zKVxuRGVjb2RlcjogbUJhcnQgWzFdIDEwIGJsb2Nrc1xuVG9rZW5pemVyOiBHYWxhY3RpY2EgKGh0dHBzOi8vYXJ4aXYub3JnL2Ficy8yMjExLjA5MDg1KTsgc2FtZSBhcyBOb3VnYXQgdG9rZW5pemVyXG5cbiMjIyBJbnB1dCBcblxuSW5wdXQgVHlwZTogSW1hZ2UsIFRleHRcblxuSW5wdXQgVHlwZShzKTogUmVkLCBHcmVlbiwgQmx1ZSAoUkdCKSArIFByb21wdCAoU3RyaW5nKVxuXG5JbnB1dCBQYXJhbWV0ZXJzOiAyRCwgMURcblxuT3RoZXIgUHJvcGVydGllcyBSZWxhdGVkIHRvIElucHV0OlxuXG5NYXggSW5wdXQgUmVzb2x1dGlvbiAoV2lkdGgsIEhlaWdodCk6IDE2NDgsIDIwNDhcblxuTWluIElucHV0IFJlc29sdXRpb24gKFdpZHRoLCBIZWlnaHQpOiAxMDI0LCAxMjgwXG5cbkNoYW5uZWwgQ291bnQ6IDNcblxuIyMjIE91dHB1dCBcblxuT3V0cHV0IFR5cGU6IFRleHRcblxuT3V0cHV0IEZvcm1hdDogU3RyaW5nXG5cbk91dHB1dCBQYXJhbWV0ZXJzOiAxRFxuXG5PdGhlciBQcm9wZXJ0aWVzIFJlbGF0ZWQgdG8gT3V0cHV0OiBuZW1vcmV0cmlldmVyLXBhcnNlIG91dHB1dCBmb3JtYXQgaXMgYSBzdHJpbmcgd2hpY2ggZW5jb2RlcyB0ZXh0IGNvbnRlbnQgKGZvcm1hdHRlZCBvciBub3QpIGFzIHdlbGwgYXMgYm91bmRpbmcgYm94ZXMgYW5kIGNsYXNzIGF0dHJpYnV0ZXMuXG5cbiMjIFNvZnR3YXJlIEludGVncmF0aW9uXG5cblJ1bnRpbWUgRW5naW5lKHMpOiBQeVRvcmNoXG5cblN1cHBvcnRlZCBIYXJkd2FyZSBQbGF0Zm9ybShzKTogTlZJRElBIEhvcHBlci9OVklESUEgQW1wZXJlL05WSURJQSBUdXJpbmdcblxuU3VwcG9ydGVkIE9wZXJhdGluZyBTeXN0ZW0ocyk6IExpbnV4XG5cbiMjIE1vZGVsIFZlcnNpb25cblxubmVtb3JldHJpZXZlci1wYXJzZTogQXMgcGFydCBvZiB0aGlzIGZpcnN0IHJlbGVhc2UsIHdlIHNoYXJlIHRoZSBzZXQgb2Ygd2VpZ2h0cyBuYW1lZCBvdmVyam95ZWQtYWRkZXIuXG5cbiMjIFRyYWluaW5nIERhdGFzZXQgXG5cblxubmVtb3JldHJpZXZlci1wYXJzZSBpcyBmaXJzdCBwcmUtdHJhaW5lZCBvbiBvdXIgaW50ZXJuYWwgZGF0YXNldHM6IGh1bWFuLCBzeW50aGV0aWMgYW5kIGF1dG9tYXRlZFxuXG5JbmZlcmVuY2VcbiMjIEluZmVyZW5jZSBcblxuUnVudGltZSBFbmdpbmUocyk6IFB5VG9yY2hcblxuVGVzdCBIYXJkd2FyZTogTlZJRElBIEgxMDAjIFN5bmNocm9uaXphdGlvblxuXG4jIyBFdGhpY2FsIENvbnNpZGVyYXRpb25zXG5cbk5WSURJQSBiZWxpZXZlcyBUcnVzdHdvcnRoeSBBSSBpcyBhIHNoYXJlZCByZXNwb25zaWJpbGl0eSBhbmQgd2UgaGF2ZSBlc3RhYmxpc2hlZCBwb2xpY2llcyBhbmQgcHJhY3RpY2VzIHRvIGVuYWJsZSBkZXZlbG9wbWVudCBmb3IgYSB3aWRlIGFycmF5IG9mIEFJIGFwcGxpY2F0aW9ucy4gV2hlbiBkb3dubG9hZGVkIG9yIHVzZWQgaW4gYWNjb3JkYW5jZSB3aXRoIG91ciB0ZXJtcyBvZiBzZXJ2aWNlLCBkZXZlbG9wZXJzIHNob3VsZCB3b3JrIHdpdGggdGhlaXIgc3VwcG9ydGluZyBtb2RlbCB0ZWFtIHRvIGVuc3VyZSB0aGlzIG1vZGVsIG1lZXRzIHJlcXVpcmVtZW50cyBmb3IgdGhlIHJlbGV2YW50IGluZHVzdHJ5IGFuZCB1c2UgY2FzZSBhbmQgYWRkcmVzc2VzIHVuZm9yZXNlZW4gcHJvZHVjdCBtaXN1c2UuXG5cblBsZWFzZSByZXBvcnQgc2VjdXJpdHkgdnVsbmVyYWJpbGl0aWVzIG9yIE5WSURJQSBBSSBDb25jZXJucyBoZXJlLlxuXG4qKllvdSBhcmUgcmVzcG9uc2libGUgZm9yIGVuc3VyaW5nIHRoYXQgeW91ciB1c2Ugb2YgTlZJRElBIEFJIEZvdW5kYXRpb24gTW9kZWxzIGNvbXBsaWVzIHdpdGggYWxsIGFwcGxpY2FibGUgbGF3cy4qKiIsCiAgICAiZGlzcGxheU5hbWUiOiAibmVtb3JldHJpZXZlci1wYXJzZSIsCiAgICAiZXhwbGFpbmFiaWxpdHkiOiAiIiwKICAgICJmcmFtZXdvcmsiOiAiT3RoZXIiLAogICAgImhhc1BsYXlncm91bmQiOiBmYWxzZSwKICAgICJoYXNTaWduZWRWZXJzaW9uIjogdHJ1ZSwKICAgICJpc1BsYXlncm91bmRFbmFibGVkIjogZmFsc2UsCiAgICAiaXNQdWJsaWMiOiBmYWxzZSwKICAgICJpc1JlYWRPbmx5IjogdHJ1ZSwKICAgICJsYWJlbHMiOiBbCiAgICAgICAgIk5TUEVDVC1VVDRVLTVCTkIiLAogICAgICAgICJhcHBsaWNhdGlvbjptb2RlbDp1c2NzX29iamVjdF9kZXRlY3Rpb24iLAogICAgICAgICJudmFpZTptb2RlbDpudmFpZV9zdXBwb3J0ZWQiLAogICAgICAgICJudmlkaWFfbmltOm1vZGVsOm5pbW1jcm9fbnZpZGlhX25pbSIsCiAgICAgICAgInByb2R1Y3ROYW1lczpuaW0tZGV2IiwKICAgICAgICAicHJvZHVjdE5hbWVzOm52LWFpLWVudGVycHJpc2UiLAogICAgICAgICJ0ZWNobm9sb2d5Om1vZGVsOnNvbG5fYXBwbGljYXRpb25fZGV2ZWxvcG1lbnQiLAogICAgICAgICJ0ZWNobm9sb2d5Om1vZGVsOnNvbG5faW5mZXJlbmNlIgogICAgXSwKICAgICJsYXRlc3RWZXJzaW9uSWRTdHIiOiAiY29udmVydGVkX2NoZWNrcG9pbnRfdjIiLAogICAgImxhdGVzdFZlcnNpb25TaXplSW5CeXRlcyI6IDU1ODY4NDEzOCwKICAgICJsb2dvIjogImh0dHBzOi8vYXNzZXRzLm5nYy5udmlkaWEuY29tL3Byb2R1Y3RzL2FwaS1jYXRhbG9nL2ltYWdlcy9uZW1vcmV0cmlldmVyLXBhcnNlLmpwZyIsCiAgICAibW9kZWxGb3JtYXQiOiAiZnAzMiIsCiAgICAibmFtZSI6ICJuZW1vcmV0cmlldmVyLXBhcnNlIiwKICAgICJvcmdOYW1lIjogIm5pbSIsCiAgICAicHJlY2lzaW9uIjogIk4vQSIsCiAgICAicHJpdmFjeSI6ICIiLAogICAgInByb2R1Y3ROYW1lcyI6IFsKICAgICAgICAibmltLWRldiIsCiAgICAgICAgIm52LWFpLWVudGVycHJpc2UiCiAgICBdLAogICAgInB1YmxpY0RhdGFzZXRVc2VkIjoge30sCiAgICAicHVibGlzaGVyIjogIk5WSURJQSIsCiAgICAic2FmZXR5QW5kU2VjdXJpdHkiOiAiIiwKICAgICJzaG9ydERlc2NyaXB0aW9uIjogIm5lbW9yZXRyaWV2ZXItcGFyc2UgaXMgYSB0aW55IGF1dG9yZWdyZXNzaXZlIFZpc2lvbiBMYW5ndWFnZSBNb2RlbCAoVkxNKSBkZXNpZ25lZCBmb3IgZG9jdW1lbnQgdHJhbnNjcmlwdGlvbiBmcm9tIGltYWdlcy4gSXQgb3V0cHV0cyB0ZXh0IGluIHJlYWRpbmcgb3JkZXIuIiwKICAgICJ0ZWFtTmFtZSI6ICJudmlkaWEiLAogICAgInVwZGF0ZWREYXRlIjogIjIwMjUtMDMtMTNUMDA6Mjk6MzAuNTMwWiIKfQ== source: URL: https://build.nvidia.com/nvidia/nemoretriever-parse optimizationProfiles: - profileId: nim/nvidia/nemoretriever-parse:a100x1-throughput-bf16-e9wjao-enw framework: TensorRT-LLM displayName: nemoretriever-parse A100 BF16 Throughput ngcMetadata: 19c68819d9428cfa494e977f4d2be6378215a8f610cce9bdfc0aa3cdd7d66aa9: model: nvidia/nemoretriever-parse release: 1.2.0 tags: gpu: A100 gpu_device: 20b2:10de llm_engine: tensorrt_llm pp: '1' profile: throughput precision: bf16 tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: Throughput - key: PRECISION value: BF16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20b2:10de - key: NIM VERSION value: 1.2.0 - key: DOWNLOAD SIZE value: 600MB - profileId: nim/nvidia/nemoretriever-parse:h100x1-throughput-bf16-2apiazbpma framework: TensorRT-LLM displayName: nemoretriever-parse H100 BF16 Throughput ngcMetadata: 8db6dcd816ca1ce8d07e72d8b9c4682120b3c50799422361e35b4ab87820efd6: model: nvidia/nemoretriever-parse release: 1.2.0 tags: gpu: H100 gpu_device: 2330:10de llm_engine: tensorrt_llm pp: '1' profile: throughput precision: bf16 tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: Throughput - key: PRECISION value: BF16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10de - key: NIM VERSION value: 1.2.0 - key: DOWNLOAD SIZE value: 600MB - profileId: nim/nvidia/nemoretriever-parse:l40sx1-throughput-bf16-r98ogb1a1a framework: TensorRT-LLM displayName: nemoretriever-parse L40S BF16 Throughput ngcMetadata: 00c8a43783e7acf3d59a0d773cd78d3d29eaa71fa4412af7af2fbaf20e196a8b: model: nvidia/nemoretriever-parse release: 1.2.0 tags: gpu: L40S gpu_device: 26b5:10de llm_engine: tensorrt_llm pp: '1' profile: throughput precision: bf16 tp: '1' modelFormat: trt-llm spec: - key: PROFILE value: Throughput - key: PRECISION value: BF16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26b5:10de - key: NIM VERSION value: 1.2.0 - key: DOWNLOAD SIZE value: 600MB labels: - NeMo - Text Extraction - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Nemoretriever Graphic Elements V1 displayName: Nemoretriever Graphic Elements V1 modelHubID: nemoretriever-graphic-elements-v1 category: Object Detection type: NGC description: NVIDIA NeMo Retriever NIM for graphic elements v1 is a fine-tuned object detection model, trained specifically for detecting the elements of charts and tables in documents requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Nemoretriever Graphic Elements V1 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "| Field | Response |\n| ----- | ----- |\n| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |\n| Measures taken to mitigate against unwanted bias | None |",
    "builtBy": "NVIDIA",
    "canGuestDownload": false,
    "createdDate": "2025-03-14T00:29:08.674Z",
    "description": "## **Model Overview**\n\n### **Description**\n\nThe **NeMo Retriever Graphic Elements v1** model is a specialized object detection system designed to identify and extract key elements from charts and graphs. Based on YOLOX, an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. While the underlying technology builds upon work from [Megvii Technology](https://github.com/Megvii-BaseDetection/YOLOX), we developed our own base model through complete retraining rather than using pre-trained weights.\n\nThe model excels at detecting and localizing various graphic elements within chart images, including titles, axis labels, legends, and data point annotations. This capability makes it particularly valuable for document understanding tasks and automated data extraction from visual content.\n\nThis model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents.\n\nThis model supersedes the [CACHED](https://build.nvidia.com/university-at-buffalo/cached) model.\n\n### **License/Terms of use**\n\nUse of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/).\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n**Deployment Geography**: Global\n\n**Use Case**: <br>\n\nThis model is designed for automating extraction of graphic elements of charts in enterprise documents. Key applications include:\n- Enterprise document extraction, embedding and indexing\n- Augmenting Retrieval Augmented Generation (RAG) workflows with multimodal retrieval\n- Data extraction from legacy documents and reports\n\n**Release Date**: 2025-03-17\n\n### **Model Architecture**\n\n**Architecture type:** YOLOX <br>\n**Network architecture:** DarkNet53 Backbone \\+ FPN Decoupled head (one 1x1 convolution \\+ 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction)\n\nYOLOX is a single-stage object detector that improves on Yolo-v3. The model is fine-tuned to detect 10 classes of objects in documents:\n1. Chart title\n1. X-axis title\n1. Y-axis title\n1. X-axis label(s)\n1. Y-axis label(s)\n1. Legend label(s)\n1. Legend title\n1. Markings and values labels\n1. Miscellaneous other texts on the chart\n\n## **Input**\n\n**Input type(s):** Image <br>\n**Input format(s):** Red, Green, Blue (RGB) <br>\n**Input parameters:** Two Dimensional (2D) <br>\n**Other properties related to input:** Expected input is a `np.ndarray` image of shape `[Channel, Width, Height]`, or an `np.ndarray` batch of image of shape `[Batch, Channel, Width, Height]`.\n\n## **Output**\n\n**Output type(s):** Text associated to each of the following classes : <br>\n* `[\"chart_title\", \"x_title\", \"y_title\", \"xlabel\", \"ylabel\", \"other\", \"legend_label\", \"legend_title\", \"mark_label\", \"value_label\"]`\n\n**Output format:** Dict of String <br>\n**Output parameters:** 1D <br>\n**Other properties related to output:** None\n\n### Software Integration\n\n**Runtime Engine**: **NeMo Retriever Graphic Elements v1** NIM <br>\n**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace <br>\n**Supported Operating System(s)**: Linux\n\n## **Model Version(s):**\n\n* `nemoretriever-graphic-elements-v1`\n\n## **Training Dataset:**\n\n* PubMed Central (PMC) Chart Dataset\n\n  * **Link:** [https://chartinfo.github.io/index\\_2022.html](https://chartinfo.github.io/index_2022.html)\n  * **Data collection method:** Automated, Human\n  * **Labeling method**: Human\n  * **Description:** A real-world dataset collected from PubMed Central Documents and manually annotated, released in the ICPR 2022 CHART-Infographic competition. There are 5,614 images for chart element detection, 4,293 images for final plot detection and data extraction, and 22,924 images for chart classification.\n\n* DeepRule dataset\n\n  * **Link:** [https://github.com/soap117/DeepRule](https://github.com/soap117/DeepRule)\n  * **Data collection method:** Automated, Human\n  * **Labeling method**: Distillation by the CACHED model\n  * **Description:** The original dataset consists of 386,966 chart images obtained by crawling public Excel sheets from the web with texts overwritten to protect privacy. The CACHED model is used to pseudo-label the relevant classes. We used a subsample of 9,091 charts where a title was detected for training alongside with the 5,614 PMC training images.\n\n## **Evaluation Results**\n\nResults were evaluated using the **PMC Chart dataset**. The **Mean Average Precision (mAP)** was used as the evaluation metric to measure the model's ability to correctly identify and localize objects across different confidence thresholds.\n\n### **Data Collection & Labeling**\n- **Data collection method:** **Hybrid (Automated & Human)**\n- **Labeling method:** **Hybrid (Automated & Human)**\n- **Properties:** The validation dataset is the same as the **PMC Chart dataset**.\n\n### **Dataset Overview**\n\n**Number of bounding boxes and images per class:**\n\n| **Label**       | **Images** | **Boxes** |\n|----------------|----------:|---------:|\n| **chart_title**   | 38  | 38  |\n| **legend_label**  | 318 | 1077 |\n| **legend_title**  | 17  | 19  |\n| **mark_label**    | 42  | 219  |\n| **other**         | 113 | 464  |\n| **value_label**   | 52  | 726  |\n| **x_title**       | 404 | 437  |\n| **xlabel**        | 553 | 4091 |\n| **y_title**       | 502 | 505  |\n| **ylabel**        | 534 | 3944 |\n| **Total**         | 560 | **11,520** |\n\n### **Per-Class Performance Metrics**\n\n#### **Average Precision (AP)**\n| **Class**       | **AP**   | **Class**       | **AP**   | **Class**      | **AP**   |\n|----------------|---------:|----------------|---------:|---------------|---------:|\n| **chart_title**  | 82.38  | **x_title**     | 88.77  | **y_title**    | 89.48  |\n| **xlabel**      | 85.04  | **ylabel**      | 86.22  | **other**      | 55.14  |\n| **legend_label** | 84.09  | **legend_title** | 60.61  | **mark_label** | 49.31  |\n| **value_label**  | 62.66  |                |        |               |        |\n\n#### **Average Recall (AR)**\n| **Class**       | **AR**   | **Class**       | **AR**   | **Class**      | **AR**   |\n|----------------|---------:|----------------|---------:|---------------|---------:|\n| **chart_title**  | 93.16  | **x_title**     | 92.31  | **y_title**    | 92.32  |\n| **xlabel**      | 88.93  | **ylabel**      | 89.40  | **other**      | 79.48  |\n| **legend_label** | 88.07  | **legend_title** | 68.42  | **mark_label** | 73.61  |\n| **value_label**  | 68.32  |                |        |               |        |\n\n\n## **Inference:**\n\n**Engine:** Tensor(RT) <br>\n**Test hardware:** Tested on all supported hardware listed in compatibility section\n\n## **Ethical Considerations:**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\n**For more detailed information on ethical considerations for this model**, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety-security.md), and [Privacy](privacy.md) Subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Get Help\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or  submit a ticket at the [NVIDIA AI Enterprise Support Services page.](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\nVisit the [NeMo Retriever docs page](https://docs.nvidia.com/nemo/retriever/index.html) for release documentation, deployment guides and more.",
    "displayName": "NeMo Retriever Graphic Elements v1",
    "explainability": "| Field | Response |\n| ----- | ----- |\n| Intended Application & Domain: | Object Detection |\n| Model Type: | YOLOX-architecture for detection of graphic elements within images of charts. |\n| Intended User: | Enterprise developers, data scientists, and other technical users who need to extract textual elements from charts and graphs. |\n| Output: | A List of dictionaries containing lists of dictionaries of floating point numbers (representing bounding box information). <br> **Example**: `{\"data\": [{\"index\": 0,\"bounding_boxes\": {\"table\": [{\"x_min\": 0.6503,\"y_min\": 0.2161,\"x_max\": 0.7835,\"y_max\": 0.3236,\"confidence\": 0.9306}]}}]}` |\n| Describe how the model works: | Finds and identifies objects in images by first dividing the image into a grid. For each section of the grid, the model uses a series of neural networks to extract visual features and simultaneously predict what objects are present (in this case \"chart title\" or \"axis label\" etc.) and exactly where they are located in that section, all in a single pass through the image. |\n| Performance Metrics: | Accuracy, Throughput, and Latency |\n| Potential Known Risks: | This model will not always detect all graphic elements in an image, especially for uncommon elements or lower quality images. |\n| Licensing & Terms of Use: | Use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/) and the [Apache 2.0 License](https://github.com/Megvii-BaseDetection/YOLOX/blob/main/LICENSE). |\n| Technical Limitations | The model may correctly detect graphic elements of charts, espectially on uncommon chart styles or lower quality images. |\n| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |\n| Verified to have met prescribed NVIDIA quality standards: | Yes |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-3A0Q-P34G",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx1-trt-fp16-1ted4cchma",
    "latestVersionSizeInBytes": 281012246,
    "logo": "https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/nemo-retriever-graphic.png",
    "modelFormat": "Triton",
    "name": "nemoretriever-graphic-elements-v1",
    "orgName": "nim",
    "precision": "FP32",
    "privacy": "| Field | Response |\n| ----- | ----- |\n| Generatable or reverse engineerable personal data? | No |\n| Personal data used to create this model? | None |\n| How often is the dataset reviewed? | Before Every Release |\n| Is a mechanism in place to honor data subject right of access or deletion of personal data? | No |\n| Is there provenance for all datasets used in training? | Yes |\n| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "| Field | Response |\n| ----- | ----- |\n| Model Application(s): | Object Detection for Retrieval, focused on Enterprise |\n| Describe the physical safety impact (if present). | Not Applicable |\n| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |\n| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |",
    "shortDescription": "NVIDIA NeMo\u2122 Retriever NIM for graphic elements v1 is a fine-tuned object detection model, trained specifically for detecting the elements of charts and tables in documents",
    "teamName": "nvidia",
    "updatedDate": "2025-10-15T18:43:31.513Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemoretriever-graphic-elements-v1 optimizationProfiles: - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:a10gx1-trt-fp16-nwnqycg0xg framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 A10Gx1 FP16 ngcMetadata: 09231248dff89cf8859d9206931342e468fbddfe469df56334fbe00df7fda1da: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.6' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A10G - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:b200x1-trt-fp16-jtagdygnhq framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 B200x1 FP16 ngcMetadata: 0f3b150544da8a053048c1e2a37a282b2c43f09a99253882578d12bc1f2cfca6: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '10.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:h100x1-trt-fp16-jswckvqtmq framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 H100x1 FP16 ngcMetadata: 58adeef41afa742e753314ae51818e9f017f2c92ba0bfdc01befe6234703a54c: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '9.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:rtx6000-blackwell-svx1-trt-fp16-1ted4cchma framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 RTX6000x1 FP16 ngcMetadata: b4cc2f8b3d2dcf1afdcafbee8ea694c53aeee642d76f709ed0e79477b68a8dde: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '12.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: RTX6000 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:l40sx1-trt-fp16-cwnvuqbbna framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 L40Sx1 FP16 ngcMetadata: bc1487bf0ec3430f17595fff029c1bc50668344c7a30f9e5d64ee061c6e2d5fa: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.9' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:2_ONNX_FP16_1024 framework: ONNX displayName: Nemoretriever Graphic Elements V1 ONNX FP16 ngcMetadata: edc693c6fccd68d266622eace04225421e353d7ce31e3b207afc5ff35124127b: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton model_type: onnx precision: fp16 modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: ONNX - profileId: nim/nvidia/nemoretriever-graphic-elements-v1:a100x1-trt-fp16-qpwy-4niaa framework: TensorRT-LLM displayName: Nemoretriever Graphic Elements V1 A100x1 FP16 ngcMetadata: f0fb2f72a66230096c40fc3307872ebb9bce69816cbfc6e2918695ca824bd284: model: nvidia/nemoretriever-graphic-elements-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT labels: - signed images - NVIDIA AI Enterprise Supported - NVIDIA NIM - NSPECT-7OBP-T77C config: architectures: - Other modelType: NGC license: NVIDIA AI Foundation Models Community License - name: Nemoretriever Page Elements V2 displayName: Nemoretriever Page Elements V2 modelHubID: nemoretriever-page-elements-v2 category: Object Detection type: NGC description: NVIDIA NeMo Retriever NIM for page elements v2 is a fine-tuned object detection model, trained specifically for detecting charts, tables, infographics, and titles on a document page. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Nemoretriever Page Elements V2 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "| Field | Response |\n| ----- | ----- |\n| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |\n| Measures taken to mitigate against unwanted bias | None |",
    "builtBy": "NVIDIA",
    "canGuestDownload": false,
    "createdDate": "2025-03-14T00:47:01.563Z",
    "description": "## Model Overview\n\n### Description\n\nThe **NeMo Retriever Page Elements v2** model is a specialized object detection model designed to identify and extract key elements from charts and graphs. While the underlying technology builds upon work from [Megvii Technology](https://github.com/Megvii-BaseDetection/YOLOX), we developed our own base model through complete retraining rather than using pre-trained weights. YOLOX is an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. The model is trained to detect **tables**, **charts**, **infographics**, and **titles** in documents.\n\nThis model supersedes the [nv-yolox-page-elements](https://build.nvidia.com/nvidia/nv-yolox-page-elements-v1) model.\n\nThis model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents.\n\n### License/Terms of use\n\nThe use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n### Model Architecture\n\n**Architecture Type**: YOLOX <br>\n**Network Architecture**: DarkNet53 Backbone \\+ FPN Decoupled head (one 1x1 convolution \\+ 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction). YOLOX is a single-stage object detector that improves on Yolo-v3. <br>\n**Deployment Geography**: Global\n\n**Use Case**: <br>\nThis model is designed for automating extraction of charts, tables, infographics, and titles in enterprise documents. Key applications include:\n- Enterprise document extraction, embedding and indexing\n- Augmenting Retrieval Augmented Generation (RAG) workflows with multimodal retrieval\n- Data extraction from legacy documents and reports\n\n**Release Date**: 2025-03-17\n\n### Intended use\n\nThe **NeMo Retriever Page Elements v2** model is suitable for users who want to extract, and ultimately retrieve, tables, charts and infographics. It can be used for document analysis, understanding and processing.\n\n## Technical Details\n\n### Input\n\n**Input Type(s)**: Image <br>\n**Input Format(s)**: Red, Green, Blue (RGB) <br>\n**Input Parameters**: Two Dimensional (2D)<br>\n**Other Properties Related to Input**: Image size resized to `(1024, 1024)`\n\n### Output\n\n**Output Type(s)**: Array <br>\n**Output Format**: A dictionary of dictionaries containing `np.ndarray`. The outer dictionary contains each sample (page). Inner dictionary contains list of dictionaries with bounding boxes, class, and confidence for that page <br>\n**Output Parameters**: 1D <br>\n**Other Properties Related to Output**: Output contains Bounding box, detection confidence and object class (chart, table, infographic, title). Thresholds used for non-maximum suppression `conf_thresh = 0.01`; `iou_thresh = 0.5` <br>\n**Output Classes**: <br>\n  * Table\n    * Data structured in rows and columns\n  * Chart\n    * Specifically bar charts, line charts, or pie charts\n  * Infographic\n    * Visual representations of information that is more complex than a chart, including diagrams and flowcharts\n    * Maps are _not_ considered infographics\n  * Title\n    * Titles can be page titles, section titles, or table/chart/infographic titles\n\n### Software Integration\n\n**Runtime**: **NeMo Retriever Page Elements v2** NIM <br>\n**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace <br>\n**Supported Operating System(s)**: Linux <br>\n\n## Model Version(s):\n\n* `nemoretriever-page-elements-v2`\n\n## Training Dataset & Evaluation\n\n### Training Dataset\n\n**Data collection method by dataset**: Automated <br>\n**Labeling method by dataset**: Hybrid: Automated, Human <br>\n**Pretraining (by NVIDIA)**: 118,287 images of the [COCO train2017](https://cocodataset.org/#download) dataset <br>\n**Finetuning (by NVIDIA)**: 36,093 images from [Digital Corpora dataset](https://digitalcorpora.org/), with annotations from [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence) and data annotation team <br>\n**Number of bounding boxes per class**: 35,328 tables, 44,178 titles, 11,313 charts and 6,500 infographics. The layout model of Document Intelligence was used with `2024-02-29-preview` API version.\n\n### Evaluation Results\n\nThe primary evaluation set is a cut of the Azure labels and digital corpora images. Number of bounding boxes per class: 1,483 tables, 1,965 titles, 404 charts and 500 infographics. Mean Average Precision (mAP) was used as an evaluation metric, which measures the model's ability to correctly identify and localize objects across different confidence thresholds.\n\n**Data collection method by dataset**: Hybrid: Automated, Human <br>\n**Labeling method by dataset**: Hybrid: Automated, Human <br>\n**Properties**: We evaluated with Azure labels from manually selected pages, as well as manual inspection on public PDFs and powerpoint slides.\n\n**Per-class Performance Metrics**:\n| Class       | AP (%) | AR (%) |\n|:------------|:-------|:-------|\n| table       | 45.619 | 69.814 |\n| chart       | 53.419 | 75.755 |\n| title       | 45.116 | 65.245 |\n| infographic | 96.591 | 97.400 |\n\n\n## Inference:\n\n**Engine**: TensorRT <br>\n**Test hardware**: See Support Matrix from NIM documentation\n\n## Ethical Considerations\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\n**For more detailed information on ethical considerations for this model**, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety-security.md), and [Privacy](privacy.md) Subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Get Help\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or  submit a ticket at the [NVIDIA AI Enterprise Support Services page.](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\nVisit the [NeMo Retriever docs page](https://docs.nvidia.com/nemo/retriever/index.html) for release documentation, deployment guides and more.",
    "displayName": "NeMo Retriever Page Elements v2",
    "explainability": "| Field | Response |\n| ----- | ----- |\n| Intended Application & Domain: | Document Understanding |\n| Model Type: | YOLOX Object Detection for Charts, Tables, Infographics, and Titles |\n| Intended User: | Enterprise developers, data scientists, and other technical users who need to extract structural elements from documents. |\n| Output: | A List of dictionaries containing lists of dictionaries of floating point numbers (representing bounding box information). <br> **Example**: `{\"data\": [{\"index\": 0,\"bounding_boxes\": {\"table\": [{\"x_min\": 0.6503,\"y_min\": 0.2161,\"x_max\": 0.7835,\"y_max\": 0.3236,\"confidence\": 0.9306}]}}]}` |\n| Describe how the model works: | Finds and identifies objects in images by first dividing the image into a grid. For each section of the grid, the model uses a series of neural networks to extract visual features and simultaneously predict what objects are present (in this case \"chart\" or \"table\" etc.) and exactly where they are located in that section, all in a single pass through the image. |\n| Potential Known Risks: | This model may not always detect all elements in a document. |\n| Licensing & Terms of Use: | Use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/) and the [Apache 2.0 License](https://github.com/Megvii-BaseDetection/YOLOX/blob/main/LICENSE). |\n| Technical Limitations | The model may not generalize to unknown document types/formats not commonly found on the web. |\n| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |\n| Verified to have met prescribed NVIDIA quality standards: | Yes |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-AY6A-LXVV",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx1-trt-fp16-xd1wfged5w",
    "latestVersionSizeInBytes": 277309078,
    "logo": "https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/nemo-retriever-graphic.png",
    "modelFormat": "Triton",
    "name": "nemoretriever-page-elements-v2",
    "orgName": "nim",
    "precision": "FP16",
    "privacy": "| Field | Response |\n| ----- | ----- |\n| Generatable or reverse engineerable personal data? | No |\n| Personal data used to create this model? | None |\n| How often is the dataset reviewed? | Before Every Release |\n| Is a mechanism in place to honor data subject right of access or deletion of personal data? | No |\n| Is there provenance for all datasets used in training? | Yes |\n| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "| Field | Response |\n| ----- | ----- |\n| Model Application(s): | Object Detection for Retrieval, focused on Enterprise |\n| Describe the physical safety impact (if present). | Not Applicable |\n| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |\n| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |",
    "shortDescription": "NVIDIA NeMo\u2122 Retriever NIM for page elements v2 is a fine-tuned object detection model, trained specifically for detecting charts, tables, infographics, and titles on a document page.",
    "teamName": "nvidia",
    "updatedDate": "2025-10-15T18:43:49.378Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemoretriever-page-elements-v2 optimizationProfiles: - profileId: nim/nvidia/nemoretriever-page-elements-v2:a10gx1-trt-fp16-toixhuroha framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 A10Gx1 FP16 ngcMetadata: 09231248dff89cf8859d9206931342e468fbddfe469df56334fbe00df7fda1da: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '8.6' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A10G - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-page-elements-v2:b200x1-trt-fp16-ayukzqdapq framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 B200x1 FP16 ngcMetadata: 0f3b150544da8a053048c1e2a37a282b2c43f09a99253882578d12bc1f2cfca6: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '10.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-page-elements-v2:h100x1-trt-fp16-nuq3ijukrw framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 H100x1 FP16 ngcMetadata: 58adeef41afa742e753314ae51818e9f017f2c92ba0bfdc01befe6234703a54c: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '9.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-page-elements-v2:rtx6000-blackwell-svx1-trt-fp16-xd1wfged5w framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 RTX6000x1 FP16 ngcMetadata: b4cc2f8b3d2dcf1afdcafbee8ea694c53aeee642d76f709ed0e79477b68a8dde: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '12.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: RTX6000 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-page-elements-v2:l40sx1-trt-fp16-qnrq36wfcw framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 L40Sx1 FP16 ngcMetadata: bc1487bf0ec3430f17595fff029c1bc50668344c7a30f9e5d64ee061c6e2d5fa: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '8.9' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-page-elements-v2:a100x1-onnx-fp16-wagmq6-x1q framework: ONNX displayName: Nemoretriever Page Elements V2 ONNX FP16 ngcMetadata: edc693c6fccd68d266622eace04225421e353d7ce31e3b207afc5ff35124127b: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton model_type: onnx precision: fp16 modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: ONNX - profileId: nim/nvidia/nemoretriever-page-elements-v2:a100x1-trt-fp16-yukvwcfl5q framework: TensorRT-LLM displayName: Nemoretriever Page Elements V2 A100x1 FP16 ngcMetadata: f0fb2f72a66230096c40fc3307872ebb9bce69816cbfc6e2918695ca824bd284: model: nvidia/nemoretriever-page-elements-v2 release: 1.6.0 tags: backend: triton compute_capability: '8.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT labels: - signed images - NSPECT-7OBP-T77C - NVIDIA AI Enterprise Supported - NVIDIA NIM config: architectures: - Other modelType: NIM license: NVIDIA AI Foundation Models Community License - name: Nemoretriever Table Structure V1 displayName: Nemoretriever Table Structure V1 modelHubID: nemoretriever-table-structure-v1 category: Object Detection type: NGC description: NVIDIA NeMo Retriever NIM for table structure v1 is a fine-tuned object detection model, trained specifically for detecting the structure of complex tables. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Nemoretriever Table Structure V1 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "| Field | Response |\n| ----- | ----- |\n| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |\n| Measures taken to mitigate against unwanted bias | None |",
    "builtBy": "NVIDIA",
    "canGuestDownload": false,
    "createdDate": "2025-03-14T00:35:30.879Z",
    "description": "## Model Overview\n\n### Description\n\nThe **NeMo Retriever Table Structure v1** model is a specialized object detection model designed to identify and extract the structure of tables in images. Based on YOLOX, an anchor-free version of YOLO (You Only Look Once), this model combines a simpler architecture with enhanced performance. While the underlying technology builds upon work from [Megvii Technology](https://github.com/Megvii-BaseDetection/YOLOX), we developed our own base model through complete retraining rather than using pre-trained weights.\n\nThe model excels at detecting and localizing the fundamental structural elements within tables. Through careful fine-tuning, it can accurately identify and delineate three key components within tables:\n\n1. Individual cells (including merged cells)\n2. Rows\n3. Columns\n\nThis specialized focus on table structure enables precise decomposition of complex tables into their constituent parts, forming the foundation for downstream retrieval tasks. This model helps convert tables into the markdown format which can improve retrieval accuracy.\n\nThis model is ready for commercial use and is a part of the NVIDIA NeMo Retriever family of NIM microservices specifically for object detection and multimodal extraction of enterprise documents.\n\n### License/Terms of use\n\nThe use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n### Model Architecture\n\n**Architecture Type**: YOLOX <br>\n**Network Architecture**: DarkNet53 Backbone \\+ FPN Decoupled head (one 1x1 convolution \\+ 2 parallel 3x3 convolutions (one for the classification and one for the bounding box prediction). The YOLOX architecture is a single-stage object detector that improves on Yolo-v3. <br>\n**Deployment Geography**: Global <br>\n\n**Use Case**: <br>\nThis model specializes in analyzing images containing tables by:\n- Detecting and extracting table structure elements (rows, columns, and cells)\n- Providing precise location information for each detected element\n- Supporting downstream tasks like table analysis and data extraction\n\nThe model is designed to work in conjunction with OCR (Optical Character Recognition) systems to:\n1. Identify the structural layout of tables\n2. Preserve the relationships between table elements\n3. Enable accurate extraction of tabular data from images\n\nIdeal for:\n- Document processing systems\n- Automated data extraction pipelines\n- Digital content management solutions\n- Business intelligence applications\n\n**Release Date**: 2025-03-17\n\n## Technical Details\n\n### Input\n\n**Input type(s)**: Image <br>\n**Input format(s)**: Red, Green, Blue (RGB) <br>\n**Input parameters**: Two Dimensional (2D) <br>\n**Other properties related to input**: Image size resized to `(1024, 1024)`\n\n### Output\n\n**Output Type(s)**: Array <br>\n**Output Format**: A dictionary of dictionaries containing `np.ndarray` objects. The outer dictionary contains each sample (table). Inner dictionary contains list of dictionaries with bounding boxes, class, and confidence for that table <br>\n**Output Parameters**: 1D <br>\n**Other Properties Related to Output**: Output contains Bounding box, detection confidence and object class (cell, row, column). Thresholds used for non-maximum suppression `conf_thresh = 0.01`; `iou_thresh = 0.25`\n\n### Software Integration\n\n**Runtime**: **NeMo Retriever Table Structure v1** NIM <br>\n**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace <br>\n**Supported Operating System(s)**: Linux\n\n## Model Version(s):\n\n* `nemoretriever-table-structure-v1`\n\n## Training Dataset & Evaluation\n\n### Training Dataset\n\n**Data collection method by dataset**: Automated <br>\n**Labeling method by dataset**: Automated <br>\n**Pretraining**: [COCO train2017](https://cocodataset.org/#download)\n**Finetuning (by NVIDIA)**: 23,977 images from [Digital Corpora dataset](https://digitalcorpora.org/), with annotations from [Azure AI Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/ai-document-intelligence).\nNumber of bounding boxes per class: 1,828,978 cells, 134,089 columns and 316,901 rows. The layout model of Document Intelligence was used with `2024-02-29-preview` API version.\n\n### Evaluation Results\n\n**The primary evaluation set**: 2,459 digital corpora images with Azure labels. Number of bounding boxes per class: 200,840 cells, 13,670 columns and 34,575 rows. mAP was used as an evaluation metric. <br>\n**Data collection method by dataset**: Hybrid: Automated, Human <br>\n**Labeling method by dataset**: Hybrid: Automated, Human <br>\n**Properties**: We evaluated with Azure labels from manually selected pages, as well as manual inspection on public PDFs and powerpoint slides.\n\n**Per-class Performance Metrics**:\n| Class  | Average Precision (%) | Average Recall (%) |\n|:-------|:----------------------|:------------------|\n| cell   | 58.365                | 60.647            |\n| row    | 76.992                | 81.115            |\n| column | 85.293                | 87.434            |\n\n## Inference:\n\n**Engine**: TensorRT. <br>\n**Test hardware**: See Support Matrix from NIM documentation.\n\n## Ethical Considerations\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\n**For more detailed information on ethical considerations for this model**, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety-security.md), and [Privacy](privacy.md) Subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Get Help\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or  submit a ticket at the [NVIDIA AI Enterprise Support Services page.](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\nVisit the [NeMo Retriever docs page](https://docs.nvidia.com/nemo/retriever/index.html) for release documentation, deployment guides and more.",
    "displayName": "NeMo Retriever Table Structure v1",
    "explainability": "| Field | Response |\n| ----- | ----- |\n| Intended Application & Domain: | Object Detection |\n| Model Type: | YOLOX-architecture for detection of table structure within images of tables. |\n| Intended User: | Enterprise developers, data scientists, and other technical users who need to extract table structure from images. |\n| Output: | A List of dictionaries containing lists of dictionaries of floating point numbers (representing bounding box information). <br> **Example**: `{\"data\": [{\"index\": 0,\"bounding_boxes\": {\"table\": [{\"x_min\": 0.6503,\"y_min\": 0.2161,\"x_max\": 0.7835,\"y_max\": 0.3236,\"confidence\": 0.9306}]}}]}` |\n| Describe how the model works: | Finds and identifies objects in images by first dividing the image into a grid. For each section of the grid, the model uses a series of neural networks to extract visual features and simultaneously predict what objects are present (in this case \"cell\", \"row\", or \"column\") and exactly where they are located in that section, all in a single pass through the image. |\n| Potential Known Risks: | This model does not always guarantee to retrieve the correct table structure for a given image. |\n| Licensing & Terms of Use: | Use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/). |\n| Technical Limitations | The model may correctly detect table elements, espectially on uncommon table styles or lower quality images. |\n| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |\n| Verified to have met prescribed NVIDIA quality standards: | Yes |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-K056-3HWE",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx6000-blackwell-svx1-trt-fp16-unyitj7ofa",
    "latestVersionSizeInBytes": 277219598,
    "logo": "https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/nemo-retriever-graphic.png",
    "modelFormat": "Triton",
    "name": "nemoretriever-table-structure-v1",
    "orgName": "nim",
    "precision": "FP32",
    "privacy": "| Field | Response |\n| ----- | ----- |\n| Generatable or reverse engineerable personal data? | No |\n| Personal data used to create this model? | None |\n| How often is the dataset reviewed? | Before Every Release |\n| Is a mechanism in place to honor data subject right of access or deletion of personal data? | No |\n| Is there provenance for all datasets used in training? | Yes |\n| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "| Field | Response |\n| ----- | ----- |\n| Model Application(s): | Object Detection for Retrieval, focused on Enterprise |\n| Describe the physical safety impact (if present). | Not Applicable |\n| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |\n| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |",
    "shortDescription": "NVIDIA NeMo\u2122 Retriever NIM for table structure v1 is a fine-tuned object detection model, trained specifically for detecting the structure of complex tables.",
    "teamName": "nvidia",
    "updatedDate": "2025-10-15T18:44:05.845Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/nemoretriever-table-structure-v1 optimizationProfiles: - profileId: nim/nvidia/nemoretriever-table-structure-v1:a10gx1-trt-fp16-ncblfgrrew framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 A10Gx1 FP16 ngcMetadata: 09231248dff89cf8859d9206931342e468fbddfe469df56334fbe00df7fda1da: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.6' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A10G - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-table-structure-v1:b200x1-trt-fp16--ce2boy2vw framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 B200x1 FP16 ngcMetadata: 0f3b150544da8a053048c1e2a37a282b2c43f09a99253882578d12bc1f2cfca6: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '10.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-table-structure-v1:h100x1-trt-fp16-lnq0nmbo3g framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 H100x1 FP16 ngcMetadata: 58adeef41afa742e753314ae51818e9f017f2c92ba0bfdc01befe6234703a54c: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '9.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-table-structure-v1:rtx6000-blackwell-svx1-trt-fp16-unyitj7ofa framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 RTX6000x1 FP16 ngcMetadata: b4cc2f8b3d2dcf1afdcafbee8ea694c53aeee642d76f709ed0e79477b68a8dde: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '12.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: RTX6000 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-table-structure-v1:l40sx1-trt-fp16-ddoabmkana framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 L40Sx1 FP16 ngcMetadata: bc1487bf0ec3430f17595fff029c1bc50668344c7a30f9e5d64ee061c6e2d5fa: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.9' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/nemoretriever-table-structure-v1:a100x1-onnx-fp16-l8hnwsbr3g framework: ONNX displayName: Nemoretriever Table Structure V1 ONNX FP16 ngcMetadata: edc693c6fccd68d266622eace04225421e353d7ce31e3b207afc5ff35124127b: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton model_type: onnx precision: fp16 modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: ONNX - profileId: nim/nvidia/nemoretriever-table-structure-v1:a100x1-trt-fp16-jvvssvik-q framework: TensorRT-LLM displayName: Nemoretriever Table Structure V1 A100x1 FP16 ngcMetadata: f0fb2f72a66230096c40fc3307872ebb9bce69816cbfc6e2918695ca824bd284: model: nvidia/nemoretriever-table-structure-v1 release: 1.6.0 tags: backend: triton compute_capability: '8.0' model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100 - key: COUNT value: 1 - key: NIM VERSION value: 1.6.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT labels: - signed images - NSPECT-7OBP-T77C - NVIDIA AI Enterprise Supported - NVIDIA NIM config: architectures: - Other modelType: NIM license: NVIDIA AI Foundation Models Community License - name: PaddleOCR displayName: PaddleOCR modelHubID: paddleocr category: Optical Character Recognition type: NGC description: PaddleOCR is an ultra lightweight Optical Character Recognition (OCR) system by Baidu. PaddleOCR supports a variety of cutting-edge algorithms related to OCR. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: PaddleOCR modelCard: ewogICAgImFjY2Vzc1R5cGUiOiAiTk9UX0xJU1RFRCIsCiAgICAiYXBwbGljYXRpb24iOiAiT3RoZXIiLAogICAgImNhbkd1ZXN0RG93bmxvYWQiOiBmYWxzZSwKICAgICJjcmVhdGVkRGF0ZSI6ICIyMDI1LTAyLTE5VDE2OjEwOjE3LjA3NVoiLAogICAgImRlc2NyaXB0aW9uIjogIiMjIE1vZGVsIE92ZXJ2aWV3XG5cbiMjIyBEZXNjcmlwdGlvblxuXG5QYWRkbGVPQ1IgaXMgYW4gdWx0cmEtbGlnaHR3ZWlnaHQgT3B0aWNhbCBDaGFyYWN0ZXIgUmVjb2duaXRpb24gKE9DUikgc3lzdGVtIGRldmVsb3BlZCBieSBCYWlkdS4gSXQgc3VwcG9ydHMgYSB2YXJpZXR5IG9mIGN1dHRpbmctZWRnZSBPQ1IgYWxnb3JpdGhtcyBhbmQgcHJvdmlkZXMgdmFsdWUgYXQgZXZlcnkgc3RhZ2Ugb2YgdGhlIEFJIHBpcGVsaW5lLCBpbmNsdWRpbmcgZGF0YSBnZW5lcmF0aW9uLCBtb2RlbCB0cmFpbmluZywgYW5kIGluZmVyZW5jZS5cblxuVGhpcyBtb2RlbCBpcyByZWFkeSBmb3IgY29tbWVyY2lhbCB1c2UuXG5cbiMjIFRoaXJkLVBhcnR5IENvbW11bml0eSBDb25zaWRlcmF0aW9uXG5cblRoaXMgbW9kZWwgaXMgbm90IG93bmVkIG9yIGRldmVsb3BlZCBieSBOVklESUEuIFRoaXMgbW9kZWwgaGFzIGJlZW4gZGV2ZWxvcGVkIGFuZCBidWlsdCB0byBhIHRoaXJkLXBhcnR5XHUyMDE5cyByZXF1aXJlbWVudHMgZm9yIHRoaXMgYXBwbGljYXRpb24gYW5kIHVzZSBjYXNlOyBzZWUgbGluayB0byBOb24tTlZJRElBIFtQYWRkbGVPQ1IgVG9vbGtpdF0oaHR0cHM6Ly9naXRodWIuY29tL1BhZGRsZVBhZGRsZS9QYWRkbGVPQ1IpLlxuXG4jIyMgVGVybXMgb2YgdXNlXG5UaGUgdXNlIG9mIHRoaXMgbW9kZWwgaXMgZ292ZXJuZWQgYnkgdGhlIFtOVklESUEgQUkgRm91bmRhdGlvbiBNb2RlbHMgQ29tbXVuaXR5IExpY2Vuc2UgQWdyZWVtZW50XShodHRwczovL3d3dy5udmlkaWEuY29tL2VuLXVzL2FncmVlbWVudHMvZW50ZXJwcmlzZS1zb2Z0d2FyZS9udmlkaWEtY29tbXVuaXR5LW1vZGVscy1saWNlbnNlLykuIFBhZGRsZU9DUiBpcyBsaWNlbnNlZCB1bmRlciAgW0FwYWNoZS0yXShodHRwczovL3d3dy5hcGFjaGUub3JnL2xpY2Vuc2VzL0xJQ0VOU0UtMi4wKS5cblxuKipZb3UgYXJlIHJlc3BvbnNpYmxlIGZvciBlbnN1cmluZyB0aGF0IHlvdXIgdXNlIG9mIG1vZGVscyBjb21wbGllcyB3aXRoIGFsbCBhcHBsaWNhYmxlIGxhd3MuKipcblxuIyMjIFJlZmVyZW5jZXNcbltHaXRodWJdKGh0dHBzOi8vZ2l0aHViLmNvbS9QYWRkbGVQYWRkbGUvUGFkZGxlT0NSL2Jsb2IvbWFpbi9SRUFETUVfZW4ubWQpXG5bQXJ4aXZdKGh0dHBzOi8vYXJ4aXYub3JnL2Ficy8yMjA2LjAzMDAxKVxuXG5cbiMjIE1vZGVsIEFyY2hpdGVjdHVyZVxuKipBcmNoaXRlY3R1cmUgVHlwZSBmb3IgVGV4dCBEZXRlY3RvcjoqKiBDTk4gPGJyPlxuKipOZXR3b3JrIEFyY2hpdGVjdHVyZSBmb3IgVGV4dCBEZXRlY3RvcjoqKiAgTEstUEFOXG5cbioqQXJjaGl0ZWN0dXJlIFR5cGUgZm9yIFRleHQgUmVjb2duaXRpb246KiogSHlicmlkIFRyYW5zZm9ybWVyIENOTiAgPGJyPlxuKipOZXR3b3JrIEFyY2hpdGVjdHVyZSBmb3IgVGV4dCBSZWNvZ25pdGlvbjoqKiBTVlRSLUxDTmV0IChOUlRSIEhlYWQgYW5kIENUQ0xvc3MgaGVhZCkgPGJyPlxuXG4jIyBJbnB1dFxuKipJbnB1dCBUeXBlKHMpOioqIEltYWdlIDxicj5cbioqSW5wdXQgRm9ybWF0KHMpOioqIFJlZCwgR3JlZW4sIEJsdWUgKFJHQikgPGJyPlxuKipJbnB1dCBQYXJhbWV0ZXJzOioqIFR3byBEaW1lbnNpb25hbCAoMkQpIDxicj5cbioqU3VwcG9ydGVkIExhbmd1YWdlczoqKiBFbmdsaXNoIDxicj5cbioqTWluaW11bSBpbnB1dCBpbWFnZSBkaW1lbnNpb25zOioqICgzMiwgMzIpIDxicj5cbioqTWF4aW11bSBpbnB1dCBpbWFnZSBkaW1lbnNpb25zOioqIE5vIGxpbWl0YXRpb24gPGJyPlxuKipPdGhlciBQcm9wZXJ0aWVzIFJlbGF0ZWQgdG8gSW5wdXQ6KiogbmQgYXJyYXksIG9yIGJhdGNoIG9mIG5kIGFycmF5cyBhcmUgcGFzc2VkIGluIHdpdGggc2hhcGUgW0JhdGNoLCBDaGFubmVsLCBXaWR0aCwgSGVpZ2h0XS4gUGFkZGxlT0NSIGRvZXMgc29tZSBpbnRlcm5hbCB0aHJlc2hvbGRpbmcsIGJ1dCBub25lIHdhcyBpbXBsZW1lbnRlZCBmcm9tIG91ciBzaWRlLiA8YnI+XG5cbiMjIE91dHB1dFxuKipPdXRwdXQgVHlwZShzKToqKiBUZXh0IDxicj5cbioqT3V0cHV0IEZvcm1hdDoqKiAgU3RyaW5nIDxicj5cbioqT3V0cHV0IFBhcmFtZXRlcnM6KiogMUQgPGJyPlxuKipPdGhlciBQcm9wZXJ0aWVzIFJlbGF0ZWQgdG8gT3V0cHV0OioqIEJhdGNoIG9mIHRleHQgc3RyaW5ncy4gPGJyPlxuXG4qKlN1cHBvcnRlZCBIYXJkd2FyZSBNaWNyb2FyY2hpdGVjdHVyZSBDb21wYXRpYmlsaXR5OioqIE5WSURJQSBBbXBlcmUsIE5WSURJQSBIb3BwZXIsIE5WSURJQSBMb3ZlbGFjZTxicj5cblxuIyMgU3VwcG9ydGVkIE9wZXJhdGluZyBTeXN0ZW0ocyk6XG4qIExpbnV4IDxicj5cblxuIyMgTW9kZWwgVmVyc2lvbihzKTpcbiogYmFpZHUvcGFkZGxlb2NyICA8YnI+XG5cbiMjIFRyYWluaW5nIERhdGFzZXQ6XG5cbioqTGluazoqKiAgPGJyPlxuXG5UZXh0IGRldGVjdGlvbiBkYXRhc2V0cyBpbmNsdWRlIExTVlQgKFN1biBldCBhbC4gMjAxOSksIFJDVFctMTcgKFNoaWV0IGFsLiAyMDE3KSwgTVRXSSAyMDE4IChIZSBhbmQgWWFuZyAyMDE4KSwgQ0FTSUEtMTBLIChIZSBldCBhbC4gMjAxOCksIFNST0lFIChIdWFuZyBldCBhbC4gMjAxOSksIE1MVCAyMDE5IChOYXllZiBldCBhbC4gMjAxOSksIEJESSAoS2FyYXR6YXMgZXQgYWwuIDIwMTEpLCBNU1JBVEQ1MDAgKFlhbyBldCBhbC4gMjAxMikgYW5kIENDUEQgMjAxOSAoWHUgZXQgYWwuIDIwMTgpLlxuXG5UaGVzZSBhcmUgdHdvIG9mIHRoZSBkYXRhc2V0cyAoYW1vbmcgb3RoZXJzKSB3aGljaCBhcmUgdXNlZCBmb3IgdGV4dCByZWNvZ25pdGlvbjpcbltPcGVuSW1hZ2VzXShodHRwczovL2dpdGh1Yi5jb20vb3BlbmltYWdlcy9kYXRhc2V0KSA8YnI+XG5bSW52b2ljZURhdGFzZXRzXShodHRwczovL2dpdGh1Yi5jb20vRnV4aUppYS9JbnZvaWNlRGF0YXNldHMpXG5cbioqRGF0YSBDb2xsZWN0aW9uIE1ldGhvZCBieSBkYXRhc2V0OioqIFVua25vd24gPGJyPlxuKipMYWJlbGluZyBNZXRob2QgYnkgZGF0YXNldCoqIFVua25vd24gPGJyPlxuXG5UZXh0IERldGVjdGlvbjogMTI3ayB0cmFpbmluZyBpbWFnZXMgKDY4SyByZWFsIHNjZW5lIGltYWdlcyBmcm9tIEJhaWR1IGltYWdlIHNlYXJjaCBhbmQgcHVibGljIGRhdGFzZXRzIGFuZCA1OUsgc3ludGhldGljIGltYWdlcylcblxuVGV4dCBSZWNvZ25pdGlvbjogMTguNU0gdHJhaW5pbmcgaW1hZ2VzICg3TSByZWFsIHNjZW5lIGltYWdlcyBmcm9tIEJhaWR1IGltYWdlIHNlYXJjaCBhbmQgcHVibGljIGRhdGFzZXRzIGFuZCAxMS41TSBzeW50aGV0aWMgaW1hZ2VzKVxuXG4jIyBFdmFsdWF0aW9uOlxuXG5UaGUgbW9kZWwgaGFzIGJlZW4gcHJpbWFyaWx5IGV2YWx1YXRlZCBvbiBzdGFuZGFyZCBkb2N1bWVudCBsYXlvdXRzLiBQZXJmb3JtYW5jZSBvbiBjb21wbGV4IGxheW91dHMgc3VjaCBhcyBkZW5zZSB0YWJsZXMsIG11bHRpLWNvbHVtbiBkb2N1bWVudHMsIGFuZCBtaXhlZCBoYW5kd3JpdGluZy9wcmludGVkIHRleHQgbWF5IHZhcnkuIFVzZXJzIHNob3VsZCBjb25kdWN0IHRoZWlyIG93biB0ZXN0aW5nIHdoZW4gd29ya2luZyB3aXRoIHBhcnRpY3VsYXJseSBjaGFsbGVuZ2luZyBkb2N1bWVudCBzdHJ1Y3R1cmVzLlxuXG5QbGVhc2Ugc2VlIFBhZGRsZU9DUidzIGluZm9ybWF0aW9uIG9uIFt0aGUgbW9kZWxdKGh0dHBzOi8vcGFkZGxlcGFkZGxlLmdpdGh1Yi5pby9QYWRkbGVPQ1IvbGF0ZXN0L2VuL3Bwb2NyL292ZXJ2aWV3Lmh0bWwjcHAtb2NydjMtZW5nbGlzaC1tb2RlbCkgZm9yIG1vcmUgZGV0YWlscy5cblxuIyMgSW5mZXJlbmNlOlxuKipFbmdpbmU6KiogVGVuc29yKFJUKSA8YnI+XG4qKlRlc3QgSGFyZHdhcmU6KiogVGVzdGVkIG9uIGFsbCBzdXBwb3J0ZWQgaGFyZHdhcmUgbGlzdGVkIGluIGNvbXBhdGliaWxpdHkgc2VjdGlvbiA8YnI+XG5cbiMjIEV0aGljYWwgQ29uc2lkZXJhdGlvbnM6XG5OVklESUEgYmVsaWV2ZXMgVHJ1c3R3b3J0aHkgQUkgaXMgYSBzaGFyZWQgcmVzcG9uc2liaWxpdHkgYW5kIHdlIGhhdmUgZXN0YWJsaXNoZWQgcG9saWNpZXMgYW5kIHByYWN0aWNlcyB0byBlbmFibGUgZGV2ZWxvcG1lbnQgZm9yIGEgd2lkZSBhcnJheSBvZiBBSSBhcHBsaWNhdGlvbnMuICBXaGVuIGRvd25sb2FkZWQgb3IgdXNlZCBpbiBhY2NvcmRhbmNlIHdpdGggb3VyIHRlcm1zIG9mIHNlcnZpY2UsIGRldmVsb3BlcnMgc2hvdWxkIHdvcmsgd2l0aCB0aGVpciBpbnRlcm5hbCBtb2RlbCB0ZWFtIHRvIGVuc3VyZSB0aGlzIG1vZGVsIG1lZXRzIHJlcXVpcmVtZW50cyBmb3IgdGhlIHJlbGV2YW50IGluZHVzdHJ5IGFuZCB1c2UgY2FzZSBhbmQgYWRkcmVzc2VzIHVuZm9yZXNlZW4gcHJvZHVjdCBtaXN1c2UuXG5cblBsZWFzZSByZXBvcnQgc2VjdXJpdHkgdnVsbmVyYWJpbGl0aWVzIG9yIE5WSURJQSBBSSBDb25jZXJucyBbaGVyZV0oaHR0cHM6Ly93d3cubnZpZGlhLmNvbS9lbi11cy9zdXBwb3J0L3N1Ym1pdC1zZWN1cml0eS12dWxuZXJhYmlsaXR5LykuIiwKICAgICJkaXNwbGF5TmFtZSI6ICJQYWRkbGVPQ1IiLAogICAgImZyYW1ld29yayI6ICJPdGhlciIsCiAgICAiaGFzUGxheWdyb3VuZCI6IGZhbHNlLAogICAgImhhc1NpZ25lZFZlcnNpb24iOiB0cnVlLAogICAgImlzUGxheWdyb3VuZEVuYWJsZWQiOiBmYWxzZSwKICAgICJpc1B1YmxpYyI6IGZhbHNlLAogICAgImlzUmVhZE9ubHkiOiB0cnVlLAogICAgImxhYmVscyI6IFsKICAgICAgICAiTlNQRUNULTJJWUktQkExRyIsCiAgICAgICAgIk5WSURJQSBOSU0iLAogICAgICAgICJudmFpZTptb2RlbDpudmFpZV9zdXBwb3J0ZWQiLAogICAgICAgICJudmlkaWFfbmltOm1vZGVsOm5pbW1jcm9fbnZpZGlhX25pbSIsCiAgICAgICAgInByb2R1Y3ROYW1lczpuaW0tZGV2IiwKICAgICAgICAicHJvZHVjdE5hbWVzOm52LWFpLWVudGVycHJpc2UiCiAgICBdLAogICAgImxhdGVzdFZlcnNpb25JZFN0ciI6ICJsNDBzeDEtdHJ0LWZwMTYtazd2bmMteW1jZyIsCiAgICAibGF0ZXN0VmVyc2lvblNpemVJbkJ5dGVzIjogMTI4NTMwNjYxLAogICAgImxvZ28iOiAiaHR0cHM6Ly9kZXZlbG9wZXItYmxvZ3MubnZpZGlhLmNvbS93cC1jb250ZW50L3VwbG9hZHMvMjAyNC8wMy9uZW1vLXJldHJpZXZlci1ncmFwaGljLnBuZyIsCiAgICAibW9kZWxGb3JtYXQiOiAiTi9BIiwKICAgICJuYW1lIjogInBhZGRsZW9jciIsCiAgICAib3JnTmFtZSI6ICJuaW0iLAogICAgInByZWNpc2lvbiI6ICJOL0EiLAogICAgInByb2R1Y3ROYW1lcyI6IFsKICAgICAgICAibmltLWRldiIsCiAgICAgICAgIm52LWFpLWVudGVycHJpc2UiCiAgICBdLAogICAgInB1YmxpY0RhdGFzZXRVc2VkIjoge30sCiAgICAicHVibGlzaGVyIjogIk5WSURJQSIsCiAgICAic2hvcnREZXNjcmlwdGlvbiI6ICJQYWRkbGVPQ1IgaXMgYW4gdWx0cmEgbGlnaHR3ZWlnaHQgT3B0aWNhbCBDaGFyYWN0ZXIgUmVjb2duaXRpb24gKE9DUikgc3lzdGVtIGJ5IEJhaWR1LiBQYWRkbGVPQ1Igc3VwcG9ydHMgYSB2YXJpZXR5IG9mIGN1dHRpbmctZWRnZSBhbGdvcml0aG1zIHJlbGF0ZWQgdG8gT0NSLiIsCiAgICAidGVhbU5hbWUiOiAiYmFpZHUiLAogICAgInVwZGF0ZWREYXRlIjogIjIwMjUtMDctMTZUMDA6MzI6MDQuMjk4WiIKfQ== source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/baidu/containers/paddleocr optimizationProfiles: - profileId: nim/baidu/paddleocr:l4x1-trt-fp16-fswetrkejq framework: TensorRT-LLM displayName: Paddleocr NVIDIA L4x1 FP16 ngcMetadata: 49049986fc9bf66bc3674dd5ff7953472d7ec6ae82a64b74b8d33d3e8c077391: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 27b8:10de gpu: NVIDIA L4 gpu_key: l4 model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L4 - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:2_TRT_python_2 framework: TensorRT-LLM displayName: Paddleocr NVIDIA A100-SXM4-80GBx1 FP16 ngcMetadata: 495980e0b97395173bd2ddce9f7dec2851c654643e3bdb91c4d8fc24047c4d6a: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 20b2:10de gpu: NVIDIA A100-SXM4-80GB gpu_key: a100-sxm4-80gb model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100-SXM4-80GB - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:2_TRT_python_2 framework: TensorRT-LLM displayName: Paddleocr NVIDIA H100 NVLx1 FP16 ngcMetadata: 5c2af3e6451d4087fa274ab38bca77845fbb8e0577c176407511154869d2fe26: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 2321:10de gpu: NVIDIA H100 NVL gpu_key: h100-nvl model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100-NVL - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:l40sx1-trt-fp16-evboykuf0g framework: TensorRT-LLM displayName: Paddleocr NVIDIA L40Sx1 FP16 ngcMetadata: 631c6b6c76996d8cc04cf7cfde63d15d1b5f57cb323dc129f2a838b35703f1d9: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 26b9:10de gpu: NVIDIA L40S gpu_key: l40s model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:2_TRT_python_2 framework: TensorRT-LLM displayName: Paddleocr NVIDIA A100-SXM4-40GBx1 FP16 ngcMetadata: 93868053f6713346c8c4f6602b6a981b18d95d6680510bf249fd5b83477bbc52: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 20b0:10de gpu: NVIDIA A100-SXM4-40GB gpu_key: a100-sxm4-40gb model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100-SXM4-40GB - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:a10gx1-trt-fp16-ijpjeptpna framework: TensorRT-LLM displayName: Paddleocr NVIDIA A10Gx1 FP16 ngcMetadata: acba2841622c4da2050811e8c7c4bae4c16996ab61b67d68b089176524d70383: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 2237:10de gpu: NVIDIA A10G gpu_key: a10g model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A10G - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:b200x1-trt-fp16-itomrwaucq framework: TensorRT-LLM displayName: Paddleocr NVIDIA B200x1 FP16 ngcMetadata: b6c8b6aef874d014b535faf2742d759bce6670c32b777ea8762d6264f7d30737: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 2901:10de gpu: NVIDIA B200 gpu_key: b200 model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:2_TRT_python_2 framework: TensorRT-LLM displayName: Paddleocr NVIDIA H100 80GB HBM3x1 FP16 ngcMetadata: eaad888e841d6944998862e9ea19050e530701214aa6caa164047bc0fb800a69: model: baidu/paddleocr release: 1.5.0 tags: backend: triton batch_size: '32' device_id: 2330:10de gpu: NVIDIA H100 80GB HBM3 gpu_key: h100-hbm3-80gb model_type: tensorrt precision: fp16 modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100-HBM3-80GB - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: TENSORRT - profileId: nim/baidu/paddleocr:4_ONNX_python_2 framework: ONNX displayName: Paddleocr ONNX FP16 ngcMetadata: edc693c6fccd68d266622eace04225421e353d7ce31e3b207afc5ff35124127b: model: baidu/paddleocr release: 1.5.0 tags: backend: triton model_type: onnx precision: fp16 modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.5.0 - key: DOWNLOAD SIZE value: 1GB - key: BACKEND value: TRITON - key: MODEL TYPE value: ONNX labels: - signed images - NSPECT-LDAL-INWI - NVIDIA AI Enterprise Supported - NVIDIA NIM config: architectures: - Other modelType: NIM license: NVIDIA AI Foundation Models Community License - name: Llama 3.2 NV EmbedQA 1b V2 displayName: Llama 3.2 NV EmbedQA 1b V2 modelHubID: llama-3.2-nv-embedqa-v2 category: Text Embedding type: NGC description: The NVIDIA Retrieval QA Llama3.2 1b Embedding NIM is an embedding NIM optimized for multilingual and crosslingual text question-answering retrieval. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: Llama 3.2 NV EmbedQA 1b V2 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "| Field | Response |\n| ----- | ----- |\n| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |\n| Measures taken to mitigate against unwanted bias | None |",
    "canGuestDownload": false,
    "createdDate": "2025-02-19T16:11:27.907Z",
    "description": "## **Model Overview**\n\n### **Description**\n\nThe Llama 3.2 NeMo Retriever Embedding 1B model is optimized for **multilingual and cross-lingual** text question-answering retrieval with **support for long documents (up to 8192 tokens) and dynamic embedding size (Matryoshka Embeddings)**. This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.\n\nIn addition to enabling multilingual and cross-lingual question-answering retrieval, this model reduces the data storage footprint by 35x through dynamic embedding sizing and support for longer token length, making it feasible to handle large-scale datasets efficiently.\n\nAn embedding model is a crucial component of a text retrieval system, as it transforms textual information into dense vector representations. They are typically transformer encoders that process tokens of input text (for example: question, passage) to output an embedding.\n\nThis model is ready for commercial use.\n\nThe Llama 3.2 NeMo Retriever Embedding 1B model is a part of the NVIDIA NeMo Retriever collection of NIM, which provide state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.\n\n### **Intended use**\n\nThe Llama 3.2 NeMo Retriever Embedding 1B model is most suitable for users who want to build a multilingual question-and-answer application over a large text corpus, leveraging the latest dense retrieval technologies.\n\n### **License/Terms of use**\n\nThe use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/) and Llama 3.2 is licensed under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n### **Release Date:  \n\nBuild.Nvidia.com: March 17, 2025 via build.nvidia.com/nvidia/llama-3_2-nv-embedqa-1b-v2 <br>\nOriginal NGC Release: March 17, 2025 <br>\nLast Updated NGC Release: July 11, 2025 <br>\n\n### **Deployment Geography**\nGlobal <br>\n\n\n### **Model Architecture**\n\n**Architecture Type:** Transformer<br>\n**Network Architecture:** Fine-tuned Llama3.2 1B Retriever<br>\n\nThis NeMo Retriever embedding model is a transformer encoder - a fine-tuned version of Llama3.2 1b, with 16 layers and an embedding size of 2048, which is trained on public datasets. The AdamW optimizer is employed incorporating 100 warm up steps and 5e-6 learning rate with WarmupDecayLR scheduler. Embedding models for text retrieval are typically trained using a bi-encoder architecture. This involves encoding a pair of sentences (for example, query and chunked passages) independently using the embedding model. Contrastive learning is used to maximize the similarity between the query and the passage that contains the answer, while minimizing the similarity between the query and sampled negative passages not useful to answer the question.\n\n### **Input**\n\n**Input Type:** Text<br>\n**Input Format:** List of strings<br>\n**Input Parameter:** 1D<br>\n**Other Properties Related to Input:** The model's maximum context length is 8192 tokens. Texts longer than maximum length must either be chunked or truncated.<br>\n\n### **Output**\n\n**Output Type:** Floats<br>\n**Output Format:** List of float arrays<br>\n**Output:** Model outputs embedding vectors of maximum dimension 2048 for each text string (can be configured based on 384, 512, 768, 1024, or 2048).<br>\n**Other Properties Related to Output:** N/A<br>\n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA\u2019s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.\n\n### **Software Integration**\n\n**Runtime Engine:** NeMo Retriever embedding NIM<br>\n**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace<br>\n**Supported Operating System(s):** Linux<br>\n\n### **Model Version(s)**\n\nLlama 3.2 NeMo Retriever Embedding 1B v2<br>\nShort Name: llama-3.2-nv-embedqa-1b-v2<br>\n\n## **Training Dataset & Evaluation**\n\n### **Training Dataset**\n\nThe development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named MS MARCO restricts \u200ccommercial licensing, limiting the use of these models in commercial settings. To address this, NVIDIA created its own training dataset blend based on public QA datasets, which each have a license for commercial applications.\n\n**Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic\n\n\n**Labeling Method by dataset**: Hybrid: Automated, Human, Synthetic\n\n\n**Properties:** Semi-supervised pre-training on 12M samples from public datasets and fine-tuning on 1M samples from public datasets.\n\n\n### **Evaluation Results**\n\nProperties: We evaluated the NeMo Rtriever embdding model in comparison to literature open & commercial retriever models on academic benchmarks for question-answering - [NQ](https://huggingface.co/datasets/BeIR/nq), [HotpotQA](https://huggingface.co/datasets/hotpot_qa) and [FiQA (Finance Q\\&A)](https://huggingface.co/datasets/BeIR/fiqa) from BeIR benchmark and TechQA dataset. Note that the model was evaluated offline on A100 GPUs using the model's PyTorch checkpoint.  In this benchmark, the metric used was Recall@5.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset |\n| ----- | ----- |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 68.60% |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 64.48% |\n| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 68.97% |\n| nv-embedqa-mistral-7b-v2 | 72.97% |\n| nv-embedqa-mistral-7B-v1 | 64.93% |\n| nv-embedqa-e5-v5 | 62.07% |\n| nv-embedqa-e5-v4 | 57.65% |\n| e5-large-unsupervised | 48.03% |\n| BM25 | 44.67%  |\n\nWe evaluated the multilingual capabilities on the academic benchmark [MIRACL](https://github.com/project-miracl/miracl) across 15 languages and translated the English and Spanish version of MIRACL into additional 11 languages. The reported scores are based on an internal version of MIRACL by selecting hard negatives for each query to reduce the corpus size.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on multilingual |\n| ----- | ----- |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 60.75% |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 58.62% |\n| llama-3.2-nv-embedqa-1b-v1 | 60.07% |\n| nv-embedqa-mistral-7b-v2 | 50.42% |\n| BM25 | 26.51% |\n\nWe evaluated the cross-lingual capabilities on the academic benchmark [MLQA](https://github.com/facebookresearch/MLQA/) based on 7 languages (Arabic, Chinese, English, German, Hindi, Spanish, Vietnamese). We consider only evaluation datasets when the query and documents are in different languages. We calculate the average Recall@5 across the 42 different language pairs.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on MLQA dataset with different languages |\n| ----- | ----- |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 79.86% |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 71.61% |\n| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 78.77% |\n| nv-embedqa-mistral-7b-v2 | 68.38% |\n| BM25 | 13.01% |\n\nWe evaluated the support of long documents on the academic benchmark [Multilingual Long-Document Retrieval (MLDR)](https://huggingface.co/datasets/Shitao/MLDR) built on Wikipedia and mC4, covering 12 typologically diverse languages. The English version has a median length of 2399 tokens and 90th percentile of 7483 tokens using the llama 3.2 tokenizer. The MLDR dataset is based on synthetic generated questions with a LLM, which has the tendency to create questions with similar keywords than the positive document, but might not be representative for real user queries. This characteristic of the dataset benefits sparse embeddings like BM25.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on MLDR |\n| ----- | ----- |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 2048) | 59.55% |\n| llama-3.2-nv-embedqa-1b-v2 (embedding dim 384) | 54.77% |\n| llama-3.2-nv-embedqa-1b-v1 (embedding dim 2048) | 60.49% |\n| nv-embedqa-mistral-7b-v2 | 43.24% |\n| BM25 | 71.39% |\n\n**Data Collection Method by dataset**: Hybrid: Automated, Human, Synthetic\n\n**Labeling Method by dataset:** Hybrid: Automated, Human, Synthetic\n\n**Properties:** The evaluation datasets are based on [MTEB/BEIR](https://github.com/beir-cellar/beir), TextQA, TechQA, [MIRACL](https://github.com/project-miracl/miracl), [MLQA](https://github.com/facebookresearch/MLQA), and [MLDR](https://huggingface.co/datasets/Shitao/MLDR). The size ranges between 10,000s up to 5M depending on the dataset.\n\n**Inference**<br>\n**Engine:** TensorRT<br>\n**Test Hardware:** H100 PCIe/SXM, A100 PCIe/SXM, L40s, L4, and A10G<br>\n\n## **Ethical Considerations**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\nFor more detailed information on ethical considerations for this model, please see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).",
    "displayName": "Llama 3.2 NeMo Retriever Embedding 1B",
    "explainability": "| Field | Response |\n| ----- | ----- |\n| Intended Application & Domain: | Passage and query embedding for question and answer retrieval |\n| Model Type: | Transformer encoder |\n| Intended User: | Generative AI creators working with conversational AI models - users who want to build a multilingual question and answer application over a large text corpus, leveraging the latest dense retrieval technologies. |\n| Output: | Array of float numbers (Dense Vector Representation for the input text) |\n| Describe how the model works: | Model transforms the tokenized input text into a dense vector representation. |\n| Performance Metrics: | Accuracy, Throughput, and Latency |\n| Potential Known Risks: | This model does not always guarantee to retrieve the correct passage(s) for a given query. |\n| Licensing & Terms of Use: | The use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/) and Llama 3.2 is licensed under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. |\n| Technical Limitations | The model\u2019s max sequence length is 8192. Therefore, the longer text inputs should be truncated.   |\n| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | N/A |\n| Verified to have met prescribed NVIDIA quality standards: | Yes |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-31UJ-S8X4",
        "NVIDIA AI Enterprise Supported",
        "NVIDIA NIM",
        "llama-3-2-nv-embedqa-1b-v2",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "rtx5090x1-trt-fp8-b-g-qaa9ra",
    "latestVersionSizeInBytes": 1648513544,
    "logo": "https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/nemo-retriever-graphic.png",
    "modelFormat": "N/A",
    "name": "llama-3.2-nv-embedqa-1b-v2",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "| Field | Response |\n| ----- | ----- |\n| Generatable or reverse engineerable personal data? | None |\n| Personal data used to create this model? | None |\n| How often is dataset reviewed? | Dataset is initially reviewed upon addition, and subsequent reviews are conducted as needed or upon request for changes. |\n| Is there provenance for all datasets used in training? | Yes |\n| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |\n| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |\n| Applicable Privacy Policy | https://www.nvidia.com/en-us/about-nvidia/privacy-policy/ |",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "| Field | Response |\n| ----- | ----- |\n| Model Application(s): | Text Embedding for Retrieval |\n| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |\n| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |\n| Describe the life critical impact (if present): | Not applicable. |",
    "shortDescription": "World-class multilingual and cross-lingual question-answering retrieval.",
    "teamName": "nvidia",
    "updatedDate": "2025-08-29T18:01:57.279Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-3.2-nv-embedqa-1b-v2 optimizationProfiles: - profileId: nim/nvidia/llama-3.2-nv-embedqa-1b-v2:onnx-precision.fp16-7c7a1c17 framework: ONNX displayName: Llama 3.2 NV Embedqa 1B V2 ONNX FP16 ngcMetadata: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f: model: nvidia/llama-3.2-nv-embedqa-1b-v2 release: 1.10.0 tags: backend: onnx model_type: onnx precision: fp16 tp: '1' modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.10.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: ONNX - key: MODEL TYPE value: ONNX - key: MAX TOKENS value: 8192 - key: TOTAL PARAMETERS value: 1236 - key: Embedding Dimension value: 2048 labels: - Llama - Meta - Chat - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Llama 3.2 NV RerankQA 1b V2 displayName: Llama 3.2 NV RerankQA 1b V2 modelHubID: llama-3.2-nv-rerankqa-v2 category: Text Embedding type: NGC description: The NVIDIA Retrieval QA Llama 1B Reranking NIM is a NIM optimized for providing a logit score that represents how relevant a document(s) is to a given query, fine-tuned for multilingual and cross-lingual text question-answering retrieval. requireLicense: true licenseAgreements: - label: Use Policy url: https://llama.meta.com/llama3/use-policy/ - label: License Agreement url: https://llama.meta.com/llama3/license/ modelVariants: - variantId: Llama 3.2 NV RerankQA 1b V2 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "| Field | Response |\n| ----- | ----- |\n| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing | None |\n| Measures taken to mitigate against unwanted bias | None |",
    "canGuestDownload": false,
    "createdDate": "2025-03-14T00:25:50.051Z",
    "description": "## **Model Overview**\n\n### **Description**\n\nThe Llama 3.2 NeMo Retriever Reranking 1B model is optimized for providing a logit score that represents how relevant a document(s) is to a given query. The model was fine-tuned for **multilingual, cross-lingual** text question-answering retrieval, with support for **long documents (up to 8192 tokens)**.  This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish.\n\n\nThis model is a component in a text retrieval system to improve the overall accuracy. A text retrieval system often uses an embedding model (dense) or lexical search (sparse) index to return relevant text passages given the input. A reranking model can be used to rerank the potential candidate into a final order. The reranking model has the question-passage pairs as an input and therefore, can process cross attention between the words. It\u2019s not feasible to apply a Ranking model on all documents in the knowledge base, therefore, ranking models are often deployed in combination with embedding models.\n\n\nThis model is ready for commercial use.\n\n\nThe Llama 3.2 NeMo Retriever Reranking 1B model is a part of the NeMo Retriever collection of NIM, which provide state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for their domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants.\n\n\n### **License/Terms of use**\n\nThe use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/)  and Llama 3.2 is licensed under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved.\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n### **Intended use**\n\nThe Llama 3.2 NeMo Retriever Reranking 1B model is most suitable for users who want to improve their multilingual retrieval tasks by reranking a set of candidates for a given question.\n\n### **Model Architecture: Llama-3.2 1B Ranker**\n\n**Architecture Type:** Transformer\n**Network Architecture:** Fine-tuned meta-llama/Llama-3.2-1B\n\nThe Llama 3.2 NeMo Retriever Reranking 1B model is a transformer encoder fine-tuned for contrastive learning. We employ bi-directional attention when fine-tuning for higher accuracy. The last embedding output by the decoder model is used with a mean pooling strategy, and a binary classification head is fine-tuned for the ranking task.\n\nRanking models for text ranking are typically trained as a cross-encoder for sentence classification. This involves predicting the relevancy of a sentence pair (for example, question and chunked passages). The CrossEntropy loss is used to maximize the likelihood of passages containing information to answer the question and minimize the likelihood for (negative) passages that do not contain information to answer the question.\n\nWe trained the model on public datasets described in the Dataset and Training section.\n\n### **Input**\n\n**Input Type:** Pair of Texts\n**Input Format:** List of text pairs\n**Input Parameters:** 1D\n**Other Properties Related to Input:** The model was trained on question and answering over text documents from multiple languages. It was evaluated to work successfully with up to a sequence length of 8192 tokens. Longer texts are recommended to be either chunked or truncated.\n\n**Output**\n**Output Type:** Floats\n**Output Format:** List of floats\n**Output Parameters:** 1D\n**Other Properties Related to Output:** Each the probability score (or raw logits). Users can decide to implement a Sigmoid activation function applied to the logits in their usage of the model.\n\n### **Software Integration**\n\n**Runtime:** Llama 3.2 NeMo Retriever Reranking 1B NIM\n**Supported Hardware Microarchitecture Compatibility**: NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace\n**Supported Operating System(s):** Linux\n\n### **Model Version(s)**\n\nLlama 3.2 NeMo Retriever Reranking 1B\nShort Name: llama-3.2-nv-rerankqa-1b-v2\n\n## **Training Dataset & Evaluation**\n\n### **Training Dataset**\n\nThe development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named [MSMARCO](https://microsoft.github.io/msmarco/) restricts \u200ccommercial licensing, limiting the use of these models in commercial settings. To address this, NVIDIA created its own training dataset blend based on public QA datasets, which each have a license for commercial applications.\n\n**Data Collection Method by dataset**: Automated, Unknown\n\n**Labeling Method by dataset:** Automated, Unknown\n\n**Properties:** This model was trained on 800k samples from public datasets.\n\n### **Evaluation Results**\n\nWe evaluate the pipelines on a set of evaluation benchmarks. We applied the ranking model to the candidates retrieved from a retrieval embedding model.\n\nOverall, the pipeline llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 provides high BEIR+TechQA accuracy with multilingual and crosslingual support. The llama-3.2-nv-rerankqa-1B-v2  ranking model is 3.5x smaller than the nv-rerankqa-mistral-4b-v3 model.\n\nWe evaluated the NVIDIA Retrieval QA Embedding Model in comparison to literature open & commercial retriever models on academic benchmarks for question-answering \\- [NQ](https://huggingface.co/datasets/BeIR/nq), [HotpotQA](https://huggingface.co/datasets/hotpot_qa) and [FiQA (Finance Q\\&A)](https://huggingface.co/datasets/BeIR/fiqa) from BeIR benchmark and TechQA dataset. In this benchmark, the metric used was Recall@5. As described, we need to apply the ranking model on the output of an embedding model.\n\n| Open & Commercial Reranker Models | Average Recall@5 on NQ, HotpotQA, FiQA, TechQA dataset |\n| ----- | ----- |\n| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 | 73.64% |\n| llama-3.2-nv-embedqa-1b-v2 | 68.60% |\n| nv-embedqa-e5-v5 \\+ nv-rerankQA-mistral-4b-v3 | 75.45% |\n| nv-embedqa-e5-v5 | 62.07% |\n| nv-embedqa-e5-v4 | 57.65% |\n| e5-large\\_unsupervised | 48.03% |\n| BM25 | 44.67% |\n\nWe evaluated the model\u2019s multilingual capabilities on the [MIRACL](https://github.com/project-miracl/miracl) academic benchmark \\- a multilingual retrieval dataset, across 15 languages, and on an additional 11 languages that were translated from the English and Spanish versions of MIRACL. The reported scores are based on a custom subsampled version by selecting hard negatives for each query to reduce the corpus size.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on MIRACL multilingual datasets |\n| :---- | :---- |\n| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 | 65.80% |\n| llama-3.2-nv-embedqa-1b-v2 | 60.75% |\n| nv-embedqa-mistral-7b-v2 | 50.42% |\n| BM25 | 26.51% |\n\nWe evaluated the cross-lingual capabilities on the academic benchmark [MLQA](https://github.com/facebookresearch/MLQA/) based on 7 languages (Arabic, Chinese, English, German, Hindi, Spanish, Vietnamese). We consider only evaluation datasets when the query and documents are in different languages. We calculate the average Recall@5 across the 42 different language pairs.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on MLQA dataset with different languages |\n| :---- | :---- |\n| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 | 86.83% |\n| llama-3.2-nv-embedqa-1b-v2 | 79.86% |\n| nv-embedqa-mistral-7b-v2 | 68.38% |\n| BM25 | 13.01% |\n\nWe evaluated the support of long documents on the academic benchmark [Multilingual Long-Document Retrieval (MLDR)](https://huggingface.co/datasets/Shitao/MLDR) built on Wikipedia and mC4, covering 12 typologically diverse languages . The English version has a median length of 2399 tokens and 90th percentile of 7483 tokens using the llama 3.2 tokenizer.\n\n| Open & Commercial Retrieval Models | Average Recall@5 on MLDR |\n| :---- | :---- |\n| llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 | 70.69% |\n| llama-3.2-nv-embedqa-1b-v2 | 59.55% |\n| nv-embedqa-mistral-7b-v2 | 43.24% |\n| BM25 | 71.39% |\n\n**Data Collection Method by dataset**:\nUnknown\n\n**Labeling Method by dataset:**\nUnknown\n\n**Properties**\nThe evaluation datasets are based on three [MTEB/BEIR](https://github.com/beir-cellar/beir) TextQA datasets, the TechQA dataset, and MIRACL multilingual retrieval datasets, which are all public datasets. The sizes range between 10,000s up to 5M depending on the dataset.\n\n**Inference**\n**Engine:** TensorRT\n**Test Hardware:**  H100 PCIe/SXM, A100 PCIe/SXM, L40s, L4, and A10G\n\n## **Ethical Considerations**\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.\n\nFor more detailed information on ethical considerations for this model, please see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.\n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n## Get Help\n\n### Enterprise Support\nGet access to knowledge base articles and support cases or  submit a ticket at the [NVIDIA AI Enterprise Support Services page.](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\nVisit the [NeMo Retriever docs page](https://docs.nvidia.com/nemo/retriever/index.html) for release documentation, deployment guides and more.",
    "displayName": "Llama 3.2 NeMo Retriever Reranking 1B",
    "explainability": "| Field | Response |\n| ----- | ----- |\n| Intended Application & Domain: | Passage ranking for question and answer retrieval. |\n| Model Type: | Transformer encoder |\n| Intended User: | Generative AI creators working with conversational AI models - most suitable for users who want to improve their multilingual retrieval tasks by reranking a set of candidates for a given question. |\n| Output: | List of Floats (Score/Logit indicating if a passage relevant to a question) |\n| Describe how the model works: | Model provides a score about the likelihood the passage contains the information to answer the question. |\n| Verified to have met prescribed quality standards: | Yes |\n| Performance Metrics: | Accuracy, Throughput, and Latency |\n| Potential Known Risks: | This model does not always guarantee to provide a meaningful ranking of passage(s) for a given question. |\n| Licensing: | The use of this model is governed by the [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/)  and Llama 3.2 is licensed under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/), Copyright \u00a9 Meta Platforms, Inc. All Rights Reserved. |\n| Technical Limitations | The model\u2019s max sequence length is 8192. Therefore, the longer text inputs should be truncated. |",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-31UJ-S8X4",
        "NSPECT-VZY2-WM4U",
        "NVIDIA AI Enterprise Supported",
        "NVIDIA NIM",
        "llama-3-2-nv-rerankqa-1b-v2",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "b200x1-trt-fp8-yikjqqyrqw",
    "latestVersionSizeInBytes": 1662033148,
    "logo": "https://developer-blogs.nvidia.com/wp-content/uploads/2024/03/nemo-retriever-graphic.png",
    "modelFormat": "N/A",
    "name": "llama-3.2-nv-rerankqa-1b-v2",
    "orgName": "nim",
    "precision": "N/A",
    "privacy": "| Field | Response |\n| ----- | ----- |\n| Generatable or reverse engineerable personally-identifiable information (PII)? | None |\n| Was consent obtained for any personal data used? | Not Applicable |\n| Personal data used to create this model? | None |\n| How often is the dataset reviewed? | Before Every Release |\n| Is a mechanism in place to honor data subject right of access or deletion of personal data? | N/A |\n| If personal data was collected for the development of the model, was it collected directly by NVIDIA? | Not Applicable |\n| If personal data was collected for the development of the model by NVIDIA, do you maintain or have access to disclosures made to data subjects? | Not Applicable |\n| If personal data collected for the development of this AI model, was it minimized to only what was required? | Not Applicable |\n| Is there provenance for all datasets used in training? | Yes |\n| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |\n| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "NVIDIA",
    "safetyAndSecurity": "| Field | Response |\n| ----- | ----- |\n| Model Application(s): | Text Reranking for Retrieval |\n| Describe the physical safety impact (if present). | Not Applicable |\n| Use Case Restrictions: | Abide by [NVIDIA AI Foundation Models Community License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).   |\n| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |",
    "shortDescription": "GPU-accelerated model optimized for providing a probability score that a given passage contains the information to answer a question.",
    "teamName": "nvidia",
    "updatedDate": "2025-09-03T21:30:35.281Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/nvidia/containers/llama-3.2-nv-rerankqa-1b-v2 optimizationProfiles: - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:h100x1-trt-fp16--ckqlv3j2g framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA H100 NVLx1 FP16 ngcMetadata: 3b1e767e41d02ed0ffa5aa6b46a2edfdd1540edaec2eeda4c00278c838bba38b: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2321:10de gpu: NVIDIA H100 NVL gpu_key: h100-nvl model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100-NVL - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 5GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:a100x1-trt-fp16-dxtbz8wstg framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA A100-SXM4-40GBx1 FP16 ngcMetadata: 477500a740ea33ea1419289866bbfd598ce51a806fe034b48dc176db32155f59: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 20b0:10de gpu: NVIDIA A100-SXM4-40GB gpu_key: a100-sxm4-40gb model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100-SXM4-40GB - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:l40sx1-trt-fp16-20qsn53gag framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA L40Sx1 FP16 ngcMetadata: 49d14b4eaebc6b1f61e48afb3d88535f4ad3758ea55036f5ab3815d1c5a927fc: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 26b9:10de gpu: NVIDIA L40S gpu_key: l40s model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:a100x1-trt-fp16-dxtbz8wstg framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA A100-SXM4-80GBx1 FP16 ngcMetadata: 4ea4624dcc114adeeb29272322897800cddf5dfa873dac467f67d827b7dd9c4d: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 20b2:10de gpu: NVIDIA A100-SXM4-80GB gpu_key: a100-sxm4-80gb model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A100-SXM4-80GB - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:l40sx1-trt-fp8-4nwnajwq4g framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA L40Sx1 FP8 ngcMetadata: 5036ebf412fba4e54511ab4b3822ec7dfb9fd2c256c3100ad2ed9d2b4bda9f79: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 26b9:10de gpu: NVIDIA L40S gpu_key: l40s model_type: tensorrt precision: fp8 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP8 - key: GPU value: L40S - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 2GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:a10gx1-trt-fp16-fxo3knzn8w framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA A10Gx1 FP16 ngcMetadata: 6f21ae4169cfe3c03cc92eb194713f5a3044ac2f61526edf632d0f9a5155b538: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2237:10de gpu: NVIDIA A10G gpu_key: a10g model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: A10G - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:b200x1-trt-fp16-jiw0-uharg framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA B200x1 FP16 ngcMetadata: 75b659320dada86548fb6af5d3adfe386df6c515969d71db4e76cd64375777e1: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2901:10de gpu: NVIDIA B200 gpu_key: b200 model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 4GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:h100x1-trt-fp8-bm87q6egvq framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA H100 80GB HBM3x1 FP8 ngcMetadata: 774e4d699d318f41630b51b4280cadecb184b9b2755b707aa74232f1ea642b2c: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2330:10de gpu: NVIDIA H100 80GB HBM3 gpu_key: h100-hbm3-80gb model_type: tensorrt precision: fp8 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP8 - key: GPU value: H100-HBM3-80GB - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 2GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:l4x1-trt-fp16-bajefiwkra framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA L4x1 FP16 ngcMetadata: 9278eac727396c9f6ab9b3d421748889b0686afd20a9cef12d1d16c39fcd6a9d: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 27b8:10de gpu: NVIDIA L4 gpu_key: l4 model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: L4 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:l4x1-trt-fp8-vk0qdpls2w framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA L4x1 FP8 ngcMetadata: a344745c8dbe62413a4e95b4e5718a689c155dfb8743868fb5d13956a621b31e: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 27b8:10de gpu: NVIDIA L4 gpu_key: l4 model_type: tensorrt precision: fp8 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP8 - key: GPU value: L4 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 2GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:h100x1-trt-fp8-bm87q6egvq framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA H100 NVLx1 FP8 ngcMetadata: b469c56c1a9ac1001151765527d3c7de77f590255b08eea4aa064ee1abf0ef3f: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2321:10de gpu: NVIDIA H100 NVL gpu_key: h100-nvl model_type: tensorrt precision: fp8 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP8 - key: GPU value: H100-NVL - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 2GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:h100x1-trt-fp16--ckqlv3j2g framework: TensorRT-LLM displayName: Llama 3.2 NV Rerankqa 1B V2 NVIDIA H100 80GB HBM3x1 FP16 ngcMetadata: ddd9c5d1430631c0bd75c04b0c18e9b620219ad82c808a30d019be9cbcd618bd: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: tensorrt device_id: 2330:10de gpu: NVIDIA H100 80GB HBM3 gpu_key: h100-hbm3-80gb model_type: tensorrt precision: fp16 tp: '1' modelFormat: trt-llm spec: - key: PRECISION value: FP16 - key: GPU value: H100-HBM3-80GB - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 5GB - key: BACKEND value: TENSORRT - key: MODEL TYPE value: TENSORRT - profileId: nim/nvidia/llama-3.2-nv-rerankqa-1b-v2:onnx-precision.fp16-d03bf375 framework: ONNX displayName: Llama 3.2 NV Rerankqa 1B V2 ONNX FP16 ngcMetadata: f7391ddbcb95b2406853526b8e489fedf20083a2420563ca3e65358ff417b10f: model: nvidia/llama-3.2-nv-rerankqa-1b-v2 release: 1.8.0 tags: backend: onnx model_type: onnx precision: fp16 tp: '1' modelFormat: onnx spec: - key: PRECISION value: FP16 - key: COUNT value: 1 - key: NIM VERSION value: 1.8.0 - key: DOWNLOAD SIZE value: 3GB - key: BACKEND value: ONNX - key: MODEL TYPE value: ONNX labels: - Llama - Meta - Chat - NIM - Large Language Model - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Riva ASR Whisper Large v3 displayName: Riva ASR Whisper Large v3 modelHubID: riva-asr-whisper-large-v3 category: Text-Prompt type: NGC description: This model is used to transcribe short-form audio files and is designed to be compatible with OpenAI's sequential long-form transcription algorithm. Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labeled data, Whisper models demonstrate a strong ability to generalize to many datasets and domains without the need for fine-tuning. Whisper-large-v3 is one of the 5 configurations of the model with 1550M parameters. This model version is optimized to run with NVIDIA TensorRT-LLM. This model is ready for commercial use. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Riva ASR Whisper Large v3 modelCard: ewogICAgImFjY2Vzc1R5cGUiOiAiTk9UX0xJU1RFRCIsCiAgICAiYXBwbGljYXRpb24iOiAiT3RoZXIiLAogICAgImJpYXMiOiAiIiwKICAgICJidWlsdEJ5IjogIk9wZW5BSSIsCiAgICAiY2FuR3Vlc3REb3dubG9hZCI6IGZhbHNlLAogICAgImNyZWF0ZWREYXRlIjogIjIwMjQtMTAtMjhUMTY6NDU6MjAuOTkwWiIsCiAgICAiZGVzY3JpcHRpb24iOiAiT3ZlcnZpZXdcbj09PT09PT09PT1cblxuTW9yZSBkZXRhaWxzXG49PT09PT09PT09PT09XG5cbkRvY3VtZW50YXRpb25cbj09PT09PT09PT09PT09PSIsCiAgICAiZGlzcGxheU5hbWUiOiAiV2hpc3Blci1MYXJnZS12MyIsCiAgICAiZXhwbGFpbmFiaWxpdHkiOiAiIiwKICAgICJmcmFtZXdvcmsiOiAiT3RoZXIiLAogICAgImhhc1BsYXlncm91bmQiOiBmYWxzZSwKICAgICJoYXNTaWduZWRWZXJzaW9uIjogdHJ1ZSwKICAgICJpc1BsYXlncm91bmRFbmFibGVkIjogZmFsc2UsCiAgICAiaXNQdWJsaWMiOiBmYWxzZSwKICAgICJpc1JlYWRPbmx5IjogdHJ1ZSwKICAgICJsYWJlbHMiOiBbCiAgICAgICAgIk5JTSIsCiAgICAgICAgIk5TUEVDVC1GNVBWLVNHQU8iLAogICAgICAgICJXaGlzcGVyLUxhcmdlLXYzIiwKICAgICAgICAibnZhaWU6bW9kZWw6bnZhaWVfc3VwcG9ydGVkIiwKICAgICAgICAibnZpZGlhX25pbTptb2RlbDpuaW1tY3JvX252aWRpYV9uaW0iLAogICAgICAgICJwcm9kdWN0TmFtZXM6bmltLWRldiIsCiAgICAgICAgInByb2R1Y3ROYW1lczpudi1haS1lbnRlcnByaXNlIgogICAgXSwKICAgICJsYXRlc3RWZXJzaW9uSWRTdHIiOiAiaDEwMHgxLW9mbC0yNS4wOC1mcDE2LW1uejRwbm4wcHciLAogICAgImxhdGVzdFZlcnNpb25TaXplSW5CeXRlcyI6IDE1MzI4NjU4OTIsCiAgICAibG9nbyI6ICJodHRwczovL2Fzc2V0cy5uZ2MubnZpZGlhLmNvbS9wcm9kdWN0cy9hcGktY2F0YWxvZy9pbWFnZXMvcmFkdHRzLWhpZmlnYW4tcml2YS5qcGciLAogICAgIm1vZGVsRm9ybWF0IjogIlJNSVIiLAogICAgIm5hbWUiOiAid2hpc3Blci1sYXJnZS12MyIsCiAgICAib3JnTmFtZSI6ICJuaW0iLAogICAgInByZWNpc2lvbiI6ICJBTVAiLAogICAgInByaXZhY3kiOiAiIiwKICAgICJwcm9kdWN0TmFtZXMiOiBbCiAgICAgICAgIm5pbS1kZXYiLAogICAgICAgICJudi1haS1lbnRlcnByaXNlIgogICAgXSwKICAgICJwdWJsaWNEYXRhc2V0VXNlZCI6IHt9LAogICAgInB1Ymxpc2hlciI6ICJOVklESUEiLAogICAgInNhZmV0eUFuZFNlY3VyaXR5IjogIiIsCiAgICAic2hvcnREZXNjcmlwdGlvbiI6ICJXaGlzcGVyIGxhcmdlIHYzIHRyYWluZWQgYnkgT3BlbkFJIiwKICAgICJ0ZWFtTmFtZSI6ICJudmlkaWEiLAogICAgInVwZGF0ZWREYXRlIjogIjIwMjUtMDktMDZUMDQ6MTU6MzYuNzIzWiIKfQ== source: URL: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/riva/models/whisper_large optimizationProfiles: - profileId: nim/nvidia/whisper-large-v3:ofl-rmir-25.06 framework: TensorRT-LLM displayName: Riva ASR Whisper Large v3 Generic NVIDIA GPUx1 ngcMetadata: 5e44fa6d8cd80ad46a089089157ff4565974f0a64fd37c594265c61f00418ae0: model: nvidia/riva-asr/whisper release: 1.3.1 tags: mode: ofl model_type: rmir name: whisper-large-v3 tp: '1' modelFormat: trt-llm spec: - key: COUNT value: 1 - key: NIM VERSION value: 1.3.1 - key: DOWNLOAD SIZE value: 3GB - key: MODEL TYPE value: RMIR - key: MODE value: OFL - profileId: nim/nvidia/whisper-large-v3:h100x1-ofl-25.08-fp16-mnz4pnn0pw framework: TensorRT-LLM displayName: Riva ASR Whisper Large v3 H100 FP16 ngcMetadata: 72232937075119887298deb92b5e58f4d98a0ce0948df60d424f0d97b05da55e: model: nvidia/riva-asr/whisper release: 1.3.1 tags: gpu_device: '2330' mode: ofl model_type: prebuilt name: whisper-large-v3 gpu: H100 tp: '1' modelFormat: trt-llm spec: - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330 - key: NIM VERSION value: 1.3.1 - key: DOWNLOAD SIZE value: 2GB - key: MODEL TYPE value: PREBUILT - key: MODE value: OFL labels: - Transformer - TensorRT-LLM - Audio - NVIDIA Validated config: architectures: - Other modelType: llama license: NVIDIA AI Foundation Models Community License - name: Boltz2 displayName: Boltz2 modelHubID: boltz2 category: Biology Foundation Model type: NGC description: Boltz-2 NIM is a next-generation structural biology foundation model that shows strong performance for both structure and affinity prediction. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: Boltz2 modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-06-05T16:09:39.056Z",
    "description": "## Model Overview\n\n### Description:\n\nBoltz-2 NIM is a next-generation structural biology foundation model that shows strong performance for both structure and affinity prediction. Boltz-2 is the first deep learning model to approach the accuracy of free energy perturbation (FEP) methods in predicting binding affinities of small molecules and proteins\u2014achieving strong correlations on benchmarks while being nearly 1000\u00d7 more computationally efficient. Note that binding affinity is not yet available in the NIM, but will be available very soon!\n<br>\nKey Features: <br>\n **Trunk optimizations:** Mixed-precision (bfloat16) and trifast triangle attention cut runtime/memory; enables training with 768-token crops (as in AlphaFold3). <br>\n**Physical quality:** Integrates Boltz-steering at inference (Boltz-2x) to reduce steric clashes and stereochemistry errors without losing accuracy. <br>\n**Controllability:** <br>\n* **Method conditioning:** Steers predictions to resemble X-ray, NMR, or MD-style structures. <br>\n* **Template conditioning + steering:** Uses single or multimeric templates; supports strict template enforcement or soft guidance. <br>\n* **Contact/pocket conditioning:** Accepts distance constraints from experiments or expert priors. <br>\n\n**Affinity module:** PairFormer refines protein\u2013ligand and intra-ligand interactions; predicts both binding likelihood and a continuous affinity on log \u00b5M scale (trained on mixed Ki, Kd, IC50). Output is an IC50-like measure suitable for ranking. <br>\n**Key advances vs Boltz-1/1x:** Faster/more memory-efficient trunk, improved physical plausibility via integrated steering, markedly enhanced controllability, and added affinity prediction head. <br>\n\n\nThis NIM is ready for commercial use.\n<br>\n\n### Third-Party Community Consideration\n\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case.\n\n#### License / Terms of Use\n\nGOVERNING TERMS: This trial service is governed by the [NVIDIA API Trial Terms of Service](https://assets.ngc.nvidia.com/products/api-catalog/legal/NVIDIA%20API%20Trial%20Terms%20of%20Service.pdf). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Additional Information: [MIT](https://github.com/jwohlwend/boltz?tab=MIT-1-ov-file#readme).\n\n\n**You are responsible for ensuring that your use of NVIDIA AI Foundation Models complies with all applicable laws.**\n\n### Deployment Geography\nGlobal\n\n### Use Case\nBoltz-2 NIM enables researchers and commercial entities in the Drug Discovery, Life Sciences, and Digital Biology fields to predict the three-dimensional structure of biomolecular complexes and predict small-molecule binding affinities. Trained on millions of curated experimental datapoints with a novel training strategy tailored for noisy biochemical assay data, Boltz-2 demonstrates robust performance across hit-discovery, hit-to-lead, and lead optimization.\n\n### Release Date\nbuild.nvidia.com September 30, 2025 via [build.nvidia.com](https://build.nvidia.com/mit/boltz2)\n\nNGC September 30, 2025\n\n### References:\n\n```\n@article{wohlwend2024boltz,\n    title = {Boltz-1: Democratizing Biomolecular Interaction Modeling},\n    author = {Wohlwend, Jeremy and Corso, Gabriele and Passaro, Saro and Getz, Noah and Reveiz, Mateo and Leidal, Ken and Swiderski, Wojtek and Atkinson, Liam and Portnoi, Tally and Chinn, Itamar and Silterra, Jacob and Jaakkola, Tommi and Barzilay, Regina},\n    journal = {bioRxiv},\n    year = {2024},\n    doi = {10.1101/2024.11.19.624167},\n    language = \"en\"\n}\n```\n\n<br>\n\n### Model Architecture:\n\n**Architecture Type:** Four components \u2014 trunk, denoising module (with steering), confidence module, and a new affinity module <br>\n**Network Architecture:** PairFormer <br>\n\n**Input Type(s):** Biomolecular sequences (protein, DNA, RNA), ligand SMILES or CCD strings, molecular modifications, structural constraints, conditioning parameters, optional booleans <br>\n**Input Format(s):** Dictionary containing sequence strings, modification records, and constraint parameters <br>\n**Input Parameters:** Sequences (strings), predict_affinity(boolean), modifications (list of residue-specific changes), constraints (dictionary of structural parameters) <br>\n**Other Properties Related to Input:** Maximum sequence length of 4096 residues per chain. Maximum of 12 input polymers. Maximum of 20 input ligands. Passing boolean options such as predict_affinity will increase the runtime of the request. <br>\n**Model Parameters:**\nTables 1 and 2 record some of the hyperparameters of Boltz-2\u2019s architecture, training and inference procedures that differ from Boltz-1\u2019s and were not previously mentioned in the manuscript.  \n\n<table>\n<tr>\n<td>\n\n<b>Table 1:</b> Extra model architecture and training hyperparameters  \nthat differ from Boltz-1 and were not previously mentioned in the manuscript.  \n\n| Parameter | Value |\n|-----------|-------|\n| Max number of MSA sequences during training | 8192 |\n| Template pairwise dim | 64 |\n| Num template blocks | 2 |\n| Training diffusion multiplicity | 32 |\n| bfactor loss weight | 1 \u00d7 10\u207b\u00b3 |\n\n</td>\n<td>\n\n<b>Table 2:</b> Diffusion process hyperparameters  \nthat differ from Boltz-1, with the exception of sigma_min we opted for AlphaFold3\u2019s default hyperparameters, see Abramson et al. (2024) for more details.  \n\n| Parameter   | Value   |\n|-------------|---------|\n| sigma_min   | 0.0001  |\n| rho         | 7       |\n| gamma_0     | 0.8     |\n| gamma_min   | 1.0     |\n| noise_scale | 1.003   |\n| step_scale  | 1.5     |\n\n</td>\n</tr>\n</table>\n\n### Output:\n\n**Output Type(s):** Structure prediction in mmcif format; scores in numeric arrays; runtime metrics as a dictionary <br>\n**Output Format:** mmcif (text file); numeric arrays; scalar numeric values <br>\n**Output Parameters:** 3D atomic coordinates, predicted scores, and metadata <br>\n**Other Properties Related to Output:** All Boltz-2 scores are returned by default. Runtime metrics are optional. <br>\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>\n\n### Software Integration:\n\n**Runtime Engine(s):**\n* PyTorch, TensorRT <br>\n\n**Supported Hardware Microarchitecture Compatibility:** <br>\n* NVIDIA Ampere, NVIDIA Hopper, NVIDIA Lovelace <br>\n\n**[Preferred/Supported] Operating System(s):** <br>\n* [Linux] <br>\n* The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.\n\n### Model Version(s):\n\nBoltz2 version 1.2 <br>\n\n## Training & Evaluation:\n\n### Training Dataset:\n** Data Modality <br>\n* [Text]\n\n\n**Link:** [Protein Data Bank as used by AlphaFold3](https://github.com/jwohlwend/boltz/blob/main/docs/training.md)  <br>\n** Data Collection Method by dataset <br>\n* Human <br>\n\n** Labeling Method by dataset <br>\n* Human <br>\n\n**Properties:**\nAll Protein Data Bank structures before 2021-09-30 with a resolution of at least 9 Angstroms, preprocessed to match each structure to its sequence. Ligands were processed similarly. All data was cleaned as described in AlphaFold3.\n\n### Evaluation Dataset:\n\n**Link:** [Boltz Evaluation Performed on 744 Structures from the Protein Data Bank](https://github.com/jwohlwend/boltz/blob/main/docs/evaluation.md)  <br>\n** Data Collection Method by dataset <br>\n* Human <br>\n\n** Labeling Method by dataset <br>\n* Hybrid: Human and Automated <br>\n\n**Properties:**\nThe test and validation datasets were generated by extensive filtering of PDB sequences deposited between 2021-09-31 and 2023-01-13. In total, 593 structures passed filters and were used for validation.\n<br>\n\n### Inference:\n\n**Acceleration Engine:** PyTorch, TensorRT <br>\n**Test Hardware:** <br>\n* NVIDIA A100 <br>\n* NVIDIA B200 <br>\n* NVIDIA L40 <br>\n* NVIDIA H100 <br>\n* NVIDIA RTX6000-Ada <br>\n* NVIDIA GB200 <br>\n\n### Ethical Considerations:\n\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Please report model quality, risk, security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).\n\n**You are responsible for ensuring for ensuring the physical properties of model-generated molecules are appropriately evaluated, and comply with applicable safety regulations and ethical standards.**\n\n# Get help\n## Enterprise Support\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).",
    "displayName": "Boltz-2",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NIM",
        "NSPECT-S9XG-L0SA",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "1.3.0-gpugb200_sm100_v1",
    "latestVersionSizeInBytes": 13089396959,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/boltz2.jpg",
    "modelFormat": "SAVED_MODEL",
    "name": "boltz2",
    "orgName": "nim",
    "precision": "FP16",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "mit",
    "safetyAndSecurity": "",
    "shortDescription": "Boltz-2 NIM is a next-generation structural biology foundation model that shows strong performance for both structure and affinity prediction.",
    "teamName": "mit",
    "updatedDate": "2025-10-30T05:12:32.317Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/mit/containers/boltz2 optimizationProfiles: - profileId: nim/mit/boltz2:1.3.0-gpuh100_sm90_v1 framework: TensorRT-LLM displayName: Boltz2 H100x1 SM90 V1 FP16 Trt ngcMetadata: 0901e344383119d8d4a5160d4d63933fd350e6aa92a56b925a77ecc32378d4a5: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: H100 gpu_device: 2330:10de nim_workspace_hash_v1: 41c01d6bb5cc24cc98a2e59b7a367d197951ad5684f03ed939a35c45dbafd514 number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '90' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: H100 - key: COUNT value: 1 - key: GPU DEVICE value: 2330:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '90' - key: V value: '1' - profileId: nim/mit/boltz2:1.3.0-gpua100_sm80_v1 framework: TensorRT-LLM displayName: Boltz2 A100x1 SM80 V1 FP16 Trt ngcMetadata: 3fbc1c2eb885b24f631f8a1d0c58704cca1c7cf1cd2db4b791c7e6d7201aaa5c: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: A100 gpu_device: 20b2:10de nim_workspace_hash_v1: 1cbe7ff69de2ff435540d9ba81052291e9d9f6fdcd6c580b4db556b5e0adf542 number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '80' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: A100 - key: COUNT value: 1 - key: GPU DEVICE value: 20B2:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '80' - key: V value: '1' - profileId: nim/mit/boltz2:1.3.0-gpurtx6000_ada_sm86_v1 framework: TensorRT-LLM displayName: Boltz2 RTX6000_ADAx1 SM86 V1 FP16 Trt ngcMetadata: a49907753de80032ade9659c95ef20be5e93af4166b4451531ecac61247ae4b3: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: RTX6000_ADA gpu_device: 26b1:10de nim_workspace_hash_v1: 8d2989817b603a609a1da17a59010e28c15dd22b51a047d48596d892ff4a1d1a number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '86' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: RTX6000_ADA - key: COUNT value: 1 - key: GPU DEVICE value: 26B1:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '86' - key: V value: '1' - profileId: nim/mit/boltz2:1.3.0-gpul40s_sm89_v1 framework: TensorRT-LLM displayName: Boltz2 L40Sx1 SM89 V1 FP16 Trt ngcMetadata: baffed15a6e497b2a6a18437bca323a2f9c42f269d3f733a1b7fba0020eb9b02: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: L40S gpu_device: 26b9:10de nim_workspace_hash_v1: 41fcfb105bfe999210b6cf66cc28a8ecfaa4e513ca13b7b2616c88cd80094bf5 number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '89' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: L40S - key: COUNT value: 1 - key: GPU DEVICE value: 26B9:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '89' - key: V value: '1' - profileId: nim/mit/boltz2:1.3.0-gpub200_sm100_v1 framework: TensorRT-LLM displayName: Boltz2 B200x1 SM100 V1 FP16 Trt ngcMetadata: ca0fa87fc9c52aea7475339eab9fbcec7637a304b261b4ba4450c085a7af4c4d: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: B200 gpu_device: 2901:10de nim_workspace_hash_v1: 9bf86f1e8426e435b6319e0e0d1b6e0a954aced6b59aaa4ed5075de4a7fc52d0 number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '100' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: B200 - key: COUNT value: 1 - key: GPU DEVICE value: 2901:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '100' - key: V value: '1' - profileId: nim/mit/boltz2:1.3.0-gpugb200_sm100_v1 framework: TensorRT-LLM displayName: Boltz2 GB200x1 SM100 V1 FP16 Trt ngcMetadata: f27c6b6a5dc860324bcc06dcc0ae502d9840546188e69238ebf58376d26f0539: model: mit/boltz2 release: 1.3.0 tags: feat_lora: 'False' gpu: GB200 gpu_device: 2941:10de nim_workspace_hash_v1: 9b199cebb2c954abbc94f5ebf37b0052988a7aa8987617c9271605d5e2e2f0b5 number_of_gpus: '1' pp: '1' precision: fp16 profile: trt sm: '100' tp: '1' v: '1' modelFormat: trt-llm spec: - key: PROFILE value: TRT - key: PRECISION value: FP16 - key: GPU value: GB200 - key: COUNT value: 1 - key: GPU DEVICE value: 2941:10DE - key: NIM VERSION value: 1.3.0 - key: DOWNLOAD SIZE value: 13GB - key: SM value: '100' - key: V value: '1' labels: - Biology Foundation Model - signed images - NSPECT-D4IX-8I2O - NVIDIA AI Enterprise Supported - NVIDIA NIM config: architectures: - Other modelType: NIM license: NVIDIA AI Foundation Models Community License - name: GPT-OSS displayName: GPT-OSS modelHubID: gpt-oss category: Text Generation type: NGC description: The GPT-OSS NIM simplifies the deployment of the GPT-OSS-120B and GPT-OSS-20B tuned models which are optimized for language understanding, reasoning, and text generation use cases, and outperforms many of the available open source chat models on common industry benchmarks. requireLicense: true licenseAgreements: - label: Use Policy url: https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/ - label: License Agreement url: https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/ modelVariants: - variantId: GPT-OSS 120B modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-08-05T19:37:06.330Z",
    "description": "# GPT OSS 120B Overview\n\n## Description: <br>\nOpenAI releases the gpt-oss family of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. The family consists of the:\n- `gpt-oss-120b` \u2014 for production, general purpose, high reasoning use-cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters)\n- `gpt-oss-20b` \u2014 for lower latency, and local or specialized use-cases (21B parameters with 3.6B active parameters).\n\nThe `gpt-oss-120b` model is architecturally designed as a Mixture-of-Experts (MoE) model. This model features SwiGLU activations and learned attention sinks within its architecture. It functions as a reasoning model, supporting capabilities such as chain-of-thought processing, adjustable reasoning effort levels, instruction following, and tool use. This model is text-only for both input and output modalities, enabling enterprises and governments to deploy it on-premises or in private cloud environments for enhanced data security and privacy.\n\nModel Highlights:  \n- **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk\u2014ideal for experimentation, customization, and commercial deployment.\n- **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.\n- **Full chain-of-thought:** Gain complete access to the model's reasoning process, facilitating easier debugging and increased trust in outputs. It's not intended to be shown to end users.\n- **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning.\n- **Agentic capabilities:** Use the models' native capabilities for function calling, web browsing, python code execution, and structured outputs.\n\nThis model is ready for commercial/non-commercial use.\n\n\n## Third-Party Community Consideration <br>\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [gpt-oss-120b model card](https://huggingface.co/openai/gpt-oss-120b).\n\n\n### License and Terms of Use: <br>\nGOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).\nAdditional Information: [Apache License Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).\n\n**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws**\n\n## Get Help\n\n### Enterprise Support\n\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\n\nVisit the [NIM Container LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) page for release documentation, deployment guides and more.\n\n### Deployment Geography:\nGlobal\n\n### Use Case: <br>\nIntended for use as a reasoning model with features like chain-of-thought and adjustable reasoning effort levels. It supports instruction following and tool use, offering transparency, customization, and deployment flexibility for developers, researchers, and startups. Additionally, it enables enterprises and governments to deploy on-premises or in private clouds to ensure data security and privacy.\n\n### Release Date:  <br>\nBuild.NVIDIA.com - 08/05/2025 via [link](https://build.nvidia.com/openai/gpt-oss-120b) <br> \nHugging Face - 08/05/2025 via [link](https://huggingface.co/openai/gpt-oss-120b) <br>\n\n## Reference(s):\n- [OpenAI Cookbook](https://cookbook.openai.com/)\n- [Open AI Coobkbook -- Serving Model with TensorRT-LLM](https://cookbook.openai.com/articles/gpt-oss/run-nvidia)\n\n\n## Model Architecture: <br> \n**Architecture Type:** Transformer <br>\n**Network Architecture:** Mixture-of-Experts (MoE) <br>\n**Total Parameters:** 117B <br>\n**Active Parameters:** 5.7B <br>\n**Vocabulary Size:** 201,088 <br>\n\n\n## Input: <br>\n**Input Type(s):** Text <br>\n**Input Format(s):** String <br>\n**Input Parameters:** One Dimensional (1D) <br>\n**Other Properties Related to Input:** Uses RoPE with a 128k context length, with attention layers alternating between full context and a sliding 128-token window. Includes a learned attention sink per-head. Employs SwiGLU activations in the MoE layers, and the router performs a Top-K operation (K=4) followed by a Sigmoid function. GEMMs in the MoE include a per-expert bias. Utilizes tiktoken for tokenization. Input Context Length (ISL): 128000 <br>\n\n## Output: <br>\n**Output Type(s):** Text <br>\n**Output Format:** String <br>\n**Output Parameters:** One Dimensional (1D) <br>\n**Other Properties Related to Output:** The model is designed to be compatible with the OpenAI Responses API and supports Structured Output. <br> \n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems [or name equivalent hardware preference]. By leveraging NVIDIA\u2019s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>   \n\n## Software Integration: <br>\n**Runtime Engine(s):** <br>\n* NeMo Framework (based on 25.07)<br>\n\n\n**Supported Hardware Microarchitecture Compatibility:** <br>\n* NVIDIA Blackwell: B200 <br>\n* NVIDIA Hopper: H200\n\n\n**Operating System(s):** Linux \n\n## Model Version(s): \n`gpt-oss-120b` v1.0 (August 5, 2025)\n\n\n## Training, Testing, and Evaluation Datasets: <br>   \n### Training Dataset:\n\n* **Training Data Collection:** Undisclosed <br>\n* **Training Labeling:** Undisclosed <br>\n* **Training Properties:** The model has approximately 117 billion parameters. Weights for all layers are in BF16, except for MoE projection weights, which are in MXFP4. The reference implementation currently upcasts all weights to BF16. Activations are expected to be in BF16 or FP8.\n\n\n### Testing Dataset:\n* **Testing Data Collection:** Undisclosed <br>\n* **Testing Labeling:** Undisclosed <br>\n* **Testing Properties:** The model is tested against benchmarks such as MMLU and GPQA, among others including LiveCodeBench, AIME 2024, and MATH-500. \n\n### Evaluation Dataset:\n\n* **Evaluation Data Collection:** Undisclosed <br>\n* **Evaluation Labeling:** Undisclosed <br>\n* **Evaluation Benchmark Score:** \n\n| Benchmark  | gpt-oss-120b | gpt-oss-20b |\n|----------|-----------| -----------|\n| AIME 2024 (no tools) | 95.8   | 92.1 |\n| AIME 2024 (with tools) | 96.6 | 96.0 |\n| AIME 2025 (no tools) | 92.5  | 91.7 |\n| AIME 2025 (with tools) | 97.9 | 98.7 |\n| GPQA Diamond (no tools) | 80.1 | 71.5 |\n| GPQA Diamond (with tools) | 80.9 | 74.2 |\n| HLE (no tools) | 14.9 | 10.9 |\n| HLE (with tools) | 19.0 | 17.3 |\n| MMLU | 90.0 | 85.3 |\n| SWE-Bench Verified | 62.4 | 60.7 |\n| Tau-Bench Retail | 67.8 | 54.4 |\n| Tau-Bench Airline | 49.2 | 38.0 |\n| Aider Polyglot | 44.4 | 34.2 |\n| MMMLU (Average) | 81.3 | 75.6 |\n| HealthBench | 57.6 | 42.5 |\n| HealthBench Hard | 30.0 | 10.8 |\n| HealthBench Consensus | 89.9 | 82.6 |\n| Codeforces (no tools) [elo] | 2463 | 2230 |\n| Codeforces (with tools) [elo] | 2622 | 2516 |\n\nAbove scores were measured for the high reasoning level.\n\n### Safety Results:\n\nThe following evaluations check that the model does not comply with requests for content that is\ndisallowed under OpenAI\u2019s safety policies, including hateful content or illicit advice.\n\n| Category  | gpt-oss-120b | gpt-oss-20b |\n|----------|-----------| -----------|\n| hate (aggregate) | 0.996   | 0.996 |\n| self-harm/intent and selfharm/instructions | 0.995 | 0.984 |\n| personal data/semi restrictive | 0.967  | 0.947 |\n| sexual/exploitative | 1.000 | 0.980 |\n| sexual/minors | 1.000 | 0.971 |\n| illicit/non-violent | 1.000 | 0.983 |\n| illicit/violent | 1.000 | 1.000 |\n| personal data/restricted | 0.996 | 0.978 |\n\n## Inference:\n**Acceleration Engine:** vLLM <br>\n**Test Hardware:** NVIDIA Hopper: B200 <br>\n\n\n## Additional Details\nThe model is released with the native quantization support. Specifically, [MXFP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) is used for the linear projection weights in the MoE layer. It is stored the MoE tensor in two parts:\n\n- `tensor.blocks` stores the actual fp4 values. Every two values are packed in one `uint8` value.\n- `tensor.scales` stores the block scale. The block scaling is done among the last dimension for all MXFP4 tensors.\n\nAll other tensors are stored in BF16. It is recommended to use BF16 as the activation precision for the model.\n\n## Ethical Considerations:\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.  \n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).",
    "displayName": "GPT-OSS-120B",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-EEZS-7JBM",
        "Signed Models",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "hf-8b193b0-nim",
    "latestVersionSizeInBytes": 65276859875,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/gpt-oss-120b.jpg",
    "modelFormat": "SavedModel",
    "name": "gpt-oss-120b",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "OpenAI",
    "safetyAndSecurity": "",
    "shortDescription": "OpenAI releases the gpt-oss family of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.",
    "teamName": "openai",
    "updatedDate": "2025-09-04T20:15:18.748Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/openai/containers/gpt-oss-120b optimizationProfiles: - profileId: nim/openai/gpt-oss-120b:hf-8b193b0-nim framework: VLLM displayName: GPT-OSS 120B Generic NVIDIA GPUx8 MXFP4 ngcMetadata: 650450b7f0c9fb164c4e7e03fca53a2e781718930eb23d23b730ffaff2056685: model: openai/gpt-oss-120b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: 8d1357e1888e26523f732140e20c1562434517e6f8e5fa12bc9a67bebf202d33 pp: '1' precision: mxfp4 tp: '8' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 8 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 61GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-120b:hf-8b193b0-nim framework: VLLM displayName: GPT-OSS 120B Generic NVIDIA GPUx4 MXFP4 ngcMetadata: 9af7e80ca3e26c05e61e22b2f1f88314f03964a30b1f5ebdbe103704d5e48d8f: model: openai/gpt-oss-120b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: 8d1357e1888e26523f732140e20c1562434517e6f8e5fa12bc9a67bebf202d33 pp: '1' precision: mxfp4 tp: '4' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 4 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 61GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-120b:hf-8b193b0-nim framework: VLLM displayName: GPT-OSS 120B Generic NVIDIA GPUx2 MXFP4 ngcMetadata: b8a95a1d502de2bd02c311f4b590ee8b645eaf4b93584c75314d80a4fd719c57: model: openai/gpt-oss-120b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: 8d1357e1888e26523f732140e20c1562434517e6f8e5fa12bc9a67bebf202d33 pp: '1' precision: mxfp4 tp: '2' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 2 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 61GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-120b:hf-8b193b0-nim framework: VLLM displayName: GPT-OSS 120B Generic NVIDIA GPUx1 MXFP4 ngcMetadata: fc1df044c94b466d0ebd561df47556bc23a01ac8147d68dc49f04238a6cfcd7f: model: openai/gpt-oss-120b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: 8d1357e1888e26523f732140e20c1562434517e6f8e5fa12bc9a67bebf202d33 pp: '1' precision: mxfp4 tp: '1' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 1 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 61GB - key: LLM ENGINE value: VLLM - variantId: GPT-OSS 20B modelCard: {
    "accessType": "NOT_LISTED",
    "application": "Other",
    "bias": "",
    "canGuestDownload": false,
    "createdDate": "2025-08-05T19:37:12.739Z",
    "description": "# GPT OSS 20B Overview\n\n## Description: <br>\nOpenAI releases the gpt-oss family of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. The family consists of the:\n- `gpt-oss-120b` \u2014 for production, general purpose, high reasoning use-cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters)\n- `gpt-oss-20b` \u2014 for lower latency, and local or specialized use-cases (21B parameters with 3.6B active parameters).\n\nThe `gpt-oss-20b` is designed as a Mixture-of-Experts (MoE) model, structurally identical to the larger 117B variant, albeit with different hyperparameters. This model leverages SwiGLU activations and incorporates learned attention sinks within its architecture. Functionally, it serves as a robust reasoning model, supporting advanced capabilities such as chain-of-thought processing, adjustable reasoning effort levels, instruction following, and tool use. It operates strictly with text-only modalities for both input and output. A key strategic benefit is its suitability for enterprises and governments, facilitating on-premises or private cloud deployment to ensure enhanced data security and privacy.\n\nModel Highlights:  \n- **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk\u2014ideal for experimentation, customization, and commercial deployment.\n- **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs.\n- **Full chain-of-thought:** Gain complete access to the model's reasoning process, facilitating easier debugging and increased trust in outputs. It's not intended to be shown to end users.\n- **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning.\n- **Agentic capabilities:** Use the models' native capabilities for function calling, web browsing, python code execution, and structured outputs.\n\nThis model is ready for commercial/non-commercial use.\n\n## Third-Party Community Consideration <br>\nThis model is not owned or developed by NVIDIA. This model has been developed and built to a third-party\u2019s requirements for this application and use case; see link to Non-NVIDIA [gpt-oss-20b model card](https://huggingface.co/openai/gpt-oss-20b).\n\n### License and Terms of Use: <br>\n\nGOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and the [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/); and the use of this model is governed by the [NVIDIA Community Model License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/).\nAdditional Information: [Apache License Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).\n\n**You are responsible for ensuring that your use of NVIDIA provided models complies with all applicable laws**\n\n## Get Help\n\n### Enterprise Support\n\nGet access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).\n\n### NVIDIA NIM Documentation\n\nVisit the [NIM Container LLM](https://docs.nvidia.com/nim/large-language-models/latest/introduction.html) page for release documentation, deployment guides and more.\n\n\n### Deployment Geography:\nGlobal\n\n### Use Case: <br>\nIntended for use as a reasoning model, offering features like chain-of-thought and adjustable reasoning effort levels. It provides comprehensive support for instruction following and tool use, fostering transparency, customization, and deployment flexibility for developers, researchers, and startups. Crucially, it enables enterprises and governments to deploy on-premises or in private clouds, ensuring stringent data security and privacy requirements are met.\n\n### Release Date:  <br>\nBuild.NVIDIA.com - 08/05/2025 via [link](https://build.nvidia.com/openai/gpt-oss-20b) <br> \nHugging Face - 08/05/2025 via [link](https://huggingface.co/openai/gpt-oss-20b) <br>\n\n## Reference(s):\n- [OpenAI Cookbook](https://cookbook.openai.com/)\n- [Open AI Coobkbook -- Serving Model with TensorRT-LLM](https://cookbook.openai.com/articles/gpt-oss/run-nvidia)\n\n\n## Model Architecture: <br> \n**Architecture Type:** Transformer <br>\n**Network Architecture:** Mixture-of-Experts (MoE) <br>\n**Total Parameters:** 20B <br>\n**Active Parameters:** 4B <br>\n**Vocabulary Size:** 201,088 (Utilizes the standard tokenizer used by GPT-4o) <br>\n\n\n## Input: <br>\n**Input Type(s):** Text <br>\n**Input Format(s):** String <br>\n**Input Parameters:** One Dimensional (1D) <br>\n**Other Properties Related to Input:** Uses RoPE with a 128k context length, with attention layers alternating between full context and a sliding 128-token window. Includes a learned attention sink per-head. Employs SwiGLU activations in the MoE layers, and the router performs a Top-K operation (K=4) followed by a Sigmoid function. GEMMs in the MoE include a per-expert bias. Utilizes tiktoken for tokenization. Input Context Length (ISL): 128000 <br>\n\n## Output: <br>\n**Output Type(s):** Text <br>\n**Output Format:** String <br>\n**Output Parameters:** One Dimensional (1D) <br>\n**Other Properties Related to Output:** The model is architected to be compatible with the OpenAI Responses API and supports Structured Output, aligning with key partner expectations for advanced response formatting. <br> \n\nOur AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems [or name equivalent hardware preference]. By leveraging NVIDIA\u2019s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>   \n\n## Software Integration: <br>\n**Runtime Engine(s):** <br>\n* NeMo Framework (based on 25.07)<br>\n\n\n**Supported Hardware Microarchitecture Compatibility:** <br>\n* NVIDIA Blackwell: B200 <br>\n* NVIDIA Hopper: H200\n\n\n**Operating System(s):** Linux \n\n## Model Version(s): \n`gpt-oss-20b` v1.0 (August 5, 2025)\n\n\n## Training, Testing, and Evaluation Datasets: <br>   \n### Training Dataset:\n\n* **Training Data Collection:** Undisclosed <br>\n* **Training Labeling:** Undisclosed <br>\n* **Training Properties:** The gpt-oss-20b model has approximately 20 billion total parameters, with approximately 4 billion active parameters per inference. The weights for all layers are in BF16, except for the MoE projection weights, which are in MXFP4. The reference implementation, for initial accuracy validation, currently upcasts all weights to BF16. Activations are expected to be in BF16 or FP8.\n\n\n### Testing Dataset:\n* **Testing Data Collection:** Undisclosed <br>\n* **Testing Labeling:** Undisclosed <br>\n* **Testing Properties:** The model's performance is tested against recognized benchmarks such as MMLU (Massive Multitask Language Understanding) and GPQA (General Purpose Question Answering), alongside other benchmarks including LiveCodeBench, AIME 2024, and MATH-500 \n\n### Evaluation Dataset:\n\n* **Evaluation Data Collection:** Undisclosed <br>\n* **Evaluation Labeling:** Undisclosed <br>\n* **Evaluation Benchmark Score:** \n\n| Benchmark  | gpt-oss-120b | gpt-oss-20b |\n|----------|-----------| -----------|\n| AIME 2024 (no tools) | 95.8   | 92.1 |\n| AIME 2024 (with tools) | 96.6 | 96.0 |\n| AIME 2025 (no tools) | 92.5  | 91.7 |\n| AIME 2025 (with tools) | 97.9 | 98.7 |\n| GPQA Diamond (no tools) | 80.1 | 71.5 |\n| GPQA Diamond (with tools) | 80.9 | 74.2 |\n| HLE (no tools) | 14.9 | 10.9 |\n| HLE (with tools) | 19.0 | 17.3 |\n| MMLU | 90.0 | 85.3 |\n| SWE-Bench Verified | 62.4 | 60.7 |\n| Tau-Bench Retail | 67.8 | 54.4 |\n| Tau-Bench Airline | 49.2 | 38.0 |\n| Aider Polyglot | 44.4 | 34.2 |\n| MMMLU (Average) | 81.3 | 75.6 |\n| HealthBench | 57.6 | 42.5 |\n| HealthBench Hard | 30.0 | 10.8 |\n| HealthBench Consensus | 89.9 | 82.6 |\n| Codeforces (no tools) [elo] | 2463 | 2230 |\n| Codeforces (with tools) [elo] | 2622 | 2516 |\n\nAbove scores were measured for the high reasoning level.\n\n### Safety Results:\n\nThe following evaluations check that the model does not comply with requests for content that is\ndisallowed under OpenAI\u2019s safety policies, including hateful content or illicit advice.\n\n| Category  | gpt-oss-120b | gpt-oss-20b |\n|----------|-----------| -----------|\n| hate (aggregate) | 0.996   | 0.996 |\n| self-harm/intent and selfharm/instructions | 0.995 | 0.984 |\n| personal data/semi restrictive | 0.967  | 0.947 |\n| sexual/exploitative | 1.000 | 0.980 |\n| sexual/minors | 1.000 | 0.971 |\n| illicit/non-violent | 1.000 | 0.983 |\n| illicit/violent | 1.000 | 1.000 |\n| personal data/restricted | 0.996 | 0.978 |\n\n## Inference:\n**Acceleration Engine:** vLLM <br>\n**Test Hardware:** NVIDIA Hopper (H200) <br>\n\n\n## Additional Details\nThe model is released with the native quantization support. Specifically, [MXFP4](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) is used for the linear projection weights in the MoE layer. It is stored the MoE tensor in two parts:\n\n- `tensor.blocks` stores the actual fp4 values. Every two values are packed in one `uint8` value.\n- `tensor.scales` stores the block scale. The block scaling is done among the last dimension for all MXFP4 tensors.\n\nAll other tensors are stored in BF16. It is recommended to use BF16 as the activation precision for the model.\n\n## Ethical Considerations:\nNVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.  \n\nPlease report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).",
    "displayName": "GPT-OSS-20B",
    "explainability": "",
    "framework": "Other",
    "hasPlayground": false,
    "hasSignedVersion": true,
    "isPlaygroundEnabled": false,
    "isPublic": false,
    "isReadOnly": true,
    "labels": [
        "NSPECT-SW5U-LCYW",
        "Signed Models",
        "nvaie:model:nvaie_supported",
        "nvidia_nim:model:nimmcro_nvidia_nim",
        "productNames:nim-dev",
        "productNames:nv-ai-enterprise"
    ],
    "latestVersionIdStr": "hf-d666cf3-nim",
    "latestVersionSizeInBytes": 13789263951,
    "logo": "https://assets.ngc.nvidia.com/products/api-catalog/images/gpt-oss-20b.jpg",
    "modelFormat": "SavedModel",
    "name": "gpt-oss-20b",
    "orgName": "nim",
    "precision": "OTHER",
    "privacy": "",
    "productNames": [
        "nim-dev",
        "nv-ai-enterprise"
    ],
    "publicDatasetUsed": {},
    "publisher": "OpenAI",
    "safetyAndSecurity": "",
    "shortDescription": "OpenAI releases the gpt-oss family of open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases.",
    "teamName": "openai",
    "updatedDate": "2025-09-04T20:15:14.370Z"
} source: URL: https://catalog.ngc.nvidia.com/orgs/nim/teams/openai/containers/gpt-oss-20b optimizationProfiles: - profileId: nim/openai/gpt-oss-20b:hf-d666cf3-nim framework: VLLM displayName: GPT-OSS 20B Generic NVIDIA GPUx4 MXFP4 ngcMetadata: 653e98d21f9274306416d736519e1c0442d9dad9d8756ff1134cbededfd43323: model: openai/gpt-oss-20b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: bef4d428df8c3e67ebe56ba2050a0f50216e82c0172407b43c99c1f6befc9fc5 pp: '1' precision: mxfp4 tp: '4' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 4 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 13GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-20b:hf-d666cf3-nim framework: VLLM displayName: GPT-OSS 20B Generic NVIDIA GPUx8 MXFP4 ngcMetadata: 66b8ec445352535aa8c640435d6f7b00fb2cabb70f8d39fc371adb00322907df: model: openai/gpt-oss-20b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: bef4d428df8c3e67ebe56ba2050a0f50216e82c0172407b43c99c1f6befc9fc5 pp: '1' precision: mxfp4 tp: '8' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 8 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 13GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-20b:hf-d666cf3-nim framework: VLLM displayName: GPT-OSS 20B Generic NVIDIA GPUx1 MXFP4 ngcMetadata: 66fb3113efd2aae1b0a3bfa2a375de5fe1cc1b557abac4eb271730482a26ae8e: model: openai/gpt-oss-20b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: bef4d428df8c3e67ebe56ba2050a0f50216e82c0172407b43c99c1f6befc9fc5 pp: '1' precision: mxfp4 tp: '1' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 1 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 13GB - key: LLM ENGINE value: VLLM - profileId: nim/openai/gpt-oss-20b:hf-d666cf3-nim framework: VLLM displayName: GPT-OSS 20B Generic NVIDIA GPUx2 MXFP4 ngcMetadata: c3035169e189674226b284a07173f495b6ce13f2a06d5ea204f1e505c2fac2be: model: openai/gpt-oss-20b release: 1.12.4 tags: feat_lora: 'false' llm_engine: vllm nim_workspace_hash_v1: bef4d428df8c3e67ebe56ba2050a0f50216e82c0172407b43c99c1f6befc9fc5 pp: '1' precision: mxfp4 tp: '2' modelFormat: vllm spec: - key: PRECISION value: MXFP4 - key: COUNT value: 2 - key: NIM VERSION value: 1.12.4 - key: DOWNLOAD SIZE value: 13GB - key: LLM ENGINE value: VLLM labels: - OpenAI - signed images - NSPECT-LJGD-9W15 - NVIDIA AI Enterprise Supported - NVIDIA NIM config: architectures: - Other modelType: NIM license: NVIDIA AI Foundation Models Community License - name: Gemma 2 displayName: Gemma 2 modelHubID: gemma-2 category: Text Generation type: HF description: Gemma 2 the second generation of the Google community Gemma lineage. Gemma 2 is improved with higher performance with significant safety improvements and well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. modelVariants: - variantId: Gemma 2 9B displayName: Gemma 2 9B source: URL: https://huggingface.co/google/gemma-2-9b requireToken: true requireLicense: true licenseAgreements: - label: License Agreement url: https://ai.google.dev/gemma/terms - label: Use Policy url: https://ai.google.dev/gemma/prohibited_use_policy optimizationProfiles: - profileId: google/gemma-2-9b displayName: Gemma 2 9b A10G framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: A10G - key: COUNT value: 1 - profileId: google/gemma-2-9b displayName: Gemma 2 A100 framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: A100 - key: COUNT value: 1 - profileId: google/gemma-2-9b displayName: Gemma 2 9b L40S framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: L40S - key: COUNT value: 1 labels: - google - Gemma - "Text Generation" - "Multilingual support" config: architectures: - Gemma2ForCausalLM modelType: Gemma2 license: gemma - name: Llama 3 SQLCoder displayName: Llama 3 SQLCoder modelHubID: llama-3-sqlcoder-8b category: Text Generation type: HF description: A capable language model for text to SQL generation for Postgres, Redshift and Snowflake that is on-par with the most capable generalist frontier models. modelVariants: - variantId: Llama 3 SQLCoder 8B displayName: Llama 3 SQLCoder 8B source: URL: https://huggingface.co/defog/llama-3-sqlcoder-8b requireToken: false requireLicense: false licenseAgreements: - label: License Agreement url: https://choosealicense.com/licenses/cc-by-sa-4.0/ optimizationProfiles: - profileId: defog/llama-3-sqlcoder-8b displayName: Llama 3 SQLCoder 8B A10G framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: A10G - key: COUNT value: 1 - profileId: defog/llama-3-sqlcoder-8b displayName: Llama 3 SQLCoder 8B A100 framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: A100 - key: COUNT value: 1 - profileId: defog/llama-3-sqlcoder-8b displayName: Llama 3 SQLCoder 8B L40S framework: vllm sha: vllm modelFormat: vllm spec: - key: GPU value: L40S - key: COUNT value: 1 labels: - Llama - "Text To SQL" - "Code Generation" - "Fine Tuned" config: architectures: - LlamaForCausalLM modelType: llama license: Creative Commons Attribution Share Alike 4.0