--- description: "Comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments" categories: ["reference"] tags: ["requirements", "system-requirements", "hardware", "software", "gpu", "storage"] personas: ["admin-focused", "devops-focused"] difficulty: "reference" content_type: "reference" modality: "universal" --- # Production Deployment Requirements This page details the comprehensive system, hardware, and software requirements for deploying NeMo Curator in production environments. ## System Requirements - **Operating System**: Ubuntu 22.04/20.04 (recommended) - **Python**: Python 3.10, 3.11, or 3.12 - packaging >= 22.0 **Python 3.10 support will be removed in NeMo Curator 26.06.** 26.04 is the last release to support Python 3.10. Standardize production environments on a newer supported Python version (3.11+) before upgrading to 26.06. See the [26.04 release notes](/about/release-notes) for details. ## Hardware Requirements ### CPU Requirements - Multi-core CPU with sufficient cores for parallel processing - **Memory**: Minimum 16GB RAM recommended for text processing - For large datasets: 32GB+ RAM recommended - Memory requirements scale with dataset size and number of workers ### GPU Requirements (Optional but Recommended) - **GPU**: NVIDIA GPU with Volta™ architecture or higher - Compute capability 7.0+ required - **Memory**: Minimum 16GB VRAM for GPU-accelerated operations - For video processing: 21GB+ VRAM (reducible with optimization) - For large-scale deduplication: 32GB+ VRAM recommended - **CUDA**: CUDA 12.0 or above with compatible drivers ## Software Dependencies ### Core Dependencies - Python 3.10+ with required packages for distributed computing - RAPIDS libraries (cuDF) for GPU-accelerated deduplication operations ### Container Support (Recommended) - **Docker** or **Podman** for containerized deployment - Access to NVIDIA NGC registry for official containers ## Network Requirements - Reliable network connectivity between nodes - High-bandwidth network for large dataset transfers - InfiniBand recommended for multi-node GPU clusters ## Storage Requirements - **Capacity**: Storage capacity should be 3-5x the size of input datasets - Input data storage - Intermediate processing files - Output data storage - **Performance**: High-throughput storage system recommended - SSD storage preferred for frequently accessed data - Parallel filesystem for multi-node access ## Deployment-Specific Requirements - Resource quotas configured for GPU and memory allocation ## Performance Considerations ### Memory Management - Monitor memory usage across distributed workers - Configure appropriate memory limits per worker - Use memory-efficient data formats (e.g., Parquet) ### GPU Optimization - Ensure CUDA drivers are compatible with RAPIDS versions - Configure GPU memory pools (RMM) for optimal performance - Monitor GPU utilization and memory usage ### Network Optimization - Use high-bandwidth interconnects for multi-node deployments - Configure appropriate network protocols (TCP vs UCX) - Optimize data transfer patterns to minimize network overhead