--- description: "Configure LLM clients for synthetic data generation with NVIDIA APIs or custom endpoints" categories: ["how-to-guides"] tags: ["llm-client", "openai", "nvidia-api", "configuration"] personas: ["data-scientist-focused", "mle-focused"] difficulty: "beginner" content_type: "how-to" modality: "text-only" --- # LLM Client Configuration NeMo Curator's synthetic data generation uses OpenAI-compatible clients to communicate with LLM inference servers. This guide covers client configuration, performance tuning, and integration with various endpoints. ## Overview Two client types are available: - **`AsyncOpenAIClient`**: Recommended for high-throughput batch processing with concurrent requests - **`OpenAIClient`**: Synchronous client for simpler use cases or debugging For most SDG workloads, use `AsyncOpenAIClient` to maximize throughput. ## Basic Configuration ### NVIDIA API Endpoints ```python from nemo_curator.models.client.openai_client import AsyncOpenAIClient client = AsyncOpenAIClient( api_key="your-nvidia-api-key", # Or use NVIDIA_API_KEY env var base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=5, ) ``` ### Environment Variables Set your API key as an environment variable to avoid hardcoding credentials: ```bash export NVIDIA_API_KEY="nvapi-..." ``` The underlying OpenAI client automatically uses the `OPENAI_API_KEY` environment variable if no `api_key` is provided. For NVIDIA APIs, explicitly pass the key: ```python import os client = AsyncOpenAIClient( api_key=os.environ["NVIDIA_API_KEY"], base_url="https://integrate.api.nvidia.com/v1", ) ``` ## Generation Parameters Configure LLM generation behavior using `GenerationConfig`: ```python from nemo_curator.models.client.llm_client import GenerationConfig config = GenerationConfig( max_tokens=2048, temperature=0.7, top_p=0.95, seed=42, # For reproducibility ) ``` | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `max_tokens` | int | 2048 | Maximum tokens to generate per request | | `temperature` | float | 0.0 | Sampling temperature (0.0-2.0). Higher values increase randomness | | `top_p` | float | 0.95 | Nucleus sampling parameter (0.0-1.0) | | `top_k` | int | None | Top-k sampling (if supported by the endpoint) | | `seed` | int | 0 | Random seed for reproducibility | | `stop` | str/list | None | Stop sequences to end generation | | `stream` | bool | False | Enable streaming (not recommended for batch processing) | | `n` | int | 1 | Number of completions to generate per request | | `extra_kwargs` | dict | None | Additional keyword arguments passed through to the OpenAI `create()` call | ## Performance Tuning ### Concurrency vs. Parallelism The `max_concurrent_requests` parameter controls how many API requests the client can have in-flight simultaneously. This interacts with Ray's distributed workers: - **Client-level concurrency**: `max_concurrent_requests` limits concurrent API calls per worker - **Worker-level parallelism**: Ray distributes tasks across multiple workers ```python # For NVIDIA API endpoints with rate limits client = AsyncOpenAIClient( base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=3, # Conservative for cloud APIs ) ``` ### Retry Configuration The client includes automatic retry with exponential backoff for transient errors: ```python client = AsyncOpenAIClient( base_url="https://integrate.api.nvidia.com/v1", max_retries=3, # Number of retry attempts base_delay=1.0, # Base delay in seconds timeout=120, # Request timeout ) ``` The retry logic handles: - **Rate limit errors (429)**: Automatic backoff with jitter - **Connection errors**: Retry with exponential delay - **Transient failures**: Configurable retry attempts ## Using Other OpenAI-Compatible Endpoints The `AsyncOpenAIClient` works with any OpenAI-compatible API endpoint. Simply configure the `base_url` and `api_key` parameters: ```python # OpenAI API client = AsyncOpenAIClient( base_url="https://api.openai.com/v1", api_key="sk-...", # Or set OPENAI_API_KEY env var max_concurrent_requests=5, ) # Any OpenAI-compatible endpoint client = AsyncOpenAIClient( base_url="http://your-endpoint/v1", api_key="your-api-key", max_concurrent_requests=5, ) ``` ### Local Inference with InferenceServer To serve models locally and connect them to `AsyncOpenAIClient`, use NeMo Curator's built-in [Inference Server](/curate-text/synthetic/inference-server) (Ray Serve + vLLM): ```python from nemo_curator.core.serve import InferenceModelConfig, InferenceServer from nemo_curator.models.client.openai_client import AsyncOpenAIClient config = InferenceModelConfig( model_identifier="meta-llama/Llama-3-8B-Instruct", engine_kwargs={"tensor_parallel_size": 2}, ) with InferenceServer(models=[config]) as server: client = AsyncOpenAIClient( base_url=server.endpoint, api_key="unused", max_concurrent_requests=10, ) # Use client in pipeline stages ``` ## Complete Example ```python import os from nemo_curator.models.client.openai_client import AsyncOpenAIClient from nemo_curator.models.client.llm_client import GenerationConfig from nemo_curator.pipeline import Pipeline from nemo_curator.stages.synthetic.qa_multilingual_synthetic import QAMultilingualSyntheticStage # Configure client client = AsyncOpenAIClient( api_key=os.environ.get("NVIDIA_API_KEY"), base_url="https://integrate.api.nvidia.com/v1", max_concurrent_requests=5, max_retries=3, base_delay=1.0, ) # Configure generation config = GenerationConfig( temperature=0.9, top_p=0.95, max_tokens=2048, ) # Use in a pipeline stage pipeline = Pipeline(name="sdg_example") pipeline.add_stage( QAMultilingualSyntheticStage( prompt="Generate a Q&A pair about science in {language}.", languages=["English", "French", "German"], client=client, model_name="meta/llama-3.3-70b-instruct", num_samples=100, generation_config=config, ) ) ``` ## Troubleshooting ### Rate Limit Errors If you encounter frequent 429 errors: 1. Reduce `max_concurrent_requests` 2. Increase `base_delay` for longer backoff 3. Consider using a local deployment for high-volume workloads ### Connection Timeouts For slow networks or high-latency endpoints: ```python client = AsyncOpenAIClient( base_url="...", timeout=300, # Increase from default 120 seconds ) ``` --- ## Next Steps - [Multilingual Q&A](/curate-text/synthetic/multilingual-qa): Generate multilingual Q&A pairs - [Nemotron-CC](/curate-text/synthetic/nemotron-cc): Advanced text transformation pipelines