--- name: coreml description: Use when deploying custom ML models on-device, converting PyTorch models, compressing models, implementing LLM inference, or optimizing CoreML performance. Covers model conversion, compression, stateful models, KV-cache, multi-function models, MLTensor. license: MIT version: 1.0.0 --- # CoreML On-Device Machine Learning ## Overview CoreML enables on-device machine learning inference across all Apple platforms. It abstracts hardware details while leveraging Apple Silicon's CPU, GPU, and Neural Engine for high-performance, private, and efficient execution. **Key principle**: Start with the simplest approach, then optimize based on profiling. Don't over-engineer compression or caching until you have real performance data. ## Decision Tree - CoreML vs Foundation Models ``` Need on-device ML? ├─ Text generation (LLM)? │ ├─ Simple prompts, structured output? → Foundation Models (ios-ai skill) │ └─ Custom model, fine-tuned, specific architecture? → CoreML ├─ Custom trained model? │ └─ Yes → CoreML ├─ Image/audio/sensor processing? │ └─ Yes → CoreML └─ Apple's built-in intelligence? └─ Yes → Foundation Models (ios-ai skill) ``` ## Red Flags Use this skill when you see: - "Convert PyTorch model to CoreML" - "Model too large for device" - "Slow inference performance" - "LLM on-device" - "KV-cache" or "stateful model" - "Model compression" or "quantization" - MLModel, MLTensor, or coremltools in context ## Pattern 1 - Basic Model Conversion The standard PyTorch → CoreML workflow. ```python import coremltools as ct import torch # Trace the model model.eval() traced_model = torch.jit.trace(model, example_input) # Convert to CoreML mlmodel = ct.convert( traced_model, inputs=[ct.TensorType(shape=example_input.shape)], minimum_deployment_target=ct.target.iOS18 ) # Save mlmodel.save("MyModel.mlpackage") ``` **Critical**: Always set `minimum_deployment_target` to enable latest optimizations. ## Pattern 2 - Model Compression (Post-Training) Three techniques, each with different tradeoffs: ### Palettization (Best for Neural Engine) Clusters weights into lookup tables. Use per-grouped-channel for better accuracy. ```python from coremltools.optimize.coreml import ( OpPalettizerConfig, OptimizationConfig, palettize_weights ) # 4-bit with grouped channels (iOS 18+) op_config = OpPalettizerConfig( mode="kmeans", nbits=4, granularity="per_grouped_channel", group_size=16 ) config = OptimizationConfig(global_config=op_config) compressed_model = palettize_weights(model, config) ``` | Bits | Compression | Accuracy Impact | |------|-------------|-----------------| | 8-bit | 2x | Minimal | | 6-bit | 2.7x | Low | | 4-bit | 4x | Moderate (use grouped channels) | | 2-bit | 8x | High (requires training-time) | ### Quantization (Best for GPU on Mac) Linear mapping to INT8/INT4. Use per-block for better accuracy. ```python from coremltools.optimize.coreml import ( OpLinearQuantizerConfig, OptimizationConfig, linear_quantize_weights ) # INT4 per-block quantization (iOS 18+) op_config = OpLinearQuantizerConfig( mode="linear", dtype="int4", granularity="per_block", block_size=32 ) config = OptimizationConfig(global_config=op_config) compressed_model = linear_quantize_weights(model, config) ``` ### Pruning (Combine with other techniques) Sets weights to zero for sparse representation. Can combine with palettization. ```python from coremltools.optimize.coreml import ( OpMagnitudePrunerConfig, OptimizationConfig, prune_weights ) op_config = OpMagnitudePrunerConfig( target_sparsity=0.4 # 40% zeros ) config = OptimizationConfig(global_config=op_config) sparse_model = prune_weights(model, config) ``` ## Pattern 3 - Training-Time Compression When post-training compression loses too much accuracy, fine-tune with compression. ```python from coremltools.optimize.torch.palettization import ( DKMPalettizerConfig, DKMPalettizer ) # Configure 4-bit palettization config = DKMPalettizerConfig(global_config={"n_bits": 4}) # Prepare model palettizer = DKMPalettizer(model, config) prepared_model = palettizer.prepare() # Fine-tune (your training loop) for epoch in range(num_epochs): train_epoch(prepared_model, data_loader) palettizer.step() # Finalize final_model = palettizer.finalize() ``` **Tradeoff**: Better accuracy than post-training, but requires training data and time. ## Pattern 4 - Calibration-Based Compression (iOS 18+) Middle ground: uses calibration data without full training. ```python from coremltools.optimize.torch.pruning import ( MagnitudePrunerConfig, LayerwiseCompressor ) # Configure config = MagnitudePrunerConfig( target_sparsity=0.4, n_samples=128 # Calibration samples ) # Create pruner pruner = LayerwiseCompressor(model, config) # Calibrate sparse_model = pruner.compress(calibration_data_loader) ``` ## Pattern 5 - Stateful Models (KV-Cache for LLMs) For transformer models, use state to avoid recomputing key/value vectors. ### PyTorch Model with State ```python class StatefulLLM(nn.Module): def __init__(self): super().__init__() # Register state buffers self.register_buffer("keyCache", torch.zeros(batch, heads, seq_len, dim)) self.register_buffer("valueCache", torch.zeros(batch, heads, seq_len, dim)) def forward(self, input_ids, causal_mask): # Update caches in-place during forward # ... attention with KV-cache ... return logits ``` ### Conversion with State ```python import coremltools as ct mlmodel = ct.convert( traced_model, inputs=[ ct.TensorType(name="input_ids", shape=(1, ct.RangeDim(1, 2048))), ct.TensorType(name="causal_mask", shape=(1, 1, ct.RangeDim(1, 2048), ct.RangeDim(1, 2048))) ], states=[ ct.StateType(name="keyCache", ...), ct.StateType(name="valueCache", ...) ], minimum_deployment_target=ct.target.iOS18 ) ``` ### Using State at Runtime ```swift // Create state from model let state = model.makeState() // Run prediction with state (updated in-place) let output = try model.prediction(from: input, using: state) ``` **Performance**: 1.6x speedup on Mistral-7B (M3 Max) compared to manual KV-cache I/O. ## Pattern 6 - Multi-Function Models (Adapters/LoRA) Deploy multiple adapters in a single model, sharing base weights. ```python from coremltools.models import MultiFunctionDescriptor from coremltools.models.utils import save_multifunction # Convert individual models sticker_model = ct.convert(sticker_adapter_model, ...) storybook_model = ct.convert(storybook_adapter_model, ...) # Save individually sticker_model.save("sticker.mlpackage") storybook_model.save("storybook.mlpackage") # Merge with shared weights desc = MultiFunctionDescriptor() desc.add_function("sticker", "sticker.mlpackage") desc.add_function("storybook", "storybook.mlpackage") save_multifunction(desc, "MultiAdapter.mlpackage") ``` ### Loading Specific Function ```swift let config = MLModelConfiguration() config.functionName = "sticker" // or "storybook" let model = try MLModel(contentsOf: modelURL, configuration: config) ``` ## Pattern 7 - MLTensor for Pipeline Stitching (iOS 18+) Simplifies computation between models (decoding, post-processing). ```swift import CoreML // Create tensors let scores = MLTensor(shape: [1, vocab_size], scalars: logits) // Operations (executed asynchronously on Apple Silicon) let topK = scores.topK(k: 10) let probs = (topK.values / temperature).softmax() // Sample from distribution let sampled = probs.multinomial(numSamples: 1) // Materialize to access data (blocks until complete) let shapedArray = await sampled.shapedArray(of: Int32.self) ``` **Key insight**: MLTensor operations are async. Call `shapedArray()` to materialize results. ## Pattern 8 - Async Prediction for Concurrency Thread-safe concurrent predictions for throughput. ```swift class ImageProcessor { let model: MLModel func processImages(_ images: [CGImage]) async throws -> [Output] { try await withThrowingTaskGroup(of: Output.self) { group in for image in images { group.addTask { // Check cancellation before expensive work try Task.checkCancellation() let input = try self.prepareInput(image) // Async prediction - thread safe! return try await self.model.prediction(from: input) } } return try await group.reduce(into: []) { $0.append($1) } } } } ``` **Warning**: Limit concurrent predictions to avoid memory pressure from multiple input/output buffers. ```swift // Limit concurrency let semaphore = AsyncSemaphore(value: 2) for image in images { group.addTask { await semaphore.wait() defer { semaphore.signal() } return try await process(image) } } ``` ## Anti-Patterns ### Don't - Load models on main thread at launch ```swift // BAD - blocks UI class AppDelegate { let model = try! MLModel(contentsOf: url) // Blocks! } // GOOD - lazy async loading class ModelManager { private var model: MLModel? func getModel() async throws -> MLModel { if let model { return model } model = try await Task.detached { try MLModel(contentsOf: url) }.value return model! } } ``` ### Don't - Reload model for each prediction ```swift // BAD - reloads every time func predict(_ input: Input) throws -> Output { let model = try MLModel(contentsOf: url) // Expensive! return try model.prediction(from: input) } // GOOD - keep model loaded class Predictor { private let model: MLModel func predict(_ input: Input) throws -> Output { try model.prediction(from: input) } } ``` ### Don't - Compress without profiling first ```swift // BAD - blind compression let compressed = palettize_weights(model, 2bit_config) // May break accuracy! // GOOD - profile, then compress iteratively // 1. Profile Float16 baseline // 2. Try 8-bit → check accuracy // 3. Try 6-bit → check accuracy // 4. Try 4-bit with grouped channels → check accuracy // 5. Only use 2-bit with training-time compression ``` ### Don't - Ignore deployment target ```python # BAD - misses optimizations mlmodel = ct.convert(traced_model, inputs=[...]) # GOOD - enables SDPA fusion, per-block quantization, etc. mlmodel = ct.convert( traced_model, inputs=[...], minimum_deployment_target=ct.target.iOS18 ) ``` ## Pressure Scenarios ### Scenario 1 - "Model is 5GB, need it under 2GB for iPhone" **Wrong approach**: Jump straight to 2-bit palettization. **Right approach**: 1. Start with 8-bit palettization → check accuracy 2. Try 6-bit → check accuracy 3. Try 4-bit with `per_grouped_channel` → check accuracy 4. If still too large, use calibration-based compression 5. If still losing accuracy, use training-time compression ### Scenario 2 - "LLM inference is too slow" **Wrong approach**: Try different compute units randomly. **Right approach**: 1. Profile with Core ML Instrument 2. Check if load is cached (look for "cached" vs "prepare and cache") 3. Enable stateful KV-cache 4. Check SDPA optimization is enabled (iOS 18+ deployment target) 5. Consider INT4 quantization for GPU on Mac ### Scenario 3 - "Need multiple LoRA adapters in one app" **Wrong approach**: Ship separate models for each adapter. **Right approach**: 1. Convert each adapter model separately 2. Use `MultiFunctionDescriptor` to merge with shared base 3. Load specific function via `config.functionName` 4. Weights are deduplicated automatically ## Checklist Before deploying a CoreML model: - [ ] Set `minimum_deployment_target` to latest supported iOS - [ ] Profile baseline Float16 performance - [ ] Check if model load is cached - [ ] Consider compression only if size/performance requires it - [ ] Test accuracy after each compression step - [ ] Use async prediction for concurrent workloads - [ ] Limit concurrent predictions to manage memory - [ ] Use state for transformer KV-cache - [ ] Use multi-function for adapter variants ## Resources **WWDC**: 2023-10047, 2023-10049, 2024-10159, 2024-10161 **Docs**: /coreml, /coreml/mlmodel, /coreml/mltensor **Skills**: coreml-ref, coreml-diag, axiom-ios-ai (Foundation Models)