--- name: mlx description: Running and fine-tuning LLMs on Apple Silicon with MLX. Use when working with models locally on Mac, converting Hugging Face models to MLX format, fine-tuning with LoRA/QLoRA on Apple Silicon, or serving models via HTTP API. --- # Using MLX for LLMs on Apple Silicon MLX-LM is a Python package for running large language models on Apple Silicon, leveraging the MLX framework for optimized performance with unified memory architecture. ## Table of Contents - [Core Concepts](#core-concepts) - [Installation](#installation) - [Text Generation](#text-generation) - [Interactive Chat](#interactive-chat) - [Model Conversion](#model-conversion) - [Quantization](#quantization) - [Fine-tuning with LoRA](#fine-tuning-with-lora) - [Serving Models](#serving-models) - [Best Practices](#best-practices) - [References](#references) ## Core Concepts ### Why MLX | Aspect | PyTorch on Mac | MLX | |--------|----------------|-----| | Memory | Separate CPU/GPU copies | Unified memory, no copies | | Optimization | Generic Metal backend | Apple Silicon native | | Model loading | Slower, more memory | Lazy loading, efficient | | Quantization | Limited support | Built-in 4/8-bit | MLX arrays live in shared memory, accessible by both CPU and GPU without data transfer overhead. ### Supported Models MLX-LM supports most popular architectures: Llama, Mistral, Qwen, Phi, Gemma, Cohere, and many more. Check the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face for pre-converted models. ## Installation ```bash pip install mlx-lm ``` Requires macOS 13.5+ and Apple Silicon (M1/M2/M3/M4). ## Text Generation ### Python API ```python from mlx_lm import load, generate # Load model (from HF hub or local path) model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit") # Generate text response = generate( model, tokenizer, prompt="Explain quantum computing in simple terms:", max_tokens=256, temp=0.7, ) print(response) ``` ### Streaming Generation ```python from mlx_lm import load, stream_generate model, tokenizer = load("mlx-community/Mistral-7B-Instruct-v0.3-4bit") prompt = "Write a haiku about programming:" for response in stream_generate(model, tokenizer, prompt, max_tokens=100): print(response.text, end="", flush=True) print() ``` ### Batch Generation ```python from mlx_lm import load, batch_generate model, tokenizer = load("mlx-community/Qwen2.5-7B-Instruct-4bit") prompts = [ "What is machine learning?", "Explain neural networks:", "Define deep learning:", ] responses = batch_generate( model, tokenizer, prompts, max_tokens=100, ) for prompt, response in zip(prompts, responses): print(f"Q: {prompt}\nA: {response}\n") ``` ### CLI Generation ```bash # Basic generation mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --prompt "Explain recursion:" \ --max-tokens 256 # With sampling parameters mlx_lm.generate --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \ --prompt "Write a poem about AI:" \ --temp 0.8 \ --top-p 0.95 ``` ## Interactive Chat ### CLI Chat ```bash # Start chat REPL (context preserved between turns) mlx_lm.chat --model mlx-community/Llama-3.2-3B-Instruct-4bit ``` ### Python Chat ```python from mlx_lm import load, generate model, tokenizer = load("mlx-community/Llama-3.2-3B-Instruct-4bit") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What's the capital of France?"}, ] prompt = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) response = generate(model, tokenizer, prompt=prompt, max_tokens=256) print(response) ``` ## Model Conversion Convert Hugging Face models to MLX format: ### CLI Conversion ```bash # Convert with 4-bit quantization mlx_lm.convert --hf-path meta-llama/Llama-3.2-3B-Instruct \ -q # Quantize to 4-bit # With specific quantization mlx_lm.convert --hf-path mistralai/Mistral-7B-Instruct-v0.3 \ -q \ --q-bits 8 \ --q-group-size 64 # Upload to Hugging Face Hub mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct \ -q \ --upload-repo your-username/Llama-3.2-1B-Instruct-4bit-mlx ``` ### Python Conversion ```python from mlx_lm import convert convert( hf_path="meta-llama/Llama-3.2-3B-Instruct", mlx_path="./llama-3.2-3b-mlx", quantize=True, q_bits=4, q_group_size=64, ) ``` ### Conversion Options | Option | Default | Description | |--------|---------|-------------| | `--q-bits` | 4 | Quantization bits (4 or 8) | | `--q-group-size` | 64 | Group size for quantization | | `--dtype` | float16 | Data type for non-quantized weights | ## Quantization MLX supports multiple quantization methods for different use cases: | Method | Best For | Command | |--------|----------|---------| | Basic | Quick conversion | `mlx_lm.convert -q` | | DWQ | Quality-preserving | `mlx_lm.dwq` | | AWQ | Activation-aware | `mlx_lm.awq` | | Dynamic | Per-layer precision | `mlx_lm.dynamic_quant` | | GPTQ | Established method | `mlx_lm.gptq` | ### Quick Quantization ```bash # 4-bit quantization during conversion mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q # 8-bit for higher quality mlx_lm.convert --hf-path mistralai/Mistral-7B-v0.3 -q --q-bits 8 ``` For detailed coverage of each method, see `reference/quantization.md`. ## Fine-tuning with LoRA MLX supports LoRA and QLoRA fine-tuning for efficient adaptation on Apple Silicon. ### Quick Start ```bash # Prepare training data (JSONL format) # {"text": "Your training text here"} # or # {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]} # Fine-tune with LoRA mlx_lm.lora --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --train \ --data ./data \ --iters 1000 # Generate with adapter mlx_lm.generate --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path ./adapters \ --prompt "Your prompt here" ``` ### Fuse Adapter into Model ```bash # Merge LoRA weights into base model mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path ./adapters \ --save-path ./fused-model # Or export to GGUF mlx_lm.fuse --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --adapter-path ./adapters \ --export-gguf ``` For detailed LoRA configuration and training patterns, see `reference/fine-tuning.md`. ## Serving Models ### OpenAI-Compatible Server ```bash # Start server mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit --port 8080 # Use with OpenAI client curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "default", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 256 }' ``` ### Python Client ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed") response = client.chat.completions.create( model="default", messages=[{"role": "user", "content": "Explain MLX in one sentence."}], max_tokens=100, ) print(response.choices[0].message.content) ``` ## Best Practices 1. **Use pre-quantized models**: Download from `mlx-community` on Hugging Face for immediate use 2. **Match quantization to your hardware**: M1/M2 with 8GB: use 4-bit; M2/M3 Pro/Max: 8-bit for quality 3. **Leverage unified memory**: Unlike CUDA, MLX models can exceed "GPU memory" by using swap (slower but works) 4. **Use streaming for UX**: `stream_generate` provides responsive output for interactive applications 5. **Cache prompt prefixes**: Use `mlx_lm.cache_prompt` for repeated prompts with varying suffixes 6. **Batch similar requests**: `batch_generate` is more efficient than sequential generation 7. **Start with 4-bit quantization**: Good quality/size tradeoff; upgrade to 8-bit if quality issues 8. **Fuse adapters for deployment**: After fine-tuning, fuse adapters for faster inference without loading separately 9. **Monitor memory with Activity Monitor**: Watch memory pressure to avoid swap thrashing 10. **Use chat templates**: Always apply `tokenizer.apply_chat_template()` for instruction-tuned models ## References See `reference/` for detailed documentation: - `quantization.md` - Detailed quantization methods and when to use each - `fine-tuning.md` - Complete LoRA/QLoRA training guide with data formats and configuration