---
name: transformers
description: Loading and using pretrained models with Hugging Face Transformers. Use when working with pretrained models from the Hub, running inference with Pipeline API, fine-tuning models with Trainer, or handling text, vision, audio, and multimodal tasks.
---

# Using Hugging Face Transformers

Transformers is the model-definition framework for state-of-the-art machine learning across text, vision, audio, and multimodal domains. It provides unified APIs for loading pretrained models, running inference, and fine-tuning.

## Table of Contents

- [Core Concepts](#core-concepts)
- [Pipeline API](#pipeline-api)
- [Model Loading](#model-loading)
- [Inference Patterns](#inference-patterns)
- [Fine-tuning with Trainer](#fine-tuning-with-trainer)
- [Working with Modalities](#working-with-modalities)
- [Memory and Performance](#memory-and-performance)
- [Best Practices](#best-practices)

## Core Concepts

### The Three Core Classes

Every model in Transformers has three core components:

```python
from transformers import AutoConfig, AutoModel, AutoTokenizer

# Configuration: hyperparameters and architecture settings
config = AutoConfig.from_pretrained("bert-base-uncased")

# Model: the neural network weights
model = AutoModel.from_pretrained("bert-base-uncased")

# Tokenizer/Processor: converts inputs to tensors
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
```

### The `from_pretrained` Pattern

All loading uses `from_pretrained()` which handles downloading, caching, and device placement:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",  # Automatic device placement
)
```

### Auto Classes

Use task-specific Auto classes for the correct model head:

```python
from transformers import (
    AutoModelForCausalLM,          # Text generation (GPT, Llama)
    AutoModelForSeq2SeqLM,         # Encoder-decoder (T5, BART)
    AutoModelForSequenceClassification,  # Classification
    AutoModelForTokenClassification,     # NER, POS tagging
    AutoModelForQuestionAnswering,       # Extractive QA
    AutoModelForMaskedLM,                # BERT-style masked LM
    AutoModelForImageClassification,     # Vision models
    AutoModelForSpeechSeq2Seq,           # Speech recognition
)
```

## Pipeline API

The `pipeline()` function provides high-level inference with minimal code:

### Text Tasks

```python
from transformers import pipeline

# Text generation
generator = pipeline("text-generation", model="Qwen/Qwen2.5-1.5B")
output = generator("The secret to success is", max_new_tokens=50)

# Text classification
classifier = pipeline("sentiment-analysis")
result = classifier("I love this product!")
# [{'label': 'POSITIVE', 'score': 0.9998}]

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Hugging Face is based in New York City.")

# Question answering
qa = pipeline("question-answering")
answer = qa(question="What is the capital?", context="Paris is the capital of France.")

# Summarization
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(long_text, max_length=130, min_length=30)

# Translation
translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
result = translator("Hello, how are you?")
```

### Chat/Conversational

```python
from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain quantum computing in simple terms."},
]

response = pipe(messages, max_new_tokens=256)
print(response[0]["generated_text"][-1]["content"])
```

### Vision Tasks

```python
# Image classification
classifier = pipeline("image-classification", model="google/vit-base-patch16-224")
result = classifier("path/to/image.jpg")

# Object detection
detector = pipeline("object-detection", model="facebook/detr-resnet-50")
objects = detector("path/to/image.jpg")

# Image segmentation
segmenter = pipeline("image-segmentation", model="facebook/mask2former-swin-base-coco-panoptic")
masks = segmenter("path/to/image.jpg")
```

### Audio Tasks

```python
# Speech recognition
transcriber = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")
text = transcriber("path/to/audio.mp3")

# Audio classification
classifier = pipeline("audio-classification", model="superb/wav2vec2-base-superb-ks")
result = classifier("path/to/audio.wav")
```

### Multimodal Tasks

```python
# Visual question answering
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
answer = vqa(image="image.jpg", question="What color is the car?")

# Image-to-text (captioning)
captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base")
caption = captioner("image.jpg")

# Document question answering
doc_qa = pipeline("document-question-answering", model="impira/layoutlm-document-qa")
answer = doc_qa(image="document.png", question="What is the total?")
```

## Model Loading

### Device Placement

```python
from transformers import AutoModelForCausalLM
import torch

# Automatic placement across available devices
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Specific device
model = AutoModelForCausalLM.from_pretrained(
    "gpt2",
    device_map="cuda:0",
)

# Custom device map for model parallelism
device_map = {
    "model.embed_tokens": 0,
    "model.layers.0": 0,
    "model.layers.1": 1,
    "model.norm": 1,
    "lm_head": 1,
}
model = AutoModelForCausalLM.from_pretrained(model_name, device_map=device_map)
```

### Loading from Local Path

```python
# Save model locally
model.save_pretrained("./my_model")
tokenizer.save_pretrained("./my_model")

# Load from local path
model = AutoModelForCausalLM.from_pretrained("./my_model")
tokenizer = AutoTokenizer.from_pretrained("./my_model")
```

### Trust Remote Code

Some models require executing custom code from the Hub:

```python
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-2",
    trust_remote_code=True,  # Required for custom architectures
)
```

## Inference Patterns

### Text Generation

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Qwen/Qwen2.5-3B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Basic generation
inputs = tokenizer("Once upon a time", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)

# With generation config
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    top_k=50,
    repetition_penalty=1.1,
)
```

### Chat Templates

```python
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

# Apply chat template
input_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
```

### Getting Embeddings

```python
from transformers import AutoModel, AutoTokenizer
import torch

model = AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")

def get_embeddings(texts: list[str]) -> torch.Tensor:
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    with torch.no_grad():
        outputs = model(**inputs)

    # Mean pooling
    attention_mask = inputs["attention_mask"]
    embeddings = outputs.last_hidden_state
    mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    sum_embeddings = (embeddings * mask_expanded).sum(1)
    sum_mask = mask_expanded.sum(1).clamp(min=1e-9)
    return sum_embeddings / sum_mask

embeddings = get_embeddings(["Hello world", "How are you?"])
```

### Classification

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("I love this movie!", return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.softmax(outputs.logits, dim=-1)

labels = model.config.id2label
for idx, prob in enumerate(predictions[0]):
    print(f"{labels[idx]}: {prob:.4f}")
```

## Fine-tuning with Trainer

### Basic Fine-tuning

```python
from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    Trainer,
    TrainingArguments,
)
from datasets import load_dataset

# Load data and model
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased",
    num_labels=2,
)

# Tokenize dataset
def tokenize(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized = dataset.map(tokenize, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    logging_steps=100,
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()
```

### Pushing to Hub

```python
# Login first: huggingface-cli login

# Push model and tokenizer
model.push_to_hub("my-username/my-fine-tuned-model")
tokenizer.push_to_hub("my-username/my-fine-tuned-model")

# Or use trainer
trainer.push_to_hub()
```

See `reference/fine-tuning.md` for advanced patterns including LoRA, custom data collators, and evaluation metrics.

## Working with Modalities

### Vision Models

```python
from transformers import AutoImageProcessor, AutoModelForImageClassification
from PIL import Image

processor = AutoImageProcessor.from_pretrained("google/vit-base-patch16-224")
model = AutoModelForImageClassification.from_pretrained("google/vit-base-patch16-224")

image = Image.open("image.jpg")
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predicted_class = outputs.logits.argmax(-1).item()
    print(model.config.id2label[predicted_class])
```

### Audio Models

```python
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
import torch

processor = AutoProcessor.from_pretrained("openai/whisper-large-v3")
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    "openai/whisper-large-v3",
    torch_dtype=torch.float16,
    device_map="auto",
)

# Load audio (use librosa, soundfile, or datasets)
import librosa
audio, sr = librosa.load("audio.mp3", sr=16000)

inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
```

### Vision-Language Models

```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import torch

model_name = "llava-hf/llava-1.5-7b-hf"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto",
)

image = Image.open("image.jpg")
prompt = "USER: <image>\nDescribe this image in detail.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
response = processor.decode(outputs[0], skip_special_tokens=True)
```

## Memory and Performance

### Quantization

```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    quantization_config=bnb_config,
    device_map="auto",
)

# 8-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    load_in_8bit=True,
    device_map="auto",
)
```

### Flash Attention

```python
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # Requires flash-attn package
    device_map="auto",
)
```

### torch.compile

```python
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
model = torch.compile(model, mode="reduce-overhead")
```

### Batched Inference

```python
texts = ["First prompt", "Second prompt", "Third prompt"]
inputs = tokenizer(texts, return_tensors="pt", padding=True).to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50)

decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
```

See `reference/advanced-inference.md` for streaming, KV caching, and serving patterns.

## Best Practices

1. **Use bfloat16 over float16**: Better numerical stability on modern GPUs
2. **Set pad token for generation**: `tokenizer.pad_token = tokenizer.eos_token`
3. **Use device_map="auto"**: Let Accelerate handle device placement
4. **Enable Flash Attention**: Significant speedup for long sequences
5. **Batch when possible**: Amortize fixed costs across multiple inputs
6. **Use pipeline for quick prototyping**: Switch to manual control for production
7. **Cache models locally**: Set `HF_HOME` environment variable for model cache location
8. **Check model license**: Verify usage rights before deployment

## References

See `reference/` for detailed documentation:
- `fine-tuning.md` - Advanced fine-tuning patterns with LoRA, PEFT, and custom training
- `advanced-inference.md` - Generation strategies, streaming, and serving