# Ray Data Transformations

Complete guide to data transformations in Ray Data.

## Core operations

### Map batches (vectorized)

```python
# Recommended for performance
def process_batch(batch):
    # batch is dict of numpy arrays or pandas Series
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)
```

**Performance**: 10-100× faster than row-by-row

### Map (row-by-row)

```python
# Use only when vectorization not possible
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)
```

### Filter

```python
# Remove rows
ds = ds.filter(lambda row: row["score"] > 0.5)
```

### Flat map

```python
# One row → multiple rows
def expand_row(row):
    return [{"value": row["value"] + i} for i in range(3)]

ds = ds.flat_map(expand_row)
```

## GPU-accelerated transforms

```python
def gpu_transform(batch):
    import torch
    data = torch.tensor(batch["data"]).cuda()
    # GPU processing
    result = data * 2
    return {"processed": result.cpu().numpy()}

ds = ds.map_batches(gpu_transform, num_gpus=1, batch_size=64)
```

## Groupby operations

```python
# Group by column
grouped = ds.groupby("category")

# Aggregate
result = grouped.count()

# Custom aggregation
result = grouped.map_groups(lambda group: {
    "sum": group["value"].sum(),
    "mean": group["value"].mean()
})
```

## Best practices

1. **Use map_batches over map** - 10-100× faster
2. **Tune batch_size** - Larger = faster (balance with memory)
3. **Use GPUs for heavy compute** - Image/audio preprocessing
4. **Stream large datasets** - Use iter_batches for >memory data