---
name: casa-vl-fusion
title: "CASA: Cross-Attention via Self-Attention for Efficient VL Fusion"
version: 0.0.2
engine: skillxiv-v0.0.2-claude-opus-4.6
license: MIT
url: https://arxiv.org/abs/2512.19535
keywords: [vision-language, cross-attention, efficient, multi-image, streaming]
description: "Replace token-insertion for fusing vision and language with efficient cross-attention that maintains separate text self-attention. Enables text tokens to attend images within local windows, preserves gist tokens from prior images, and maintains near-constant memory costs for streaming video—more practical than direct token insertion for resource-constrained applications."
---

## Overview

CASA revisits cross-attention (CA) as a practical alternative to direct token insertion for vision-language fusion. Token insertion becomes prohibitively expensive for high-resolution images and video, while CA offers efficient fusing with careful design. Five key design differences restore CA's competitiveness.

## Core Technique

The key insight is that cross-attention requires specific design choices to match or exceed token-insertion performance.

**Five Critical Design Differences:**

```python
# CASA architecture components
class CASAVisionLanguageModel:
    def __init__(self):
        # D1: Separate parameter layers for cross-attention
        self.text_self_attention = SelfAttentionLayer()
        self.cross_attention = CrossAttentionLayer()  # Not shared

        # D2: Joint text-image attention with local windows
        self.local_window_size = 128

        # D3: Reduced self-attention layers for CA layers
        self.num_self_attn = 16
        self.num_cross_attn = 8  # Replaces some self-attn

        # D4: Optional image token FFN updates
        self.image_ffn = FFNLayer()

        # D5: Visual history via gist tokens
        self.gist_tokens = None

    def forward(self, text_tokens, image_features, prev_gist=None):
        """
        Process text and image with CASA design principles.
        """
        # Maintain text self-attention for robustness
        text_hidden = self.text_self_attention(text_tokens)

        # Joint attention: text attends to image + preceding text
        # within local windows for efficiency
        attended = self.cross_attention(
            query=text_hidden,
            key_value_image=image_features,
            key_value_text=text_hidden,
            window_size=self.local_window_size
        )

        # Optional: update image embeddings via FFN
        image_features = self.image_ffn(image_features)

        # D5: Compress current image into gist tokens for next round
        gist_tokens = self.compute_gist(image_features)

        return attended, gist_tokens
```

**Gist Tokens for Visual History:**
Preserve compressed representations of past images without growing memory.

```python
def compute_gist_tokens(image_features, num_gist=8):
    """
    Compress image features into small number of gist tokens
    representing essential visual information for future frames.
    """
    # Average pooling over spatial dimensions
    spatial_mean = torch.mean(image_features, dim=(1, 2))  # [batch, hidden]

    # Project to gist token dimension
    gist = apply_projection(spatial_mean, output_dim=hidden_dim)

    # Take top-k tokens by importance score
    importance_scores = compute_importance(gist)
    gist_tokens = select_top_k(gist, importance_scores, k=num_gist)

    return gist_tokens
```

**Streaming Efficiency with Constant Memory:**
Unlike token insertion, KV cache scales with gist tokens, not image resolution.

```python
def streaming_forward(model, text_query, new_frame, history_gist):
    """
    Process new frame without storing all prior image tokens.
    Memory is O(gist_tokens), not O(image_resolution).
    """
    # Current image gist
    gist_current = model.compute_gist(new_frame)

    # Combine historical gists (constant size)
    gist_memory = history_gist + [gist_current]

    # Cross-attention over gists (efficient)
    output = model.cross_attention(
        query=text_query,
        key_value=gist_memory
    )

    # Memory complexity: O(num_frames * gist_tokens)
    # vs O(num_frames * image_resolution²) for token insertion

    return output, gist_memory
```

## When to Use This Technique

Use CASA when:
- Processing high-resolution images or video streams
- Memory bandwidth is constrained
- Multi-image conversations with streaming
- Token insertion memory costs are prohibitive

## When NOT to Use This Technique

Avoid this approach if:
- Single low-resolution image tasks (token insertion suffices)
- Fine-grained pixel-level understanding needed (lose spatial detail)
- Very few images/frames (token insertion memory manageable)

## Implementation Notes

The framework requires:
- Separate cross-attention and self-attention layer implementations
- Local windowing mechanism for joint text-image attention
- Gist token computation and compression
- Streaming inference pipeline for video

## Key Performance

- Near-constant memory costs for streaming video
- Comparable or superior performance to token insertion
- Efficient multi-image conversation support
- Strong baseline on various VLM benchmarks

## References

- Cross-attention as efficient alternative to token insertion
- Local windowing for joint text-image attention
- Gist tokens for visual memory compression
- Streaming-friendly architecture design