---
name: evla-vla
description: EdgeVLA - Open-source edge vision-language-action model for robotics. Standardizes Open-X Embodiment datasets for consistent VLA training and deployment.
version: 1.0.0
category: robotics-vla
author: K-Scale Labs
source: kscalelabs/evla
license: MIT
trit: -1
trit_label: MINUS
color: "#DBA51D"
verified: false
featured: true
---

# EdgeVLA Skill

**Trit**: -1 (MINUS - analysis/verification)
**Color**: #DBA51D (Golden Yellow)
**URI**: skill://evla-vla#DBA51D

## Overview

EdgeVLA is an open-source edge vision-language-action model for robotics. It standardizes diverse robotics datasets from the Open-X Embodiment (OXE) collection for consistent training and deployment.

## Architecture

```
┌────────────────────────────────────────────────────────────────┐
│                     EdgeVLA ARCHITECTURE                        │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │              Open-X Embodiment Datasets                   │  │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐         │  │
│  │  │ DROID   │ │ Bridge  │ │ LIBERO  │ │ RT-X    │ + 60... │  │
│  │  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘         │  │
│  └───────┼───────────┼───────────┼───────────┼──────────────┘  │
│          │           │           │           │                  │
│          ▼           ▼           ▼           ▼                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │           OXE_DATASET_CONFIGS Standardization            │  │
│  │  • image_obs_keys: primary, secondary, wrist cameras     │  │
│  │  • state_encoding: POS_EULER, POS_QUAT, JOINT           │  │
│  │  • action_encoding: EEF_POS, JOINT_POS                  │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                  Unified Data Format                      │  │
│  │  ┌─────────────────────────────────────────────────────┐ │  │
│  │  │ Images: resized, normalized, multi-view             │ │  │
│  │  │ States: 8-dim standardized proprioception           │ │  │
│  │  │ Actions: 7-dim EEF or joint actions                 │ │  │
│  │  └─────────────────────────────────────────────────────┘ │  │
│  └──────────────────────────────────────────────────────────┘  │
│                              │                                  │
│                              ▼                                  │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │                     VLA Model                             │  │
│  │  Vision Encoder → Language Model → Action Decoder        │  │
│  └──────────────────────────────────────────────────────────┘  │
└────────────────────────────────────────────────────────────────┘
```

## Dataset Configuration

```python
from evla.config import OXE_DATASET_CONFIGS, StateEncoding, ActionEncoding

# DROID dataset configuration
droid_config = OXE_DATASET_CONFIGS["droid"]
# {
#     "image_obs_keys": {
#         "primary": "exterior_image_1_left",
#         "secondary": "exterior_image_2_left",
#         "wrist": "wrist_image_left",
#     },
#     "state_encoding": StateEncoding.POS_QUAT,
#     "action_encoding": ActionEncoding.EEF_POS,
# }

# Bridge dataset configuration
bridge_config = OXE_DATASET_CONFIGS["bridge"]
# {
#     "image_obs_keys": {
#         "primary": "image_0",
#         "wrist": "image_1",
#     },
#     "state_encoding": StateEncoding.POS_EULER,
#     "action_encoding": ActionEncoding.EEF_POS,
# }
```

## Named Mixtures

```python
from evla.config import OXE_NAMED_MIXTURES

# Comprehensive multi-dataset training
oxe_magic_soup = OXE_NAMED_MIXTURES["oxe_magic_soup"]

# RT-X reproduction
rtx_mixture = OXE_NAMED_MIXTURES["rtx"]

# Custom mixture with weights
custom_mixture = {
    "droid": 1.0,
    "bridge": 0.5,
    "libero": 0.3,
}
```

## Usage

```python
from evla import EdgeVLA, DataLoader

# Load model
model = EdgeVLA.from_pretrained("kscale/evla-base")

# Create dataloader with mixture
loader = DataLoader(
    mixture="oxe_magic_soup",
    batch_size=32,
    image_size=(224, 224),
)

# Training loop
for batch in loader:
    images = batch["images"]  # (B, V, H, W, C)
    states = batch["states"]  # (B, 8)
    actions = batch["actions"]  # (B, 7)
    
    loss = model.train_step(images, states, actions)

# Inference
with torch.no_grad():
    image = camera.capture()
    state = robot.get_state()
    action = model.predict(image, state, "pick up the red block")
    robot.execute(action)
```

## Key Contributors

- **budzianowski**: Core architecture, dataset configs, finetuning
- **moojink**: LIBERO eval, dataset transforms
- **WT-MM**: README, integration

## GF(3) Triads

This skill participates in balanced triads:

```
evla-vla (-1) ⊗ kos-firmware (+1) ⊗ mujoco-scenes (0) = 0 ✓
ksim-rl (-1) ⊗ topos-generate (+1) ⊗ evla-vla (-1) = needs balancing
```

## Related Skills

- `kos-firmware` (+1): Robot firmware for deployment
- `ksim-rl` (-1): RL training for locomotion
- `kbot-humanoid` (-1): K-Bot configuration
- `mujoco-scenes` (0): Scene composition

## References

```bibtex
@misc{evla2024,
  title={EdgeVLA: Open-Source Edge Vision-Language-Action Model},
  author={K-Scale Labs},
  year={2024},
  url={https://github.com/kscalelabs/evla}
}

@article{openvla2024,
  title={OpenVLA: An Open-Source Vision-Language-Action Model},
  author={Kim, Moo Jin and others},
  journal={arXiv:2406.09246},
  year={2024}
}
```


## SDF Interleaving

This skill connects to **Software Design for Flexibility** (Hanson & Sussman, 2021):

### Primary Chapter: 5. Evaluation

**Concepts**: eval, apply, interpreter, environment

### GF(3) Balanced Triad

```
evla-vla (−) + SDF.Ch5 (−) + [balancer] (−) = 0
```

**Skill Trit**: -1 (MINUS - verification)

### Secondary Chapters

- Ch2: Domain-Specific Languages

### Connection Pattern

Evaluation interprets expressions. This skill processes or generates evaluable forms.