---
name: speech-vlm
description: Run a multimodal Visual Language Model (VLM) with speech interaction on reComputer Jetson (AGX Orin 64G or Orin NX 16G), combining NVIDIA VLM, SenseVoice speech-to-text, and Coqui-ai TTS for voice-driven visual scene understanding.
---

# Run VLM with Speech Interaction

## Execution model

Run one phase at a time. After each phase, verify the expected result before continuing.
- If a phase succeeds → print `[OK]` and move to the next phase.
- If a phase fails → print `[STOP]`, consult the failure decision tree, and ask the user before retrying.

## Phase 1 — Verify prerequisites

Required hardware:
- reComputer Jetson AGX Orin 64G or Orin NX 16G (16GB+ memory)
- USB driver-free speaker microphone
- IP camera with RTSP output (or use NVStreamer for local video)

```bash
# Check JetPack 6 and CUDA
cat /etc/nv_tegra_release
nvcc --version
# Check available memory
free -h
```

Expected: JetPack 6.x installed; CUDA available; 16GB+ RAM.

## Phase 2 — Initialize system environment

```bash
# Ensure nvidia-jetpack is fully installed
sudo apt-get install nvidia-jetpack

# Install system dependencies
sudo apt-get install libportaudio2 libportaudiocpp0 portaudio19-dev

# Install Python packages
sudo pip3 install pyaudio playsound subprocess wave keyboard
sudo pip3 --upgrade setuptools
sudo pip3 install sudachipy==0.5.2
```

Verify audio devices are working and network is stable:

```bash
arecord -l   # List recording devices
aplay -l     # List playback devices
ping -c 2 8.8.8.8
```

Expected: Audio devices listed; network reachable.

## Phase 3 — Install VLM

Follow the NVIDIA Jetson VLM installation guide. Ensure you can perform text-based inference with VLM before proceeding.

Reference: [Run VLM on reComputer](https://wiki.seeedstudio.com/run_vlm_on_recomputer)

## Phase 4 — Install PyTorch and Torchaudio

Install PyTorch, Torchaudio, and Torchvision matching your JetPack version.

Reference: [PyTorch installation for Jetson](https://github.com/Seeed-Projects/reComputer-Jetson-for-Beginners/blob/main/3-Basic-Tools-and-Getting-Started/3.3-Pytorch-and-Tensorflow/README.md)

```bash
# Verify PyTorch with CUDA
python3 -c "import torch; print(torch.cuda.is_available())"
```

Expected: `True`

## Phase 5 — Install Speech_vlm (SenseVoice)

```bash
cd ~/
git clone https://github.com/ZhuYaoHui1998/speech_vlm.git
cd ~/speech_vlm
sudo pip3 install -r requement.txt
```

Expected: All SenseVoice dependencies installed.

## Phase 6 — Install TTS (Coqui-ai)

```bash
cd ~/speech_vlm/TTS
sudo pip3 install .[all]
```

Expected: TTS package installed successfully.

## Phase 7 — Start VLM service

```bash
cd ~/speech_vlm
sudo docker compose up -d
```

```bash
# Verify containers are running
sudo docker ps
```

Expected: VLM containers running.

## Phase 8 — Add RTSP camera stream

Edit `set_streamer_id.sh` — replace `0.0.0.0` with Jetson IP and set your RTSP stream address:

```bash
cd ~/speech_vlm
# Edit the script with your Jetson IP and RTSP URL
nano set_streamer_id.sh
sudo chmod +x ./set_streamer_id.sh
./set_streamer_id.sh
```

Record the returned camera ID — it is needed for the next phase.

Expected: Camera ID returned in the response.

## Phase 9 — Run speech VLM

Edit `vlm_voice.py` — replace `0.0.0.0` with Jetson IP in `API_URL` and fill in the camera ID in `REQUEST_ID`.

```bash
cd ~/speech_vlm
sudo python3 vlm_voice.py
```

After launch, select the audio device index when prompted. Press `1` to record, `2` to send.

Expected: Program starts, audio device selection shown, speech interaction works.

## Phase 10 — View results (optional)

Edit `view_rtsp.py` — replace `0.0.0.0` in `rtsp_url` with Jetson IP.

```bash
sudo pip3 install opencv-python
cd ~/speech_vlm
sudo python3 view_rtsp.py
```

Expected: RTSP output stream displayed with VLM annotations.

## Failure decision tree

| Symptom | Likely cause | Suggested fix |
|---|---|---|
| `nvidia-jetpack` install fails | Incomplete JetPack flash | Reflash with full JetPack 6 image |
| `pyaudio` install fails | Missing portaudio dev headers | `sudo apt-get install portaudio19-dev` |
| No audio devices found | USB mic not recognized | Check `lsusb`; try different USB port |
| Docker compose fails | Docker not installed or no permissions | Install docker-ce; add user to docker group |
| Camera ID not returned | Wrong IP or RTSP URL | Verify Jetson IP and camera RTSP stream accessibility |
| VLM inference timeout | Insufficient memory | Ensure 16GB+ RAM; close other processes |
| TTS install fails | Missing build dependencies | `sudo apt-get install build-essential python3-dev` |

## Reference files

- `references/source.body.md` — Full original wiki content (reference only)