# CUTracer

CUTracer is a CUDA binary instrumentation tool built on [NVBit](https://github.com/NVlabs/NVBit). It cleanly separates lightweight data collection (instrumentation) from host-side processing (analysis). Typical workflows include per-warp instruction histograms (delimited by GPU clock reads) and kernel hang detection.

## Features

-   NVBit-powered, runtime attach via `CUDA_INJECTION64_PATH` (no app rebuild needed)
-   Multiple instrumentation modes: opcode-only, register trace, memory trace, random delay
-   Built-in analyses:
    -   Instruction Histogram (for Proton/Triton workflows)
    -   Deadlock/Hang Detection
    -   Data Race Detection
-   CUDA Graph and stream-capture aware flows
-   Deterministic kernel log file naming and CSV outputs

## Requirements

All requirements are aligned with NVBit.

Unique requirements:
- **libzstd**: Required for trace compression

## Installation

1. Clone the repository:

```bash
cd ~
git clone git@github.com:facebookexperimental/CUTracer.git
cd CUTracer
```

2. Install system dependencies (libzstd static library for self-contained builds):

```bash
# Ubuntu/Debian
# On most Ubuntu/Debian systems, libzstd-dev provides both shared and static libs (libzstd.a).
# You can verify this with: dpkg -L libzstd-dev | grep 'libzstd.a'
# If your distribution does not ship the static library in libzstd-dev, you may need to
# build zstd from source or install a distro-specific static libzstd package.
sudo apt-get install libzstd-dev

# CentOS/RHEL/Fedora (static library for portable builds)
sudo dnf install libzstd-static

# If static library is not available, the build will fall back to dynamic linking
# and display a warning. The resulting binary will not be self-contained.
```

3. Download third-party dependencies:

```bash
./install_third_party.sh
```

This will download:
- NVBit (NVIDIA Binary Instrumentation Tool)
- nlohmann/json (JSON library for C++)

4. Build the tool:

```bash
make -j$(nproc)
```

## Quickstart

### 1. Install the Python CLI

```bash
cd ~/CUTracer/python
pip install .
```

### 2. Run your CUDA app with CUTracer

```bash
# Option A: Set CUTRACER_LIB_PATH once (recommended)
export CUTRACER_LIB_PATH=~/CUTracer/lib
cutracer trace -i tma_trace -- ./your_app

# Option B: Specify cutracer.so explicitly
cutracer trace -i tma_trace --cutracer-so ~/CUTracer/lib/cutracer.so -- ./your_app

# Option C: Run from the CUTracer project root (auto-discovers ./lib/cutracer.so)
cd ~/CUTracer
cutracer trace -i tma_trace -- ./your_app

# Option D: Kernel launch logger only (no instrumentation, no trace files)
cutracer trace -- ./your_app
```

### 3. Analyze the output

```bash
cutracer analyze warp-summary output.ndjson
cutracer query output.ndjson --filter "warp=24"
cutracer validate output.ndjson
```

> **Note**: You can also use CUTracer without the Python CLI by setting the
> `CUDA_INJECTION64_PATH` environment variable directly:
> ```bash
> CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so ./your_app
> ```

## Configuration (env vars)

-   `CUTRACER_INSTRUMENT`: comma-separated modes: `opcode_only`, `reg_trace`, `mem_trace`, `random_delay`
-   `CUTRACER_ANALYSIS`: comma-separated analyses: `proton_instr_histogram`, `deadlock_detection`, `random_delay`
    -   Enabling `proton_instr_histogram` auto-enables `opcode_only`
    -   Enabling `deadlock_detection` auto-enables `reg_trace`
    -   Enabling `random_delay` auto-enables `random_delay` instrumentation; also requires `CUTRACER_DELAY_NS` to be set
-   `KERNEL_FILTERS`: comma-separated substrings matching unmangled or mangled kernel names
-   `INSTR_BEGIN`, `INSTR_END`: static instruction index gate during instrumentation
-   `TOOL_VERBOSE`: 0/1/2
-   `CUTRACER_TRACE_FORMAT`: trace output format. Accepts string names or numeric values (replaces the legacy `TRACE_FORMAT_NDJSON` env var, which is still accepted for backward compatibility)
    -   **ndjson** or 2 (default): NDJSON uncompressed (`.ndjson`)
    -   text (or 0): Plain text (`.log`, legacy format, verbose)
    -   zstd (or 1): NDJSON+Zstd compressed (`.ndjson.zst`, ~12x compression, 92% space savings)
    -   clp (or 3): CLP Archive (`.clp`)
-   `CUTRACER_ZSTD_LEVEL`: Zstd compression level (1-22, default 9)
    -   Lower values (1-3): Faster compression, slightly larger output
    -   Higher values (19-22): Maximum compression, slower but smallest output
    -   Default of 9 provides balanced compression speed and ratio
- `CUTRACER_DELAY_NS`: Max delay value in nanoseconds for `random_delay` analysis (required when `random_delay` is enabled)
- `CUTRACER_DELAY_MIN_NS`: Minimum delay in nanoseconds — floor for random mode (default: 0). Must be ≤ `CUTRACER_DELAY_NS`
- `CUTRACER_DELAY_MODE`: Delay mode: `random` (per-thread random, default), `fixed` (same for all threads), `cluster` (one CTA per cluster), `cluster_fixed` (fixed delay, one CTA per cluster)
- `CUTRACER_DELAY_DUMP_PATH`: Output path for delay config JSON file (for recording instrumentation patterns)
- `CUTRACER_DELAY_LOAD_PATH`: Input path for delay config JSON file (for replay mode - deterministic reproduction)
- `CUTRACER_DELAY_PATTERNS`: Comma-separated SASS instruction substrings for delay injection (overrides built-in patterns). Use `"*"` to match all instructions
- `CUTRACER_DELAY_ENABLE_PROB`: Per-PC enable probability (0.0-1.0, default: 0.5). Use `1.0` with warp targeting
- `CUTRACER_DELAY_WARPGROUP_ID`: Warp-targeted delay: warpgroup index (>= 0 selects warps `[4N..4N+3]`, -1 = disabled)
- `CUTRACER_DELAY_WARP_MASK`: Warp-targeted delay: hex bitmask of CTA-local warp IDs (e.g. `0xF` for warps 0-3, 0 = disabled)
- `CUTRACER_OUTPUT_DIR`: Output directory for all CUTracer files (trace files and log files). Defaults to the current directory. The directory must exist and be writable.
- `CUTRACER_CPU_CALLSTACK`: CPU call stack capture mode at each kernel launch (default: `auto`)
    - `auto` (default): Prefer PyTorch CapturedTraceback for Python frames, fallback to C++ backtrace if Python/PyTorch is unavailable
    - `pytorch`: Force PyTorch CapturedTraceback only (returns empty if unavailable)
    - `backtrace`: Force C++ backtrace only (original behavior)
    - `1`: Same as `auto` (backward compatible)
    - `0`: Disable call stack capture
    - When enabled, the `kernel_metadata` trace event includes a `cpu_callstack` array and a `cpu_callstack_source` field (`"pytorch"` or `"backtrace"`) indicating the capture method used
- `CUTRACER_KERNEL_TIMEOUT_S`: Kernel execution time limit in seconds (default: 0 = disabled)
    - Terminates the process with SIGTERM when a kernel runs longer than this value
    - Acts as a general safety valve, independent of deadlock detection (does not require `-a deadlock_detection`)
- `CUTRACER_NO_DATA_TIMEOUT_S`: No-data hang detection timeout in seconds (default: 15)
    - Terminates the process with SIGTERM when no trace data arrives for this duration
    - Acts as a general safety valve, independent of deadlock detection (does not require `-a deadlock_detection`)
    - Catches "silent" hangs where all warps are blocked on synchronization primitives with zero trace output
    - Works whether the kernel went silent after producing some data, or never produced any data at all
    - When `-a deadlock_detection` is also active, prints detailed warp status summary before termination
    - Set to 0 to disable
- `CUTRACER_TRACE_SIZE_LIMIT_MB`: Maximum trace file size in MB (default: 0 = disabled)
    - When any trace file exceeds this limit, tracing is stopped for that kernel; kernel execution continues normally
    - Useful for preventing runaway trace files from filling disk (e.g., during deadlocked kernels)

**Notes:**
- The tool sets `CUDA_MANAGED_FORCE_DEVICE_ALLOC=1` to simplify channel memory handling.
- Multiple analyses can be combined (e.g., `CUTRACER_ANALYSIS=proton_instr_histogram,deadlock_detection`). Each analysis auto-enables its required instrumentation mode.

## Analyses

### Instruction Histogram (proton_instr_histogram)

-   Counts SASS instruction mnemonics per warp within regions delimited by clock reads (start/stop model; nested regions not supported)
-   Output: one CSV per kernel launch with columns `warp_id,region_id,instruction,count`

Example (Triton/Proton + IPC):

```bash
cd ~/CUTracer/tests/proton_tests

# 1) Collect histogram with CUTracer
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=proton_instr_histogram \
KERNEL_FILTERS=add_kernel \
python ./vector-add-instrumented.py

# 2) Run without CUTracer to generate a clean Chrome trace
python ./vector-add-instrumented.py

# 3) Merge and compute IPC
python ~/CUTracer/scripts/parse_instr_hist_trace.py \
  --chrome-trace ./vector.chrome_trace \
  --cutracer-trace ./kernel_*_add_kernel_hist.csv \
  --cutracer-log ./cutracer_main_*.log \
  --output vectoradd_ipc.csv
```

### Deadlock / Hang Detection (deadlock_detection)

-   Detects sustained hangs by identifying warps stuck in stable PC loops; logs and issues SIGTERM→SIGKILL if sustained
-   Requires `reg_trace` (auto-enabled)

Example (intentional loop):

```bash
cd ~/CUTracer/tests/hang_test
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
CUTRACER_ANALYSIS=deadlock_detection \
python ./test_hang.py
```

### Data Race Detection (random_delay)

-   Data races depend on thread scheduling and timing — buggy code may appear correct by luck.
    This analysis exposes hidden races by injecting random delays before **synchronization-related SASS instructions** (e.g., `BAR`, `MEMBAR`, `ATOM`, `RED`), disrupting the normal timing and forcing latent races to manifest as observable failures.
-   Each instrumentation point is randomly enabled/disabled (50% probability)
-   Two delay modes:
    -   **`random` (default):** Each thread gets a random delay in `[0, CUTRACER_DELAY_NS]` using GPU-side xorshift32 PRNG seeded with `threadIdx/blockIdx/clock`. Creates per-thread timing skew that amplifies data races. **Recommended.**
    -   **`fixed`:** All threads get the same delay. Preserves relative timing between threads and often *masks* races rather than exposing them. Not recommended for race detection.
-   Requires `CUTRACER_DELAY_NS` to be set. The `random_delay` instrumentation mode is auto-enabled.

Example:

```bash
CUTRACER_DELAY_NS=100000 \
CUTRACER_ANALYSIS=random_delay \
CUDA_INJECTION64_PATH=~/CUTracer/lib/cutracer.so \
python3 your_kernel.py
```

#### Delay Dump and Replay

CUTracer supports dumping delay configurations to JSON for deterministic reproduction of data races:

- **Dump mode**: Set `CUTRACER_DELAY_DUMP_PATH` to save the random instrumentation pattern to a JSON file
- **Replay mode**: Set `CUTRACER_DELAY_LOAD_PATH` to load a saved config and reproduce the exact same delay pattern

**Note**: You cannot use both at the same time.

**Workflow**:
1. Run with `CUTRACER_DELAY_DUMP_PATH=/tmp/config.json` to record the delay pattern
2. When a failure occurs, save the config file
3. Replay with `CUTRACER_DELAY_LOAD_PATH=/tmp/config.json` to reproduce deterministically
4. Reduce with `cutracer reduce` to find the minimal set of delay points (see below)

#### Reduce (Delta Debugging)

The `reduce` subcommand finds the minimal set of delay injection points that trigger a data race. Two strategies:

-   **`linear`**: Tests each point one by one. O(N) test runs. Simple but slow.
-   **`bisect`**: ddmin-style bisection. Splits points in half and recursively narrows down. Typically O(log N) iterations. **Recommended for large configs.**

Use `--confidence-runs N` (odd number) for majority voting when the race is probabilistic.

```bash
# Bisection reduction (fast)
cutracer reduce -c config.json -t ./test_race.sh --strategy bisect --confidence-runs 3
```

The test script convention follows `llvm-reduce`: exit 0 = interesting (race occurred), exit 1+ = not interesting (no race).

## Examples

The [`examples/`](examples/) directory contains reference trace outputs for common workflows:

-   **[Proton Trace](examples/proton_trace/)** -- sample instruction histogram CSV, CUTracer log, and a README explaining the end-to-end proton instrumentation workflow for a Triton vector-add kernel

## Troubleshooting

-   No CSV/log: check `CUDA_INJECTION64_PATH`, `KERNEL_FILTERS`, and write permissions
-   Empty histogram: ensure kernels emit clock instructions (e.g., Triton `pl.scope`)
-   High overhead: prefer opcode-only; narrow filters; use `INSTR_BEGIN/INSTR_END`
-   CUDA Graph/stream capture: data is flushed at `cuGraphLaunch` exit; ensure stream sync
-   IPC merge issues: resolve warp mismatches and kernel hash ambiguity with parser flags

## License

This repository contains code under the MIT license (Meta) and the BSD-3-Clause license (NVIDIA). See [LICENSE](LICENSE) and [LICENSE-BSD](LICENSE-BSD) for details.

## 📚 Documentation

The full project documentation lives in [`docs/`](docs/) and is automatically
synced to the [GitHub Wiki](https://github.com/facebookexperimental/CUTracer/wiki)
on every push to `main` via
[`.github/workflows/sync-wiki.yml`](.github/workflows/sync-wiki.yml).

**Edit `docs/*.md`, not the wiki directly** — direct wiki edits will be
overwritten on the next sync.

Key topics: [Quickstart](docs/Quickstart.md), [Concepts](docs/Concepts.md),
[Configuration](docs/Configuration.md),
[Instrumentation Modes](docs/Instrumentation-Modes.md),
[Outputs and File Formats](docs/Outputs-and-File-Formats.md),
[API and Data Structures](docs/API-and-Data-Structures.md),
[Analyses](docs/Analyses.md),
[Post-processing: IPC Merge](docs/Post-processing-IPC-Merge.md),
[Triton/Proton Integration](docs/Triton_Proton-Integration.md),
[Architecture](docs/Architecture.md),
[Developer Guide](docs/Developer-Guide.md),
[Build, Test, and CI](docs/Build-Test-and-CI.md),
[Troubleshooting](docs/Troubleshooting.md), [FAQ](docs/FAQ.md).