--- name: profiling description: Profile code performance using callgrind and valgrind with nextest integration for analyzing instruction counts, cache behavior, and identifying bottlenecks --- # Profiling with Valgrind, Callgrind, and Nextest The facet project has pre-configured valgrind integration for debugging crashes, memory leaks, and performance profiling. ## Quick Usage ```bash # Run test under valgrind (memory errors + leaks) cargo nextest run --profile valgrind -p PACKAGE TEST_FILTER # Run test under callgrind (profiling) valgrind --tool=callgrind --callgrind-out-file=callgrind.out \ cargo nextest run --no-fail-fast -p PACKAGE TEST_FILTER # Analyze callgrind output callgrind_annotate callgrind.out # or with GUI kcachegrind callgrind.out # Linux qcachegrind callgrind.out # macOS ``` ## Nextest Valgrind Profile The project has a pre-configured valgrind profile in `.config/nextest.toml`: ### Configuration ```toml [scripts.wrapper.valgrind] # Leak checking configuration command = 'valgrind --leak-check=full --show-leak-kinds=all --errors-for-leak-kinds=definite,indirect --error-exitcode=1' [profile.valgrind] # Apply to all tests on Linux platform = 'cfg(target_os = "linux")' filter = 'all()' run-wrapper = 'valgrind' ``` **What it does:** - `--leak-check=full` - Show details for each leak - `--show-leak-kinds=all` - Show all leak types for diagnostics - `--errors-for-leak-kinds=definite,indirect` - Only fail on real leaks (not "still reachable") - `--error-exitcode=1` - Exit with code 1 if errors found ### Usage ```bash # Run specific test cargo nextest run --profile valgrind -p facet-format-json test_simple_struct # Run all tests in a file cargo nextest run --profile valgrind -p facet-format-json --test jit_deserialize # Run with filter cargo nextest run --profile valgrind -p facet-json booleans ``` **Benefits:** - ✅ Automatic configuration - no manual valgrind commands - ✅ Consistent flags across team - ✅ Integrated with nextest filtering - ✅ Clean, formatted output ## Profiling with Callgrind Callgrind is a valgrind tool for profiling instruction counts and function call graphs. ### Basic Profiling ```bash # Profile a specific test valgrind --tool=callgrind \ --callgrind-out-file=callgrind.out \ cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME # Analyze output callgrind_annotate callgrind.out ``` ### Advanced Options ```bash # Collect cache simulation data (slower but more detailed) valgrind --tool=callgrind \ --cache-sim=yes \ --branch-sim=yes \ --callgrind-out-file=callgrind.out \ cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME # Focus on specific function valgrind --tool=callgrind \ --toggle-collect=main \ --callgrind-out-file=callgrind.out \ cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME # Compress output (can get large) valgrind --tool=callgrind \ --compress-strings=yes \ --compress-pos=yes \ --callgrind-out-file=callgrind.out.gz \ cargo nextest run --no-fail-fast -p PACKAGE TEST_NAME ``` ### Analyzing Callgrind Output #### Command Line (callgrind_annotate) ```bash # Full report callgrind_annotate callgrind.out # Focus on specific functions callgrind_annotate --include='facet::' callgrind.out # Show only top functions callgrind_annotate --auto=yes --threshold=1 callgrind.out # Compare two runs callgrind_annotate --diff callgrind.old.out callgrind.new.out ``` **Reading the output:** ``` Ir # Instruction reads (total) I1mr # L1 instruction cache misses ILmr # Last-level instruction cache misses Dr # Data reads Dw # Data writes D1mr, D1mw # L1 data cache read/write misses DLmr, DLmw # Last-level data cache read/write misses -------------------------------------------------------------------------------- Ir file:function -------------------------------------------------------------------------------- 1,234,567 (45%) facet_format_json::deserialize 987,654 (35%) facet_format::parse_value ... ``` #### GUI (KCachegrind/QCachegrind) Install: ```bash # Linux sudo apt install kcachegrind # macOS brew install qcachegrind # Windows (WSL) sudo apt install kcachegrind ``` Launch: ```bash kcachegrind callgrind.out # Linux qcachegrind callgrind.out # macOS ``` **GUI features:** - Call graph visualization - Flamegraph-like views - Source code annotation (if debug symbols available) - Caller/callee relationships - Multiple metrics (instructions, cache misses, branches) ## Profiling Benchmarks The generated benchmark tests (from `benchmarks.json`) can be profiled: ### 1. As Tests (Recommended for Callgrind) ```bash # Profile a benchmark test under callgrind valgrind --tool=callgrind \ --callgrind-out-file=callgrind_simple_struct.out \ cargo nextest run --profile valgrind -p facet-json test_simple_struct # Analyze callgrind_annotate callgrind_simple_struct.out ``` **Why use tests:** - Single iteration = cleaner callgrind output - No benchmark harness overhead - Easier to focus on hot path - Faster to run ### 2. As Benchmarks (For Realistic Instruction Counts) The benchmark harness (gungraun) already uses valgrind internally: ```bash # Run gungraun benchmark (uses callgrind automatically) cargo bench --bench unified_benchmarks_gungraun --features jit simple_struct # Check output in bench-reports/gungraun-*.txt ``` **gungraun automatically collects:** - Instructions executed - Estimated cycles - L1/LL cache hits - RAM hits - Total read/write operations This data appears in `bench-reports/perf/RESULTS.md`. ## Common Profiling Workflows ### Debug a Crash ```bash # 1. Run under valgrind to find memory error cargo nextest run --profile valgrind -p PACKAGE TEST_NAME # 2. Read valgrind output for exact error location # Example: "Invalid read of size 8 at 0x123456" # 3. Fix the bug # 4. Verify fix cargo nextest run -p PACKAGE TEST_NAME ``` ### Find Performance Bottleneck ```bash # 1. Profile with callgrind valgrind --tool=callgrind \ --callgrind-out-file=profile.out \ cargo nextest run --no-fail-fast -p facet-json test_booleans # 2. Analyze callgrind_annotate --auto=yes profile.out | head -30 # 3. Identify hot functions (high instruction counts) # 4. Optimize hot functions # 5. Re-profile and compare valgrind --tool=callgrind \ --callgrind-out-file=profile_after.out \ cargo nextest run --no-fail-fast -p facet-json test_booleans callgrind_annotate --diff profile.out profile_after.out ``` ### Optimize Tier-2 JIT ```bash # 1. Check RESULTS.md for slow benchmarks grep "⚠" bench-reports/perf/RESULTS.md # 2. Profile the slow benchmark test valgrind --tool=callgrind \ --callgrind-out-file=jit_profile.out \ cargo nextest run --profile valgrind -p facet-json test_long_strings --features jit # 3. Analyze with GUI for visual call graph kcachegrind jit_profile.out # 4. Look for: # - Helper function calls in tight loops # - Redundant alignment checks # - Allocation hot spots # 5. Optimize based on findings # 6. Verify with benchmarks cargo xtask bench long_strings ``` ### Compare Before/After Optimization ```bash # Before git checkout main valgrind --tool=callgrind --callgrind-out-file=before.out \ cargo nextest run --no-fail-fast -p facet-json test_target # After git checkout my-optimization-branch valgrind --tool=callgrind --callgrind-out-file=after.out \ cargo nextest run --no-fail-fast -p facet-json test_target # Compare callgrind_annotate --diff before.out after.out ``` ## Interpreting Valgrind Output ### Memory Error Example ``` ==12345== Invalid read of size 8 ==12345== at 0x123456: facet_format_json::parse_number (parse.rs:42) ==12345== by 0x234567: facet_format_json::deserialize (lib.rs:123) ==12345== Address 0x789abc is 0 bytes after a block of size 16 alloc'd ==12345== at 0x345678: alloc (alloc.rs:88) ==12345== by 0x456789: Vec::push (vec.rs:1234) ``` **Translation:** - Reading 8 bytes from invalid address - Happened in `parse_number` at line 42 - Address is just past end of 16-byte allocation - **Fix:** Check bounds before reading, or fix off-by-one error ### Leak Example ``` ==12345== 128 bytes in 1 blocks are definitely lost in loss record 1 of 10 ==12345== at 0x123456: malloc (vg_replace_malloc.c:299) ==12345== by 0x234567: alloc (alloc.rs:88) ==12345== by 0x345678: Box::new (boxed.rs:123) ==12345== by 0x456789: setup_jit (jit.rs:456) ``` **Translation:** - 128 bytes allocated but never freed - Allocated in `setup_jit` function - **Fix:** Ensure cleanup/Drop implementation ### Cachegrind Output Example ``` Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw -------------------------------------------------------------------------------- 1,234,567 123 45 456,789 234 12 123,456 67 8 facet::deserialize 987,654 98 32 345,678 189 9 98,765 43 5 - facet::parse_value 234,567 23 10 98,765 45 2 23,456 12 1 - facet::parse_string ``` **Key metrics:** - `Ir` - Instructions executed (most important for optimization) - `D1mr/D1mw` - L1 data cache misses (indicates poor locality) - `DLmr/DLmw` - Last-level cache misses (very expensive) **Optimization targets:** 1. High `Ir` count = time-consuming function 2. High `D1mr` = poor data locality, consider restructuring 3. High `DLmr` = main memory accesses, critical to optimize ## Profiling Flags ### Valgrind (Memory Debugging) ```bash --leak-check=full # Detailed leak info --show-leak-kinds=all # Show all leak types --track-origins=yes # Track uninitialized values (slower) --verbose # More diagnostic info --log-file=valgrind.log # Save output to file ``` ### Callgrind (Profiling) ```bash --callgrind-out-file=FILE # Output file (default: callgrind.out.) --cache-sim=yes # Simulate cache behavior --branch-sim=yes # Simulate branch prediction --collect-jumps=yes # Collect jump information --dump-instr=yes # Dump instruction info --compress-strings=yes # Compress output (smaller files) ``` ### Cargo Nextest ```bash --no-fail-fast # Continue running after first failure --profile valgrind # Use valgrind profile from nextest.toml --test-threads=1 # Run single-threaded (better for profiling) ``` ## Tips and Tricks ### Speed Up Profiling 1. **Profile in release mode** (but keep debug symbols): ```bash # Add to Cargo.toml [profile.release] debug = true ``` 2. **Use `--no-fail-fast` to avoid stopping early** 3. **Filter to specific tests** - don't profile everything at once 4. **Disable address randomization** for reproducible runs: ```bash setarch $(uname -m) -R valgrind --tool=callgrind ... ``` ### Read Callgrind Data Programmatically ```python # Example: Parse callgrind output for automation def parse_callgrind(filename): import re costs = {} with open(filename) as f: for line in f: if m := re.match(r'(\d+)\s+(.+)', line): cost, func = m.groups() costs[func] = int(cost) return costs # Compare two profiles before = parse_callgrind('before.out') after = parse_callgrind('after.out') for func in before: if func in after: delta = after[func] - before[func] percent = (delta / before[func]) * 100 if abs(percent) > 5: # More than 5% change print(f"{func}: {percent:+.1f}% ({delta:+,} instructions)") ``` ## Don't Do This ❌ Run valgrind without nextest profile - inconsistent flags ❌ Profile debug builds - too slow and unrepresentative ❌ Ignore "still reachable" leaks in FFI code - sometimes OK ❌ Profile with multiple test threads - non-deterministic results ❌ Forget to clean between profiling runs - stale data ## Do This Instead ✅ Use `--profile valgrind` for memory debugging ✅ Use callgrind for performance profiling ✅ Profile release builds with debug symbols ✅ Focus on hot paths (high `Ir` counts) ✅ Compare before/after with `--diff` ✅ Use GUI tools (kcachegrind) for complex call graphs ## Files and Locations ``` .config/nextest.toml # Valgrind profile configuration callgrind.out.* # Callgrind output files (gitignored) bench-reports/gungraun-*.txt # Gungraun output (includes instruction counts) ``` ## Troubleshooting ### Valgrind complains about "unrecognized instruction" - Update valgrind: `sudo apt update && sudo apt install valgrind` - Or use `--vex-iropt-register-updates=allregs-at-mem-access` ### Callgrind output is huge - Use `--compress-strings=yes --compress-pos=yes` - Or filter to specific functions with `--toggle-collect=function_name` ### Profile doesn't match benchmark results - Ensure you're profiling the same code path - Check if JIT compilation is cached (use setup functions in gungraun) - Profile release build, not debug ### Can't open callgrind file in GUI - Check file permissions - Ensure file isn't corrupted (run `callgrind_annotate` first) - Try different viewer (kcachegrind vs qcachegrind) ## See Also - Valgrind manual: https://valgrind.org/docs/manual/manual.html - Callgrind manual: https://valgrind.org/docs/manual/cl-manual.html - Nextest wrapper scripts: https://nexte.st/docs/configuration/wrapper-scripts/ - KCachegrind handbook: https://docs.kde.org/stable5/en/kcachegrind/ - Project nextest config: `.config/nextest.toml` - Benchmark debugging: See `benchmarking.md`