# Backtesting Engine ## Purpose The backtesting module evaluates standardized strategy signals for a single-asset, delta-neutral perpetual funding-rate arbitrage prototype. It is intentionally narrower than a generic portfolio simulator: the goal is to make the course-project strategy accounting explicit, reproducible, and hard to overstate. The engine answers one practical question: > Given hourly model or rule-based signals, does a simple delta-neutral execution rule produce useful post-cost performance after fees, slippage, gas, funding effects, and holding-period constraints? ## Scope The current version supports: - one asset at a time - one open position at a time per strategy - standardized signals from the signal layer - fixed-notional delta-neutral positions - explicit perp, spot, funding, fee, gas, and friction PnL - realized-only and mark-to-market equity curves - test-split-first leaderboards by default - strategy metrics, split summaries, trade logs, plots, markdown reports, and manifests It compares each `strategy_name` independently. If one signal artifact contains several rule, baseline, or deep-learning strategies, each strategy receives its own simulated equity curve and metrics row. ## Inputs Default inputs: - signals: `data/artifacts/signals/binance/btcusdt/1h/baseline/signals.parquet` - market data: `data/processed/binance/btcusdt/1h/hourly_market_data.parquet` - config: `configs/backtests/default.yaml` Required signal fields include: - `timestamp` - `asset` - `source` - `source_subtype` - `strategy_name` - `task` - `signal_score` - `predicted_class` - `expected_return_bps` - `suggested_direction` - `confidence` - `should_trade` - `split` - `metadata_json` Optional model metadata fields such as selected threshold, calibration method, checkpoint-selection metric, and preprocessing settings are preserved when present. Required market fields include: - `timestamp` - `symbol` - `venue` - `frequency` - `perp_open`, `perp_close` - `spot_open`, `spot_close` - `funding_rate` ## Position Model The default direction is `short_perp_long_spot`. The implemented hedge mode is: - `equal_notional_hedge` For a configured `position_notional` of `10,000 USD`, each trade uses: - `10,000 USD` notional on the perpetual leg - `10,000 USD` notional on the spot hedge leg - `20,000 USD` gross exposure Other hedge modes are reserved in config for future work: - `equal_quantity_hedge` - `contract_multiplier_adjusted_hedge` They are not implemented yet because this prototype does not model exchange-specific contract multipliers or quantity-level hedge rebalancing. ## Entry And Exit Logic Entry logic: 1. observe signal at timestamp `t` 2. require `should_trade = 1` unless disabled in config 3. require `suggested_direction` to match config direction 4. optionally require minimum signal score, confidence, or expected return 5. execute after `entry_delay_bars` using `execution_price_field` With the default config, a signal observed at hour `t` executes on the next bar open. This preserves the no-lookahead convention used by the label and signal pipelines. Exit logic: - `signal_off`: close when the signal no longer satisfies entry conditions - `holding_window`: close at the configured soft holding horizon - `maximum_holding`: hard time cap - `stop_loss` / `take_profit`: optional stop conditions - `end_of_data`: close any remaining open position at the final market row Stop approximation: - `stop_observation_mode = bar_close_observed` - `stop_execution_mode = next_bar_executed` This means stops are observed on hourly bar-close marks and executed on the configured next execution bar. The engine does not model intrabar stop fills. ## PnL Logic Trade PnL is decomposed into: - perpetual leg PnL - spot hedge leg PnL - funding PnL - trading fees - gas cost - other friction - embedded slippage diagnostic Slippage is applied through adverse effective execution prices. The preferred output column is `embedded_slippage_cost_usd`, which is a diagnostic approximation of the embedded price impact and is not deducted a second time. The older `estimated_slippage_cost_usd` column is retained as a compatibility alias. ## Funding Assumptions Funding behavior is configurable: - `funding_mode = prototype_bar_sum`: original prototype behavior; sum every aligned `funding_rate` row between entry and exit. - `funding_mode = event_aware`: use explicit `funding_event` / `is_funding_event` markers if present; otherwise use UTC hour modulo `funding_interval_hours` as a Binance-style fallback. Funding notional is configurable: - `funding_notional_mode = initial_notional`: apply funding to fixed entry notional. - `funding_notional_mode = dynamic_position_value`: approximate funding using current perp close value of the original position quantity. The default stays `prototype_bar_sum` plus `initial_notional` for backward compatibility and simple auditability. Reports and manifests record the chosen mode and funding rows used. ## Equity Curves The output `equity_curve.parquet` includes both audit and risk views: - `realized_equity_usd`: changes only when trades close. - `mark_to_market_equity_usd`: marks open positions on each market bar using current close prices. - `equity_usd`: alias for the mark-to-market equity used by primary risk metrics. - `realized_drawdown`: drawdown from realized-only equity. - `mark_to_market_drawdown`: drawdown from mark-to-market equity. - `drawdown`: alias for mark-to-market drawdown. Primary drawdown, Sharpe, annualized return, and cumulative return use the mark-to-market curve by default. Realized-only columns remain available because they are easier to audit against closed trades, but they can understate intratrade risk. ## Split-Aware Evaluation The config field `reporting.primary_split` controls the main leaderboard. The default is: ```yaml reporting: primary_split: test ``` This means: - `strategy_metrics.parquet` and `leaderboard.parquet` are test-split primary outputs. - `primary_trade_log.parquet` contains only trades from the configured primary split. - `split_summary.parquet` keeps train, validation, and test trade summaries separate. - `combined_strategy_metrics.parquet` is written as a secondary diagnostic when enabled. This prevents the main leaderboard from silently mixing in-sample and out-of-sample trades. The config model also validates that `reporting.primary_split` is present in `selection.split_filter` unless the primary split is explicitly set to `combined`. ## Capital And Leverage Checks The engine reports implied gross leverage: ```text gross exposure = 2 * position_notional * max_open_positions implied gross leverage = gross exposure / initial_capital ``` Config fields: ```yaml portfolio: max_gross_leverage: 2.0 leverage_check_mode: warn ``` `leverage_check_mode` can be: - `off` - `warn` - `fail` For course-project clarity, the default warns rather than failing, but submission/demo configs should avoid unrealistic leverage. ## Metrics Primary strategy metrics include: - cumulative return - annualized return - simple annualized Sharpe - raw-period Sharpe - autocorrelation-adjusted Sharpe diagnostic - mark-to-market max drawdown - realized-only max drawdown - win rate - profit factor - average and median trade return - expectancy per trade - average and median holding hours - max consecutive losses - exposure time fraction - average and maximum gross leverage - turnover - fee, gas, other friction, funding, and embedded slippage diagnostics - funding contribution share Sharpe caveat: The main Sharpe is a simple square-root-scaled annualized Sharpe from the mark-to-market return stream. This is useful for comparison, but sparse and serially correlated strategy returns can make annualization optimistic. Treat it as a compact diagnostic rather than proof of live-trading quality. No-trade caveat: If a strategy has no executable trades on the evaluation split, the backtest now records: - `status = no_tradable_signals` when the upstream signal layer produced no tradeable signals - `status = no_executed_trades` when signals existed but execution logic never opened a position - `diagnostic_reason` and `skip_reason` In those cases, Sharpe and drawdown style risk metrics are written as `NaN`, not `0`, because `0` can falsely imply a meaningful flat-risk outcome when nothing actually traded. ## Outputs The backtester writes: - `trade_log.parquet` / `trade_log.csv` - `primary_trade_log.parquet` / `primary_trade_log.csv` - `equity_curve.parquet` / `equity_curve.csv` - `strategy_metrics.parquet` / `strategy_metrics.csv` - `combined_strategy_metrics.parquet` / `combined_strategy_metrics.csv` - `split_summary.parquet` / `split_summary.csv` - `leaderboard.parquet` / `leaderboard.csv` - `backtest_report.md` - `backtest_manifest.json` - figures under `figures/` Strategy metric and leaderboard tables now also preserve `status`, `diagnostic_reason`, `skip_reason`, and split-level signal counts so zero-trade rows remain interpretable downstream. Plots: - mark-to-market cumulative return by strategy - mark-to-market drawdown by strategy - primary-split trade-return distribution boxplot ## Command ```powershell & 'd:\MG\anaconda3\python.exe' -m src.main backtest --config configs/backtests/default.yaml ``` ## Main Files - engine: `src/funding_arb/backtest/engine.py` - metrics helpers: `src/funding_arb/evaluation/metrics.py` - config model: `src/funding_arb/config/models.py` - default config: `configs/backtests/default.yaml` - CLI wrapper: `scripts/backtests/run_backtest.py` - tests: `tests/unit/test_backtest_engine.py`, `tests/unit/test_metrics.py` ## Limitations The engine is still a research backtester, not a production execution simulator. Known simplifications: - single-asset only - one open position per strategy - no partial exits - no order book, latency, liquidation, margin, or borrow-cost model - no intrabar stop execution - event-aware funding uses cleaned dataset markers when available and a simple UTC-hour fallback otherwise - no dynamic hedge rebalancing beyond the optional dynamic funding notional approximation These limitations should be disclosed in the final presentation and interpreted as prototype boundaries, not hidden assumptions.