# Baseline Strategies and Predictive Models

## Purpose

This module is the first serious benchmark layer for the project. Its job is not to "win" the strategy problem by itself, but to answer a more disciplined question:

> Before we trust deep learning, how far can interpretable rules and lightweight supervised models go on a low-signal, post-cost funding arbitrage task?

That makes the baseline layer important for both methodology and presentation:

- it gives simple, explainable reference points
- it exposes how sparse tradable post-cost opportunities really are
- it provides prediction artifacts that later feed unified signal generation and backtesting

## What Changed in the Upgraded Baseline Pipeline

The current baseline module is no longer a simple one-shot fit with fixed thresholds.
It now includes:

- time-series-safe hyperparameter tuning on the `train` split only
- validation-driven threshold selection for rule, classification, and regression baselines
- optional classifier probability calibration
- configurable missing-data handling with safe forward-fill support and missing indicators
- stronger penalized-linear baselines
- held-out permutation importance for final interpretation
- optional expanding / rolling walk-forward prediction mode

The CLI entry points did not change:

```powershell
& 'd:\MG\anaconda3\python.exe' -m src.main train-baseline --config configs/models/baseline.yaml
& 'd:\MG\anaconda3\python.exe' -m src.main evaluate-baseline --config configs/models/baseline.yaml
```

## Inputs

The baseline pipeline consumes the supervised dataset produced by `build-labels`:

- default input: `data/processed/supervised/binance/btcusdt/1h/btcusdt_supervised_dataset.parquet`
- default classification target: `target_is_profitable_24h`
- default regression target: `target_future_net_return_bps_24h`
- default split column: `split`
- default readiness filter: `supervised_ready == 1`

The baseline layer assumes the supervised dataset already contains:

- leakage-safe engineered features
- post-cost labels
- time-series split assignment

## Baseline Families

### Rule-based baselines

Implemented examples:

1. `funding_threshold_2bps`
2. `spread_zscore_1p5`
3. `combined_funding_spread`

These rules remain intentionally interpretable:

- positive funding supports `short perp + long spot`
- positive spread z-score suggests the perp is relatively rich to spot
- the combined rule checks whether carry and basis dislocation line up together

Rules now optionally support validation-set threshold search through configurable grids.

### Predictive classification baselines

Implemented:

1. Logistic regression with L2 penalty
2. Logistic regression with L1 penalty
3. Logistic regression with elastic-net penalty
4. Optional random-forest classifier

Output:

- calibrated or uncalibrated probability of `target_is_profitable_24h == 1`
- validation-selected trade threshold

### Predictive regression baselines

Implemented:

1. Ridge regression
2. ElasticNet regression
3. Optional random-forest regressor

Output:

- predicted future net return in basis points
- validation-selected return threshold for turning forecasts into trade candidates

## Time-Series-Safe Tuning

Hyperparameter tuning is done only inside the `train` split.

The tuning workflow uses chronological inner folds, not shuffled CV. Configurable controls include:

- `tuning.n_splits`
- `tuning.gap`
- `tuning.mode`
  Supported: `expanding`, `rolling`
- `tuning.min_train_size`
- `tuning.rolling_window_size`

This matters because:

- observations are time-ordered
- labels overlap across nearby timestamps
- the project is a trading problem, not an IID tabular benchmark

The default config uses a non-zero `gap` to reduce leakage risk near split boundaries.

## Threshold Selection

The upgraded baseline layer does not rely only on fixed thresholds from config.

Instead, after the model is trained on `train`, it can search thresholds on `validation`:

- classifier probability threshold search
- regression expected-return threshold search
- rule-threshold grid search

The optimization target is configurable through `threshold_search.objective`.

Useful objectives include:

- `avg_signal_return_bps`
- `cumulative_signal_return_bps`
- `precision`
- `f1`
- `signal_hit_rate`
- `signal_sharpe_like`

For this project, `avg_signal_return_bps` is a strong default because the target is trading usefulness, not raw classification accuracy alone.

## Degenerate Threshold Search Handling

Threshold search is now guarded explicitly.

Default behavior:

- if the validation split cannot support threshold selection, the run fails
- if every threshold candidate has an invalid objective or zero usable traded signals, the run fails
- if a model produces zero signals on validation or test, the report and manifest record that status instead of silently presenting a healthy-looking zero row

The baseline artifacts now carry:

- `degenerate_experiment`
- `status`
- `reason`
- `fallback_used`
- `fallback_reason`
- `signal_count_by_split`
- `tradeable_rate_by_split`
- `profitable_rate_by_split`
- `threshold_search_summary`

If you intentionally want to keep writing diagnostic artifacts even when threshold selection degenerates, you must opt in through `threshold_search.allow_degenerate_fallback: true`.

## Probability Calibration

Classifier baselines support:

- `none`
- `sigmoid`
- `isotonic`

Calibration is fit in a time-series-safe way using chronological inner folds on the `train` split.

Why calibration matters here:

- profitable post-cost labels are sparse
- raw classifier scores can be poorly calibrated
- later signal ranking and thresholding are more trustworthy when probabilities are better behaved

The pipeline also writes calibration tables for validation/test when probabilities are available.

## Missing-Data Handling

The old one-size-fits-all median imputation approach has been upgraded.

The pipeline now supports:

- remaining-value median imputation
- optional leakage-safe chronological forward-fill for designated persistent features
- missing-indicator feature generation

Important implementation detail:

- forward-fill is only applied left-to-right in time
- forward-fill is restricted to explicitly designated columns or prefixes
- remaining imputation is still fit on the model training data

This keeps the pipeline practical without introducing backward-looking leakage.

## Prediction Modes

Two broad prediction styles are supported:

### `static`

- fit once on `train`
- score the full dataset with the trained model

This is the simplest and fastest baseline mode.

### `expanding` / `rolling`

- periodically refit in chronological order
- use `prediction.refit_every_n_periods`
- optionally restrict history with `prediction.rolling_window_size`
- optionally exclude validation history from test-period refits

This is still lighter than a full backtest, but it gives more realistic chronological prediction behavior.

## Diagnostics

The baseline module now produces several diagnostics.

### Preferred final interpretation

- held-out permutation importance on validation, or test if validation is unavailable

### Supplemental diagnostics

- linear coefficients for linear models
- impurity-based importances for tree models
- calibration tables for classifier models
- cross-validation search tables
- threshold-search tables

For tree models especially, the report should emphasize permutation importance over impurity importance.

## Strategy-Oriented Evaluation Outputs

The evaluation tables keep the usual ML metrics, but they now emphasize trading usefulness too.

Classification outputs include:

- accuracy
- precision
- recall
- F1
- ROC-AUC
- average precision
- Brier score when probabilities exist

Regression outputs include:

- MAE
- RMSE
- R-squared
- Pearson correlation
- directional accuracy

Both task types now report trading-style metrics such as:

- `signal_count`
- `signal_rate`
- `avg_signal_return_bps`
- `median_signal_return_bps`
- `cumulative_signal_return_bps`
- `signal_hit_rate`
- `precision_among_signaled`
- `signal_sharpe_like`
- `top_quantile_avg_return_bps`

These metrics are usually more meaningful for this project than raw ML accuracy alone.

## Outputs

Default artifact directory:

`data/artifacts/models/baselines/binance/btcusdt/1h/btcusdt_24h_default/`

Key outputs:

- `baseline_predictions.parquet`
  Unified row-level prediction table for rules, linear models, and optional tree baselines.
- `baseline_metrics.parquet`
  Split-level evaluation metrics.
- `baseline_leaderboard.parquet`
  Validation/test comparison summary.
- `baseline_report.md`
  Short markdown report.
- `feature_columns.json`
  Exact final model feature set, including any missing indicators.
- `models/*.joblib`
  Saved model bundles.
- `diagnostics/*`
  Cross-validation results, threshold search tables, permutation importance, coefficients, impurity importance, and calibration tables.
- `baseline_manifest.json`
  Reproducibility metadata including tuned hyperparameters, selected thresholds, calibration choices, prediction mode, and artifact paths.

The report and manifest now also distinguish a healthy model-selection path from a degenerate one, so a strategy row with `signal_count = 0` is accompanied by a machine-readable reason.

## Prediction Table Contract

Downstream signal generation still relies on these columns:

- `timestamp`
- `split`
- `model_name`
- `model_family`
- `task`
- `signal_direction`
- `signal`
- `decision_score`
- `signal_threshold`
- `signal_strength`
- `predicted_probability`
- `predicted_return_bps`
- `predicted_label`
- `actual_label`
- `actual_return_bps`

Additional metadata columns are now included for auditability, such as:

- `selected_hyperparameters_json`
- `selected_threshold_objective`
- `calibration_method`
- `feature_importance_method`
- `prediction_mode`

Those fields are no longer "baseline-only" metadata. They are now propagated into the standardized signal layer and then into backtest / robustness summaries, so later strategy comparisons can explain not just which baseline won, but which configuration decisions produced that result.

## Recommended Review Story

When presenting the baseline layer, a clean narrative is:

1. start with rule-based heuristics
2. show that fixed thresholds are not enough, so validation-driven thresholding matters
3. show that time-series-safe linear models are stronger benchmarks than naive one-shot models
4. show that calibration and held-out permutation importance make the predictions more trustworthy and interpretable
5. then compare these baselines against the later LSTM model

## Caveats

- This is still a prototype benchmark layer, not a live alpha engine.
- Walk-forward prediction here is lighter than the full execution logic in the backtest engine.
- Threshold search is validation-driven, so validation remains a model-selection surface rather than a purely final reporting surface.
- Extremely sparse post-cost positives can still make classifier metrics unstable.
- That sparsity is now surfaced directly through degenerate-experiment diagnostics rather than being hidden inside ordinary-looking zero metrics.
- Tree models remain optional because they are slower and easier to overfit than the penalized-linear baselines.