# Prediction Code

Short-horizon BTC midprice classification and long-only backtesting pipeline built on 1-second order book features.

## Overview

This project takes raw Gemini BTC event data, engineers 1-second microstructure features, labels future midprice direction, trains a LightGBM classifier with purged walk-forward validation, and converts model probabilities into a cost-aware long-only trading strategy.

The current setup is best understood as a short-horizon alpha / timing system, not true sub-second HFT. Features are sampled at 1-second resolution and the active configuration in [config.py](./config.py) uses `PREDICTION_HORIZON = 180`, which corresponds to roughly 3 minutes ahead.

## Repository Structure

```text
prediction_code/
  input/                    Raw Gemini BTC event data
  output_minute/            Main outputs for the minute-style pipeline
  output_second/            Alternate output location
  config.py                 Central configuration
  main.py                   End-to-end pipeline entry point
  feature_engineering.py    Book reconstruction, 1s aggregation, labels
  order_book.py             Local order book helper
  modeling.py               Purged 5-fold walk-forward training/backtest
  hft_backtesting.py        Long-only execution simulation
  plotting.py               Result plots
  random_search.py          Hyperparameter and threshold search
  utils.py                  Metrics, helpers, class mappings
```

## Pipeline

### 1. Raw Data -> 1-Second Features

`feature_engineering.py` reconstructs the local book from the event stream and aggregates it into 1-second snapshots.

Core feature groups include:

- top-of-book state: `best_bid`, `best_ask`, `mid`, `spread`, `microprice`
- depth state: top-1 and top-5 bid/ask depth, depth ratio, entropy, HHI, slope
- imbalance state: `imbalance_1`, `imbalance_5`, depth pressure
- event flow: trade counts, place/cancel/fill counts, signed trade flow, passive flow, net order flow
- intrasecond path features: second open/high/low/close mid, realized variation, spread variation, quote observation count
- rolling features over multiple windows: spread stats, volatility, flow stats, imbalance means, correlation features, EWMs, z-scores

Feature generation entry point:

```python
build_features_from_csv(
    csv_path=INPUT_CSV,
    sample_every_n_events=SAMPLE_EVERY_N_EVENTS,
    top_k=TOP_K_LEVELS,
    window_ms=WINDOW_MS,
)
```

### 2. Target Construction

`add_targets(...)` in `feature_engineering.py` creates:

- `target_ret`: forward midprice return over `PREDICTION_HORIZON`
- `target_class`: `negative`, `mild`, or `positive`
- `target_label`: human-readable class label

Labels are cost-aware. The mild / neutral band is:

```text
max(MILD_RETURN_THRESHOLD, TRANSACTION_FEE_RATE * COST_AWARE_LABEL_THRESHOLD_MULTIPLIER)
```

That prevents the model from treating moves smaller than trading cost as actionable signal.

### 3. Model Training

`modeling.py` trains a LightGBM classifier on the engineered features.

Important details:

- `TimeSeriesSplit(n_splits=5)` is used for sequential walk-forward evaluation
- each fold uses a purge gap equal to `prediction_horizon` between train and test
- the internal validation split for early stopping is also purged
- if a fold contains only one observed class, the code falls back to a constant probability model instead of breaking

### 4. Signal Generation

The model outputs:

- `prob_negative`
- `prob_mild`
- `prob_positive`

`hft_backtesting.py` converts these into a smoothed directional edge:

```text
signal = EWM((prob_positive - prob_negative) * (1 - prob_mild))
```

Trade thresholds are dynamic:

- buy when the current signal is above the previous rolling `SIGNAL_SCORE_BUY_QUANTILE`
- close when the current signal is below the previous rolling `SIGNAL_SCORE_CLOSE_QUANTILE`
- the rolling window length is `SIGNAL_SCORE_WINDOW_MINUTES`

This is a lagged rolling-quantile threshold, not a fixed constant threshold.

### 5. Backtest Logic

The execution model is deliberately simple and conservative:

- long-only
- enter at the next row's `best_ask`
- exit at the next row's `best_bid`
- transaction fees applied on entry and exit
- optional minimum hold via `MIN_HOLD_SECONDS`
- the final open position is closed at the end of the test path if `close_position=True`

This is not a queue-aware simulator. It is closer to a research backtest for directional timing than a production execution simulator.

## Validation Design

The current evaluation is a purged 5-fold walk-forward backtest:

1. Split the full dataset into 5 sequential test folds with expanding training history.
2. Remove the last `prediction_horizon` rows from the training side before each test fold.
3. Split the remaining training data into fit / validation segments for early stopping.
4. Insert another purge gap between fit and validation.
5. Train on the fit segment, early-stop on validation, predict on the next unseen fold.
6. Concatenate all out-of-sample predictions and run a full OOS backtest over the stitched path.

This is more realistic than shuffled k-fold and avoids forward-label overlap leaking into the next fold.

## Configuration

Most research knobs live in [config.py](./config.py).

Current important settings:

- `PREDICTION_HORIZON = 180`
- `TRANSACTION_FEE_RATE = 0.001`
- `N_SPLITS = 5`
- `SIGNAL_SCORE_WINDOW_MINUTES = 60`
- `SIGNAL_SCORE_BUY_QUANTILE = 0.9`
- `SIGNAL_SCORE_CLOSE_QUANTILE = 0.1`
- `SIGNAL_SMOOTHING_SECONDS = max(60, PREDICTION_HORIZON // 3)`
- `MIN_HOLD_SECONDS = max(60, PREDICTION_HORIZON // 3)`

## How To Run

Install dependencies:

```bash
pip install numpy pandas scikit-learn lightgbm matplotlib tqdm
```

Run the pipeline:

```bash
python prediction_code/main.py
```

### Feature Build Note

`main.py` currently loads a cached feature file from `FEATURE_CSV` and the feature-building step is commented out. If you want to rebuild features from the raw CSV, uncomment Step 1 in [main.py](./main.py).

## Hyperparameter Search

`random_search.py` runs randomized search across:

- LightGBM parameters
- signal-score quantiles
- rolling threshold window

It evaluates each trial with the same purged 5-fold backtest and saves the history to `parameter_tuning...csv`.

Run it with:

```bash
python prediction_code/random_search.py
```

## Outputs

The main pipeline writes to `output_minute/`:

- `dataset_with_targets...csv`
- `fold_metrics...csv`
- `overall_metrics...csv`
- `feature_importances...csv`
- `confusion_matrix...csv`
- `oos_predictions...csv`
- `oos_trades...csv`
- `oos_trade_events...csv`
- cumulative return, feature importance, signal, and position plots

Useful files:

- `oos_predictions...csv`: per-row probabilities, signal score, thresholds, position, turnover, net/gross equity curves
- `oos_trades...csv`: round-trip trade summary
- `oos_trade_events...csv`: entry / exit events
- `overall_metrics...csv`: aggregate OOS performance and classification metrics

## Research Notes

- This code is closer to medium-frequency crypto timing than true HFT.
- The strategy trades on 1-second snapshots but the forecast horizon is measured in minutes.
- The backtest assumes immediate execution at the next inside quote and does not model queue position or partial fills.
- Results should be interpreted as research-stage directional alpha tests, not production execution estimates.

## Suggested Workflow

1. Rebuild features if the raw input changed.
2. Run `main.py` to produce out-of-sample predictions and plots.
3. Inspect `overall_metrics...csv`, `oos_trades...csv`, and the signal plot.
4. Use `random_search.py` only after the base pipeline is stable.
5. Treat improvements in cost-adjusted return, turnover, and drawdown as more meaningful than raw classification accuracy alone.