# Prediction Code Short-horizon BTC midprice classification and long-only backtesting pipeline built on 1-second order book features. ## Overview This project takes raw Gemini BTC event data, engineers 1-second microstructure features, labels future midprice direction, trains a LightGBM classifier with purged walk-forward validation, and converts model probabilities into a cost-aware long-only trading strategy. The current setup is best understood as a short-horizon alpha / timing system, not true sub-second HFT. Features are sampled at 1-second resolution and the active configuration in [config.py](./config.py) uses `PREDICTION_HORIZON = 180`, which corresponds to roughly 3 minutes ahead. ## Repository Structure ```text prediction_code/ input/ Raw Gemini BTC event data output_minute/ Main outputs for the minute-style pipeline output_second/ Alternate output location config.py Central configuration main.py End-to-end pipeline entry point feature_engineering.py Book reconstruction, 1s aggregation, labels order_book.py Local order book helper modeling.py Purged 5-fold walk-forward training/backtest hft_backtesting.py Long-only execution simulation plotting.py Result plots random_search.py Hyperparameter and threshold search utils.py Metrics, helpers, class mappings ``` ## Pipeline ### 1. Raw Data -> 1-Second Features `feature_engineering.py` reconstructs the local book from the event stream and aggregates it into 1-second snapshots. Core feature groups include: - top-of-book state: `best_bid`, `best_ask`, `mid`, `spread`, `microprice` - depth state: top-1 and top-5 bid/ask depth, depth ratio, entropy, HHI, slope - imbalance state: `imbalance_1`, `imbalance_5`, depth pressure - event flow: trade counts, place/cancel/fill counts, signed trade flow, passive flow, net order flow - intrasecond path features: second open/high/low/close mid, realized variation, spread variation, quote observation count - rolling features over multiple windows: spread stats, volatility, flow stats, imbalance means, correlation features, EWMs, z-scores Feature generation entry point: ```python build_features_from_csv( csv_path=INPUT_CSV, sample_every_n_events=SAMPLE_EVERY_N_EVENTS, top_k=TOP_K_LEVELS, window_ms=WINDOW_MS, ) ``` ### 2. Target Construction `add_targets(...)` in `feature_engineering.py` creates: - `target_ret`: forward midprice return over `PREDICTION_HORIZON` - `target_class`: `negative`, `mild`, or `positive` - `target_label`: human-readable class label Labels are cost-aware. The mild / neutral band is: ```text max(MILD_RETURN_THRESHOLD, TRANSACTION_FEE_RATE * COST_AWARE_LABEL_THRESHOLD_MULTIPLIER) ``` That prevents the model from treating moves smaller than trading cost as actionable signal. ### 3. Model Training `modeling.py` trains a LightGBM classifier on the engineered features. Important details: - `TimeSeriesSplit(n_splits=5)` is used for sequential walk-forward evaluation - each fold uses a purge gap equal to `prediction_horizon` between train and test - the internal validation split for early stopping is also purged - if a fold contains only one observed class, the code falls back to a constant probability model instead of breaking ### 4. Signal Generation The model outputs: - `prob_negative` - `prob_mild` - `prob_positive` `hft_backtesting.py` converts these into a smoothed directional edge: ```text signal = EWM((prob_positive - prob_negative) * (1 - prob_mild)) ``` Trade thresholds are dynamic: - buy when the current signal is above the previous rolling `SIGNAL_SCORE_BUY_QUANTILE` - close when the current signal is below the previous rolling `SIGNAL_SCORE_CLOSE_QUANTILE` - the rolling window length is `SIGNAL_SCORE_WINDOW_MINUTES` This is a lagged rolling-quantile threshold, not a fixed constant threshold. ### 5. Backtest Logic The execution model is deliberately simple and conservative: - long-only - enter at the next row's `best_ask` - exit at the next row's `best_bid` - transaction fees applied on entry and exit - optional minimum hold via `MIN_HOLD_SECONDS` - the final open position is closed at the end of the test path if `close_position=True` This is not a queue-aware simulator. It is closer to a research backtest for directional timing than a production execution simulator. ## Validation Design The current evaluation is a purged 5-fold walk-forward backtest: 1. Split the full dataset into 5 sequential test folds with expanding training history. 2. Remove the last `prediction_horizon` rows from the training side before each test fold. 3. Split the remaining training data into fit / validation segments for early stopping. 4. Insert another purge gap between fit and validation. 5. Train on the fit segment, early-stop on validation, predict on the next unseen fold. 6. Concatenate all out-of-sample predictions and run a full OOS backtest over the stitched path. This is more realistic than shuffled k-fold and avoids forward-label overlap leaking into the next fold. ## Configuration Most research knobs live in [config.py](./config.py). Current important settings: - `PREDICTION_HORIZON = 180` - `TRANSACTION_FEE_RATE = 0.001` - `N_SPLITS = 5` - `SIGNAL_SCORE_WINDOW_MINUTES = 60` - `SIGNAL_SCORE_BUY_QUANTILE = 0.9` - `SIGNAL_SCORE_CLOSE_QUANTILE = 0.1` - `SIGNAL_SMOOTHING_SECONDS = max(60, PREDICTION_HORIZON // 3)` - `MIN_HOLD_SECONDS = max(60, PREDICTION_HORIZON // 3)` ## How To Run Install dependencies: ```bash pip install numpy pandas scikit-learn lightgbm matplotlib tqdm ``` Run the pipeline: ```bash python prediction_code/main.py ``` ### Feature Build Note `main.py` currently loads a cached feature file from `FEATURE_CSV` and the feature-building step is commented out. If you want to rebuild features from the raw CSV, uncomment Step 1 in [main.py](./main.py). ## Hyperparameter Search `random_search.py` runs randomized search across: - LightGBM parameters - signal-score quantiles - rolling threshold window It evaluates each trial with the same purged 5-fold backtest and saves the history to `parameter_tuning...csv`. Run it with: ```bash python prediction_code/random_search.py ``` ## Outputs The main pipeline writes to `output_minute/`: - `dataset_with_targets...csv` - `fold_metrics...csv` - `overall_metrics...csv` - `feature_importances...csv` - `confusion_matrix...csv` - `oos_predictions...csv` - `oos_trades...csv` - `oos_trade_events...csv` - cumulative return, feature importance, signal, and position plots Useful files: - `oos_predictions...csv`: per-row probabilities, signal score, thresholds, position, turnover, net/gross equity curves - `oos_trades...csv`: round-trip trade summary - `oos_trade_events...csv`: entry / exit events - `overall_metrics...csv`: aggregate OOS performance and classification metrics ## Research Notes - This code is closer to medium-frequency crypto timing than true HFT. - The strategy trades on 1-second snapshots but the forecast horizon is measured in minutes. - The backtest assumes immediate execution at the next inside quote and does not model queue position or partial fills. - Results should be interpreted as research-stage directional alpha tests, not production execution estimates. ## Suggested Workflow 1. Rebuild features if the raw input changed. 2. Run `main.py` to produce out-of-sample predictions and plots. 3. Inspect `overall_metrics...csv`, `oos_trades...csv`, and the signal plot. 4. Use `random_search.py` only after the base pipeline is stable. 5. Treat improvements in cost-adjusted return, turnover, and drawdown as more meaningful than raw classification accuracy alone.