Crypto Statistical Arbitrage

Crypto Statistical Arbitrage

Multi-venue quantitative crypto trading system — cointegration-based pair selection,
ML-enhanced signals, and walk-forward backtesting across 32 CEX/DEX venues.

Stars  Forks  License: MIT

Python 3.10+  NumPy  Pandas  scikit-learn  SciPy

Results •  Architecture •  Strategies •  Data Pipeline •  Quick Start •  Structure •  Development

--- ## Key Results
### Altcoin Statistical Arbitrage | Metric | Value | |:-------|------:| | **Sharpe Ratio** | **1.61** | | **Total Return** | 6.84% | | **Max Drawdown** | 4.64% | | **Win Rate** | 51.18% | | **Total Trades** | 127 | | **Profit Factor** | 1.69 | | **BTC Correlation** | -0.12 | ### BTC Futures Curve Trading | Metric | Value | |:-------|------:| | **Sharpe Ratio** | **5.81** | | **Total Return** | 203.70% | | **Max Drawdown** | 0.89% | | **Win Rate** | 95.02% | | **Total Trades** | 44,652 | | **Profit Factor** | 28.53 | | **BTC Correlation** | -0.05 |
> **Walk-forward out-of-sample results.** Train: Jan 2022 – Jun 2023  |  Test: Jul 2023 – Dec 2024. No leverage (1.0x). Transaction costs included. --- ## Highlights

32 Venues
CEX • DEX • Hybrid


211 Symbols
16 crypto sectors


226K+ Lines
184 Python files


137 Dependencies
Pinned & reproducible


Walk-Forward
Out-of-sample validated


ML Enhanced
GBM + Random Forest


No Leverage
1.0x only


61 Compliance Checks
Automated validation

--- ## Architecture ``` run_arb.py Master orchestrator │ ├── phase1run.py ─► run_phase1.py Phase 1: Multi-venue data collection │ ├── 7 CEX collectors Binance, Bybit, OKX, Kraken, Coinbase, Deribit, CME │ ├── 12 DEX collectors Uniswap, Curve, GMX, SushiSwap, Jupiter, ... │ ├── 3 Hybrid collectors Hyperliquid, dYdX, Drift │ └── 10 Alternative sources On-chain, sentiment, social, analytics │ ├── phase2run.py Phase 2: Altcoin StatArb (5-step pipeline) │ ├── Step 1 Universe construction + cointegration testing │ ├── Step 2 Baseline z-score mean reversion strategy │ ├── Step 3 ML enhancement (Gradient Boosting + Random Forest) │ ├── Step 4 Walk-forward backtest + crisis analysis │ └── Step 5 Report generation │ ├── run_phase3.py ─► phase3run.py Phase 3: BTC Futures curve trading │ ├── Funding rate term structure │ ├── Calendar spread signals │ ├── Cross-venue arbitrage │ └── Walk-forward backtest │ ├── generate_visualizations.py 34 publication-quality charts └── Compliance validator 61 automated checks ``` --- ## Strategies ### Phase 2: Altcoin Statistical Arbitrage Identifies cointegrated cryptocurrency pairs and trades mean-reverting spreads with ML-enhanced signals. | Parameter | CEX | DEX | |:----------|:----|:----| | **Universe** | 50 tokens (top by volume) | 25 tokens (DeFi-native) | | **Entry Z-Score** | ± 2.0 | ± 2.5 | | **Exit Z-Score** | 0.0 (mean) | \|z\| < 1.0 | | **Stop Z-Score** | ± 3.0 | ± 3.5 | | **Max Position** | $100,000 | $50,000 | | **Transaction Cost** | 0.20% (4-leg round trip) | 0.50 – 1.50% all-in | | **Max Positions** | 5 – 8 | 2 – 3 | **ML Enhancement:** Gradient Boosting + Random Forest ensemble predicts spread direction, filtering baseline z-score signals. Features include lagged spreads, volatility regimes (HMM), momentum, and cross-pair correlations. **Risk Controls:** - Venue-based tier classification (T1: Both CEX, T2: Mixed, T3: Both DEX) - 40% sector concentration limit, 70% max cross-pair correlation - Kelly criterion position sizing (0.25 – 0.5x) - 1.0x leverage only (no leverage) ### Phase 3: BTC Futures Curve Trading Exploits the term structure of BTC perpetual funding rates across venues. | Component | Details | |:----------|:--------| | **Venues** | Binance, Hyperliquid, dYdX, OKX, Bybit, GMX, Aevo | | **Signals** | Funding rate carry, calendar spreads, cross-venue basis | | **Frequency** | Hourly rebalancing | | **Walk-Forward** | 6-month train / 18-month test | --- ## Data Pipeline ### Supported Venues (32)
TypeVenuesData
CEX (7) Binance, Bybit, OKX, Kraken, Coinbase, Deribit, CME OHLCV, funding rates, open interest, liquidations, options
Hybrid (3) Hyperliquid, dYdX, Drift OHLCV, hourly funding rates, open interest
DEX (12) Uniswap, Curve, GMX, SushiSwap, Jupiter, 1inch, 0x, CoWSwap, GeckoTerminal, DexScreener, … Pool data, swaps, TVL, liquidity
On-Chain (5) Covalent, Bitquery, Santiment, The Graph, Nansen Wallet flows, smart money, on-chain metrics
Alternative (5+) DeFiLlama, Coinalyze, LunarCrush, Dune, CoinGecko, CryptoCompare, Messari TVL, sentiment, social, fundamentals
### Symbol Universe **211 unique symbols** across 16 sectors with full survivorship bias tracking:
View full sector breakdown | Sector | Count | Examples | |:-------|------:|:--------| | L1 Blockchains | 18 | SOL, AVAX, ADA, DOT, ATOM, TON | | DeFi DEX | 16 | UNI, SUSHI, CRV, DYDX, GMX, JUP | | Major Altcoins | 13 | BNB, XRP, DOGE, LTC, BCH | | Infrastructure | 13 | LINK, GRT, FIL, AR, ENS | | DeFi Lending | 9 | AAVE, COMP, MKR, ENA | | Liquid Staking | 9 | LDO, RPL, EIGEN, PENDLE | | AI / ML | 8 | FET, TAO, WLD, RNDR | | L2 Solutions | 8 | ARB, OP, MATIC, STRK | | Meme Tokens | 8 | PEPE, SHIB, BONK, WIF | | Gaming | 7 | AXS, SAND, MANA, APE |
--- ## Backtesting The system includes both an **event-driven backtester** and a **vectorized fast backtester**: | Feature | Details | |:--------|:--------| | **Walk-Forward Validation** | Train: Jan 2022 – Jun 2023  \|  Test: Jul 2023 – Dec 2024 | | **Crisis Analysis** | UST/Luna collapse, FTX bankruptcy, Banking crisis, SEC lawsuits | | **Capacity Analysis** | Market-impact modelling per venue | | **Attribution** | Per-pair, per-sector, and per-regime P&L decomposition | | **Compliance** | 61 automated checks via `run_arb.py --validate` | --- ## Quick Start ### Prerequisites | Requirement | Minimum | Recommended | |:------------|:--------|:------------| | **Python** | 3.10 | 3.12 | | **RAM** | 8 GB | 16 GB | | **Disk** | 5 GB | 10 GB | | **OS** | macOS / Linux / Windows (WSL) | macOS (Apple Silicon) | ### Installation ```bash # Clone the repository git clone https://github.com/abailey81/Crypto-Statistical-Arbitrage.git cd Crypto-Statistical-Arbitrage # Create virtual environment python -m venv .venv source .venv/bin/activate # macOS / Linux # .venv\Scripts\activate # Windows # Install dependencies pip install -r requirements.txt # macOS only (required by XGBoost) brew install libomp ``` ### Configuration ```bash # Copy the API key template cp config/api_keys_template.env config/.env # Edit with your API keys (many venues work without keys) nano config/.env # Verify credentials python config/verify_my_credentials.py ``` > **Note:** Many data sources (Binance, Bybit, OKX, Hyperliquid, dYdX, GeckoTerminal, DeFiLlama, etc.) work without API keys using public endpoints. ### Running ```bash # Full pipeline: Data Collection + Altcoin StatArb + BTC Futures + Visualizations python run_arb.py # Skip data collection (use existing data) python run_arb.py --skip-phase1 # Run specific phases python run_arb.py --phase 2 # Altcoin StatArb only python run_arb.py --phase 3 # BTC Futures only python run_arb.py --phase 2 3 # Both strategies # Cold run (clear all caches) python run_arb.py --clean-cache # Validate compliance (61 checks) python run_arb.py --validate ```
All run modes | Mode | Command | Description | |:-----|:--------|:------------| | **Full Pipeline** | `python run_arb.py` | All phases + visualizations + compliance | | **Cold Run** | `python run_arb.py --clean-cache` | Clear caches, run from scratch | | **Warm Run** | `python run_arb.py --skip-phase1` | Skip data collection | | **Phase Select** | `python run_arb.py --phase 2` | Run specific phase(s) | | **Validate** | `python run_arb.py --validate` | 61-check compliance audit | | **Check Data** | `python run_arb.py --check-only` | Data readiness audit | | **1-Day Test** | `python run_phase1.py --start 2026-02-08 --end 2026-02-09` | Smoke test |
--- ## Project Structure
View full project tree ``` . ├── config/ # Configuration │ ├── config.yaml # Strategy parameters, risk limits, dates │ ├── venues.yaml # 32 venue configs (endpoints, costs, capacity) │ ├── symbols.yaml # 211 symbols across 16 sectors │ └── api_keys_template.env # API key template (copy to .env) │ ├── data_collection/ # Phase 1: Data acquisition layer │ ├── cex/ # CEX collectors (Binance, Bybit, OKX, ...) │ ├── dex/ # DEX collectors (Uniswap, Curve, GMX, ...) │ ├── hybrid/ # Hybrid collectors (Hyperliquid, dYdX, Drift) │ ├── onchain/ # On-chain analytics (10 providers) │ ├── options/ # Options data (Deribit, Aevo) │ ├── alternative/ # Alternative data (DeFiLlama, Coinalyze, ...) │ ├── market_data/ # Market data aggregators │ ├── indexers/ # Blockchain indexers (The Graph) │ └── utils/ # Rate limiting, caching, validation, storage │ ├── strategies/ # Trading strategies │ ├── pairs_trading/ # Cointegration, Kalman filter, ML signals │ ├── futures_curve/ # Term structure, calendar spreads │ ├── funding_rate_arb/ # Cross-venue funding rate arbitrage │ └── vol_surface_or_dex_arb/ # Options vol surface / DEX arbitrage │ ├── backtesting/ # Backtesting engine │ ├── backtest_engine.py # Core event-driven backtester │ ├── optimized_backtest.py # Vectorized fast backtester │ └── analysis/ # Walk-forward, crisis, capacity, attribution │ ├── portfolio/ # Portfolio construction │ ├── optimizer.py # HRP, MVO, risk parity, Black-Litterman │ └── risk_manager.py # Drawdown stops, VaR limits, stress tests │ ├── reporting/ # Report generation │ ├── advanced_report_generator.py │ └── strict_pdf_validator.py # 61-check compliance validator │ ├── execution/ # Execution layer │ ├── order_manager.py # Order routing and management │ └── slippage_model.py # Venue-specific slippage models │ ├── notebooks/ # Jupyter notebooks │ ├── 00_data_acquisition_plan.ipynb │ ├── 01_cex_data_exploration.ipynb │ ├── 02_dex_data_exploration.ipynb │ ├── 03_venue_comparison.ipynb │ ├── 04_strategy_development.ipynb │ ├── 05_multi_venue_backtesting.ipynb │ └── 06_portfolio_construction.ipynb │ ├── docs/ # Documentation │ ├── methodology.md # Statistical methodology │ ├── data_dictionary.md # Data schema reference │ ├── data_sources.md # Venue documentation │ ├── api_reference.md # API reference │ └── venue_comparison.md # Venue comparison analysis │ ├── tests/ # Test suite │ ├── unit/ # Unit tests │ ├── integration/ # Integration tests │ └── performance/ # Performance benchmarks │ ├── run_arb.py # Master orchestrator ├── phase1run.py # Phase 1 entry point ├── phase2run.py # Phase 2 engine (~5,700 lines) ├── run_phase1.py # Phase 1 engine (~5,400 lines) ├── run_phase3.py # Phase 3 entry point ├── generate_visualizations.py # Chart generator (34 visualizations) ├── requirements.txt # Dependencies (137 packages) ├── setup.py # Package configuration └── Makefile # Build automation ```
--- ## Configuration All strategy parameters are defined in `config/config.yaml`: | Section | What it controls | |:--------|:-----------------| | `universe` | Token lists, sector mappings, venue assignments | | `cointegration` | Half-life bounds, p-value thresholds, test window | | `strategy` | Z-score entry/exit/stop, position sizing, Kelly fraction | | `risk` | Drawdown limits, concentration caps, correlation thresholds | | `backtest` | Train/test dates, transaction costs, walk-forward windows | | `venues` | Per-venue endpoints, rate limits, fee schedules | --- ## Development ```bash make format # Black + isort formatting make lint # Flake8 linting make type-check # mypy type checking make quality # All of the above make test # Run all tests make test-unit # Unit tests only make test-integration # Integration tests make test-coverage # With coverage report ``` See [CONTRIBUTING.md](CONTRIBUTING.md) for full development guidelines. --- ## Documentation | Document | Description | |:---------|:------------| | [Methodology](docs/methodology.md) | Statistical methodology — cointegration, Kalman filter, HMM | | [Data Dictionary](docs/data_dictionary.md) | Data schema and field reference | | [Data Sources](docs/data_sources.md) | Venue documentation and capabilities | | [API Reference](docs/api_reference.md) | Module-level API reference | | [Venue Comparison](docs/venue_comparison.md) | Cross-venue comparison analysis | --- ## Dependencies 137 pinned packages organized by function:
View dependency breakdown | Category | Key Packages | |:---------|:-------------| | **Scientific Computing** | numpy, pandas, scipy | | **Data Collection** | ccxt, aiohttp, requests, websockets, httpx | | **Econometrics** | statsmodels, arch, hmmlearn | | **Machine Learning** | scikit-learn, lightgbm, xgboost | | **GPU Acceleration** | numba, pyopencl, joblib | | **Portfolio Optimization** | cvxpy | | **Data Storage** | pyarrow, fastparquet, h5py | | **Visualization** | matplotlib, seaborn, plotly, kaleido | | **Configuration** | pydantic, python-dotenv, PyYAML |
--- ## Disclaimer This project is for **educational and research purposes only**. It is not financial advice. Cryptocurrency trading involves substantial risk of loss. Past performance does not guarantee future results. Always do your own research before making any investment decisions. --- ## License This project is licensed under the MIT License. See [LICENSE](LICENSE) for details. ---

Built by Tamer Atesyakar