# Crypto Stat-Arb This repository studies market-neutral crypto statistical arbitrage with signed-graph clustering and walk-forward backtesting. It builds a residualized correlation graph after removing the market mode, clusters the graph with signed methods such as SPONGE and BNC, and trades cluster-level mean-reversion signals under explicit turnover and transaction-cost controls. ## Repository layout - `stat_arb/`: main research package for data loading, graph construction, clustering, signals, backtests, and reporting - `data/`: processed market, volume, ETH, and correlation datasets used by the backtests - `pics/`: diagnostic figures for clustering quality and exploratory analysis - `crypto_project.ipynb`: exploratory notebook used during early research - `archived_research/`: older exploratory artifacts retained for reference - `Crypto_Project_Report_Pre_Backtest.pdf`: written report from the earlier research stage ## Setup ```bash python -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install numpy pandas scipy scikit-learn matplotlib statsmodels ``` Run the baseline SPONGE backtest: ```bash python stat_arb/run_phase1.py ``` Run the clustering-method sweep: ```bash python stat_arb/run_phase2.py ``` If you want to rerun the notebook cells that call CoinMarketCap, export your credential first: ```bash export CMC_API_KEY=your_coinmarketcap_key ``` ## Methodology The pipeline first aligns token prices, volumes, and ETH reference data, then builds a tradable universe subject to history and liquidity filters. Returns are residualized against the market mode with PCA, transformed into a signed k-nearest-neighbor correlation graph, and clustered with SPONGE, BNC, or signed spectral methods. Signals are generated from within-cluster mean reversion, normalized to target leverage, and evaluated in a walk-forward backtest with lagging, turnover controls, and transaction-cost assumptions to limit lookahead and overstatement. ## Results Primary outputs are written under `stat_arb/reporting/` and include fold-level returns, turnover series, clustering sweep summaries, leaderboards, and the final report. The intended use is comparative research across clustering methods rather than a production-ready live trading engine. ## Known limits - Results are sensitive to crypto data quality, survivorship, and execution assumptions - The checked-in notebook and archived artifacts reflect exploratory work and are less polished than the package backtest path - Transaction costs and liquidity in crypto can change quickly enough to invalidate static assumptions ## License This project is distributed under the MIT License