--- name: data-orchestrator description: | Central data strategy coordinator for AI-driven crypto trading. Use when building data pipelines, validating data quality, coordinating multi-source data aggregation, or implementing ML-ready data infrastructure. Covers: data governance, quality validation, source management, real-time pipelines, backtesting infrastructure. tools: Read(pattern:.claude/skills/data-orchestrator/**), WebSearch, WebFetch(domain:api.llama.fi|dexscreener.com|birdeye.so|coingecko.com|coinapi.io), mcp__perplexity-ask__search, mcp__perplexity-ask__reason, TodoWrite --- # Data Orchestrator - AI Trading Data Strategy Layer Central nervous system for all data operations across the meme-times ecosystem. Implements a robust, AI-ready data strategy that precedes and enables all trading decisions. ## Core Principle > Data strategy comes BEFORE AI. Clean governance, validated sources, and secure infrastructure enable useful trading insights. ## Activation Triggers - "Build a data pipeline for [use case]" - "Validate data quality for [source]" - "Set up backtesting infrastructure" - "Aggregate data from multiple sources" - "Data governance for [trading strategy]" - "Real-time data feed for [token/market]" - Keywords: data pipeline, data quality, validation, aggregation, backtesting, ML training, historical data, real-time feed, data governance ## Data Strategy Pillars ### 1. Diverse Data Sources **Price & Market Data:** | Source | Data Type | Latency | Quality | Cost | |--------|-----------|---------|---------|------| | Dexscreener API | DEX prices, pools | Real-time | High | Free tier | | Birdeye API | Solana tokens, analytics | Real-time | High | Paid | | Jupiter Price API | Solana swap prices | Real-time | High | Free | | CoinGecko | Cross-chain prices | 1-5 min | High | Free tier | | CoinAPI | Historical + streaming | Configurable | Premium | Paid | **On-Chain Data:** | Source | Data Type | Latency | Quality | Cost | |--------|-----------|---------|---------|------| | Helius RPC | Solana transactions | Real-time | High | Freemium | | Solscan API | Token info, holders | Near real-time | High | Free tier | | Dune Analytics | SQL queries | Minutes-hours | High | Freemium | | Flipside Crypto | Pre-built datasets | Hours | High | Free | **Sentiment & Social:** | Source | Data Type | Latency | Quality | Cost | |--------|-----------|---------|---------|------| | Twitter/X API | Social mentions | Real-time | Medium | Paid | | LunarCrush | Social metrics | Near real-time | High | Paid | | Telegram scraping | Community sentiment | Real-time | Low | DIY | | Reddit API | Discussion sentiment | Minutes | Medium | Free | **DeFi Protocol Data:** | Source | Data Type | Latency | Quality | Cost | |--------|-----------|---------|---------|------| | DefiLlama API | TVL, revenue, yields | 15 min | High | Free | | Token Terminal | Revenue, P/E ratios | Daily | High | Paid | | DeFi Pulse | TVL rankings | Hourly | Medium | Free | ### 2. Data Quality & Governance **Quality Dimensions:** ```typescript interface DataQualityMetrics { accuracy: number; // 0-100: correctness vs ground truth completeness: number; // 0-100: missing values ratio timeliness: number; // 0-100: freshness score consistency: number; // 0-100: cross-source agreement validity: number; // 0-100: schema conformance } interface QualityThresholds { trading_signals: { min: 90, critical: 'timeliness' }; historical_analysis: { min: 85, critical: 'completeness' }; sentiment_analysis: { min: 70, critical: 'consistency' }; backtesting: { min: 95, critical: 'accuracy' }; } ``` **Validation Pipeline:** ``` Raw Data → Schema Validation → Anomaly Detection → Cross-Source Check → Quality Score → Accept/Reject ``` **Validation Rules:** 1. **Schema Validation**: All data must match expected types/formats 2. **Range Checks**: Prices, volumes, percentages within valid bounds 3. **Anomaly Detection**: Flag outliers > 3 standard deviations 4. **Cross-Source Verification**: Compare with 2+ sources for critical data 5. **Freshness Enforcement**: Reject stale data beyond threshold **Automated Quality Monitoring:** ```bash # Run continuous quality checks npx tsx .claude/skills/data-orchestrator/scripts/quality-monitor.ts \ --sources "dexscreener,birdeye,jupiter" \ --interval 60 \ --alert-threshold 80 ``` ### 3. Real-Time Data Pipeline Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ DATA INGESTION LAYER │ ├─────────────────────────────────────────────────────────────────┤ │ WebSocket Feeds │ REST Polling │ RPC Subscriptions │ │ (Dexscreener, WS) │ (Coingecko) │ (Helius, Solana) │ └──────────┬──────────┴────────┬─────────┴──────────┬────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ VALIDATION & ENRICHMENT │ ├─────────────────────────────────────────────────────────────────┤ │ Schema Check │ Anomaly Flag │ Cross-Verify │ Quality Score│ └──────────┬──────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ DATA STORAGE LAYER │ ├─────────────────────────────────────────────────────────────────┤ │ Hot Store (Redis) │ Warm Store (SQLite) │ Cold (Parquet) │ │ Real-time prices │ Recent history (7d) │ Historical data │ │ TTL: 5 minutes │ Indexed, queryable │ Compressed, ML │ └──────────┬──────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────────┐ │ CONSUMPTION LAYER │ ├─────────────────────────────────────────────────────────────────┤ │ Trading Signals │ ML Training │ Backtesting │ Dashboards │ │ (meme-trader) │ (llama-analyst) │ (meme-executor) │ (Reports)│ └─────────────────────────────────────────────────────────────────┘ ``` **Data Flow Configuration:** ```typescript interface PipelineConfig { sources: DataSource[]; validationRules: ValidationRule[]; enrichmentSteps: EnrichmentFunction[]; storageTargets: StorageTarget[]; alertsEnabled: boolean; qualityThreshold: number; } const defaultPipeline: PipelineConfig = { sources: [ { name: 'dexscreener', type: 'websocket', priority: 1 }, { name: 'birdeye', type: 'rest', priority: 2 }, { name: 'jupiter', type: 'rest', priority: 3 }, ], validationRules: ['schema', 'range', 'anomaly', 'freshness'], enrichmentSteps: ['normalize', 'calculate_indicators', 'tag_quality'], storageTargets: ['redis:hot', 'sqlite:warm'], alertsEnabled: true, qualityThreshold: 85, }; ``` ### 4. ML/AI Integration Framework **Data Preparation for ML:** ```typescript interface MLReadyDataset { features: { price_data: TimeSeriesFeatures; volume_data: TimeSeriesFeatures; onchain_metrics: OnChainFeatures; sentiment_scores: SentimentFeatures; technical_indicators: TechnicalFeatures; }; labels: { price_direction: 'up' | 'down' | 'sideways'; price_magnitude: number; optimal_action: 'buy' | 'sell' | 'hold'; }; metadata: { timestamp: Date; token: string; quality_score: number; source_count: number; }; } interface TimeSeriesFeatures { values: number[]; timestamps: Date[]; normalized: number[]; // Z-score normalized lagged: number[][]; // [lag_1, lag_5, lag_15, lag_60] rolling_stats: { mean_5: number[]; std_5: number[]; mean_15: number[]; std_15: number[]; }; } ``` **Supported ML Techniques:** 1. **Anomaly Detection**: Isolation forests for unusual price/volume patterns 2. **NLP/Sentiment**: BERT-based sentiment from social feeds 3. **Time Series**: LSTM/Transformer for price prediction 4. **Classification**: XGBoost for buy/sell signal classification 5. **Reinforcement Learning**: DQN for optimal trade execution **Continuous Learning Pipeline:** ``` New Data → Feature Extraction → Model Inference → Signal Generation ↑ Performance Feedback ↓ Model Retraining (Weekly) ``` ### 5. Backtesting Infrastructure **Historical Data Requirements:** ```typescript interface BacktestDataset { token: string; timeframe: '1m' | '5m' | '15m' | '1h' | '4h' | '1d'; start_date: Date; end_date: Date; data_points: { timestamp: Date; open: number; high: number; low: number; close: number; volume: number; liquidity: number; holders: number; sentiment_score?: number; }[]; quality_metrics: DataQualityMetrics; } ``` **Backtest Execution:** ```bash # Run backtest with historical data npx tsx .claude/skills/data-orchestrator/scripts/backtest-runner.ts \ --strategy "momentum" \ --token "BONK" \ --start "2024-01-01" \ --end "2024-12-01" \ --initial-capital 1000 \ --slippage 0.01 ``` **Backtest Report Output:** ``` BACKTEST REPORT: Momentum Strategy on BONK Period: 2024-01-01 to 2024-12-01 PERFORMANCE: - Total Return: +234.5% - Sharpe Ratio: 1.87 - Max Drawdown: -28.3% - Win Rate: 62.4% - Profit Factor: 2.15 TRADES: - Total Trades: 156 - Avg Trade Duration: 4.2 hours - Best Trade: +45.2% - Worst Trade: -12.8% DATA QUALITY: - Coverage: 99.2% - Missing Points: 847 / 105,120 - Quality Score: 94/100 CAVEATS: - Historical results do not guarantee future performance - Slippage model: 1% (actual may vary) - Does not account for: MEV, extreme volatility periods ``` ### 6. Risk & Portfolio Data Layer **Portfolio Metrics:** ```typescript interface PortfolioData { positions: { token: string; entry_price: number; current_price: number; size: number; unrealized_pnl: number; allocation_pct: number; risk_score: number; }[]; aggregate: { total_value: number; total_pnl: number; daily_var: number; // Value at Risk (95%) beta_to_sol: number; concentration_score: number; // Herfindahl index }; limits: { max_position_size: number; max_daily_loss: number; max_correlation: number; stop_loss_pct: number; }; } ``` **Risk Data Sources:** - Position tracking from meme-executor - Price volatility from historical data - Correlation matrix from cross-asset analysis - Liquidity depth from DEX APIs ## Implementation Scripts ### Data Pipeline Manager ```bash # Start the data pipeline npx tsx .claude/skills/data-orchestrator/scripts/pipeline-manager.ts \ --mode production \ --sources all \ --storage sqlite,redis # Validate specific data source npx tsx .claude/skills/data-orchestrator/scripts/validate-source.ts \ --source dexscreener \ --token "BONK" \ --verbose # Generate ML-ready dataset npx tsx .claude/skills/data-orchestrator/scripts/ml-dataset-builder.ts \ --token "BONK" \ --features "price,volume,sentiment" \ --lookback 30 \ --output ./datasets/bonk_ml_ready.parquet ``` ## Integration with Other Skills **Data Orchestrator provides to:** - **meme-trader**: Validated price/volume data, quality scores - **llama-analyst**: DeFi protocol metrics, TVL/revenue time series - **meme-executor**: Real-time execution prices, slippage estimates - **flow-tracker**: On-chain flow data, whale movements - **degen-savant**: Sentiment aggregates, social momentum **Data Orchestrator receives from:** - **All skills**: Data quality feedback, missing data requests - **meme-executor**: Trade execution data for backtest validation ## Quality Gates - All data must have quality score >= 80% for trading signals - Price data staleness: max 30 seconds for live trading - Historical data completeness: min 95% for backtesting - Cross-source agreement: min 2 sources for critical decisions - Schema validation: 100% compliance required - Anomaly flagging: auto-reject data points > 5 sigma ## Error Handling - **Source unavailable**: Failover to backup source within 5 seconds - **Quality below threshold**: Alert + fallback to last good data - **Schema mismatch**: Log error, use default values, alert - **Rate limit hit**: Exponential backoff, rotate API keys - **Network timeout**: Retry 3x with increasing delay ## Compliance & Security - Encrypt API keys at rest and in transit - Log all data access for audit trails - Implement IP whitelisting for production - GDPR-compliant: no PII in trading data - Rate limit all external calls (prevent API bans) - Regular security audits of data pipeline ## Performance Targets | Metric | Target | Measurement | |--------|--------|-------------| | Data latency (real-time) | < 500ms | WebSocket to storage | | Validation throughput | > 10K records/sec | Validation pipeline | | Quality score accuracy | > 95% | vs manual audit | | Backtest data coverage | > 99% | Historical completeness | | ML feature freshness | < 5 min | Feature store update | - references/data-sources.md - Complete API documentation - references/quality-standards.md - Validation rule definitions - references/ml-feature-catalog.md - Available ML features - scripts/pipeline-manager.ts - Main orchestration script - scripts/quality-monitor.ts - Continuous quality monitoring - scripts/backtest-runner.ts - Historical backtesting