--- name: crypto-prediction-multimodal description: | Skill untuk membangun sistem prediksi harga kripto multimodal yang menggabungkan data time-series harga, metrik on-chain blockchain, dan analisis sentimen teks (berita/media sosial). Gunakan skill ini setiap kali pengguna meminta bantuan membuat, mengembangkan, atau memperbaiki proyek prediksi harga kripto — termasuk ketika mereka menyebut kata-kata seperti: prediksi kripto, model harga Bitcoin/Ethereum, sentimen berita kripto, data on-chain, LSTM kripto, Transformer time-series, FinBERT, MVRV, SOPR, exchange flow, backtesting kripto, atau pipeline multimodal. Juga gunakan skill ini jika pengguna ingin membangun dashboard analisis kripto, sistem peringatan volatilitas, atau aplikasi trading berbasis sinyal ML. --- # Crypto Prediction Multimodal Skill Skill ini memandu pembangunan sistem prediksi harga kripto yang menggabungkan tiga modalitas data: **harga pasar (time-series)**, **metrik on-chain (blockchain fundamental)**, dan **sentimen teks (NLP)**. Tujuannya bukan prediksi sempurna, melainkan sinyal probabilistik yang lebih unggul dari model univariat tradisional. --- ## Arsitektur Sistem (7 Layer) ``` [DATA SOURCES] → [PREPROCESSING] → [ENCODING] → [FUSION] → [OUTPUT HEADS] → [EVALUATION] → [DEPLOYMENT] ``` Lihat detail setiap layer di bawah, atau langsung ke bagian yang relevan: - Setup & data → [Section 1] - Preprocessing & feature engineering → [Section 2] - Arsitektur model → [Section 3] - Training & evaluasi → [Section 4] - Deployment & dashboard → [Section 5] - Checklist anti-kesalahan umum → [Section 6] --- ## Section 1: Setup & Pengumpulan Data ### 1.1 Struktur Proyek ``` crypto-prediction/ ├── data/ │ ├── raw/ │ │ ├── price/ # OHLCV dari exchange │ │ ├── onchain/ # Metrik blockchain │ │ └── text/ # Berita & media sosial mentah │ └── processed/ │ ├── features/ # Fitur siap pakai │ └── sequences/ # Sequence siap masuk model ├── models/ │ ├── encoders/ # TS encoder & text encoder │ ├── fusion/ # Cross-modal attention │ └── heads/ # Output heads ├── notebooks/ │ ├── 01_eda.ipynb │ ├── 02_feature_engineering.ipynb │ └── 03_modelling.ipynb ├── src/ │ ├── data/ # Pipeline pengumpulan data │ ├── features/ # Feature engineering │ ├── models/ # Definisi arsitektur │ ├── training/ # Training loop & validasi │ └── evaluation/ # Metrik & backtesting ├── dashboard/ # Streamlit app ├── configs/ # Hyperparameter YAML └── requirements.txt ``` ### 1.2 Dependensi Utama ```txt # Core ML torch>=2.0.0 pytorch-lightning>=2.0.0 transformers>=4.35.0 einops>=0.7.0 # Time-series neuralforecast>=1.6.0 # TFT, PatchTST statsmodels>=0.14.0 # GARCH, HMM ta>=0.10.0 # Technical indicators # Data ccxt>=4.0.0 # Exchange API pandas>=2.0.0 numpy>=1.24.0 # NLP vaderSentiment>=3.3.2 # pip install transformers (untuk FinBERT / CryptoBERT) # Explainability shap>=0.43.0 captum>=0.6.0 # Integrated Gradients # Dashboard streamlit>=1.28.0 plotly>=5.17.0 # MLOps mlflow>=2.7.0 evidently>=0.4.0 # Drift detection ``` ### 1.3 Sumber Data & Cara Ambil #### A. Data Harga Pasar ```python import ccxt import pandas as pd def fetch_ohlcv(symbol="BTC/USDT", timeframe="1h", since_days=365): exchange = ccxt.binance() since = exchange.parse8601( (pd.Timestamp.now() - pd.Timedelta(days=since_days)).isoformat() ) ohlcv = exchange.fetch_ohlcv(symbol, timeframe, since=since, limit=1000) df = pd.DataFrame(ohlcv, columns=["timestamp","open","high","low","close","volume"]) df["timestamp"] = pd.to_datetime(df["timestamp"], unit="ms") df.set_index("timestamp", inplace=True) return df # Fitur tambahan: funding rate (untuk futures) def fetch_funding_rate(symbol="BTC/USDT:USDT"): exchange = ccxt.binance() history = exchange.fetch_funding_rate_history(symbol, limit=500) return pd.DataFrame(history) ``` #### B. Data On-Chain Sumber prioritas (dari mudah ke advance): 1. **Kaggle dataset** — cari "Bitcoin on-chain metrics", gratis, langsung pakai 2. **CoinMetrics Community** — `https://api.coinmetrics.io/v4/timeseries/asset-metrics` 3. **Glassnode** — tier gratis terbatas, tier berbayar untuk SOPR/MVRV 4. **Dune Analytics** — query SQL custom, hasilnya bisa diexport CSV Metrik on-chain prioritas (urutkan dari dampak tertinggi ke terendah): | Metrik | Interpretasi Sinyal | |---|---| | Exchange Net Flow | Inflow besar → potensi jual (bearish) | | MVRV Ratio | >3.5 = overvalued, <1 = undervalued | | SOPR | >1 = holder jual untung, <1 = jual rugi | | Active Addresses | Naik = adopsi meningkat (bullish) | | Whale Supply % | Akumulasi whale = bullish medium-term | | Hash Rate | Naik = kepercayaan miner (bullish PoW) | | Exchange Reserves | Turun = bullish (holder pindah ke cold wallet) | ```python import requests def fetch_coinmetrics(asset="btc", metrics="AdrActCnt,TxCnt,FlowNetToExNtv", freq="1d"): url = "https://api.coinmetrics.io/v4/timeseries/asset-metrics" params = {"assets": asset, "metrics": metrics, "frequency": freq, "page_size": 1000} r = requests.get(url, params=params) data = r.json()["data"] df = pd.DataFrame(data) df["time"] = pd.to_datetime(df["time"]) df.set_index("time", inplace=True) return df.astype(float) ``` #### C. Data Teks & Sentimen ```python import feedparser from datetime import datetime, timedelta # RSS feeds prioritas kripto FEEDS = [ "https://cointelegraph.com/rss", "https://coindesk.com/arc/outboundfeeds/rss/", "https://cryptopanic.com/news/rss/" ] def fetch_news_rss(feeds=FEEDS, days_back=30): articles = [] cutoff = datetime.now() - timedelta(days=days_back) for url in feeds: feed = feedparser.parse(url) for entry in feed.entries: pub = datetime(*entry.published_parsed[:6]) if pub >= cutoff: articles.append({ "timestamp": pub, "title": entry.title, "summary": getattr(entry, "summary", ""), "source": feed.feed.title }) return pd.DataFrame(articles).sort_values("timestamp") # Reddit via PRAW import praw def fetch_reddit_posts(subreddits=["Bitcoin","CryptoCurrency"], limit=100): reddit = praw.Reddit(client_id="...", client_secret="...", user_agent="crypto-research") posts = [] for sub in subreddits: for post in reddit.subreddit(sub).hot(limit=limit): posts.append({ "timestamp": datetime.fromtimestamp(post.created_utc), "title": post.title, "score": post.score, "num_comments": post.num_comments, "subreddit": sub }) return pd.DataFrame(posts) ``` --- ## Section 2: Preprocessing & Feature Engineering ### 2.1 Sinkronisasi Timestamp (Kritis!) Semua data HARUS diselaraskan ke timeframe yang sama. Aturan temporal yang WAJIB dipatuhi: - Sentimen dari berita jam 14:00 TIDAK BOLEH digunakan untuk prediksi sebelum jam 14:00 - Selalu gunakan timestamp `published_at`, bukan `scraped_at` - Untuk data on-chain harian, forward-fill ke hourly timeframe ```python def align_timestamps(df_price, df_onchain, df_sentiment, freq="1H"): # Price: resample ke freq target df_p = df_price.resample(freq).last().ffill() # On-chain: forward fill (data harian → hourly) df_o = df_onchain.resample(freq).last().ffill() # Sentiment: agregasi dalam window freq df_s = df_sentiment.resample(freq).agg({ "sentiment_score": ["mean", "std", "count"], "sentiment_momentum": "last" }) df_s.columns = ["sent_mean", "sent_std", "sent_count", "sent_momentum"] df_s = df_s.ffill() # Gabungkan, pastikan tidak ada lookahead bias df = df_p.join(df_o, how="left").join(df_s, how="left") df = df.ffill().dropna() return df ``` ### 2.2 Feature Engineering Time-Series ```python import ta # technical analysis library def build_ts_features(df): close = df["close"] high = df["high"] low = df["low"] volume = df["volume"] # Return features (stasioner) df["log_return_1h"] = np.log(close / close.shift(1)) df["log_return_4h"] = np.log(close / close.shift(4)) df["log_return_24h"] = np.log(close / close.shift(24)) # Volatilitas rolling df["volatility_24h"] = df["log_return_1h"].rolling(24).std() df["volatility_7d"] = df["log_return_1h"].rolling(168).std() # Technical indicators df["rsi_14"] = ta.momentum.RSIIndicator(close, window=14).rsi() df["macd"] = ta.trend.MACD(close).macd_diff() df["bbands_pct"] = ta.volatility.BollingerBands(close).bollinger_pband() df["obv"] = ta.volume.OnBalanceVolumeIndicator(close, volume).on_balance_volume() # Market microstructure df["hl_spread"] = (high - low) / close # intrabar spread df["vol_price_ratio"] = volume / close # turnover proxy # Wavelet decomposition (opsional, menangkap multi-scale) # import pywt # cA, cD = pywt.dwt(close.values, 'db4') return df.dropna() ``` ### 2.3 On-Chain Sentiment Index Buat indeks komposit dari metrik on-chain — ini yang membuat proyek unik: ```python def build_onchain_sentiment_index(df): """ Gabungkan beberapa metrik on-chain menjadi satu indeks sentimen. Positif = bullish signal, Negatif = bearish signal. """ signals = pd.DataFrame(index=df.index) # Exchange flow: outflow lebih banyak = bullish (holder simpan di wallet) if "exchange_net_flow" in df.columns: signals["flow_signal"] = -np.sign(df["exchange_net_flow"]) # MVRV: di bawah 1 = undervalued (bullish), di atas 3.5 = overvalued (bearish) if "mvrv" in df.columns: signals["mvrv_signal"] = np.where(df["mvrv"] < 1, 1, np.where(df["mvrv"] > 3.5, -1, 0)) # SOPR: di atas 1 = holder profit-taking (neutral-bearish jangka pendek) if "sopr" in df.columns: signals["sopr_signal"] = np.where(df["sopr"] > 1.02, -0.5, np.where(df["sopr"] < 0.98, 0.5, 0)) # Active addresses: momentum naik = bullish if "active_addresses" in df.columns: signals["addr_signal"] = np.sign(df["active_addresses"].pct_change(7)) # Rata-rata tertimbang (bisa dioptimasi dengan grid search) weights = {"flow_signal": 0.35, "mvrv_signal": 0.25, "sopr_signal": 0.20, "addr_signal": 0.20} df["onchain_sentiment_idx"] = sum( signals[col] * w for col, w in weights.items() if col in signals ) return df ``` ### 2.4 NLP Sentiment Scoring ```python from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer from transformers import pipeline # Level 1: VADER (cepat, cocok untuk teks pendek) def score_vader(texts): analyzer = SentimentIntensityAnalyzer() return [analyzer.polarity_scores(t)["compound"] for t in texts] # Level 2: FinBERT (lebih akurat untuk teks finansial) def score_finbert(texts, batch_size=32): classifier = pipeline( "sentiment-analysis", model="ProsusAI/finbert", device=0 if torch.cuda.is_available() else -1 ) results = [] for i in range(0, len(texts), batch_size): batch = texts[i:i+batch_size] preds = classifier(batch, truncation=True, max_length=512) for p in preds: score = p["score"] if p["label"] == "positive" else -p["score"] results.append(score) return results # Level 3: CryptoBERT (fine-tuned khusus domain kripto) # model="ElKulako/cryptobert" — ganti model string di atas def aggregate_sentiment_hourly(df_news, score_col="sentiment_score"): """Agregasi sentimen per jam dengan fitur tambahan.""" hourly = df_news.resample("1H", on="timestamp").agg( sent_mean=(score_col, "mean"), sent_std=(score_col, "std"), sent_count=(score_col, "count"), sent_max=(score_col, "max"), sent_min=(score_col, "min"), ).fillna(0) # Momentum sentimen: perubahan terhadap jam sebelumnya hourly["sent_momentum"] = hourly["sent_mean"].diff() # Bobot berdasarkan jumlah berita (lebih banyak berita = sinyal lebih kuat) hourly["sent_weighted"] = hourly["sent_mean"] * np.log1p(hourly["sent_count"]) return hourly ``` --- ## Section 3: Arsitektur Model ### 3.1 Pilih Arsitektur Berdasarkan Kompleksitas | Level | Arsitektur | Kapan Pakai | |---|---|---| | Baseline | LSTM + concatenation | Proof of concept, dataset kecil | | Intermediate | TFT + cross-attention | Dataset sedang, butuh interpretability | | Advanced | PatchTST + CryptoBERT + gating | Dataset besar, riset serius | ### 3.2 Baseline: Dual-LSTM dengan Concatenation ```python import torch import torch.nn as nn class DualLSTMFusion(nn.Module): def __init__(self, ts_dim, sent_dim, hidden_dim=64, output_dim=1, dropout=0.2): super().__init__() # Time-series encoder self.ts_encoder = nn.LSTM( input_size=ts_dim, hidden_size=hidden_dim, num_layers=2, batch_first=True, dropout=dropout ) # Sentiment encoder self.sent_encoder = nn.LSTM( input_size=sent_dim, hidden_size=hidden_dim // 2, num_layers=1, batch_first=True, dropout=dropout ) # Fusion + prediction head fusion_dim = hidden_dim + hidden_dim // 2 self.head = nn.Sequential( nn.Linear(fusion_dim, 64), nn.ReLU(), nn.Dropout(dropout), nn.Linear(64, 32), nn.ReLU(), nn.Linear(32, output_dim) ) def forward(self, x_ts, x_sent): _, (h_ts, _) = self.ts_encoder(x_ts) _, (h_sent, _) = self.sent_encoder(x_sent) # Ambil hidden state layer terakhir h_ts = h_ts[-1] # (batch, hidden_dim) h_sent = h_sent[-1] # (batch, hidden_dim//2) fused = torch.cat([h_ts, h_sent], dim=-1) return self.head(fused) ``` ### 3.3 Intermediate: Cross-Modal Attention Fusion ```python class CrossModalAttentionFusion(nn.Module): """ Time-series sebagai query, sentimen sebagai key/value. Model belajar 'kapan' pasar sensitif terhadap sentimen. """ def __init__(self, ts_dim, sent_dim, d_model=128, n_heads=4, dropout=0.1): super().__init__() # Proyeksi input ke d_model self.ts_proj = nn.Linear(ts_dim, d_model) self.sent_proj = nn.Linear(sent_dim, d_model) # TS encoder (Transformer atau LSTM) encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=n_heads, dropout=dropout, batch_first=True ) self.ts_encoder = nn.TransformerEncoder(encoder_layer, num_layers=3) # Cross-attention: TS query attends to sentiment key/value self.cross_attn = nn.MultiheadAttention( embed_dim=d_model, num_heads=n_heads, dropout=dropout, batch_first=True ) # Regime gating (opsional tapi powerful) self.regime_gate = nn.Sequential( nn.Linear(d_model * 2, d_model), nn.Sigmoid() # Gate 0-1: berapa banyak sentimen digunakan ) # Output heads (multi-task) self.price_head = nn.Linear(d_model, 3) # quantile: 10%, 50%, 90% self.direction_head = nn.Linear(d_model, 1) # up/down binary self.volatility_head = nn.Linear(d_model, 1) # volatility regime def forward(self, x_ts, x_sent): # Project ke common space ts_emb = self.ts_proj(x_ts) # (B, T, d_model) sent_emb = self.sent_proj(x_sent) # (B, T, d_model) # Encode time-series ts_out = self.ts_encoder(ts_emb) # (B, T, d_model) # Cross-attention: TS query, sentiment key/value fused, attn_weights = self.cross_attn( query=ts_out, key=sent_emb, value=sent_emb ) # Regime-aware gating gate = self.regime_gate(torch.cat([ts_out, fused], dim=-1)) output = gate * fused + (1 - gate) * ts_out # Blend berdasarkan gate # Ambil representasi timestep terakhir z = output[:, -1, :] # (B, d_model) return { "price_quantiles": self.price_head(z), "direction": self.direction_head(z).squeeze(-1), "volatility": self.volatility_head(z).squeeze(-1), "attention_weights": attn_weights # untuk visualisasi } ``` ### 3.4 Market Regime Detector (Komponen Tambahan) ```python from hmmlearn import hmm import numpy as np class MarketRegimeDetector: """ Deteksi regime pasar: bull / bear / sideways / high-volatility. Output digunakan sebagai fitur tambahan ke model utama. """ def __init__(self, n_regimes=4): self.model = hmm.GaussianHMM( n_components=n_regimes, covariance_type="full", n_iter=1000 ) self.n_regimes = n_regimes def fit(self, returns: np.ndarray): obs = np.column_stack([returns, np.abs(returns)]) # return + volatilitas self.model.fit(obs) return self def predict(self, returns: np.ndarray) -> np.ndarray: obs = np.column_stack([returns, np.abs(returns)]) return self.model.predict(obs) def get_regime_probs(self, returns: np.ndarray) -> np.ndarray: obs = np.column_stack([returns, np.abs(returns)]) return self.model.predict_proba(obs) # (T, n_regimes) ``` ### 3.5 Multi-Task Training Loop ```python import pytorch_lightning as pl class CryptoPredictorLightning(pl.LightningModule): def __init__(self, model, lr=1e-3, lambda_price=1.0, lambda_dir=0.5, lambda_vol=0.3): super().__init__() self.model = model self.lr = lr # Bobot loss untuk multi-task learning self.lp = lambda_price self.ld = lambda_dir self.lv = lambda_vol def training_step(self, batch, batch_idx): x_ts, x_sent, y_price, y_dir, y_vol = batch out = self.model(x_ts, x_sent) # Quantile loss untuk price q_loss = self._quantile_loss(out["price_quantiles"], y_price) # Binary CE untuk direction d_loss = nn.BCEWithLogitsLoss()(out["direction"], y_dir.float()) # MSE untuk volatility v_loss = nn.MSELoss()(out["volatility"], y_vol) # Total weighted loss loss = self.lp * q_loss + self.ld * d_loss + self.lv * v_loss self.log_dict({"train/loss": loss, "train/q_loss": q_loss, "train/d_loss": d_loss, "train/v_loss": v_loss}) return loss def _quantile_loss(self, preds, targets, quantiles=[0.1, 0.5, 0.9]): losses = [] for i, q in enumerate(quantiles): errors = targets - preds[:, i] losses.append(torch.max((q-1)*errors, q*errors).mean()) return torch.stack(losses).mean() def configure_optimizers(self): opt = torch.optim.AdamW(self.model.parameters(), lr=self.lr, weight_decay=1e-4) sch = torch.optim.lr_scheduler.OneCycleLR( opt, max_lr=self.lr, total_steps=self.trainer.estimated_stepping_batches ) return [opt], [{"scheduler": sch, "interval": "step"}] ``` --- ## Section 4: Evaluasi & Validasi ### 4.1 Walk-Forward Validation (WAJIB — jangan pakai random split!) ```python from sklearn.model_selection import TimeSeriesSplit def walk_forward_validate(model_fn, df, n_splits=5, test_size=720): """ Walk-forward validation: train pada masa lalu, test pada masa depan. test_size: jumlah jam untuk test set (720 = 30 hari) """ results = [] tscv = TimeSeriesSplit(n_splits=n_splits, test_size=test_size) for fold, (train_idx, test_idx) in enumerate(tscv.split(df)): df_train = df.iloc[train_idx] df_test = df.iloc[test_idx] model = model_fn() model.fit(df_train) preds = model.predict(df_test) metrics = evaluate_predictions(preds, df_test["close"]) metrics["fold"] = fold metrics["train_end"] = df_train.index[-1] metrics["test_start"] = df_test.index[0] results.append(metrics) print(f"Fold {fold}: {metrics}") return pd.DataFrame(results) ``` ### 4.2 Metrik Evaluasi Lengkap ```python def evaluate_predictions(preds, actuals, prices=None): """Evaluasi komprehensif: statistik + trading metrics.""" metrics = {} # Statistik prediksi metrics["rmse"] = np.sqrt(np.mean((preds - actuals)**2)) metrics["mae"] = np.mean(np.abs(preds - actuals)) metrics["mape"] = np.mean(np.abs((preds - actuals) / actuals)) * 100 # Hit ratio: persentase arah prediksi benar actual_dir = np.sign(np.diff(actuals)) pred_dir = np.sign(np.diff(preds)) metrics["hit_ratio"] = np.mean(actual_dir == pred_dir) # Trading metrics (jika ada harga) if prices is not None: returns = np.diff(np.log(prices)) strategy_returns = returns * np.sign(pred_dir) # Sharpe Ratio (annualized, hourly data) metrics["sharpe"] = (strategy_returns.mean() / strategy_returns.std()) * np.sqrt(8760) # Sortino Ratio downside = strategy_returns[strategy_returns < 0].std() metrics["sortino"] = (strategy_returns.mean() / downside) * np.sqrt(8760) # Max Drawdown cumulative = (1 + strategy_returns).cumprod() rolling_max = cumulative.cummax() drawdown = (cumulative - rolling_max) / rolling_max metrics["max_drawdown"] = drawdown.min() return metrics ``` ### 4.3 Causal Probe: Deteksi Data Leakage ```python def causal_probe_test(model, df_test, shift_hours=24): """ Uji apakah model menggunakan data masa depan secara tidak sengaja. Geser fitur sentimen +N jam — jika performa NAIK, ada leakage! """ # Prediksi normal normal_preds = model.predict(df_test) normal_metrics = evaluate_predictions(normal_preds, df_test["close"]) # Prediksi dengan sentimen yang di-shift (future data) df_shifted = df_test.copy() sent_cols = [c for c in df_test.columns if "sent_" in c] df_shifted[sent_cols] = df_shifted[sent_cols].shift(-shift_hours) df_shifted = df_shifted.dropna() shifted_preds = model.predict(df_shifted) shifted_metrics = evaluate_predictions(shifted_preds, df_shifted["close"]) improvement = shifted_metrics["hit_ratio"] - normal_metrics["hit_ratio"] print(f"Hit ratio normal: {normal_metrics['hit_ratio']:.4f}") print(f"Hit ratio shifted (+{shift_hours}h): {shifted_metrics['hit_ratio']:.4f}") print(f"Improvement dari future data: {improvement:.4f}") if improvement > 0.05: print("⚠️ PERINGATAN: Kemungkinan ada data leakage!") else: print("✓ Tidak ada indikasi leakage yang signifikan.") return {"normal": normal_metrics, "shifted": shifted_metrics, "leakage_delta": improvement} ``` ### 4.4 Explainability dengan SHAP ```python import shap def explain_predictions(model, X_background, X_explain, feature_names): """Hitung SHAP values untuk interpretasi model.""" # Untuk model berbasis tree # explainer = shap.TreeExplainer(model) # Untuk neural network (model-agnostic, lebih lambat) explainer = shap.KernelExplainer( lambda x: model.predict(x), shap.sample(X_background, 100) # background sample ) shap_values = explainer.shap_values(X_explain) # Summary plot shap.summary_plot(shap_values, X_explain, feature_names=feature_names) # Feature importance rata-rata importance = pd.DataFrame({ "feature": feature_names, "importance": np.abs(shap_values).mean(0) }).sort_values("importance", ascending=False) return importance, shap_values ``` --- ## Section 5: Deployment & Dashboard ### 5.1 Streamlit Dashboard (Minimum Viable) ```python # dashboard/app.py import streamlit as st import plotly.graph_objects as go st.set_page_config(page_title="Crypto Prediction Dashboard", layout="wide") st.title("🔮 Crypto Price Prediction — Multimodal ML") # Sidebar controls with st.sidebar: coin = st.selectbox("Koin", ["BTC/USDT", "ETH/USDT"]) horizon = st.selectbox("Horizon prediksi", ["1 jam", "4 jam", "24 jam"]) show_shap = st.checkbox("Tampilkan SHAP explanation", value=True) col1, col2, col3 = st.columns(3) # Metrik utama with col1: st.metric("Prediksi arah", "↑ Naik", delta="Confidence 72%") with col2: st.metric("Volatility regime", "Medium", delta="Stabil") with col3: st.metric("On-chain sentiment", "+0.42 (Bullish)", delta="+0.08 vs kemarin") # Chart harga + prediksi fig = go.Figure() # ... tambahkan trace harga historis dan interval prediksi st.plotly_chart(fig, use_container_width=True) # Kontributor utama (dari SHAP) if show_shap: st.subheader("Faktor penentu prediksi") st.caption("Kontributor utama prediksi naik/turun saat ini:") # Tampilkan bar chart SHAP values teratas top_factors = [ ("Exchange outflow (+500 BTC)", 0.38, "bullish"), ("Sentimen berita ETF Approval", 0.31, "bullish"), ("RSI overbought (74)", -0.18, "bearish"), ("MVRV mendekati 3.5", -0.12, "bearish"), ] for factor, val, signal in top_factors: color = "🟢" if signal == "bullish" else "🔴" st.write(f"{color} **{factor}** — kontribusi: {val:+.2f}") ``` ### 5.2 MLflow Tracking ```python import mlflow import mlflow.pytorch def train_with_tracking(model, train_loader, val_loader, config): mlflow.set_experiment("crypto-prediction") with mlflow.start_run(run_name=f"fusion_{config['architecture']}"): # Log hyperparameters mlflow.log_params({ "architecture": config["architecture"], "lookback_window": config["lookback"], "learning_rate": config["lr"], "batch_size": config["batch_size"], "sentiment_model": config["sentiment_model"], }) # Training loop trainer = pl.Trainer(max_epochs=config["epochs"]) trainer.fit(model, train_loader, val_loader) # Log metrics val_metrics = evaluate_model(model, val_loader) mlflow.log_metrics(val_metrics) # Save model mlflow.pytorch.log_model(model, "model") return model ``` ### 5.3 Drift Detection (Monitoring Produksi) ```python from evidently.report import Report from evidently.metric_preset import DataDriftPreset def check_feature_drift(df_reference, df_current): """ Deteksi apakah distribusi fitur bergeser — umum setelah market regime berganti. """ report = Report(metrics=[DataDriftPreset()]) report.run(reference_data=df_reference, current_data=df_current) drift_results = report.as_dict() drifted_features = [ col for col, res in drift_results["metrics"][0]["result"]["drift_by_columns"].items() if res["drift_detected"] ] if drifted_features: print(f"⚠️ Drift terdeteksi pada fitur: {drifted_features}") print("Pertimbangkan untuk retrain model!") return drifted_features ``` --- ## Section 6: Checklist Anti-Kesalahan Umum Sebelum submit atau deploy, pastikan semua poin berikut terpenuhi: ### Data & Leakage - [ ] Semua fitur menggunakan data yang tersedia SEBELUM timestamp prediksi - [ ] Sentimen menggunakan `published_at`, bukan `scraped_at` - [ ] On-chain data forward-filled (bukan backward-filled) - [ ] Causal probe test sudah dijalankan dan tidak ada leakage signifikan - [ ] Train/val/test split menggunakan TimeSeriesSplit, bukan random split ### Model & Training - [ ] Log-return digunakan sebagai target (bukan harga absolut yang non-stasioner) - [ ] Normalisasi dilakukan fit di training set, transform di test set - [ ] Tidak ada weight sharing yang tidak disengaja antara train dan test - [ ] Multi-task loss sudah di-tune bobotnya ### Evaluasi - [ ] Walk-forward validation dengan minimal 3 fold - [ ] Sharpe Ratio dihitung, bukan hanya RMSE - [ ] Benchmark dibandingkan dengan model baseline (harga kemarin = prediksi besok) - [ ] SHAP atau attention visualization tersedia ### Deployment - [ ] Drift detection aktif - [ ] MLflow tracking diaktifkan untuk semua eksperimen - [ ] Model disimpan dengan versioning yang jelas --- ## Referensi & Sumber Lanjutan - **Dataset siap pakai**: Kaggle — "Bitcoin Historical Data", "Crypto Sentiment Analysis" - **On-chain gratis**: CoinMetrics Community API, Dune Analytics (SQL query) - **Model pretrained**: - `ProsusAI/finbert` — sentiment finansial umum - `ElKulako/cryptobert` — fine-tuned khusus kripto - `neuralforecast` library — TFT, PatchTST siap pakai - **Paper referensi**: - "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (Lim et al., 2021) - "A Transformer-based Framework for Multivariate Time Series Representation Learning" (Zerveas et al., 2021) - "Bitcoin Price Prediction Using Machine Learning: An Approach to Sample Dimension Engineering" — survey metrik on-chain --- ## Urutan Pengembangan yang Disarankan Ikuti urutan ini untuk hasil terbaik: 1. **Minggu 1**: Setup data pipeline — harga + satu sumber on-chain + VADER sentiment 2. **Minggu 2**: Baseline LSTM, evaluasi walk-forward, causal probe 3. **Minggu 3**: Tambahkan FinBERT/CryptoBERT, upgrade ke cross-attention fusion 4. **Minggu 4**: Multi-task head, regime detector, SHAP explanation 5. **Minggu 5**: Streamlit dashboard, MLflow tracking 6. **Minggu 6**: Drift detection, polishing, dokumentasi riset