---
name: bio-workflows-biomarker-pipeline
description: End-to-end biomarker discovery workflow from expression data to validated biomarker panels. Covers feature selection with Boruta/LASSO, classifier training with nested CV, and SHAP interpretation. Use when building and validating diagnostic or prognostic biomarker signatures from omics data.
tool_type: python
primary_tool: sklearn
workflow: true
depends_on:
  - machine-learning/biomarker-discovery
  - machine-learning/model-validation
  - machine-learning/omics-classifiers
  - machine-learning/prediction-explanation
qc_checkpoints:
  - after_selection: "Selected features >5 and <200, stability >0.6"
  - after_cv: "Nested CV AUC >0.7, fold variance <0.1"
  - after_interpretation: "Top 20 SHAP features: >50% overlap with selected features"
---

# Biomarker Discovery Pipeline

Complete pipeline from expression data to validated biomarker panels with classifier.

## Workflow Overview

```
Expression matrix + Metadata
    |
    v
[1. Data Preparation] -----> StandardScaler, train/test split
    |
    v
[2. Feature Selection] ----> Boruta or LASSO stability selection
    |
    v
[3. Model Training] -------> RandomForest/XGBoost with nested CV
    |
    v
[4. Model Interpretation] -> SHAP values, feature importance
    |
    v
[5. Validation] -----------> Hold-out test, bootstrap CI
    |
    v
Validated biomarker panel + classifier
```

## Step 1: Data Preparation

```python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

expr = pd.read_csv('expression.csv', index_col=0)
meta = pd.read_csv('metadata.csv', index_col=0)

X = expr.T  # samples x genes
y = meta.loc[X.index, 'condition'].values

# test_size=0.2: Standard 80/20 split; use 0.3 for <100 samples
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Fit scaler on training only to prevent data leakage
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
```

**QC Checkpoint 1:** Check class balance, sample counts per group
- Minimum 10 samples per class recommended
- Classes should be reasonably balanced (ratio <3:1)

## Step 2: Feature Selection

### Option A: Boruta (All-Relevant Selection)

```python
from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif

# Pre-filter if >10k features
if X_train_scaled.shape[1] > 10000:
    selector = SelectKBest(f_classif, k=5000)
    selector.fit(X_train_scaled, y_train)
    X_train_filt = X_train_scaled[:, selector.get_support()]
    feature_mask = selector.get_support()
else:
    X_train_filt = X_train_scaled
    feature_mask = None

# max_depth=5: Shallow trees for stable importances
rf = RandomForestClassifier(n_estimators=100, max_depth=5, n_jobs=-1, random_state=42)
# max_iter=100: Usually sufficient; 200 if many tentative
boruta = BorutaPy(rf, n_estimators='auto', max_iter=100, random_state=42, verbose=0)
boruta.fit(X_train_filt, y_train)

selected_idx = boruta.support_
print(f'Selected {selected_idx.sum()} features')
```

### Option B: LASSO Stability Selection

```python
from sklearn.linear_model import LogisticRegressionCV
import numpy as np

# n_bootstrap=100: Quick; use 500 for publication
n_bootstrap = 100
stability_scores = np.zeros(X_train_scaled.shape[1])

for i in range(n_bootstrap):
    idx = np.random.choice(len(y_train), size=len(y_train), replace=True)
    # Cs=10: 10 regularization values to search
    model = LogisticRegressionCV(penalty='l1', solver='saga', Cs=10, cv=3, random_state=i, max_iter=1000)
    model.fit(X_train_scaled[idx], y_train[idx])
    stability_scores += (model.coef_[0] != 0).astype(int)

stability_scores /= n_bootstrap
# stability_threshold=0.6: Standard; 0.8 for strict
selected_idx = stability_scores > 0.6
print(f'Selected {selected_idx.sum()} features (stability >0.6)')
```

**QC Checkpoint 2:**
- Selected features: 5-200 range
- Too few (<5): lower threshold, increase iterations
- Too many (>200): increase threshold, add pre-filtering

## Step 3: Model Training with Nested CV

```python
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

X_train_sel = X_train_scaled[:, selected_idx]
X_test_sel = X_test_scaled[:, selected_idx]

# outer_cv=5: Standard for performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# n_estimators=100: Sufficient for most omics
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
cv_scores = cross_val_score(clf, X_train_sel, y_train, cv=outer_cv, scoring='roc_auc')

print(f'Nested CV AUC: {cv_scores.mean():.3f} +/- {cv_scores.std():.3f}')
```

**QC Checkpoint 3:**
- AUC >0.7 acceptable, >0.8 good
- Fold variance <0.1 (stable performance)
- Check for overfitting: train AUC should not be >>test AUC

## Step 4: Model Interpretation

```python
import shap
import matplotlib.pyplot as plt

clf.fit(X_train_sel, y_train)

# SHAP v0.47+: call explainer directly
explainer = shap.TreeExplainer(clf)
shap_values = explainer(X_train_sel)

# Beeswarm: shows importance AND direction
shap.plots.beeswarm(shap_values, max_display=20, show=False)
plt.tight_layout()
plt.savefig('shap_beeswarm.png', dpi=150, bbox_inches='tight')
plt.close()

# Extract top features
import numpy as np
mean_shap = np.abs(shap_values.values).mean(axis=0)
top_shap_idx = np.argsort(mean_shap)[-20:]
```

**QC Checkpoint 4:**
- Top 20 SHAP features should have >50% overlap with selected features
- SHAP directions should be biologically plausible

## Step 5: Final Validation

```python
from sklearn.metrics import roc_auc_score, classification_report
import numpy as np

y_prob = clf.predict_proba(X_test_sel)[:, 1]
test_auc = roc_auc_score(y_test, y_prob)
print(f'Hold-out test AUC: {test_auc:.3f}')

# Bootstrap CI for AUC
# n_bootstrap=1000: Standard for publication-quality CI
n_bootstrap = 1000
boot_aucs = []
for i in range(n_bootstrap):
    idx = np.random.choice(len(y_test), size=len(y_test), replace=True)
    boot_aucs.append(roc_auc_score(y_test[idx], y_prob[idx]))

ci_lower, ci_upper = np.percentile(boot_aucs, [2.5, 97.5])
print(f'95% CI: [{ci_lower:.3f}, {ci_upper:.3f}]')

print(classification_report(y_test, clf.predict(X_test_sel)))
```

## Parameter Recommendations

| Step | Parameter | Recommendation |
|------|-----------|----------------|
| Split | test_size | 0.2 (standard), 0.3 for small datasets |
| Boruta | max_iter | 100 (sufficient), 200 if tentative features |
| LASSO | n_bootstrap | 100 (quick), 500 for publication |
| LASSO | stability_threshold | 0.6 (standard), 0.8 for strict |
| Nested CV | outer_folds | 5 (standard), 10 for small datasets |
| Nested CV | inner_folds | 3 (sufficient for tuning) |
| RF | n_estimators | 100-500 |
| XGBoost | learning_rate | 0.1 (conservative) |

## Troubleshooting

| Issue | Likely Cause | Solution |
|-------|--------------|----------|
| No features selected | Too strict threshold | Lower stability threshold, increase iterations |
| Too many features (>200) | Noisy data | Add pre-filtering, increase regularization |
| Low CV AUC (<0.6) | No signal, low power | Check data quality, add samples |
| High variance across folds | Small sample size | Use more folds, LOOCV |
| SHAP features differ from selected | Model using different signal | Review feature correlations |

## Export Results

```python
import pandas as pd
import joblib

# Save biomarker panel
feature_names = X_train.columns[selected_idx].tolist()
pd.DataFrame({'feature': feature_names}).to_csv('biomarker_panel.csv', index=False)

# Save model and scaler for deployment
joblib.dump(clf, 'biomarker_classifier.joblib')
joblib.dump(scaler, 'feature_scaler.joblib')
```

## Related Skills

- machine-learning/biomarker-discovery - Detailed feature selection methods
- machine-learning/model-validation - Nested CV implementation details
- machine-learning/omics-classifiers - Classifier options and tuning
- machine-learning/prediction-explanation - SHAP and LIME interpretation
- differential-expression/de-results - Pre-filter with DE genes
- pathway-analysis/go-enrichment - Functional enrichment of biomarkers