--- name: tooluniverse-multi-omics-integration description: Integrate and analyze multiple omics datasets (transcriptomics, proteomics, epigenomics, genomics, metabolomics) for systems biology and precision medicine. Performs cross-omics correlation, multi-omics clustering (MOFA+, NMF), pathway-level integration, and sample matching. Coordinates ToolUniverse skills for expression data (RNA-seq), epigenomics (methylation, ChIP-seq), variants (SNVs, CNVs), protein interactions, and pathway enrichment. Use when analyzing multi-omics datasets, performing integrative analysis, discovering multi-omics biomarkers, studying disease mechanisms across molecular layers, or conducting systems biology research that requires coordinated analysis of transcriptome, genome, epigenome, proteome, and metabolome data. --- # Multi-Omics Integration Coordinate and integrate multiple omics datasets for comprehensive systems biology analysis. This skill orchestrates specialized ToolUniverse skills to perform cross-omics correlation, multi-omics clustering, pathway-level integration, and unified interpretation across molecular layers. ## When to Use This Skill **Triggers**: - User has multiple omics datasets (RNA-seq + proteomics, methylation + expression, etc.) - Requests for integrative multi-omics analysis - Cross-omics correlation queries (e.g., "How does methylation affect expression?") - Multi-omics biomarker discovery - Systems biology questions requiring multiple molecular layers - Precision medicine applications with multi-omics patient data - Questions about molecular mechanisms across omics types **Example Questions This Skill Solves**: 1. "Integrate RNA-seq and proteomics data to find genes with concordant changes" 2. "How does promoter methylation correlate with gene expression?" 3. "Perform multi-omics clustering to identify patient subtypes" 4. "Which pathways are dysregulated across transcriptome, proteome, and metabolome?" 5. "Find multi-omics biomarkers for disease classification" 6. "Correlate CNV with gene expression to identify dosage effects" 7. "Integrate GWAS variants, eQTLs, and expression data" 8. "Perform MOFA+ analysis on multi-omics cancer data" --- ## Core Capabilities | Capability | Description | |-----------|-------------| | **Data Integration** | Match samples across omics, handle missing data, normalize scales | | **Cross-Omics Correlation** | Correlate features across molecular layers (gene expression vs protein, methylation vs expression) | | **Multi-Omics Clustering** | MOFA+, NMF, joint clustering to identify omics-driven subtypes | | **Pathway Integration** | Combine omics evidence at pathway level for unified biological interpretation | | **Biomarker Discovery** | Identify multi-omics signatures with improved predictive power | | **Skill Coordination** | Orchestrate RNA-seq, epigenomics, variant-analysis, protein-interactions, gene-enrichment skills | | **Visualization** | Circos plots, integrated heatmaps, network visualizations | | **Reporting** | Unified multi-omics reports with cross-layer insights | --- ## Workflow Overview ``` Input: Multiple Omics Datasets | v Phase 1: Data Loading & QC |-- Load RNA-seq (expression matrix) |-- Load proteomics (protein abundance) |-- Load methylation (beta values or M-values) |-- Load variants (CNV, SNV from VCF) |-- Load metabolomics (metabolite abundance) |-- Quality control per omics type | v Phase 2: Sample Matching |-- Match samples across omics by ID |-- Identify common samples |-- Handle batch effects |-- Normalize sample identifiers | v Phase 3: Feature Mapping |-- Map features to common identifier space (genes, proteins, metabolites) |-- Link CpG sites to genes (promoter, gene body) |-- Map variants to genes |-- Create unified feature matrix | v Phase 4: Cross-Omics Correlation |-- Gene expression vs protein abundance (translation efficiency) |-- Promoter methylation vs expression (epigenetic regulation) |-- CNV vs expression (dosage effect) |-- eQTL variants vs expression (genetic regulation) |-- Metabolite vs enzyme expression (metabolic flux) | v Phase 5: Multi-Omics Clustering |-- MOFA+ (Multi-Omics Factor Analysis) for latent factors |-- NMF (Non-negative Matrix Factorization) for patient subtypes |-- Joint clustering across omics |-- Identify omics-specific vs shared variation | v Phase 6: Pathway-Level Integration |-- Aggregate omics to pathway level |-- Score pathway dysregulation (combined evidence) |-- Use ToolUniverse enrichment tools (Reactome, KEGG, GO) |-- Identify driver pathways across omics | v Phase 7: Biomarker Discovery |-- Feature selection across omics |-- Multi-omics signatures for classification |-- Cross-validation and performance |-- Interpretation and biological validation | v Phase 8: Generate Integrated Report |-- Summary statistics per omics |-- Cross-omics correlation results |-- Multi-omics clusters and subtypes |-- Top dysregulated pathways |-- Multi-omics biomarkers |-- Biological interpretation ``` --- ## Phase Details ### Phase 1: Data Loading & Quality Control **Objective**: Load multiple omics datasets and perform quality control. **Supported omics types**: - **Transcriptomics**: RNA-seq count matrices, microarray - **Proteomics**: Protein abundance (MS-based) - **Epigenomics**: Methylation (450K, EPIC arrays, WGBS), ChIP-seq peaks - **Genomics**: CNV, SNV, structural variants - **Metabolomics**: Metabolite abundance (targeted, untargeted) **Data formats**: - Expression: CSV/TSV matrices, HDF5, AnnData (.h5ad) - Proteomics: MaxQuant output, Spectronaut, DIA-NN - Methylation: IDAT files, beta value matrices - Variants: VCF, SEG files (CNV) - Metabolomics: Peak tables, identified metabolites **Quality control per omics**: ```python # RNA-seq QC - Filter low-count genes (mean counts < threshold) - Normalize (TPM, FPKM, or DESeq2) - Log-transform for correlation # Proteomics QC - Filter proteins with high missing values - Impute missing values (minimum, KNN) - Normalize (median, quantile) # Methylation QC - Remove failed probes - Correct for batch effects (ComBat) - Filter cross-reactive probes # Variants QC - Use variant-analysis skill for VCF QC - CNV segmentation validation ``` ### Phase 2: Sample Matching **Objective**: Identify common samples across omics datasets. **Sample ID harmonization**: ```python def match_samples_across_omics(omics_data_dict): """ Match samples across multiple omics datasets. Parameters: omics_data_dict: { 'rnaseq': DataFrame (genes x samples), 'proteomics': DataFrame (proteins x samples), 'methylation': DataFrame (CpGs x samples), 'cnv': DataFrame (genes x samples) } Returns: - common_samples: List of sample IDs present in all omics - matched_data: Dict of DataFrames with common samples only """ # Extract sample IDs from each omics sample_ids = { omics_type: set(df.columns) for omics_type, df in omics_data_dict.items() } # Find common samples (intersection) common_samples = set.intersection(*sample_ids.values()) # Subset each omics to common samples matched_data = { omics_type: df[sorted(common_samples)] for omics_type, df in omics_data_dict.items() } return sorted(common_samples), matched_data ``` **Handling missing omics**: - Pairwise integration if not all samples have all omics - Document sample availability matrix ### Phase 3: Feature Mapping **Objective**: Map features from different omics to common gene-level identifiers. **Gene-centric integration**: ```python # Map all features to genes feature_mapping = { 'rnaseq': 'gene_symbol', # Already gene-level 'proteomics': 'gene_symbol', # Map protein to gene 'methylation': 'gene_symbol', # Map CpG to gene (promoter) 'cnv': 'gene_symbol', # CNV regions to overlapping genes 'metabolomics': 'enzyme_gene' # Metabolite to enzyme gene } ``` **CpG to gene mapping**: - **Promoter methylation**: CpGs within TSS ± 2kb - **Gene body methylation**: CpGs within gene boundaries - Average methylation per gene (weighted by probe coverage) **CNV to gene mapping**: - Use variant-analysis skill to identify genes in CNV regions - Calculate copy number per gene (log2 ratio) ### Phase 4: Cross-Omics Correlation **Objective**: Correlate features across molecular layers to understand regulation. **Example analyses**: #### 4.1: Expression vs Protein (Translation Efficiency) ```python def correlate_rna_protein(rnaseq_data, proteomics_data): """ Correlate mRNA and protein levels for each gene. Expected: Positive correlation (r ~ 0.4-0.6 typical) Discordance indicates post-transcriptional regulation """ # Find common genes common_genes = set(rnaseq_data.index) & set(proteomics_data.index) correlations = {} for gene in common_genes: rna = rnaseq_data.loc[gene] protein = proteomics_data.loc[gene] # Spearman correlation (robust to outliers) r, p = spearmanr(rna, protein) correlations[gene] = {'r': r, 'p': p} # Identify discordant genes (low RNA-protein correlation) discordant = {g: v for g, v in correlations.items() if abs(v['r']) < 0.2} return correlations, discordant ``` #### 4.2: Methylation vs Expression (Epigenetic Regulation) ```python def correlate_methylation_expression(methylation_data, rnaseq_data): """ Correlate promoter methylation with gene expression. Expected: Negative correlation (increased methylation → decreased expression) """ # For each gene with promoter methylation results = {} for gene in methylation_data.index: if gene in rnaseq_data.index: meth = methylation_data.loc[gene] # Average promoter beta expr = rnaseq_data.loc[gene] r, p = spearmanr(meth, expr) results[gene] = {'r': r, 'p': p, 'direction': 'repressive' if r < 0 else 'activating'} # Identify genes with strong methylation-expression anticorrelation regulated = {g: v for g, v in results.items() if v['r'] < -0.5 and v['p'] < 0.01} return results, regulated ``` #### 4.3: CNV vs Expression (Dosage Effect) ```python def correlate_cnv_expression(cnv_data, rnaseq_data): """ Correlate copy number with gene expression. Expected: Positive correlation (gene dosage effect) """ results = {} for gene in cnv_data.index: if gene in rnaseq_data.index: cnv = cnv_data.loc[gene] # log2 ratio expr = rnaseq_data.loc[gene] r, p = pearsonr(cnv, expr) results[gene] = {'r': r, 'p': p} # Genes with dosage effect (CNV drives expression) dosage_genes = {g: v for g, v in results.items() if v['r'] > 0.5 and v['p'] < 0.01} return results, dosage_genes ``` ### Phase 5: Multi-Omics Clustering **Objective**: Identify patient subtypes using integrated omics data. **Method 1: MOFA+ (Multi-Omics Factor Analysis)** MOFA+ identifies latent factors that explain variation across omics. ```python # Conceptual workflow (uses R's MOFA2 package or Python implementation) # 1. Prepare multi-omics data as list of matrices # 2. Run MOFA+ to identify factors # 3. Inspect factor variance explained per omics # 4. Cluster samples based on factor scores # Example interpretation: # Factor 1: Explains 40% variance in RNA-seq, 30% in proteomics → Cell proliferation # Factor 2: Explains 50% variance in methylation → Epigenetic subtype # Factor 3: Explains 20% variance in CNV → Genomic instability ``` **Method 2: Joint NMF (Non-negative Matrix Factorization)** Decompose multi-omics matrices into shared latent components. ```python def joint_nmf_clustering(omics_data_dict, n_clusters=3): """ Perform joint NMF across omics for clustering. Returns patient cluster assignments based on shared factors. """ # Concatenate omics matrices (after normalization) combined_matrix = np.vstack([ omics_data_dict['rnaseq'].values, omics_data_dict['proteomics'].values, omics_data_dict['methylation'].values ]) # Run NMF from sklearn.decomposition import NMF model = NMF(n_components=n_clusters, init='nndsvd', random_state=42) W = model.fit_transform(combined_matrix) # Feature loadings H = model.components_ # Sample coefficients # Cluster samples based on H (components) from sklearn.cluster import KMeans clusters = KMeans(n_clusters=n_clusters).fit_predict(H.T) return clusters, W, H ``` **Method 3: Similarity Network Fusion (SNF)** Integrate omics through patient similarity networks. ### Phase 6: Pathway-Level Integration **Objective**: Aggregate multi-omics evidence at the pathway level. **Approach**: Score pathway dysregulation using combined evidence from multiple omics. ```python def integrate_pathway_evidence(omics_results, pathway_genes): """ Score pathway dysregulation across omics. omics_results: { 'rnaseq': {'gene': fold_change}, 'proteomics': {'gene': fold_change}, 'methylation': {'gene': methylation_diff}, 'cnv': {'gene': copy_number} } pathway_genes: List of genes in pathway """ # For each gene in pathway pathway_scores = [] for gene in pathway_genes: gene_score = 0 evidence_count = 0 # RNA-seq evidence if gene in omics_results['rnaseq']: gene_score += abs(omics_results['rnaseq'][gene]) evidence_count += 1 # Proteomics evidence if gene in omics_results['proteomics']: gene_score += abs(omics_results['proteomics'][gene]) evidence_count += 1 # Methylation evidence (negative correlation) if gene in omics_results['methylation']: gene_score += abs(omics_results['methylation'][gene]) evidence_count += 1 # CNV evidence if gene in omics_results['cnv']: gene_score += abs(omics_results['cnv'][gene]) evidence_count += 1 if evidence_count > 0: pathway_scores.append(gene_score / evidence_count) # Aggregate pathway score (mean of gene scores) pathway_score = np.mean(pathway_scores) if pathway_scores else 0 return { 'pathway_score': pathway_score, 'n_genes_with_evidence': len(pathway_scores), 'n_omics_types': evidence_count } ``` **Use ToolUniverse enrichment tools**: ```python # Get pathways for gene set from tooluniverse import ToolUniverse tu = ToolUniverse() # Enrichment for genes dysregulated in ANY omics all_dysregulated_genes = set() all_dysregulated_genes.update(rnaseq_degs) all_dysregulated_genes.update(diff_proteins) all_dysregulated_genes.update(methylation_dmgs) # Run enrichment enrichment = tu.run_one_function({ "name": "enrichr_enrich", "arguments": { "gene_list": ",".join(all_dysregulated_genes), "library": "KEGG_2021_Human" } }) # Score each pathway with multi-omics evidence for pathway in enrichment['data']['results']: pathway_genes = pathway['genes'] pathway['multi_omics_score'] = integrate_pathway_evidence( omics_results, pathway_genes ) ``` ### Phase 7: Biomarker Discovery **Objective**: Identify multi-omics signatures for disease classification. **Feature selection across omics**: ```python def select_multiomics_features(X_dict, y, n_features=50): """ Select top features across omics for classification. X_dict: { 'rnaseq': DataFrame (samples x genes), 'proteomics': DataFrame (samples x proteins), 'methylation': DataFrame (samples x CpGs) } y: Target labels (disease vs control) Returns: Selected features per omics """ from sklearn.feature_selection import SelectKBest, f_classif selected_features = {} for omics_type, X in X_dict.items(): selector = SelectKBest(f_classif, k=min(n_features, X.shape[1])) selector.fit(X, y) # Get selected feature names selected_idx = selector.get_support() selected_features[omics_type] = X.columns[selected_idx].tolist() return selected_features ``` **Multi-omics classification**: ```python def multiomics_classification(X_dict, y, selected_features): """ Train classifier using multi-omics features. """ from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score # Concatenate selected features from each omics X_combined = [] for omics_type, features in selected_features.items(): X_combined.append(X_dict[omics_type][features]) X_combined = pd.concat(X_combined, axis=1) # Train classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) scores = cross_val_score(clf, X_combined, y, cv=5, scoring='roc_auc') return { 'mean_auc': scores.mean(), 'std_auc': scores.std(), 'n_features': X_combined.shape[1], 'features_per_omics': {k: len(v) for k, v in selected_features.items()} } ``` ### Phase 8: Integrated Reporting **Generate comprehensive multi-omics report**: ```markdown # Multi-Omics Integration Report ## Dataset Summary - **Omics Types**: RNA-seq, Proteomics, Methylation, CNV - **Common Samples**: 45 patients (30 disease, 15 control) - **Features**: 15,000 genes, 5,000 proteins, 450K CpGs, 20K CNV regions ## Cross-Omics Correlation ### RNA-Protein Correlation - **Overall correlation**: r = 0.52 (expected: 0.4-0.6) - **Highly correlated**: 3,245 genes (45%) - **Discordant genes**: 890 genes (post-transcriptional regulation) ### Methylation-Expression - **Promoter methylation**: Anticorrelation r = -0.41 - **Epigenetically regulated genes**: 1,256 genes (p < 0.01) - **Example**: BRCA1 promoter hypermethylation → 3-fold reduced expression ### CNV-Expression Dosage Effect - **Genes with dosage effect**: 445 genes (r > 0.5, p < 0.01) - **Example**: MYC amplification (3 copies) → 2.8-fold increased expression ## Multi-Omics Clustering ### MOFA+ Analysis - **Factor 1** (25% variance): Cell cycle genes (RNA + protein) - **Factor 2** (18% variance): Immune signature (RNA + methylation) - **Factor 3** (15% variance): Metabolic reprogramming (RNA + metabolites) ### Patient Subtypes - **Subtype 1** (n=18): High proliferation, MYC amplification - **Subtype 2** (n=15): Immune-enriched, hypomethylation - **Subtype 3** (n=12): Metabolic dysregulation, mitochondrial dysfunction ## Pathway Integration ### Top Dysregulated Pathways (Multi-Omics Score) 1. **Cell Cycle** (score: 8.5) - RNA (↑), Protein (↑), CNV (amplification) 2. **Immune Response** (score: 7.2) - RNA (↑), Methylation (hypo) 3. **Glycolysis** (score: 6.8) - RNA (↑), Metabolites (↑) ## Multi-Omics Biomarkers ### Classification Performance - **AUC**: 0.92 ± 0.04 (5-fold CV) - **Features**: 50 total (20 RNA, 15 protein, 10 methylation, 5 CNV) - **Top biomarkers**: - MYC expression (RNA) - CDK1 protein abundance - BRCA1 promoter methylation - TP53 CNV status ## Biological Interpretation The multi-omics analysis reveals three distinct disease subtypes driven by different molecular mechanisms: 1. **Proliferative subtype**: Characterized by MYC amplification driving coordinated upregulation of cell cycle genes at both RNA and protein levels. 2. **Immune subtype**: Hypomethylation of immune genes leading to increased expression and T-cell infiltration. 3. **Metabolic subtype**: Shift from oxidative phosphorylation to glycolysis, with concordant changes in enzyme expression and metabolite levels. These subtypes may respond differently to targeted therapies. ``` --- ## ToolUniverse Skills Coordination This skill orchestrates multiple specialized skills: | Skill | Used For | Phase | |-------|----------|-------| | `tooluniverse-rnaseq-deseq2` | Load and analyze RNA-seq data | Phase 1, 4 | | `tooluniverse-epigenomics` | Methylation analysis, ChIP-seq peaks | Phase 1, 4 | | `tooluniverse-variant-analysis` | CNV and SNV processing | Phase 1, 3, 4 | | `tooluniverse-protein-interactions` | Protein network context | Phase 6 | | `tooluniverse-gene-enrichment` | Pathway enrichment | Phase 6 | | `tooluniverse-expression-data-retrieval` | Public omics data retrieval | Phase 1 | | `tooluniverse-target-research` | Gene/protein annotation | Phase 3, 8 | --- ## Example Use Cases ### Use Case 1: Cancer Multi-Omics **Question**: "Integrate TCGA breast cancer RNA-seq, proteomics, methylation, and CNV data" **Workflow**: 1. Load 4 omics types for 500 patients 2. Match samples (450 common across all omics) 3. Correlate RNA-protein (identify translation-regulated genes) 4. Correlate methylation-expression (find epigenetically silenced genes) 5. Correlate CNV-expression (identify dosage-sensitive genes) 6. Run MOFA+ to find latent factors 7. Identify 4 subtypes with distinct multi-omics profiles 8. Perform pathway enrichment per subtype 9. Select multi-omics biomarkers (AUC=0.94) ### Use Case 2: eQTL + Expression **Question**: "How do GWAS variants affect gene expression through methylation?" **Workflow**: 1. Load genotype data (SNPs from GWAS) 2. Load expression data (RNA-seq) 3. Load methylation data (450K array) 4. For each GWAS SNP: - Test association with nearby gene expression (eQTL) - Test association with nearby CpG methylation (meQTL) - Test CpG-gene correlation 5. Identify SNP → methylation → expression regulatory chains 6. Annotate with ToolUniverse (GWAS traits, gene function) ### Use Case 3: Drug Response Multi-Omics **Question**: "Predict drug response using multi-omics profiles" **Workflow**: 1. Load baseline multi-omics (pre-treatment) 2. Load drug response data (IC50 or clinical response) 3. Correlate each omics with response 4. Select multi-omics features predictive of response 5. Train multi-omics classifier 6. Identify pathways associated with resistance/sensitivity 7. Use ToolUniverse drug-repurposing skill for alternative options --- ## Advanced Analysis Patterns ### Pattern 1: Omics-Driven Patient Stratification For precision medicine applications where patient stratification is goal. ### Pattern 2: Multi-Omics Network Analysis Build integrated networks combining PPI, co-expression, regulatory interactions. ### Pattern 3: Temporal Multi-Omics Longitudinal multi-omics data (time-series or treatment response). ### Pattern 4: Spatial Multi-Omics Spatial transcriptomics + proteomics for tissue architecture. --- ## Quantified Minimums | Component | Requirement | |-----------|-------------| | Omics types | At least 2 omics datasets | | Common samples | At least 10 samples across omics | | Cross-correlation | Pearson/Spearman correlation computed | | Clustering | At least one method (MOFA+, NMF, or SNF) | | Pathway integration | Enrichment with multi-omics evidence scores | | Report | Summary, correlations, clusters, pathways, biomarkers | --- ## Limitations - **Sample size**: Multi-omics integration requires sufficient samples (n≥20 recommended) - **Missing data**: Some patients may not have all omics types - **Batch effects**: Different omics platforms/batches require careful normalization - **Computational**: Large multi-omics datasets may require significant memory/compute - **Interpretation**: Multi-omics results require domain expertise for biological validation --- ## References **Methods**: - MOFA+: https://doi.org/10.1186/s13059-020-02015-1 - Similarity Network Fusion: https://doi.org/10.1038/nmeth.2810 - Multi-omics review: https://doi.org/10.1038/s41576-019-0093-7 **ToolUniverse Skills**: - See individual skill documentation for omics-specific methods