--- name: "arboreto-grn-inference" description: "GRN inference from expression via GRNBoost2 (gradient boosting) or GENIE3 (Random Forest). Load matrix, filter by TFs, infer TF-target-importance links, save network. Dask-parallelized to single-cell scale. Core SCENIC component." license: "BSD-3-Clause" --- # Arboreto GRN Inference ## Overview Arboreto infers gene regulatory networks (GRNs) from gene expression data using parallelized tree-based regression. For each target gene, it trains a regression model with all other genes (or a specified TF list) as features and emits TF-target-importance triplets. It provides two interchangeable algorithms -- GRNBoost2 (gradient boosting, fast) and GENIE3 (Random Forest, classic) -- sharing identical input/output formats. Computation is Dask-parallelized, scaling from laptop cores to HPC clusters. ## When to Use - Inferring transcription factor-to-target gene regulatory relationships from bulk RNA-seq expression data - Building gene regulatory networks from single-cell RNA-seq count matrices (cells as rows, genes as columns) - Generating the adjacency matrix (Step 1) of the pySCENIC regulatory analysis pipeline - Comparing regulatory network structure across experimental conditions (e.g., control vs treatment) - Producing consensus regulatory networks by running inference across multiple random seeds - Validating GRN results by comparing GRNBoost2 and GENIE3 outputs on the same dataset - For downstream regulon identification and activity scoring, use arboreto output with pySCENIC - For single-cell preprocessing (QC, normalization, clustering) before GRN inference, use **scanpy-scrna-seq** ## Prerequisites - **Python packages**: `arboreto`, `pandas`, `numpy`, `dask`, `distributed`, `scikit-learn`, `scipy` - **Data requirements**: Gene expression matrix (genes as columns, observations as rows) in TSV/CSV; optionally a TF list file (one gene name per line) - **Environment**: Python 3.8+; optional `networkx`, `matplotlib` for visualization ```bash pip install arboreto distributed networkx matplotlib ``` ## Quick Start Complete GRN inference in a single block. The `if __name__ == '__main__':` guard is required because Dask spawns worker processes via multiprocessing. ```python import pandas as pd from arboreto.algo import grnboost2 from arboreto.utils import load_tf_names if __name__ == '__main__': # Load expression matrix (observations x genes) expression_matrix = pd.read_csv('expression_data.tsv', sep='\t') tf_names = load_tf_names('tf_list.txt') # optional TF filter # Infer GRN (uses all local cores by default) network = grnboost2(expression_data=expression_matrix, tf_names=tf_names, seed=777) # Filter top links and save top_network = network[network['importance'] > 1.0] top_network.to_csv('grn_output.tsv', sep='\t', index=False, header=False) print(f"Inferred {len(network)} links, kept {len(top_network)} above threshold") # Example: Inferred 185432 links, kept 12876 above threshold ``` ## Workflow ### Step 1: Load Expression Data Arboreto accepts a pandas DataFrame (recommended) or NumPy array. Rows are observations (cells or samples), columns are genes. Gene names must be column headers. ```python import pandas as pd # From TSV (genes as columns, observations as rows) expression_matrix = pd.read_csv('expression_data.tsv', sep='\t') print(f"Shape: {expression_matrix.shape} " f"({expression_matrix.shape[0]} observations x {expression_matrix.shape[1]} genes)") # Example: Shape: (5000, 18654) (5000 observations x 18654 genes) # From AnnData (e.g., after scanpy preprocessing) import anndata as ad adata = ad.read_h5ad('preprocessed.h5ad') expression_matrix = pd.DataFrame( adata.X.toarray() if hasattr(adata.X, 'toarray') else adata.X, columns=adata.var_names.tolist() ) print(f"Converted AnnData: {expression_matrix.shape}") # Example: Converted AnnData: (5000, 18654) ``` ### Step 2: Load Transcription Factor List Providing a TF list restricts regulators to known transcription factors, reducing computation time and improving biological relevance. If omitted, all genes are treated as potential regulators. ```python from arboreto.utils import load_tf_names # From file (one TF name per line) tf_names = load_tf_names('human_tfs.txt') print(f"Loaded {len(tf_names)} transcription factors") # Example: Loaded 1639 transcription factors # Or define directly tf_names = ['MYC', 'TP53', 'SOX2', 'NANOG', 'POU5F1'] # Verify TFs exist in expression matrix columns tf_in_data = [tf for tf in tf_names if tf in expression_matrix.columns] print(f"TFs found in expression data: {len(tf_in_data)}/{len(tf_names)}") # Example: TFs found in expression data: 1583/1639 ``` ### Step 3: Configure Dask Client (Optional) By default arboreto creates an internal Dask client using all local cores. Create an explicit client for resource control, monitoring, or cluster deployment. ```python from distributed import LocalCluster, Client # Custom local client with resource limits local_cluster = LocalCluster( n_workers=8, threads_per_worker=1, # avoid GIL contention in scikit-learn memory_limit='4GB' # per worker ) client = Client(local_cluster) print(f"Dashboard: {client.dashboard_link}") # Example: Dashboard: http://127.0.0.1:8787/status ``` ### Step 4: Run GRN Inference Call `grnboost2()` (recommended) or `genie3()`. Both share the same signature and output format. ```python from arboreto.algo import grnboost2 if __name__ == '__main__': network = grnboost2( expression_data=expression_matrix, tf_names=tf_names, # 'all' if no TF list client_or_address=client, # omit to use default local scheduler seed=777, # for reproducibility verbose=True # print progress ) print(f"Inferred {len(network)} regulatory links") print(network.head()) # Example: # TF target importance # 0 MYC CDK4 3.214 # 1 MYC CCND1 2.871 # 2 TP53 CDKN1A 2.654 ``` ### Step 5: Filter Results Raw output contains links for every TF-target pair with non-zero importance. Filter to retain high-confidence regulatory interactions. ```python # Strategy 1: Importance threshold threshold = 1.0 filtered = network[network['importance'] > threshold] print(f"Threshold {threshold}: {len(filtered)} links " f"({len(filtered)/len(network)*100:.1f}% retained)") # Example: Threshold 1.0: 12876 links (6.9% retained) # Strategy 2: Top N links per target gene top_n = 10 top_per_target = (network.groupby('target') .apply(lambda g: g.nlargest(top_n, 'importance')) .reset_index(drop=True)) print(f"Top {top_n} per target: {len(top_per_target)} links") # Example: Top 10 per target: 186540 links ``` ### Step 6: Save Network Save as a TSV file with three columns: TF, target, importance. ```python # Save full network network.to_csv('full_network.tsv', sep='\t', index=False) print(f"Saved full network: {len(network)} links") # Save filtered network (without header for pySCENIC compatibility) filtered.to_csv('filtered_network.tsv', sep='\t', index=False, header=False) print(f"Saved filtered network: {len(filtered)} links") # Clean up Dask client if explicitly created client.close() local_cluster.close() ``` ### Step 7: Visualize Results (Optional) Plot the top regulatory interactions as a directed network graph. ```python import networkx as nx import matplotlib.pyplot as plt # Build directed graph from top links top_links = network.nlargest(50, 'importance') G = nx.from_pandas_edgelist( top_links, source='TF', target='target', edge_attr='importance', create_using=nx.DiGraph() ) # Draw network fig, ax = plt.subplots(figsize=(12, 10)) pos = nx.spring_layout(G, k=2, seed=42) nx.draw_networkx_nodes(G, pos, node_size=300, node_color='lightblue', ax=ax) nx.draw_networkx_labels(G, pos, font_size=7, ax=ax) edges = nx.draw_networkx_edges( G, pos, edge_color=[G[u][v]['importance'] for u, v in G.edges()], edge_cmap=plt.cm.Reds, width=1.5, arrows=True, ax=ax ) plt.colorbar(edges, ax=ax, label='Importance') ax.set_title('Top 50 Regulatory Interactions') plt.tight_layout() plt.savefig('grn_network.png', dpi=300, bbox_inches='tight') print("Saved grn_network.png") ``` ## Key Parameters | Parameter | Default | Range / Options | Effect | |-----------|---------|-----------------|--------| | `expression_data` | (required) | DataFrame or ndarray | Expression matrix, observations x genes | | `tf_names` | `'all'` | list of str or `'all'` | Restrict regulators to known TFs; `'all'` uses every gene | | `gene_names` | `None` | list of str | Required when `expression_data` is a NumPy array | | `client_or_address` | `'local'` | Client, str, or `'local'` | Dask client instance or scheduler address | | `seed` | `None` | int | Random seed for reproducibility; always set for replicable results | | `verbose` | `False` | bool | Print progress messages during inference | ## Key Concepts ### Algorithm Selection Guide Both algorithms follow the same multiple-regression strategy (train one model per target gene, extract feature importances) but differ in the underlying regressor. | Feature | GRNBoost2 | GENIE3 | |---------|-----------|--------| | **Method** | Stochastic gradient boosting with early stopping | Random Forest (or ExtraTrees) | | **Speed** | Fast -- optimized for large datasets | Slower -- higher per-model cost | | **Memory** | Lower (early stopping limits tree depth) | Higher (full forests per target) | | **Best for** | Default choice; 10k+ observations, single-cell scale | Comparison with published GENIE3 results; validation | | **Import** | `from arboreto.algo import grnboost2` | `from arboreto.algo import genie3` | | **Output format** | TF, target, importance (identical) | TF, target, importance (identical) | **Decision rule**: Start with GRNBoost2. Use GENIE3 only when reproducing published GENIE3 analyses or as an independent validation of GRNBoost2 results. ### Output Format The network DataFrame has three columns, sorted by descending importance: | Column | Type | Description | |--------|------|-------------| | `TF` | str | Transcription factor (regulator) gene name | | `target` | str | Target gene name | | `importance` | float | Regulatory importance score (higher = stronger predicted regulation) | Importance scores are not p-values. They represent feature importance from the tree-based model. Filter by absolute threshold or top-N per target for downstream analysis. ### The `if __name__ == '__main__':` Requirement Dask uses Python's `multiprocessing` module to spawn worker processes. Without the main guard, each spawned process re-executes the script, leading to infinite process spawning. Always wrap arboreto calls in `if __name__ == '__main__':` when running as a script. This is not needed inside Jupyter notebooks. ## Common Recipes ### Recipe: Distributed Computing on a Cluster When to use: datasets too large for a single machine (e.g., 50k+ cells, 20k+ genes). Requires a Dask distributed scheduler running on the cluster. ```bash # On head node: start scheduler dask-scheduler # Output: Scheduler at tcp://10.118.224.134:8786 # On each compute node: start workers dask-worker tcp://10.118.224.134:8786 --nprocs 4 --nthreads 1 --memory-limit 16GB ``` ```python from distributed import Client from arboreto.algo import grnboost2 import pandas as pd if __name__ == '__main__': client = Client('tcp://10.118.224.134:8786') print(f"Connected to cluster: {client.scheduler_info()['workers'].__len__()} workers") # Example: Connected to cluster: 16 workers expression_data = pd.read_csv('large_scrna.tsv', sep='\t') tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist() network = grnboost2( expression_data=expression_data, tf_names=tf_names, client_or_address=client, seed=42, verbose=True ) network.to_csv('cluster_grn.tsv', sep='\t', index=False) print(f"Cluster inference complete: {len(network)} links") client.close() ``` ### Recipe: Consensus Network from Multiple Seeds When to use: improve robustness by aggregating networks from multiple random seeds. Links appearing consistently across runs are more reliable. ```python import pandas as pd from distributed import LocalCluster, Client from arboreto.algo import grnboost2 if __name__ == '__main__': client = Client(LocalCluster(n_workers=8, threads_per_worker=1)) expression_data = pd.read_csv('expression_data.tsv', sep='\t') tf_names = pd.read_csv('tf_list.txt', header=None)[0].tolist() seeds = [42, 123, 456, 789, 1001] all_networks = [] for seed in seeds: net = grnboost2(expression_data=expression_data, tf_names=tf_names, client_or_address=client, seed=seed) net['seed'] = seed all_networks.append(net) print(f"Seed {seed}: {len(net)} links") combined = pd.concat(all_networks, ignore_index=True) # Consensus: mean importance across seeds, keep links found in 3+ runs consensus = (combined.groupby(['TF', 'target']) .agg(mean_importance=('importance', 'mean'), n_seeds=('seed', 'nunique')) .reset_index()) consensus = consensus[consensus['n_seeds'] >= 3] consensus = consensus.sort_values('mean_importance', ascending=False) consensus.to_csv('consensus_network.tsv', sep='\t', index=False) print(f"Consensus network: {len(consensus)} links (found in 3+ of {len(seeds)} runs)") # Example: Consensus network: 28453 links (found in 3+ of 5 runs) client.close() ``` ### Recipe: pySCENIC Integration When to use: downstream regulatory analysis after arboreto GRN inference. Arboreto produces the adjacency matrix (Step 1 of SCENIC); pySCENIC performs regulon identification (Step 2) and activity scoring (Step 3). ```python from arboreto.algo import grnboost2 from arboreto.utils import load_tf_names import pandas as pd if __name__ == '__main__': # Step 1 (arboreto): GRN inference expression_matrix = pd.read_csv('scrna_expression.tsv', sep='\t') tf_names = load_tf_names('allTFs_hg38.txt') adjacencies = grnboost2( expression_data=expression_matrix, tf_names=tf_names, seed=777 ) adjacencies.to_csv('adjacencies.tsv', sep='\t', index=False, header=False) print(f"SCENIC Step 1 complete: {len(adjacencies)} adjacencies") # Example: SCENIC Step 1 complete: 234567 adjacencies # Steps 2-3 use pySCENIC CLI (requires pyscenic installation): # pyscenic ctx adjacencies.tsv hg38_ranking.feather \ # --annotations_fname motifs-v10nr.hgnc-m0.001-o0.0.tbl \ # --output regulons.csv # # pyscenic aucell scrna_expression.tsv regulons.csv \ # --output auc_matrix.csv print("Next: run pyscenic ctx (Step 2) and pyscenic aucell (Step 3)") ``` ## Expected Outputs - `full_network.tsv` -- complete TF-target-importance table; columns: TF, target, importance - `filtered_network.tsv` -- high-confidence subset after threshold or top-N filtering - `consensus_network.tsv` -- aggregated network from multi-seed runs; columns: TF, target, mean_importance, n_seeds - `adjacencies.tsv` -- arboreto output formatted for pySCENIC input (no header) - `grn_network.png` -- directed graph visualization of top regulatory interactions ## Troubleshooting | Problem | Cause | Solution | |---------|-------|----------| | `RuntimeError: freeze_support()` or infinite process spawning | Missing `if __name__ == '__main__':` guard | Wrap all arboreto calls in main guard; not needed in Jupyter | | `MemoryError` during inference | Expression matrix too large for available RAM | Filter low-variance genes first, reduce to top 5-10k genes, or use distributed cluster | | Empty or near-empty network | TF names do not match expression matrix column names | Verify TF names overlap with `expression_matrix.columns`; check case sensitivity | | Very slow inference | Using GENIE3 on large dataset, or too few Dask workers | Switch to GRNBoost2; create explicit Dask client with more workers | | `ModuleNotFoundError: No module named 'arboreto'` | Package not installed | `pip install arboreto` | | All importance scores are 0 or identical | Expression matrix has no variance (e.g., all zeros after filtering) | Check data quality; ensure matrix contains non-zero expression values with variance | | `TypeError` with NumPy array input | Missing `gene_names` parameter | Provide `gene_names=` list matching the column count of the array | ## Bundled Resources This entry is fully self-contained with no `references/` directory. **Original reference file dispositions**: - **references/algorithms.md** (139 lines) -- Algorithm comparison, parameter details, custom regressor kwargs. Consolidated into Key Concepts (Algorithm Selection Guide table) and Key Parameters. Custom regressor kwargs (`regressor_type`, `regressor_kwargs`) omitted: advanced tuning rarely needed; users can pass scikit-learn kwargs directly per the arboreto API docs. - **references/basic_inference.md** (152 lines) -- Input formats, TF loading, basic workflow, output format. Fully consolidated into Workflow Steps 1-2, Key Concepts (Output Format table), and Quick Start. - **references/distributed_computing.md** (242 lines) -- Local client, cluster setup, monitoring, performance tips. Consolidated into Workflow Step 3 and Recipe: Distributed Computing. Omitted: Dask dashboard panel descriptions (monitoring UI detail beyond scope), detailed network/storage tuning (cluster-admin concern). - **scripts/basic_grn_inference.py** (98 lines) -- CLI wrapper with argparse. Core load-infer-save logic absorbed into Workflow Steps 1-6 and Quick Start. Argparse/CLI boilerplate stripped. **Narrative use-case dispositions** (from original Common Use Cases): - Single-cell RNA-seq analysis -- absorbed into When to Use bullets and Workflow Steps 1/4 - Bulk RNA-seq with TF filtering -- absorbed into Workflow Step 2 and Quick Start - Comparative analysis (multiple conditions) -- omitted as standalone recipe: trivial loop over conditions, pattern shown in Consensus Network recipe ## Related Skills - **scanpy-scrna-seq** -- upstream single-cell preprocessing (QC, normalization, HVG selection) before GRN inference - **scvi-tools-single-cell** -- deep generative models for batch correction and imputation prior to network inference - **pyscenic (planned)** -- downstream regulon identification and cellular activity scoring using arboreto adjacencies - **networkx-graph-analysis** -- graph-theoretic analysis of inferred regulatory networks (centrality, communities) ## References - [Arboreto GitHub repository](https://github.com/aertslab/arboreto) -- source code, README, examples - [Arboreto PyPI](https://pypi.org/project/arboreto/) -- installation and version history - Moerman et al. (2019) "GRNBoost2 and Arboreto: efficient and scalable inference of gene regulatory networks" *Bioinformatics* 35(12):2159-2161 -- algorithm paper - Aibar et al. (2017) "SCENIC: single-cell regulatory network inference and clustering" *Nature Methods* 14:1083-1086 -- SCENIC pipeline paper - [Dask distributed documentation](https://distributed.dask.org/) -- Dask client, cluster setup, dashboard