---
name: "archs4-database"
description: "Query ARCHS4 REST API for uniformly processed RNA-seq expression, tissue patterns, co-expression across 1M+ human/mouse samples. Retrieve z-scores, co-expressed genes, samples by metadata, HDF5 matrices. For variant population genetics use gnomad-database; for pathway enrichment use gget-genomic-databases (Enrichr)."
license: "CC-BY-4.0"
---

# ARCHS4 Database

## Overview

ARCHS4 (All RNA-seq and ChIP-seq Sample and Signature Search) is a resource of uniformly aligned and processed human and mouse RNA-seq data from NCBI GEO and SRA, covering 1 million+ samples. The REST API at `https://maayanlab.cloud/archs4/api/` provides gene-level expression profiles, z-score normalized tissue expression, co-expression networks, and sample metadata search — all without authentication. Large-scale bulk queries can also use the downloadable HDF5 expression matrices.

## When to Use

- Retrieving tissue-specific or cell-type-specific expression z-scores for a gene of interest across hundreds of tissue types
- Finding genes co-expressed with a query gene (co-expression network construction or guilt-by-association analysis)
- Searching for RNA-seq samples by tissue, disease, or metadata keyword to identify candidate datasets for reanalysis
- Comparing expression profiles of multiple genes across tissues to prioritize candidates for wet-lab follow-up
- Accessing uniformly processed gene expression matrices (HDF5 format) for large-scale cross-study analysis
- Validating differential expression results by checking whether a gene's expression direction matches population-level tissue profiles
- For variant-level population allele frequencies use `gnomad-database`; ARCHS4 provides expression evidence only
- For Enrichr pathway enrichment from a gene list use `gget-genomic-databases` (`gget enrichr`); ARCHS4 is for expression lookups

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`, `seaborn`
- **Data requirements**: gene symbols (HGNC format, e.g., `TP53`, `BRCA1`); sample GEO/SRA IDs for direct sample queries
- **Environment**: internet connection; no API key or account required
- **Rate limits**: ~10 requests/second; add `time.sleep(0.1)` between sequential gene queries to avoid throttling

```bash
pip install requests pandas matplotlib seaborn
```

## Quick Start

```python
import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def archs4_get(endpoint: str, params: dict = None) -> dict:
    """Send a GET request to the ARCHS4 API and return parsed JSON."""
    r = requests.get(f"{ARCHS4_BASE}/{endpoint}", params=params, timeout=30)
    r.raise_for_status()
    return r.json()

# Quick check: top tissues expressing TP53
data = archs4_get("meta/genes/TP53/zscore")
tissues = data.get("values", [])
print(f"TP53 tissue expression entries: {len(tissues)}")
top5 = sorted(tissues, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
for t in top5:
    print(f"  {t['tissue']:<40}  z={t['zscore']:.2f}")
# TP53 tissue expression entries: 200
#   thymus                                   z=2.81
#   testis                                   z=2.44
```

## Core API

### Query 1: Gene Expression Z-Scores Across Tissues

Retrieve z-score normalized expression for a gene across all available tissue types. Z-scores are computed per-sample relative to the population distribution; positive values indicate above-average expression.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_tissue_zscore(gene_symbol: str, species: str = "human") -> pd.DataFrame:
    """Return tissue z-score expression profile for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol (e.g., 'TP53').
    species : str
        'human' or 'mouse' (default: 'human').
    """
    endpoint = f"meta/genes/{gene_symbol}/zscore"
    r = requests.get(
        f"{ARCHS4_BASE}/{endpoint}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("zscore", ascending=False).reset_index(drop=True)

df = get_gene_tissue_zscore("MYC")
print(f"MYC tissue z-scores: {len(df)} tissue types")
print(df[["tissue", "zscore"]].head(10).to_string(index=False))
# MYC tissue z-scores: 200
#                     tissue  zscore
#                      colon    3.12
#             small intestine    2.98
#                    placenta    2.74
```

```python
# Query mouse tissues for a gene
df_mouse = get_gene_tissue_zscore("Myc", species="mouse")
print(f"Mouse Myc: top 5 tissues")
print(df_mouse[["tissue", "zscore"]].head(5).to_string(index=False))
```

### Query 2: Co-expressed Genes

Find genes whose expression is most correlated with a query gene across all ARCHS4 samples. Useful for identifying pathway partners, regulators, or candidate targets.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexpressed_genes(gene_symbol: str, top_n: int = 50,
                           species: str = "human") -> pd.DataFrame:
    """Return genes co-expressed with the query gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of correlated genes to return (default: 50).
    species : str
        'human' or 'mouse' (default: 'human').
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("values", [])
    df = pd.DataFrame(records)
    return df.sort_values("correlation", ascending=False).reset_index(drop=True)

coexp = get_coexpressed_genes("PCNA", top_n=20)
print(f"Top co-expressed genes with PCNA (n={len(coexp)}):")
print(coexp[["gene", "correlation"]].head(10).to_string(index=False))
# Top co-expressed genes with PCNA (n=20):
#   gene  correlation
#   RFC4         0.91
#   RFC2         0.89
#   MCM6         0.87
```

```python
# Extract gene list for downstream enrichment
gene_list = coexp["gene"].tolist()
print(f"Co-expression gene list: {gene_list[:10]}")
# Pass gene_list to Enrichr or pathway analysis tools
```

### Query 3: Sample Search

Search for RNA-seq samples by metadata keyword (tissue, disease condition, cell type, treatment). Returns GEO/SRA sample identifiers with metadata fields.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def search_samples(keyword: str, species: str = "human",
                   limit: int = 100) -> pd.DataFrame:
    """Search ARCHS4 samples by metadata keyword.

    Parameters
    ----------
    keyword : str
        Search term (e.g., 'breast cancer', 'liver', 'HeLa').
    species : str
        'human' or 'mouse'.
    limit : int
        Maximum number of samples to return.
    """
    r = requests.get(
        f"{ARCHS4_BASE}/samples/search",
        params={"query": keyword, "species": species, "limit": limit},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("samples", [])
    return pd.DataFrame(records)

samples = search_samples("pancreatic cancer", limit=50)
print(f"Samples matching 'pancreatic cancer': {len(samples)}")
if len(samples) > 0:
    print(samples[["sample_id", "series_id", "title"]].head(5).to_string(index=False))
# Samples matching 'pancreatic cancer': 50
#   sample_id  series_id  title
#   GSM2345678  GSE123456  Pancreatic ductal adenocarcinoma - sample 1
```

### Query 4: Gene-Level Metadata Summary

Retrieve summary statistics and metadata for a gene including the number of samples expressing it, expression percentile, and available annotation.

```python
import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_gene_metadata(gene_symbol: str, species: str = "human") -> dict:
    """Return metadata and expression summary for a gene."""
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    return r.json()

meta = get_gene_metadata("GAPDH")
print(f"Gene: {meta.get('gene_symbol', 'N/A')}")
print(f"Species: {meta.get('species', 'N/A')}")
print(f"Ensembl ID: {meta.get('ensembl_gene_id', 'N/A')}")
print(f"Description: {meta.get('description', 'N/A')[:80]}")
```

```python
# Compare metadata for a panel of housekeeping genes
import time

housekeeping = ["GAPDH", "ACTB", "B2M", "HPRT1", "RPLP0"]
for gene in housekeeping:
    meta = get_gene_metadata(gene)
    print(f"  {gene:<8}  {meta.get('ensembl_gene_id', 'N/A')}")
    time.sleep(0.1)
```

### Query 5: Visualization — Tissue Expression Barplot

Generate a publication-ready barplot of z-score expression across the top tissues for a gene.

```python
import requests
import pandas as pd
import matplotlib.pyplot as plt

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def plot_tissue_expression(gene_symbol: str, top_n: int = 20,
                            species: str = "human",
                            output_file: str = None) -> None:
    """Plot top tissue z-score expression for a gene.

    Parameters
    ----------
    gene_symbol : str
        HGNC gene symbol.
    top_n : int
        Number of top tissues to display.
    species : str
        'human' or 'mouse'.
    output_file : str
        If provided, save figure to this path.
    """
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/zscore",
        params={"species": species},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    df = pd.DataFrame(records).sort_values("zscore", ascending=False).head(top_n)

    fig, ax = plt.subplots(figsize=(10, 6))
    colors = ["#D73027" if z > 0 else "#4575B4" for z in df["zscore"]]
    bars = ax.barh(df["tissue"][::-1], df["zscore"][::-1], color=colors[::-1])
    ax.axvline(0, color="black", linewidth=0.8, linestyle="--")
    ax.set_xlabel("Expression Z-Score")
    ax.set_title(f"ARCHS4 Tissue Expression: {gene_symbol} ({species})\nTop {top_n} tissues")
    ax.bar_label(bars, fmt="%.2f", padding=3, fontsize=8)
    plt.tight_layout()
    fname = output_file or f"{gene_symbol}_tissue_expression.png"
    plt.savefig(fname, dpi=150, bbox_inches="tight")
    print(f"Saved {fname}  ({len(df)} tissues plotted)")

plot_tissue_expression("BRCA1", top_n=15, output_file="BRCA1_tissue_expression.png")
```

### Query 6: HDF5 Bulk Data Access

Download or stream from ARCHS4's precomputed HDF5 expression matrices for large-scale cross-sample analysis. The HDF5 files contain gene × sample count matrices for human and mouse.

```python
import requests

# HDF5 files are available for bulk download from the ARCHS4 data portal
# URL pattern: https://maayanlab.cloud/archs4/download#expression
# Human gene-level: human_gene_v2.6.h5
# Mouse gene-level: mouse_gene_v2.6.h5

def get_h5_download_urls() -> dict:
    """Return download URLs for ARCHS4 HDF5 expression matrices."""
    base = "https://maayanlab.cloud/archs4"
    return {
        "human_gene": f"{base}/files/human_gene_v2.6.h5",
        "mouse_gene": f"{base}/files/mouse_gene_v2.6.h5",
        "human_transcript": f"{base}/files/human_transcript_v2.6.h5",
        "mouse_transcript": f"{base}/files/mouse_transcript_v2.6.h5",
    }

urls = get_h5_download_urls()
for key, url in urls.items():
    print(f"  {key:<22}  {url}")

# To work with a downloaded HDF5 file:
try:
    import h5py
    import numpy as np

    h5_path = "human_gene_v2.6.h5"   # after download

    def extract_gene_from_h5(h5_path: str, gene_symbol: str,
                              n_samples: int = 1000) -> dict:
        """Extract expression values for a gene from the HDF5 matrix."""
        with h5py.File(h5_path, "r") as f:
            genes = [g.decode() for g in f["meta"]["genes"]["gene_symbol"][:]]
            if gene_symbol not in genes:
                raise ValueError(f"{gene_symbol} not found in HDF5")
            idx = genes.index(gene_symbol)
            expr = f["data"]["expression"][idx, :n_samples]
            sample_ids = [s.decode() for s in f["meta"]["samples"]["geo_accession"][:n_samples]]
        return {"gene": gene_symbol, "expression": expr, "sample_ids": sample_ids}

    result = extract_gene_from_h5(h5_path, "TP53", n_samples=500)
    print(f"TP53 expression: mean={result['expression'].mean():.2f},"
          f" max={result['expression'].max():.2f} (n={len(result['expression'])} samples)")
except ImportError:
    print("h5py not installed. Install with: pip install h5py")
except FileNotFoundError:
    print("HDF5 file not downloaded yet. Use the URLs above to download first.")
```

## Key Concepts

### Z-Score Normalization

ARCHS4 reports gene expression as z-scores computed relative to all samples for that gene. A z-score of 0 means expression at the population mean; a z-score of 2.0 means expression 2 standard deviations above the mean. Z-scores are more interpretable across datasets than raw counts because they account for library size differences and batch effects introduced by uniform alignment across studies.

```python
# Example: Positive z-score = above-average expression for that gene
# z > 2.0 → top ~2.5% of samples for that gene
# z < -2.0 → bottom ~2.5% of samples for that gene
# Use absolute z-score thresholds consistently when comparing across genes
```

### HDF5 vs REST API

| Access method | Best for | Limitations |
|---------------|----------|-------------|
| REST API (`/zscore`, `/correlations`) | Quick single-gene queries, exploration | Aggregated profiles only, no per-sample access |
| REST API (`/samples/search`) | Discovering relevant datasets | Returns metadata, not expression values |
| HDF5 download | Bulk analysis, custom co-expression, ML | Requires 30–60 GB disk; download once |

### Species and Gene Symbol Conventions

ARCHS4 indexes human samples using HGNC gene symbols (uppercase, e.g., `TP53`) and mouse samples using MGI symbols (first letter uppercase, e.g., `Trp53`). The `species` parameter accepts `"human"` or `"mouse"`. Mixed-case or ensemble IDs will return empty results.

## Common Workflows

### Workflow 1: Multi-Gene Tissue Expression Heatmap

**Goal**: Compare tissue expression profiles of a gene panel and visualize as a heatmap to identify tissue-specific vs ubiquitous expression patterns.

```python
import requests, time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

gene_panel = ["MYC", "TP53", "BRCA1", "EGFR", "KRAS", "CDK4"]
top_n_tissues = 25

def get_tissue_zscores(gene: str) -> pd.Series:
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene}/zscore",
        params={"species": "human"},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    df = pd.DataFrame(records).set_index("tissue")["zscore"]
    return df

# Build expression matrix (genes × tissues)
all_data = {}
for gene in gene_panel:
    try:
        all_data[gene] = get_tissue_zscores(gene)
        print(f"  Fetched {gene}")
    except Exception as e:
        print(f"  Warning: {gene} failed — {e}")
    time.sleep(0.1)

matrix = pd.DataFrame(all_data).T   # genes × tissues
# Select top tissues by max absolute z-score
tissue_importance = matrix.abs().max(axis=0).sort_values(ascending=False)
top_tissues = tissue_importance.head(top_n_tissues).index
matrix_subset = matrix[top_tissues]

# Plot heatmap
fig, ax = plt.subplots(figsize=(14, 5))
sns.heatmap(
    matrix_subset,
    cmap="RdBu_r",
    center=0,
    vmin=-3,
    vmax=3,
    ax=ax,
    cbar_kws={"label": "Z-Score"},
    linewidths=0.5
)
ax.set_title("ARCHS4 Tissue Expression Profiles — Gene Panel")
ax.set_xlabel("Tissue")
ax.set_ylabel("Gene")
plt.xticks(rotation=45, ha="right", fontsize=8)
plt.tight_layout()
plt.savefig("archs4_panel_heatmap.png", dpi=150, bbox_inches="tight")
print(f"Saved archs4_panel_heatmap.png  ({matrix_subset.shape})")
```

### Workflow 2: Co-expression Network Seed Expansion

**Goal**: Start from a seed gene, retrieve co-expressed partners, then query their co-expressed genes in turn to build a two-hop co-expression neighborhood.

```python
import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def get_coexp(gene: str, top_n: int = 20, species: str = "human") -> list:
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene}/correlations",
        params={"species": species, "limit": top_n},
        timeout=30
    )
    r.raise_for_status()
    return [rec["gene"] for rec in r.json().get("values", [])]

seed_gene = "PCNA"
min_correlation = 0.80

# Hop 1: direct co-expressed partners
hop1_genes = get_coexp(seed_gene, top_n=30)
print(f"Hop 1 partners of {seed_gene}: {len(hop1_genes)}")
time.sleep(0.1)

# Hop 2: co-expressed genes of each partner
edges = set()
for gene in hop1_genes[:10]:   # limit for demonstration
    partners = get_coexp(gene, top_n=20)
    for partner in partners:
        if partner != seed_gene:
            edges.add((gene, partner))
    time.sleep(0.1)

# Summarize the network
network_df = pd.DataFrame(list(edges), columns=["source", "target"])
hub_counts = network_df["source"].value_counts()
print(f"\nTwo-hop network: {len(edges)} edges")
print(f"Top hub genes:")
print(hub_counts.head(5))

network_df.to_csv(f"{seed_gene}_coexp_network.csv", index=False)
print(f"\nSaved {seed_gene}_coexp_network.csv")
```

### Workflow 3: Sample Discovery and Dataset Summary

**Goal**: Search for samples by disease keyword, summarize how many GEO series are available, and export sample metadata for downstream reanalysis selection.

```python
import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def search_and_summarize(keyword: str, species: str = "human",
                          limit: int = 200) -> pd.DataFrame:
    """Search samples and return a tidy metadata DataFrame."""
    r = requests.get(
        f"{ARCHS4_BASE}/samples/search",
        params={"query": keyword, "species": species, "limit": limit},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("samples", [])
    return pd.DataFrame(records)

keyword = "colorectal cancer"
df = search_and_summarize(keyword, limit=150)
print(f"Samples matching '{keyword}': {len(df)}")

if len(df) > 0:
    # Summarize by GEO series
    series_counts = df["series_id"].value_counts()
    print(f"\nTop GEO series (by sample count):")
    print(series_counts.head(8).to_string())

    # Export sample list
    df.to_csv(f"{keyword.replace(' ', '_')}_samples.csv", index=False)
    print(f"\nSaved {keyword.replace(' ', '_')}_samples.csv ({len(df)} samples)")
    print(f"Unique GEO series: {df['series_id'].nunique()}")
```

## Key Parameters

| Parameter | Endpoint | Default | Range / Options | Effect |
|-----------|----------|---------|-----------------|--------|
| `species` | All gene endpoints | `"human"` | `"human"`, `"mouse"` | Selects the species-specific sample index |
| `limit` | `/correlations`, `/samples/search` | `100` | `1`–`500` | Number of results returned |
| `gene_symbol` (path) | `/meta/genes/{gene}/zscore`, `/correlations` | — | HGNC symbol (human) or MGI symbol (mouse) | Query gene; case-sensitive |
| `query` | `/samples/search` | — | free-text string | Metadata keyword search across title, tissue, source fields |
| `offset` | `/samples/search` | `0` | integer | Pagination offset for large result sets |
| `correlation` (response field) | `/correlations` | — | `-1.0`–`1.0` | Pearson correlation coefficient; filter `> 0.7` for high co-expression |
| `zscore` (response field) | `/zscore` | — | continuous float | Expression z-score; `> 2.0` = high expression |
| `page_size` (HDF5) | HDF5 slice | all | any integer | Number of samples to extract per read from HDF5 |

## Best Practices

1. **Use z-score thresholds consistently**: Because z-scores are gene-specific, a z-score of 2.0 for a ubiquitous gene (GAPDH) and a tissue-restricted gene (TTR, liver) have different interpretive meaning. Always annotate which gene you are comparing and the tissue background.

2. **Sleep between batch queries**: ARCHS4 enforces a soft rate limit of ~10 requests/second. Add `time.sleep(0.1)` between sequential gene queries to avoid `429 Too Many Requests` errors.

3. **Download HDF5 for large-scale analyses**: For queries covering 50+ genes or requiring per-sample expression values, the REST API is impractical. Download the HDF5 file once and use `h5py` slicing for fast matrix access; this avoids hitting rate limits and is 100× faster for bulk extraction.

4. **Match gene symbol conventions by species**: Human queries require HGNC uppercase symbols (e.g., `TP53`); mouse queries require MGI-style symbols (e.g., `Trp53`). Using the wrong case returns empty results without an error.

5. **Validate co-expression findings across datasets**: ARCHS4 co-expression aggregates across all tissue types. A high correlation may be driven by a single tissue or study. Cross-check with tissue-specific queries or manually inspect the top contributing GEO series.

## Common Recipes

### Recipe: Quick Tissue Specificity Check

When to use: Rapidly determine whether a gene is broadly expressed (housekeeping) or tissue-restricted before designing experiments.

```python
import requests

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def tissue_specificity_summary(gene_symbol: str) -> None:
    """Print a summary of high and low expression tissues for a gene."""
    r = requests.get(
        f"{ARCHS4_BASE}/meta/genes/{gene_symbol}/zscore",
        params={"species": "human"},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("values", [])
    zscores = [rec["zscore"] for rec in records if rec.get("zscore") is not None]
    top_high = sorted(records, key=lambda x: x.get("zscore", 0), reverse=True)[:5]
    top_low = sorted(records, key=lambda x: x.get("zscore", float("inf")))[:3]
    print(f"\n{gene_symbol} — {len(zscores)} tissues")
    print(f"  Range: [{min(zscores):.2f}, {max(zscores):.2f}]  "
          f"Mean: {sum(zscores)/len(zscores):.2f}")
    print("  High expression:")
    for t in top_high:
        print(f"    {t['tissue']:<35}  z={t['zscore']:.2f}")
    print("  Low expression:")
    for t in top_low:
        print(f"    {t['tissue']:<35}  z={t['zscore']:.2f}")

tissue_specificity_summary("TTR")   # Transthyretin — liver-specific
```

### Recipe: Batch Gene Co-Expression Table

When to use: Generate a pairwise correlation table for a gene panel from a list of differentially expressed genes.

```python
import requests, time
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

def batch_coexpr_table(gene_list: list, top_n: int = 10) -> pd.DataFrame:
    """For each gene in gene_list, return its top co-expressed genes."""
    rows = []
    for gene in gene_list:
        try:
            r = requests.get(
                f"{ARCHS4_BASE}/meta/genes/{gene}/correlations",
                params={"species": "human", "limit": top_n},
                timeout=30
            )
            r.raise_for_status()
            for rec in r.json().get("values", []):
                rows.append({
                    "query_gene": gene,
                    "coexp_gene": rec.get("gene"),
                    "correlation": rec.get("correlation"),
                })
            time.sleep(0.1)
        except Exception as e:
            print(f"Warning: {gene} skipped — {e}")
    return pd.DataFrame(rows)

deg_list = ["MYC", "CCND1", "CDK4", "RB1", "E2F1"]
coexp_table = batch_coexpr_table(deg_list, top_n=10)
print(f"Co-expression entries: {len(coexp_table)}")
print(coexp_table.groupby("query_gene")["coexp_gene"].count())
coexp_table.to_csv("deg_coexpression_table.csv", index=False)
print("Saved deg_coexpression_table.csv")
```

### Recipe: Export Sample IDs for GEO Download

When to use: Identify relevant GEO accessions to download raw count matrices for a meta-analysis.

```python
import requests
import pandas as pd

ARCHS4_BASE = "https://maayanlab.cloud/archs4/api/v1"

keyword = "glioblastoma"
r = requests.get(
    f"{ARCHS4_BASE}/samples/search",
    params={"query": keyword, "species": "human", "limit": 200},
    timeout=30
)
r.raise_for_status()
samples = pd.DataFrame(r.json().get("samples", []))
if len(samples) > 0:
    # Get unique GEO series accessions
    series = samples["series_id"].dropna().unique()
    print(f"Unique GEO series for '{keyword}': {len(series)}")
    for s in series[:10]:
        n = (samples["series_id"] == s).sum()
        print(f"  {s}  ({n} samples)")
    # Export series list for GEO download script
    pd.Series(series, name="geo_series").to_csv(
        f"{keyword}_geo_series.txt", index=False
    )
    print(f"\nSaved {keyword}_geo_series.txt")
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `HTTP 404` for gene query | Gene symbol not found in ARCHS4 index | Verify HGNC symbol spelling; check `species` parameter matches gene convention (human: uppercase, mouse: first-letter-upper) |
| `HTTP 429 Too Many Requests` | Exceeded ~10 req/s rate limit | Add `time.sleep(0.1)` between requests; for batch queries use a 0.5 s delay |
| Empty `values` list in z-score response | Gene is not expressed in any indexed tissue, or wrong species | Switch species; verify gene is protein-coding and has GEO coverage |
| Empty `samples` list from search | Keyword not matched in metadata fields | Try broader or alternative keywords (e.g., `"liver"` instead of `"hepatic"`) |
| HDF5 gene not found | Symbol mismatch between HDF5 version and query | Check available genes in `f["meta"]["genes"]["gene_symbol"][:]`; try Ensembl ID or alias |
| `requests.exceptions.Timeout` | Slow API response under load | Increase `timeout=60`; retry with exponential backoff |
| Z-scores all near zero | Gene has very low or absent expression across tissues | Check the gene's expression in raw counts; the gene may be non-coding or very lowly expressed |

## Related Skills

- `gnomad-database` — Population variant frequencies; use after ARCHS4 to identify variants in highly expressed genes
- `gget-genomic-databases` — Enrichr pathway enrichment for ARCHS4 co-expression gene lists (`gget enrichr`)
- `pydeseq2-differential-expression` — Differential expression analysis on bulk RNA-seq; ARCHS4 HDF5 matrices can serve as reference cohorts

## References

- [ARCHS4 web portal](https://maayanlab.cloud/archs4/) — Interactive expression browser and dataset download
- [ARCHS4 REST API documentation](https://maayanlab.cloud/archs4/api/) — Endpoint reference and parameters
- [Lachmann et al., Nature Communications 2018](https://doi.org/10.1038/s41467-018-03751-6) — ARCHS4 original publication describing uniform alignment pipeline
- [ARCHS4 GitHub](https://github.com/MaayanLab/archs4) — Source code and HDF5 schema documentation