---
name: "pride-database"
description: "Search PRIDE Archive REST API for proteomics datasets, peptide IDs, and MS raw files. Find experiments by organism, tissue, disease, or instrument; download RAW/mzML; retrieve peptide/PSM IDs and protein-level evidence. Use interpro-database for domains; uniprot-protein-database for sequences."
license: "Apache-2.0"
---

# PRIDE Database

## Overview

The PRIDE Archive (ProteomicsIDEntifications database) at EBI is the world's largest public repository of mass spectrometry-based proteomics data, containing 30,000+ datasets from peer-reviewed publications. The REST API v2 at `https://www.ebi.ac.uk/pride/ws/archive/v2/` provides project discovery, file listing, peptide/PSM identification retrieval, and protein-level evidence — all without authentication. Data types include RAW files, peak lists (mzML, MGF), PRIDE XML result files, and processed identification tables.

## When to Use

- Finding published proteomics datasets by organism, tissue, disease keyword, or instrument type for meta-analysis or benchmarking
- Downloading raw mass spectrometry data (RAW, mzML) or peak files (MGF) from a specific PRIDE project accession
- Retrieving peptide identification tables with sequence, modification, and confidence score for a project
- Querying protein-level evidence (PSMs, unique peptides) for a protein of interest across PRIDE projects
- Checking whether a protein has experimental proteomics evidence in a specific tissue or disease context
- Building training datasets of confident peptide-spectrum matches (PSMs) for proteomics ML applications
- For protein domain and family classification use `interpro-database`; PRIDE provides experimental identification evidence only
- For protein sequences, Swiss-Prot annotations, and ID mapping use `uniprot-protein-database`

## Prerequisites

- **Python packages**: `requests`, `pandas`, `matplotlib`
- **Data requirements**: PRIDE project accessions (e.g., `PXD000001`) or search keywords (tissue, organism, disease)
- **Environment**: internet connection; no API key or account required
- **Rate limits**: ~50 requests/minute; add `time.sleep(1.2)` between sequential project or file queries to stay within limits

```bash
pip install requests pandas matplotlib
```

## Quick Start

```python
import requests

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def pride_get(endpoint: str, params: dict = None) -> dict:
    """Send a GET request to the PRIDE API and return parsed JSON."""
    r = requests.get(
        f"{PRIDE_BASE}/{endpoint}",
        params=params,
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    return r.json()

# Quick lookup: project details for a known accession
project = pride_get("projects/PXD000001")
print(f"Project: {project['accession']}")
print(f"Title:   {project['title'][:80]}")
print(f"Organisms: {[o['name'] for o in project.get('organisms', [])]}")
print(f"Submitted: {project.get('submissionDate', 'N/A')}")
# Project: PXD000001
# Title:   TMT spikes — iPRG2014 Study
```

## Core API

### Query 1: Project Search

Search PRIDE projects by keyword, organism, tissue, disease, or instrument. Returns paginated project summaries with accession, title, and metadata.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def search_projects(keyword: str = None, organism: str = None,
                    tissue: str = None, disease: str = None,
                    instrument: str = None,
                    page_size: int = 25, page: int = 0) -> dict:
    """Search PRIDE projects by metadata fields.

    Parameters
    ----------
    keyword : str
        Free-text keyword search across title and description.
    organism : str
        Organism name filter (e.g., 'Homo sapiens').
    tissue : str
        Tissue filter (e.g., 'liver', 'plasma').
    disease : str
        Disease filter (e.g., 'cancer', 'Alzheimer').
    instrument : str
        Instrument filter (e.g., 'Orbitrap').
    page_size : int
        Results per page (max 100).
    page : int
        Page number for pagination (0-indexed).
    """
    params = {"pageSize": page_size, "page": page}
    if keyword:
        params["keyword"] = keyword
    if organism:
        params["organisms"] = organism
    if tissue:
        params["tissues"] = tissue
    if disease:
        params["diseases"] = disease
    if instrument:
        params["instruments"] = instrument

    r = requests.get(
        f"{PRIDE_BASE}/projects",
        params=params,
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    return r.json()

result = search_projects(organism="Homo sapiens", tissue="liver", page_size=10)
projects = result.get("_embedded", {}).get("compactprojects", [])
total = result.get("page", {}).get("totalElements", 0)
print(f"Total liver proteomics projects: {total}")
print(f"First page: {len(projects)} projects")
for p in projects[:5]:
    print(f"  {p['accession']}  {p['title'][:60]}")
# Total liver proteomics projects: 523
#   PXD012345  Liver proteome profiling in NAFLD
```

```python
# Paginate through all results for a disease keyword
def get_all_projects(keyword: str, page_size: int = 100) -> pd.DataFrame:
    """Retrieve all matching projects across pages."""
    records, page = [], 0
    while True:
        data = search_projects(keyword=keyword, page_size=page_size, page=page)
        batch = data.get("_embedded", {}).get("compactprojects", [])
        if not batch:
            break
        for p in batch:
            records.append({
                "accession": p.get("accession"),
                "title": p.get("title", "")[:100],
                "submission_date": p.get("submissionDate"),
                "publication_date": p.get("publicationDate"),
                "n_files": p.get("filesCount", 0),
            })
        total = data.get("page", {}).get("totalPages", 1)
        page += 1
        if page >= total:
            break
    return pd.DataFrame(records)

df = get_all_projects("colorectal cancer", page_size=50)
print(f"Colorectal cancer projects: {len(df)}")
df.to_csv("pride_colorectal_projects.csv", index=False)
```

### Query 2: Project Details

Retrieve complete metadata for a specific project by its PRIDE accession (PXD######).

```python
import requests

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_project(accession: str) -> dict:
    """Fetch full metadata for a PRIDE project.

    Parameters
    ----------
    accession : str
        PRIDE accession (e.g., 'PXD000001').
    """
    r = requests.get(
        f"{PRIDE_BASE}/projects/{accession}",
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    return r.json()

project = get_project("PXD004131")
print(f"Accession       : {project['accession']}")
print(f"Title           : {project['title'][:80]}")
print(f"Submission date : {project.get('submissionDate', 'N/A')}")
print(f"Publication date: {project.get('publicationDate', 'N/A')}")
organisms = [o["name"] for o in project.get("organisms", [])]
print(f"Organisms       : {organisms}")
tissues = [t["name"] for t in project.get("tissues", [])]
print(f"Tissues         : {tissues}")
instruments = [i["name"] for i in project.get("instruments", [])]
print(f"Instruments     : {instruments}")
ptms = [m["name"] for m in project.get("ptms", [])]
print(f"PTMs            : {ptms[:5]}")
print(f"References      : {[r.get('doi') for r in project.get('references', [])[:2]]}")
```

### Query 3: Project Files

List all files in a project with their types (RAW, PEAK, RESULT, FASTA, OTHER) and download URLs.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_project_files(accession: str, file_type: str = None,
                      page_size: int = 100) -> pd.DataFrame:
    """List files available in a PRIDE project.

    Parameters
    ----------
    accession : str
        PRIDE accession.
    file_type : str
        Filter by file type: 'RAW', 'PEAK', 'RESULT', 'FASTA', 'OTHER'.
    page_size : int
        Files per page.
    """
    params = {"pageSize": page_size, "page": 0}
    if file_type:
        params["fileType"] = file_type

    records, page = [], 0
    while True:
        params["page"] = page
        r = requests.get(
            f"{PRIDE_BASE}/projects/{accession}/files",
            params=params,
            headers={"Accept": "application/json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        batch = data.get("_embedded", {}).get("files", [])
        if not batch:
            break
        for f in batch:
            records.append({
                "file_name": f.get("fileName"),
                "file_type": f.get("fileCategory", {}).get("value", ""),
                "size_bytes": f.get("fileSize", 0),
                "download_url": f.get("publicFileLocations", [{}])[0].get("value", ""),
            })
        total_pages = data.get("page", {}).get("totalPages", 1)
        page += 1
        if page >= total_pages:
            break

    df = pd.DataFrame(records)
    df["size_mb"] = (df["size_bytes"] / 1e6).round(1)
    return df

files_df = get_project_files("PXD004131")
print(f"Total files: {len(files_df)}")
print(files_df.groupby("file_type")["file_name"].count())

# RAW files only
raw_files = get_project_files("PXD004131", file_type="RAW")
print(f"\nRAW files: {len(raw_files)}")
print(f"Total size: {raw_files['size_mb'].sum():.0f} MB")
print(raw_files[["file_name", "size_mb", "download_url"]].head(5).to_string(index=False))
```

### Query 4: Peptide Identifications

Retrieve peptide identifications for a project or search by peptide sequence, modification, or protein accession.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_peptides(project_accession: str = None, peptide_sequence: str = None,
                 protein_accession: str = None,
                 page_size: int = 100, page: int = 0) -> pd.DataFrame:
    """Retrieve peptide identifications from PRIDE.

    Parameters
    ----------
    project_accession : str
        Filter by PRIDE project accession.
    peptide_sequence : str
        Filter by exact or partial peptide sequence.
    protein_accession : str
        Filter by UniProt protein accession.
    page_size : int
        Results per page.
    """
    params = {"pageSize": page_size, "page": page}
    if project_accession:
        params["projectAccessions"] = project_accession
    if peptide_sequence:
        params["peptideSequence"] = peptide_sequence
    if protein_accession:
        params["proteinAccession"] = protein_accession

    r = requests.get(
        f"{PRIDE_BASE}/peptides",
        params=params,
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("_embedded", {}).get("peptideevidences", [])
    rows = []
    for rec in records:
        rows.append({
            "peptide_sequence": rec.get("peptideSequence"),
            "protein_accession": rec.get("proteinAccession"),
            "project_accession": rec.get("projectAccession"),
            "ptms": str([m.get("modification") for m in rec.get("modifications", [])]),
            "num_psms": rec.get("numberPSMs", 0),
        })
    return pd.DataFrame(rows)

# Get peptides from a specific project
pep_df = get_peptides(project_accession="PXD004131", page_size=50)
print(f"Peptides retrieved: {len(pep_df)}")
if len(pep_df) > 0:
    print(pep_df[["peptide_sequence", "protein_accession", "num_psms"]].head(8).to_string(index=False))
```

```python
# Search peptides by sequence across all PRIDE
pep_hits = get_peptides(peptide_sequence="PEPTIDER", page_size=25)
print(f"PSM hits for 'PEPTIDER': {len(pep_hits)}")
print(pep_hits.groupby("project_accession")["peptide_sequence"].count())
```

### Query 5: PSM (Peptide-Spectrum Match) Retrieval

Retrieve individual PSMs with spectrum references, modifications, and confidence scores.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_psms(project_accession: str = None, peptide_sequence: str = None,
             protein_accession: str = None,
             page_size: int = 100) -> pd.DataFrame:
    """Retrieve PSMs (peptide-spectrum matches) from PRIDE.

    Parameters
    ----------
    project_accession : str
        Filter by project accession.
    peptide_sequence : str
        Filter by peptide sequence.
    protein_accession : str
        Filter by UniProt protein accession.
    """
    params = {"pageSize": page_size, "page": 0}
    if project_accession:
        params["projectAccessions"] = project_accession
    if peptide_sequence:
        params["peptideSequence"] = peptide_sequence
    if protein_accession:
        params["proteinAccession"] = protein_accession

    r = requests.get(
        f"{PRIDE_BASE}/psms",
        params=params,
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("_embedded", {}).get("psms", [])
    rows = []
    for rec in records:
        rows.append({
            "psm_id": rec.get("psmId"),
            "peptide_sequence": rec.get("peptideSequence"),
            "protein_accession": rec.get("proteinAccession"),
            "project_accession": rec.get("projectAccession"),
            "charge": rec.get("charge"),
            "calculated_mass": rec.get("calculatedMassToCharge"),
            "experimental_mass": rec.get("experimentalMassToCharge"),
            "spectrum_id": rec.get("spectrumID"),
        })
    return pd.DataFrame(rows)

psm_df = get_psms(project_accession="PXD004131", page_size=50)
print(f"PSMs retrieved: {len(psm_df)}")
if len(psm_df) > 0:
    print(psm_df[["peptide_sequence", "protein_accession", "charge", "spectrum_id"]].head(5).to_string(index=False))
```

### Query 6: Protein Evidence

Query protein-level identification data — unique peptides, PSM counts, and cross-project evidence for a specific UniProt accession.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_protein_evidence(protein_accession: str,
                          page_size: int = 50) -> pd.DataFrame:
    """Retrieve protein identification evidence across PRIDE projects.

    Parameters
    ----------
    protein_accession : str
        UniProt accession (e.g., 'P04637' for TP53).
    page_size : int
        Results per page.
    """
    params = {
        "proteinAccession": protein_accession,
        "pageSize": page_size,
        "page": 0
    }
    r = requests.get(
        f"{PRIDE_BASE}/proteins",
        params=params,
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("_embedded", {}).get("proteinevidences", [])
    rows = []
    for rec in records:
        rows.append({
            "protein_accession": rec.get("proteinAccession"),
            "project_accession": rec.get("projectAccession"),
            "num_peptides": rec.get("numberPeptides", 0),
            "num_psms": rec.get("numberPSMs", 0),
            "coverage": rec.get("sequenceCoverage"),
        })
    return pd.DataFrame(rows)

prot_df = get_protein_evidence("P04637")   # TP53
print(f"TP53 (P04637) evidence across PRIDE: {len(prot_df)} entries")
if len(prot_df) > 0:
    prot_df = prot_df.sort_values("num_psms", ascending=False)
    print(prot_df[["project_accession", "num_peptides", "num_psms", "coverage"]].head(10).to_string(index=False))
    print(f"\nTotal projects with TP53 evidence: {prot_df['project_accession'].nunique()}")
```

## Key Concepts

### PRIDE File Types

Each PRIDE project can contain several file categories:

| File type | Description | Format examples |
|-----------|-------------|-----------------|
| `RAW` | Unprocessed instrument output | .raw (Thermo), .d (Bruker/Agilent) |
| `PEAK` | Centroided or deconvoluted spectra | mzML, mzXML, MGF |
| `RESULT` | Identification results from search engine | mzIdentML, PRIDE XML, MaxQuant txt |
| `FASTA` | Protein sequence databases used in search | .fasta |
| `OTHER` | Supplementary files (scripts, tables) | .txt, .xlsx, .csv |

For reanalysis pipelines, start with `RESULT` files for pre-processed identifications or `PEAK` files for re-searching spectra. Use `RAW` files only when you need to re-acquire spectra from raw vendor formats.

### Accession Formats

PRIDE project accessions use the format `PXD######` (ProteomXchange dataset ID). The same accession is referenced in publications and indexed by ProteomXchange partner repositories (MassIVE, jPOST, iProX). Peptide IDs, PSM IDs, and protein IDs in the API responses use project-scoped internal identifiers.

### Pagination

All PRIDE API list endpoints are paginated. The response includes a `page` object with `totalElements`, `totalPages`, `size`, and `number` fields. Use `page` (0-indexed) and `pageSize` parameters. For large result sets, iterate until `page >= totalPages`.

```python
import requests

def paginate_pride(endpoint: str, params: dict, result_key: str) -> list:
    """Generic paginator for any PRIDE list endpoint."""
    PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"
    all_records, page = [], 0
    while True:
        params["page"] = page
        r = requests.get(
            f"{PRIDE_BASE}/{endpoint}",
            params=params,
            headers={"Accept": "application/json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        batch = data.get("_embedded", {}).get(result_key, [])
        all_records.extend(batch)
        total_pages = data.get("page", {}).get("totalPages", 1)
        page += 1
        if page >= total_pages or not batch:
            break
    return all_records
```

## Common Workflows

### Workflow 1: Disease Proteomics Dataset Discovery

**Goal**: Find all PRIDE projects for a disease, summarize available data types, and export a ranked project list for manual review.

```python
import requests, time
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def discover_projects(disease: str, organism: str = "Homo sapiens",
                      page_size: int = 50) -> pd.DataFrame:
    """Retrieve and summarize all PRIDE projects for a disease."""
    records, page = [], 0
    while True:
        params = {
            "keyword": disease,
            "organisms": organism,
            "pageSize": page_size,
            "page": page
        }
        r = requests.get(
            f"{PRIDE_BASE}/projects",
            params=params,
            headers={"Accept": "application/json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        batch = data.get("_embedded", {}).get("compactprojects", [])
        for p in batch:
            records.append({
                "accession": p.get("accession"),
                "title": p.get("title", "")[:100],
                "submission_date": p.get("submissionDate"),
                "n_files": p.get("filesCount", 0),
                "instruments": ", ".join(
                    i.get("name", "") for i in p.get("instruments", [])[:2]
                ),
                "tissues": ", ".join(
                    t.get("name", "") for t in p.get("tissues", [])[:2]
                ),
            })
        total_pages = data.get("page", {}).get("totalPages", 1)
        page += 1
        if page >= total_pages or not batch:
            break
        time.sleep(1.2)

    df = pd.DataFrame(records).sort_values("submission_date", ascending=False)
    return df

disease = "breast cancer"
df = discover_projects(disease)
print(f"PRIDE projects for '{disease}': {len(df)}")
print(f"\nInstruments used:")
instr_counts = df["instruments"].str.split(", ").explode().value_counts()
print(instr_counts.head(8).to_string())
df.to_csv(f"{disease.replace(' ', '_')}_pride_projects.csv", index=False)
print(f"\nSaved {disease.replace(' ', '_')}_pride_projects.csv")
```

### Workflow 2: File Download Manager for a Project

**Goal**: List, filter, and generate download commands for files in a PRIDE project — selecting specific file types and formatting for wget or aria2c batch download.

```python
import requests
import pandas as pd
from pathlib import Path

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def build_download_manifest(accession: str,
                             file_types: list = None,
                             output_dir: str = ".") -> pd.DataFrame:
    """Build a download manifest for PRIDE project files.

    Parameters
    ----------
    accession : str
        PRIDE accession.
    file_types : list
        List of file types to include (e.g., ['RAW', 'PEAK', 'RESULT']).
        None = include all.
    output_dir : str
        Local directory for downloaded files.
    """
    params = {"pageSize": 200, "page": 0}
    records, page = [], 0
    while True:
        params["page"] = page
        r = requests.get(
            f"{PRIDE_BASE}/projects/{accession}/files",
            params=params,
            headers={"Accept": "application/json"},
            timeout=30
        )
        r.raise_for_status()
        data = r.json()
        batch = data.get("_embedded", {}).get("files", [])
        for f in batch:
            ftype = f.get("fileCategory", {}).get("value", "OTHER")
            if file_types and ftype not in file_types:
                continue
            url = next(
                (loc["value"] for loc in f.get("publicFileLocations", [])
                 if loc.get("name") == "FTP Protocol"),
                f.get("publicFileLocations", [{}])[0].get("value", "")
            )
            records.append({
                "file_name": f.get("fileName"),
                "file_type": ftype,
                "size_mb": round(f.get("fileSize", 0) / 1e6, 1),
                "url": url,
                "local_path": str(Path(output_dir) / f.get("fileName", "unknown")),
            })
        total_pages = data.get("page", {}).get("totalPages", 1)
        page += 1
        if page >= total_pages or not batch:
            break

    df = pd.DataFrame(records)
    return df

manifest = build_download_manifest(
    "PXD004131",
    file_types=["RAW", "RESULT"],
    output_dir="/data/pride/PXD004131"
)
print(f"Files to download: {len(manifest)}")
print(f"Total size: {manifest['size_mb'].sum():.0f} MB")
print(manifest.groupby("file_type")[["file_name", "size_mb"]].head(3).to_string())

# Export wget batch file
wget_lines = [f"wget -P /data/pride/PXD004131 '{row.url}'"
              for _, row in manifest.iterrows() if row.url]
with open(f"download_{manifest['file_type'].iloc[0] if len(manifest) else 'files'}.sh", "w") as fh:
    fh.write("\n".join(wget_lines))
print(f"\nExported wget script with {len(wget_lines)} download commands")
```

### Workflow 3: Protein Evidence Summary Across Projects

**Goal**: For a list of proteins (e.g., from a differential expression analysis), retrieve proteomics evidence from PRIDE and summarize detection frequency.

```python
import requests, time
import pandas as pd
import matplotlib.pyplot as plt

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def get_protein_psm_counts(uniprot_acc: str, page_size: int = 100) -> dict:
    """Return total PSMs and number of projects for a protein."""
    r = requests.get(
        f"{PRIDE_BASE}/proteins",
        params={"proteinAccession": uniprot_acc, "pageSize": page_size},
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    data = r.json()
    records = data.get("_embedded", {}).get("proteinevidences", [])
    total_psms = sum(rec.get("numberPSMs", 0) for rec in records)
    n_projects = len(set(rec.get("projectAccession") for rec in records if rec.get("projectAccession")))
    return {"uniprot": uniprot_acc, "total_psms": total_psms, "n_projects": n_projects}

# Candidate protein panel from a differential expression analysis
proteins_of_interest = ["P04637", "P38398", "P31749", "P40763", "O15530"]
# TP53, BRCA1, AKT1, STAT3, PDPK1

results = []
for acc in proteins_of_interest:
    try:
        r = get_protein_psm_counts(acc)
        results.append(r)
        print(f"  {acc}: {r['total_psms']:,} PSMs across {r['n_projects']} projects")
    except Exception as e:
        print(f"  {acc} failed: {e}")
    time.sleep(1.2)

df = pd.DataFrame(results).sort_values("total_psms", ascending=False)

# Bar chart
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.bar(df["uniprot"], df["total_psms"], color="#3182BD")
ax.bar_label(bars, fmt="%d", fontsize=8, padding=2)
ax.set_xlabel("UniProt Accession")
ax.set_ylabel("Total PSMs in PRIDE")
ax.set_title("Proteomics Evidence Depth in PRIDE Archive")
plt.tight_layout()
plt.savefig("pride_protein_evidence.png", dpi=150, bbox_inches="tight")
print(f"\nSaved pride_protein_evidence.png")
df.to_csv("pride_protein_evidence_summary.csv", index=False)
```

## Key Parameters

| Parameter | Endpoint | Default | Range / Options | Effect |
|-----------|----------|---------|-----------------|--------|
| `keyword` | `GET /projects` | — | free-text string | Full-text search across title and description |
| `organisms` | `GET /projects` | — | organism name string (e.g., `"Homo sapiens"`) | Filter projects by organism |
| `tissues` | `GET /projects` | — | tissue name string (e.g., `"liver"`) | Filter projects by tissue |
| `diseases` | `GET /projects` | — | disease keyword | Filter projects by disease annotation |
| `instruments` | `GET /projects` | — | instrument name (e.g., `"Orbitrap"`) | Filter projects by MS instrument |
| `fileType` | `GET /projects/{acc}/files` | all | `RAW`, `PEAK`, `RESULT`, `FASTA`, `OTHER` | Filter files by category |
| `pageSize` | all list endpoints | `20` | `1`–`100` | Results per page |
| `page` | all list endpoints | `0` | non-negative integer | 0-indexed page for pagination |
| `projectAccessions` | `GET /peptides`, `/psms`, `/proteins` | — | `PXD######` string | Restrict identifications to a specific project |
| `proteinAccession` | `GET /peptides`, `/psms`, `/proteins` | — | UniProt accession | Filter by protein |
| `peptideSequence` | `GET /peptides`, `/psms` | — | amino acid sequence string | Filter by peptide sequence |

## Best Practices

1. **Start with `RESULT` files for fastest reanalysis**: mzIdentML or MaxQuant output files contain already-processed identifications and are much smaller than RAW files (MB vs GB). Use these for cross-study comparison without re-searching spectra.

2. **Add `time.sleep(1.2)` between project or file queries**: The PRIDE API enforces ~50 requests/minute. Batch scripts without delays will receive `HTTP 429` errors. For large surveys (100+ projects), implement exponential backoff.

3. **Prefer FTP URLs for large file downloads**: The API returns both HTTPS and FTP public file locations. FTP downloads are more reliable for large RAW files (>1 GB) and can be parallelized with `aria2c -x 8`.

4. **Filter by `fileType=RESULT` before downloading**: Projects often contain dozens of auxiliary files per sample. Fetching the file manifest and filtering by type avoids accidentally queuing gigabytes of RAW data when you only need identifications.

5. **Cross-reference PTMs with UniMod**: PTM annotations in PRIDE use PSI-MOD or UniMod accessions. When parsing modification fields from peptide or PSM responses, look up accessions at `https://www.unimod.org/` to translate to modification names and masses.

## Common Recipes

### Recipe: Quick Project File Summary

When to use: Given a PRIDE accession, get a rapid count of file types and total dataset size before committing to a download.

```python
import requests

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def project_file_summary(accession: str) -> None:
    """Print file type breakdown and total size for a PRIDE project."""
    r = requests.get(
        f"{PRIDE_BASE}/projects/{accession}/files",
        params={"pageSize": 200},
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    files = r.json().get("_embedded", {}).get("files", [])
    type_sizes: dict = {}
    for f in files:
        ftype = f.get("fileCategory", {}).get("value", "OTHER")
        size_mb = f.get("fileSize", 0) / 1e6
        type_sizes[ftype] = type_sizes.get(ftype, 0) + size_mb
    print(f"\n{accession} file summary:")
    for ftype, total_mb in sorted(type_sizes.items()):
        print(f"  {ftype:<10}  {total_mb:>8.0f} MB")
    print(f"  {'TOTAL':<10}  {sum(type_sizes.values()):>8.0f} MB")

project_file_summary("PXD004131")
```

### Recipe: Check If a Protein Has PRIDE Evidence

When to use: Quickly validate whether a protein of interest has any experimental proteomics evidence in PRIDE before designing targeted experiments.

```python
import requests

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def has_pride_evidence(uniprot_acc: str) -> tuple:
    """Return (has_evidence, n_projects, total_psms) for a UniProt accession."""
    r = requests.get(
        f"{PRIDE_BASE}/proteins",
        params={"proteinAccession": uniprot_acc, "pageSize": 10},
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    records = r.json().get("_embedded", {}).get("proteinevidences", [])
    if not records:
        return False, 0, 0
    n_projects = len(set(rec.get("projectAccession") for rec in records))
    total_psms = sum(rec.get("numberPSMs", 0) for rec in records)
    return True, n_projects, total_psms

for acc in ["P04637", "Q99999"]:  # TP53, hypothetical unknown
    has_ev, n_proj, n_psms = has_pride_evidence(acc)
    print(f"{acc}: evidence={has_ev}, projects={n_proj}, PSMs={n_psms}")
# P04637: evidence=True, projects=47, PSMs=12834
# Q99999: evidence=False, projects=0, PSMs=0
```

### Recipe: Find Projects with Specific PTM Data

When to use: Locate PRIDE datasets that include a specific post-translational modification (e.g., phosphorylation, ubiquitination) for a given organism.

```python
import requests
import pandas as pd

PRIDE_BASE = "https://www.ebi.ac.uk/pride/ws/archive/v2"

def find_ptm_projects(ptm_keyword: str, organism: str = "Homo sapiens",
                      page_size: int = 25) -> pd.DataFrame:
    """Search PRIDE for projects annotated with a specific PTM."""
    r = requests.get(
        f"{PRIDE_BASE}/projects",
        params={
            "keyword": ptm_keyword,
            "organisms": organism,
            "pageSize": page_size
        },
        headers={"Accept": "application/json"},
        timeout=30
    )
    r.raise_for_status()
    projects = r.json().get("_embedded", {}).get("compactprojects", [])
    rows = []
    for p in projects:
        ptm_names = [m.get("name", "") for m in p.get("ptms", [])]
        if any(ptm_keyword.lower() in name.lower() for name in ptm_names) or True:
            rows.append({
                "accession": p.get("accession"),
                "title": p.get("title", "")[:80],
                "ptms": ", ".join(ptm_names[:4]),
                "n_files": p.get("filesCount", 0),
            })
    return pd.DataFrame(rows)

phospho_projects = find_ptm_projects("phospho")
print(f"Phosphoproteomics projects: {len(phospho_projects)}")
print(phospho_projects[["accession", "title", "ptms"]].head(6).to_string(index=False))
```

## Troubleshooting

| Problem | Cause | Solution |
|---------|-------|----------|
| `HTTP 404` on project lookup | Accession not found or not public | Verify accession format (`PXD######`); some datasets are under embargo until publication |
| `HTTP 429 Too Many Requests` | Exceeded ~50 req/min rate limit | Add `time.sleep(1.2)` between requests; implement exponential backoff for bursts |
| Empty `_embedded` object | No results match the query | Broaden search terms; check organism spelling (exact match required, e.g., `"Homo sapiens"`) |
| Empty peptide/PSM results | Project has no identification data loaded | Newer projects may not yet have identifications indexed; use `RESULT` file download instead |
| Download URL is empty string | File not yet available on FTP | Check `publicFileLocations` list for alternative URLs; some files are HTTPS-only |
| Very large file manifest | Project has hundreds of files | Use `fileType` filter to restrict to relevant types; build a manifest before downloading |
| `ConnectionError` or `ReadTimeout` | Transient EBI infrastructure issue | Retry after 60 seconds; EBI services occasionally have brief maintenance windows |

## Related Skills

- `interpro-database` — InterPro protein domain architecture and family classification; cross-reference proteins found in PRIDE with their structural annotations
- `uniprot-protein-database` — UniProt protein sequences, Swiss-Prot annotations, PTM sites, and disease associations; use after retrieving protein accessions from PRIDE
- `pdb-database` — PDB 3D structures for proteins with proteomics evidence in PRIDE

## References

- [PRIDE Archive REST API v2](https://www.ebi.ac.uk/pride/ws/archive/v2/) — Interactive Swagger API documentation and endpoint reference
- [Perez-Riverol et al., Nucleic Acids Research 2022](https://doi.org/10.1093/nar/gkab1038) — PRIDE 2022 update describing the repository and API
- [PRIDE web portal](https://www.ebi.ac.uk/pride/) — Interactive dataset browser and submission guide
- [ProteomXchange Consortium](http://www.proteomexchange.org/) — Standard accession system shared across PRIDE, MassIVE, jPOST, and iProX