--- name: tcga-bulk-data-preprocessing-with-omicverse title: TCGA bulk data preprocessing with omicverse description: Guide Claude through ingesting TCGA sample sheets, expression archives, and clinical carts into omicverse, initialising survival metadata, and exporting annotated AnnData files. --- # TCGA bulk data preprocessing with omicverse ## Overview Follow this skill to recreate the preprocessing routine from [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb). It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through `ov.bulk.pyTCGA`. ## Instructions 1. **Gather required downloads** - Confirm the user has: - `gdc_sample_sheet..tsv` from the TCGA Sample Sheet export. - The decompressed `gdc_download_xxxxx` directory containing expression archives. - The `clinical.cart.` directory with clinical XML/JSON files. - Mention that sample data are available under [`omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/). 2. **Initialise the TCGA helper** - Import `omicverse as ov` (and `scanpy as sc` if plotting) then call `ov.plot_set()`. - Instantiate `aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir)`. - Run `aml_tcga.adata_init()` to build the AnnData object with raw counts, FPKM, and TPM layers. 3. **Persist the dataset** - Encourage saving the initial AnnData: `aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip')`. - When reloading, reconstruct the class with the same paths and call `aml_tcga.adata_read()`. 4. **Initialise metadata and clinical information** - Populate sample metadata using `aml_tcga.adata_meta_init()` to convert gene IDs to symbols and attach patient info. - Add survival attributes via `aml_tcga.survial_init()` (note the intentional spelling in the API). 5. **Perform survival analyses** - Plot gene-level survival curves with `aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True)`. - To process all genes, call `aml_tcga.survial_analysis_all()`; warn that it may take time. 6. **Export results** - Save enriched metadata to a new AnnData file (`aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')`). - Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python. 7. **Troubleshooting tips** - Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors. - The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not align. - Survival plots require clinical dates; if absent, instruct users to check the `clinical_cart` contents. ## Examples - "Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts." - "Reload a saved AnnData file, attach survival annotations, and export the updated `.h5ad`." - "Run survival analysis for all genes and store the enriched dataset." ## References - Tutorial notebook: [`t_tcga.ipynb`](../../omicverse_guide/docs/Tutorials-bulk/t_tcga.ipynb) - Sample dataset: [`data/TCGA_OV/`](../../omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/) - Quick copy/paste commands: [`reference.md`](reference.md)