{ "cells": [ { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "# PMCGrab Tutorial\n", "\n", "**Welcome to PMCGrab!** This notebook will take you through the complete process of:\n", "\n", "1. Fetching scientific papers from PubMed Central (PMC)\n", "2. Converting messy XML to clean, AI-ready JSON\n", "3. Exploring the structured data\n", "4. Preparing data for AI/ML workflows (RAG, vector databases, LLM training)\n", "5. Saving results to disk\n", "\n", "**Prerequisites:** Make sure you have PMCGrab installed:\n", "```bash\n", "uv add pmcgrab\n", "# or\n", "pip install pmcgrab\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## What Makes PMCGrab Special?\n", "\n", "PMCGrab transforms this messy process:\n", "\n", "```xml\n", "\n", " Introduction\n", "

Machine learning 1 has revolutionized...\n", " \n", "

\n", "
\n", "```\n", "\n", "Into this clean structure:\n", "\n", "```json\n", "{\n", " \"body\": {\n", " \"Introduction\": \"Machine learning has revolutionized...\"\n", " }\n", "}\n", "```\n" ] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Setup and Imports" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Import required libraries\n", "import json\n", "import warnings\n", "from pathlib import Path\n", "\n", "# PMCGrab imports\n", "from pmcgrab import Paper # OOP interface (recommended)\n", "from pmcgrab.application.processing import process_single_pmc # Dict-based interface\n", "\n", "print(\"All imports successful!\")\n", "print(\"Ready to process some scientific literature!\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Step 1: Process Your First Paper\n", "\n", "Let's start with a real paper from PMC. We'll use **PMC7114487** -- a freely available research paper.\n", "\n", "The `Paper.from_pmc()` method downloads the XML from NCBI and parses it into a clean Python object. Pass `suppress_warnings=True` to silence any informational parsing warnings.\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Process a single paper using the Paper class (recommended API)\n", "pmcid = \"7114487\"\n", "print(f\"Fetching PMC{pmcid} from PubMed Central...\")\n", "print(\"This might take 5-15 seconds...\\n\")\n", "\n", "paper = Paper.from_pmc(pmcid, suppress_warnings=True)\n", "\n", "if paper.has_data:\n", " print(\"Success! Paper processed.\")\n", " print(f\" Title: {paper.title}\")\n", " print(f\" Abstract: {paper.abstract_as_str()[:120]}...\")\n", " print(f\" Sections: {len(paper.body_as_dict())}\")\n", "else:\n", " print(\"Failed to process the paper.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Step 2: Explore the Paper Structure\n", "\n", "Let's dive deep into what PMCGrab extracted for us:\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# body_as_dict() returns {section_title: clean_text}\n", "body = paper.body_as_dict()\n", "\n", "print(\"PAPER SECTIONS:\")\n", "print(\"=\" * 50)\n", "\n", "for section_name, content in body.items():\n", " word_count = len(content.split())\n", " char_count = len(content)\n", "\n", " print(f\"\\n{section_name}:\")\n", " print(f\" {word_count:,} words | {char_count:,} characters\")\n", "\n", " # Show preview\n", " preview = content[:200].replace(\"\\n\", \" \").strip()\n", " print(f\" Preview: {preview}...\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Step 3: Batch Processing - Build a Dataset\n", "\n", "For batch work the dict-based `process_single_pmc()` is often more convenient -- it returns a plain dictionary that is already JSON-serializable.\n", "\n", "Key fields in the returned dict:\n", "- `title` -- article title (string)\n", "- `abstract_text` -- plain-text abstract (string)\n", "- `abstract` -- structured abstract (dict of section -> text)\n", "- `body` -- body sections (dict of section title -> text)\n", "- `authors` -- author list (list of dicts)\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "paper_collection = {\n", " \"7114487\": \"Junin virus PI3K/Akt signalling\",\n", " \"3084273\": \"Nuclear domain 10 functional interaction\",\n", " \"7181753\": \"Single-cell transcriptomes of human skin\",\n", "}\n", "\n", "print(f\"Processing {len(paper_collection)} papers for our dataset...\")\n", "print(\"This will take 1-3 minutes...\\n\")\n", "\n", "dataset = {}\n", "stats = {\"ok\": 0, \"fail\": 0}\n", "\n", "# Suppress informational parsing warnings for cleaner output\n", "with warnings.catch_warnings():\n", " warnings.simplefilter(\"ignore\")\n", "\n", " for pmcid, description in paper_collection.items():\n", " print(f\" PMC{pmcid}: {description}\")\n", " try:\n", " data = process_single_pmc(pmcid)\n", " if data:\n", " dataset[pmcid] = data\n", " stats[\"ok\"] += 1\n", " print(f\" -> '{data['title'][:60]}...'\\n\")\n", " else:\n", " stats[\"fail\"] += 1\n", " print(\" -> FAILED\\n\")\n", " except Exception as e:\n", " stats[\"fail\"] += 1\n", " print(f\" -> Error: {e}\\n\")\n", "\n", "print(f\"Done! {stats['ok']} successful, {stats['fail']} failed.\")\n", "print(f\"Dataset size: {len(dataset)} papers\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Step 4: Prepare Data for AI/ML Workflows\n", "\n", "### RAG (Retrieval-Augmented Generation) Preparation\n", "\n", "Each paper dict contains `abstract_text` (plain string) and `body` (dict of section title to text). We can split these into chunks suitable for a vector database like Pinecone, Weaviate, or ChromaDB:\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Prepare chunks for a RAG pipeline\n", "rag_chunks = []\n", "\n", "for pmcid, paper in dataset.items():\n", " # Use abstract_text (plain string) -- NOT abstract (structured dict)\n", " abstract_text = paper.get(\"abstract_text\", \"\")\n", "\n", " # Add abstract as a chunk\n", " if abstract_text:\n", " rag_chunks.append(\n", " {\n", " \"id\": f\"PMC{pmcid}_abstract\",\n", " \"source\": f\"PMC{pmcid}\",\n", " \"section\": \"Abstract\",\n", " \"content\": abstract_text,\n", " \"metadata\": {\n", " \"title\": paper[\"title\"],\n", " \"journal\": paper.get(\"journal_title\", \"\"),\n", " \"content_type\": \"abstract\",\n", " \"word_count\": len(abstract_text.split()),\n", " },\n", " }\n", " )\n", "\n", " # Add each body section as a chunk\n", " for section_title, content in paper[\"body\"].items():\n", " rag_chunks.append(\n", " {\n", " \"id\": f\"PMC{pmcid}_{section_title.lower().replace(' ', '_')}\",\n", " \"source\": f\"PMC{pmcid}\",\n", " \"section\": section_title,\n", " \"content\": content,\n", " \"metadata\": {\n", " \"title\": paper[\"title\"],\n", " \"journal\": paper.get(\"journal_title\", \"\"),\n", " \"content_type\": \"section\",\n", " \"word_count\": len(content.split()),\n", " },\n", " }\n", " )\n", "\n", "print(f\"Created {len(rag_chunks)} RAG chunks from {len(dataset)} papers\")\n", "if rag_chunks:\n", " avg_words = sum(c[\"metadata\"][\"word_count\"] for c in rag_chunks) / len(rag_chunks)\n", " print(f\"Average chunk size: {avg_words:.0f} words\")\n", "\n", " # Show a sample chunk\n", " print(\"\\nSAMPLE RAG CHUNK:\")\n", " print(\"=\" * 50)\n", " sample = rag_chunks[0]\n", " print(f\"ID: {sample['id']}\")\n", " print(f\"Section: {sample['section']}\")\n", " print(f\"Word Count: {sample['metadata']['word_count']}\")\n", " print(f\"Preview: {sample['content'][:200]}...\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Step 5: Save Your Data\n", "\n", "Let's save all the processed data for future use:\n" ] }, { "cell_type": "code", "metadata": {}, "source": [ "# Create output directory\n", "output_dir = Path(\"pmcgrab_tutorial_output\")\n", "output_dir.mkdir(exist_ok=True)\n", "\n", "# 1. Save individual paper JSON files\n", "papers_dir = output_dir / \"papers\"\n", "papers_dir.mkdir(exist_ok=True)\n", "\n", "for pmcid, paper_dict in dataset.items():\n", " dest = papers_dir / f\"PMC{pmcid}.json\"\n", " with open(dest, \"w\", encoding=\"utf-8\") as f:\n", " json.dump(paper_dict, f, indent=2, ensure_ascii=False)\n", "\n", "print(f\"Saved {len(dataset)} paper files to {papers_dir}/\")\n", "\n", "# 2. Save RAG chunks\n", "chunks_path = output_dir / \"rag_chunks.json\"\n", "with open(chunks_path, \"w\", encoding=\"utf-8\") as f:\n", " json.dump(rag_chunks, f, indent=2, ensure_ascii=False)\n", "\n", "print(f\"Saved {len(rag_chunks)} RAG chunks to {chunks_path}\")\n", "print(f\"\\nAll data saved to '{output_dir}/'.\")" ], "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "vscode": { "languageId": "raw" } }, "source": [ "## Next Steps\n", "\n", "You now know how to:\n", "\n", "- Fetch and parse PMC papers using `Paper.from_pmc()` (OOP) or `process_single_pmc()` (dict)\n", "- Explore structured sections and abstracts\n", "- Build RAG-ready chunks from the parsed data\n", "- Save everything as JSON\n", "\n", "### Where to go from here\n", "\n", "1. **Scale up** -- use `process_pmc_ids()` or the CLI (`pmcgrab --pmcids ...`) for bulk processing\n", "2. **Local XML** -- process pre-downloaded XML with `Paper.from_local_xml()` or `process_single_local_xml()`\n", "3. **Vector databases** -- load your RAG chunks into Pinecone, Weaviate, or ChromaDB\n", "4. **Knowledge graphs** -- use the citation and author data to build relational graphs\n", "\n", "### Quick reference\n", "\n", "| Task | Code |\n", "|------|------|\n", "| Single paper (OOP) | `Paper.from_pmc(\"7181753\", suppress_warnings=True)` |\n", "| Single paper (dict) | `process_single_pmc(\"7181753\")` |\n", "| Local XML file | `Paper.from_local_xml(\"article.xml\")` |\n", "| Batch (network) | `process_pmc_ids([\"7181753\", \"3084273\"])` |\n", "| CLI | `pmcgrab --pmcids 7181753 3084273 --output-dir ./out` |\n", "\n", "### Resources\n", "\n", "- [PMCGrab Documentation](https://rajdeepmondaldotcom.github.io/pmcgrab/)\n", "- [GitHub Repository](https://github.com/rajdeepmondaldotcom/pmcgrab)\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.11" } }, "nbformat": 4, "nbformat_minor": 2 }