{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "# PMCGrab Tutorial\n",
    "\n",
    "**Welcome to PMCGrab!** This notebook will take you through the complete process of:\n",
    "\n",
    "1. Fetching scientific papers from PubMed Central (PMC)\n",
    "2. Converting messy XML to clean, AI-ready JSON\n",
    "3. Exploring the structured data\n",
    "4. Preparing data for AI/ML workflows (RAG, vector databases, LLM training)\n",
    "5. Saving results to disk\n",
    "\n",
    "**Prerequisites:** Make sure you have PMCGrab installed:\n",
    "```bash\n",
    "uv add pmcgrab\n",
    "# or\n",
    "pip install pmcgrab\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## What Makes PMCGrab Special?\n",
    "\n",
    "PMCGrab transforms this messy process:\n",
    "\n",
    "```xml\n",
    "<sec sec-type=\"intro\">\n",
    "  <title>Introduction</title>\n",
    "  <p>Machine learning <xref ref-type=\"bibr\" rid=\"B1\">1</xref> has revolutionized...\n",
    "    <fig id=\"F1\"><graphic xlink:href=\"figure1.jpg\"/></fig>\n",
    "  </p>\n",
    "</sec>\n",
    "```\n",
    "\n",
    "Into this clean structure:\n",
    "\n",
    "```json\n",
    "{\n",
    "  \"body\": {\n",
    "    \"Introduction\": \"Machine learning has revolutionized...\"\n",
    "  }\n",
    "}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Setup and Imports"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Import required libraries\n",
    "import json\n",
    "import warnings\n",
    "from pathlib import Path\n",
    "\n",
    "# PMCGrab imports\n",
    "from pmcgrab import Paper  # OOP interface (recommended)\n",
    "from pmcgrab.application.processing import process_single_pmc  # Dict-based interface\n",
    "\n",
    "print(\"All imports successful!\")\n",
    "print(\"Ready to process some scientific literature!\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 1: Process Your First Paper\n",
    "\n",
    "Let's start with a real paper from PMC. We'll use **PMC7114487** -- a freely available research paper.\n",
    "\n",
    "The `Paper.from_pmc()` method downloads the XML from NCBI and parses it into a clean Python object. Pass `suppress_warnings=True` to silence any informational parsing warnings.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Process a single paper using the Paper class (recommended API)\n",
    "pmcid = \"7114487\"\n",
    "print(f\"Fetching PMC{pmcid} from PubMed Central...\")\n",
    "print(\"This might take 5-15 seconds...\\n\")\n",
    "\n",
    "paper = Paper.from_pmc(pmcid, suppress_warnings=True)\n",
    "\n",
    "if paper.has_data:\n",
    "    print(\"Success! Paper processed.\")\n",
    "    print(f\"  Title:    {paper.title}\")\n",
    "    print(f\"  Abstract: {paper.abstract_as_str()[:120]}...\")\n",
    "    print(f\"  Sections: {len(paper.body_as_dict())}\")\n",
    "else:\n",
    "    print(\"Failed to process the paper.\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 2: Explore the Paper Structure\n",
    "\n",
    "Let's dive deep into what PMCGrab extracted for us:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# body_as_dict() returns {section_title: clean_text}\n",
    "body = paper.body_as_dict()\n",
    "\n",
    "print(\"PAPER SECTIONS:\")\n",
    "print(\"=\" * 50)\n",
    "\n",
    "for section_name, content in body.items():\n",
    "    word_count = len(content.split())\n",
    "    char_count = len(content)\n",
    "\n",
    "    print(f\"\\n{section_name}:\")\n",
    "    print(f\"   {word_count:,} words | {char_count:,} characters\")\n",
    "\n",
    "    # Show preview\n",
    "    preview = content[:200].replace(\"\\n\", \" \").strip()\n",
    "    print(f\"   Preview: {preview}...\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 3: Batch Processing - Build a Dataset\n",
    "\n",
    "For batch work the dict-based `process_single_pmc()` is often more convenient -- it returns a plain dictionary that is already JSON-serializable.\n",
    "\n",
    "Key fields in the returned dict:\n",
    "- `title` -- article title (string)\n",
    "- `abstract_text` -- plain-text abstract (string)\n",
    "- `abstract` -- structured abstract (dict of section -> text)\n",
    "- `body` -- body sections (dict of section title -> text)\n",
    "- `authors` -- author list (list of dicts)\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "paper_collection = {\n",
    "    \"7114487\": \"Junin virus PI3K/Akt signalling\",\n",
    "    \"3084273\": \"Nuclear domain 10 functional interaction\",\n",
    "    \"7181753\": \"Single-cell transcriptomes of human skin\",\n",
    "}\n",
    "\n",
    "print(f\"Processing {len(paper_collection)} papers for our dataset...\")\n",
    "print(\"This will take 1-3 minutes...\\n\")\n",
    "\n",
    "dataset = {}\n",
    "stats = {\"ok\": 0, \"fail\": 0}\n",
    "\n",
    "# Suppress informational parsing warnings for cleaner output\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "\n",
    "    for pmcid, description in paper_collection.items():\n",
    "        print(f\"  PMC{pmcid}: {description}\")\n",
    "        try:\n",
    "            data = process_single_pmc(pmcid)\n",
    "            if data:\n",
    "                dataset[pmcid] = data\n",
    "                stats[\"ok\"] += 1\n",
    "                print(f\"    -> '{data['title'][:60]}...'\\n\")\n",
    "            else:\n",
    "                stats[\"fail\"] += 1\n",
    "                print(\"    -> FAILED\\n\")\n",
    "        except Exception as e:\n",
    "            stats[\"fail\"] += 1\n",
    "            print(f\"    -> Error: {e}\\n\")\n",
    "\n",
    "print(f\"Done! {stats['ok']} successful, {stats['fail']} failed.\")\n",
    "print(f\"Dataset size: {len(dataset)} papers\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 4: Prepare Data for AI/ML Workflows\n",
    "\n",
    "### RAG (Retrieval-Augmented Generation) Preparation\n",
    "\n",
    "Each paper dict contains `abstract_text` (plain string) and `body` (dict of section title to text). We can split these into chunks suitable for a vector database like Pinecone, Weaviate, or ChromaDB:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Prepare chunks for a RAG pipeline\n",
    "rag_chunks = []\n",
    "\n",
    "for pmcid, paper in dataset.items():\n",
    "    # Use abstract_text (plain string) -- NOT abstract (structured dict)\n",
    "    abstract_text = paper.get(\"abstract_text\", \"\")\n",
    "\n",
    "    # Add abstract as a chunk\n",
    "    if abstract_text:\n",
    "        rag_chunks.append(\n",
    "            {\n",
    "                \"id\": f\"PMC{pmcid}_abstract\",\n",
    "                \"source\": f\"PMC{pmcid}\",\n",
    "                \"section\": \"Abstract\",\n",
    "                \"content\": abstract_text,\n",
    "                \"metadata\": {\n",
    "                    \"title\": paper[\"title\"],\n",
    "                    \"journal\": paper.get(\"journal_title\", \"\"),\n",
    "                    \"content_type\": \"abstract\",\n",
    "                    \"word_count\": len(abstract_text.split()),\n",
    "                },\n",
    "            }\n",
    "        )\n",
    "\n",
    "    # Add each body section as a chunk\n",
    "    for section_title, content in paper[\"body\"].items():\n",
    "        rag_chunks.append(\n",
    "            {\n",
    "                \"id\": f\"PMC{pmcid}_{section_title.lower().replace(' ', '_')}\",\n",
    "                \"source\": f\"PMC{pmcid}\",\n",
    "                \"section\": section_title,\n",
    "                \"content\": content,\n",
    "                \"metadata\": {\n",
    "                    \"title\": paper[\"title\"],\n",
    "                    \"journal\": paper.get(\"journal_title\", \"\"),\n",
    "                    \"content_type\": \"section\",\n",
    "                    \"word_count\": len(content.split()),\n",
    "                },\n",
    "            }\n",
    "        )\n",
    "\n",
    "print(f\"Created {len(rag_chunks)} RAG chunks from {len(dataset)} papers\")\n",
    "if rag_chunks:\n",
    "    avg_words = sum(c[\"metadata\"][\"word_count\"] for c in rag_chunks) / len(rag_chunks)\n",
    "    print(f\"Average chunk size: {avg_words:.0f} words\")\n",
    "\n",
    "    # Show a sample chunk\n",
    "    print(\"\\nSAMPLE RAG CHUNK:\")\n",
    "    print(\"=\" * 50)\n",
    "    sample = rag_chunks[0]\n",
    "    print(f\"ID:         {sample['id']}\")\n",
    "    print(f\"Section:    {sample['section']}\")\n",
    "    print(f\"Word Count: {sample['metadata']['word_count']}\")\n",
    "    print(f\"Preview:    {sample['content'][:200]}...\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 5: Save Your Data\n",
    "\n",
    "Let's save all the processed data for future use:\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "source": [
    "# Create output directory\n",
    "output_dir = Path(\"pmcgrab_tutorial_output\")\n",
    "output_dir.mkdir(exist_ok=True)\n",
    "\n",
    "# 1. Save individual paper JSON files\n",
    "papers_dir = output_dir / \"papers\"\n",
    "papers_dir.mkdir(exist_ok=True)\n",
    "\n",
    "for pmcid, paper_dict in dataset.items():\n",
    "    dest = papers_dir / f\"PMC{pmcid}.json\"\n",
    "    with open(dest, \"w\", encoding=\"utf-8\") as f:\n",
    "        json.dump(paper_dict, f, indent=2, ensure_ascii=False)\n",
    "\n",
    "print(f\"Saved {len(dataset)} paper files to {papers_dir}/\")\n",
    "\n",
    "# 2. Save RAG chunks\n",
    "chunks_path = output_dir / \"rag_chunks.json\"\n",
    "with open(chunks_path, \"w\", encoding=\"utf-8\") as f:\n",
    "    json.dump(rag_chunks, f, indent=2, ensure_ascii=False)\n",
    "\n",
    "print(f\"Saved {len(rag_chunks)} RAG chunks to {chunks_path}\")\n",
    "print(f\"\\nAll data saved to '{output_dir}/'.\")"
   ],
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Next Steps\n",
    "\n",
    "You now know how to:\n",
    "\n",
    "- Fetch and parse PMC papers using `Paper.from_pmc()` (OOP) or `process_single_pmc()` (dict)\n",
    "- Explore structured sections and abstracts\n",
    "- Build RAG-ready chunks from the parsed data\n",
    "- Save everything as JSON\n",
    "\n",
    "### Where to go from here\n",
    "\n",
    "1. **Scale up** -- use `process_pmc_ids()` or the CLI (`pmcgrab --pmcids ...`) for bulk processing\n",
    "2. **Local XML** -- process pre-downloaded XML with `Paper.from_local_xml()` or `process_single_local_xml()`\n",
    "3. **Vector databases** -- load your RAG chunks into Pinecone, Weaviate, or ChromaDB\n",
    "4. **Knowledge graphs** -- use the citation and author data to build relational graphs\n",
    "\n",
    "### Quick reference\n",
    "\n",
    "| Task | Code |\n",
    "|------|------|\n",
    "| Single paper (OOP) | `Paper.from_pmc(\"7181753\", suppress_warnings=True)` |\n",
    "| Single paper (dict) | `process_single_pmc(\"7181753\")` |\n",
    "| Local XML file | `Paper.from_local_xml(\"article.xml\")` |\n",
    "| Batch (network) | `process_pmc_ids([\"7181753\", \"3084273\"])` |\n",
    "| CLI | `pmcgrab --pmcids 7181753 3084273 --output-dir ./out` |\n",
    "\n",
    "### Resources\n",
    "\n",
    "- [PMCGrab Documentation](https://rajdeepmondaldotcom.github.io/pmcgrab/)\n",
    "- [GitHub Repository](https://github.com/rajdeepmondaldotcom/pmcgrab)\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}