--- name: pseo-data description: Design and implement the structured data architecture that powers programmatic SEO pages, including content models, data sources, slug generation, and data-fetching layers. Use when setting up or refactoring the data foundation for pSEO, designing content models, or building the data pipeline that feeds page templates. argument-hint: "[source: cms | json | database | api | mdx]" allowed-tools: Read, Glob, Grep, Bash, Edit, Write --- # pSEO Data Architecture Design and implement the structured data layer that feeds all programmatic SEO pages. This is the foundation every other pSEO skill depends on. ## Core Principles 1. **Single source of truth**: All page data flows from one data layer 2. **SEO-complete models**: Every content model includes all fields needed for metadata, schema markup, and linking 3. **Unique slugs by construction**: Slug generation enforces uniqueness at the data level 4. **Type safety**: All data models are fully typed (TypeScript interfaces/types) 5. **Separation of concerns**: Data fetching is decoupled from page rendering ## Implementation Steps ### 1. Define Content Models Create TypeScript interfaces for each page type using a two-tier model. The lightweight index tier is safe to hold in memory for all pages; the heavy full tier is loaded per-page only. ```typescript // Index tier: safe to load all at once (~1KB per page) interface PageIndex { slug: string; // unique, URL-safe title: string; // page title (50-60 chars target) metaDescription: string; // meta description (150-160 chars target) h1: string; // primary heading (can differ from title) canonicalPath: string; // canonical URL path category: string; // for hub-spoke and breadcrumbs lastModified: string; // ISO date for sitemap } // Full tier: extends PageIndex with heavy fields (~50-500KB per page) interface BaseSEOContent extends PageIndex { introText: string; bodyContent: string; faqs?: FAQ[]; relatedSlugs?: string[]; featuredImage?: SEOImage; } ``` Extend `BaseSEOContent` for each page type with domain-specific fields. The interfaces above show the minimum required fields. See `references/content-models.md` for the full definitions (which add `subcategory`, `tags`, `publishedDate`, `status`, and more) and extended type examples (LocationPage, ProductPage, ComparisonPage, CategoryPage). ### 2. Build the Data-Fetching Layer Create a centralized data module (e.g., `lib/data.ts` or `src/data/index.ts`) that exports: - `getAllSlugs()` - Returns all valid slugs for static generation. Must handle pagination internally when the data source has 1000+ records (fetch in batches, return the complete list). - `getPageData(slug)` - Returns full content for a single page - `getPagesByCategory(category, opts?)` - Returns pages in a category for hub pages. Accept optional `limit` and `offset` for paginated hub pages. - `getRelatedPages(slug, limit?)` - Returns related pages for internal linking - `getAllCategories()` - Returns all categories for navigation and hubs - `getPageCount()` - Returns total page count (useful for sitemap splitting and build diagnostics) All functions must be: - Cached or memoized during build to avoid redundant reads - Typed with explicit return types - Guarded against missing or malformed data - Internally paginated when the data source imposes limits (e.g., CMS APIs with 100-item pages). The consumer should never need to handle pagination — the data layer abstracts it. ### 3. Implement Slug Generation Design a slug strategy that: - Produces URL-safe, lowercase, hyphenated strings - Guarantees uniqueness across the entire dataset - Is deterministic (same input always produces same slug) - Includes a collision detection mechanism - Follows a consistent URL hierarchy (e.g., `/category/page-slug`) ### 4. Validate Data Integrity Build a validation function or script that checks: - No duplicate slugs exist - All required fields are present and non-empty - Title and description lengths are within SEO targets - All category references resolve to valid categories - No orphan pages (pages not reachable through any category) ### 5. Set Up Data Source Integration Based on the data source (`$ARGUMENTS` or detected): **JSON files**: Create a `data/` directory with typed JSON, a loader, and build-time validation. **CMS (headless)**: Create API client with typed responses, implement caching, handle pagination for 1000+ items. **Database**: Create a query layer with connection pooling, implement cursor-based pagination, add query caching. **MDX files**: Set up frontmatter schema validation, create a content loader with gray-matter parsing. **API**: Create a typed API client, implement rate limiting and retry logic, add response caching. ## Scale Limits The in-memory and file-based patterns in this skill work up to ~10K pages. Beyond that: - **10K-50K pages**: Requires a database (PostgreSQL, MySQL). In-memory index tier becomes borderline at 50K (~50MB). File-based data sources are too slow. - **50K-100K+ pages**: Requires database + cache layer (Redis) + cursor-based pagination. `getAllSlugs()` must use cursor iteration, not array return. Data sufficiency gating prevents generating thin pages. See **pseo-scale** for the complete database-backed data layer, sufficiency scoring, and scale-specific patterns. ## Memory-Conscious Data Patterns At 1000+ pages, how data is loaded matters more than what is loaded. A full content model with body text, FAQs, and images can be 50-500KB per page. Loading all pages into memory simultaneously will OOM. **Two-tier data model:** Split the data layer into lightweight index data and full page data. The `PageIndex` and `BaseSEOContent` interfaces from section 1 define the two tiers: - `getAllSlugs()`, `getRelatedPages()`, `getPagesByCategory()` — return `PageIndex[]` (lightweight, ~1KB per page) - `getPageData()` — returns `BaseSEOContent` (or an extended type) for a single page (heavy, ~50-500KB per page, only one at a time) **Never do this:** ```typescript // Loads ALL full content into memory — will OOM at scale const allPages = await Promise.all(slugs.map(s => getPageData(s))); ``` **Instead:** ```typescript // Process pages one at a time or in small batches for (const slug of slugs) { const page = await getPageData(slug); await processPage(page); // page is GC'd after each iteration } ``` **CMS/API pagination:** - Fetch in batches of 100-250 records - Yield or push to an array incrementally — don't hold all API responses in memory simultaneously - If using GraphQL, only request index fields in list queries, full fields in single-item queries ## File Organization ``` lib/ data/ index.ts # public API (re-exports) types.ts # TypeScript interfaces fetcher.ts # data source integration slugs.ts # slug generation and validation validation.ts # data integrity checks cache.ts # build-time caching utilities ``` ## Quality Checks Before considering this complete: - [ ] All content models extend BaseSEOContent (which extends PageIndex) - [ ] `getAllSlugs()` returns 0 duplicates - [ ] Data validation passes with zero errors - [ ] Data layer exports are fully typed with no `any` - [ ] Fetching is memoized for build performance - [ ] A test or script can validate the full dataset - [ ] Two-tier data model implemented (index data vs. full page data) - [ ] No function loads all full page content into memory simultaneously - [ ] CMS/API fetching uses batched pagination internally ## Relationship to Other Skills This skill provides the data foundation for: - **pseo-templates**: Consumes `getPageData()` and `getAllSlugs()` - **pseo-metadata**: Reads title, description, canonical from content models - **pseo-schema**: Uses structured fields for JSON-LD generation - **pseo-linking**: Uses `getRelatedPages()` and category data - **pseo-quality-guard**: Validates against the content models