--- layout: default title: "Pages Index (pages.json) Documentation" description: "Documentation for the pages.json content index used by The Sunil Abraham Project (TSAP)." categories: [TSAP Documentation] permalink: /tsap/pages-json-documentation/ created: 2026-06-08 --- {% include documentation-notice.html %} The **Pages Index** is a machine-readable catalogue of content published on The Sunil Abraham Project (TSAP). It is generated automatically from page front matter and exported as a JSON file named `pages.json`. The purpose of the Pages Index is to provide a structured representation of TSAP content that can be consumed by external tools, bots, scripts, search systems, and future applications without requiring direct access to Jekyll internals or repository source files. The implementation was developed with support from ChatGPT. All design decisions, testing, debugging, editorial judgement, and final implementation choices were made by the project maintainer. ## Background As TSAP grew beyond one thousand pages, it became increasingly desirable to expose a structured list of published content. Human readers can navigate the website through categories, internal links, search engines, and navigation menus. Software tools, however, require a structured source of information. Several future use cases were identified: - Telegram bot integration. - Search and discovery tools. - Research utilities. - Content analysis. - Statistics and reporting. - External applications consuming TSAP metadata. A machine-readable index became increasingly useful as the project expanded. ## Why the Pages Index Was Created Without a dedicated index, external tools would need to: - Read Markdown files directly. - Parse front matter. - Understand Jekyll conventions. - Reconstruct URLs. - Determine which pages should be included. - Handle future structural changes. This creates unnecessary complexity and tightly couples external tools to repository internals. The Pages Index was therefore created as a simple, portable, and machine-readable representation of published TSAP content. ## Architecture The Pages Index is generated from page front matter. The script scans Markdown files throughout the repository and extracts selected metadata fields. Only pages containing a `created` field are included. This rule was chosen because: - TSAP pages are expected to contain a `created` field. - Utility pages and temporary files can be excluded naturally. - The resulting index focuses on published content. The process is: ```text Markdown Files ↓ Front Matter Extraction ↓ Metadata Selection ↓ pages.json Generation ↓ Publication ``` The generated file becomes a structured catalogue of TSAP content. ## Script Location The Pages Index is generated by: ```text scripts/generate_pages_json.py ``` The script scans Markdown files throughout the repository, extracts selected front matter metadata, and generates a machine-readable JSON index. ## Generated File The output file is: ```text pages.json ``` It is written to the repository root and published automatically by GitHub Pages. Published URL: ```text https://sunilabraham.in/pages.json ``` ## Included Metadata Each indexed page may contain: - title - description - created - date - source - authors - categories - permalink A typical entry looks like: ```json { "title": "Example Page", "description": "Example description", "created": "2026-06-08", "date": "2026-05-01", "source": "Example Source", "authors": ["Example Author"], "categories": ["Example Category"], "permalink": "https://sunilabraham.in/example-page/" } ``` ## Installation Requirements The script was designed to remain lightweight and portable. Requirements: - Python 3 - PyYAML On Ubuntu: ```bash sudo apt install python3-yaml ``` Alternatively, within a Python virtual environment: ```bash pip install pyyaml ``` No database is required. No Jekyll plugin is required. No GitHub Actions workflow is required. ## Running the Generator Navigate to the root of the repository: ```bash cd /path/to/your/repository ``` Example: ```bash cd ~/Projects/sunilabraham ``` Run the generator: ```bash python3 scripts/generate_pages_json.py ``` Typical output: ```text Created pages.json with 1048 pages. ``` ## First Successful Build The first successful generation occurred on 8 June 2026. Results: ```text Created pages.json with 1048 pages. ``` Generated file size: ```text 576 KB ``` This demonstrated that a machine-readable index of the entire project could be generated efficiently while remaining small enough for rapid download. ## Maintenance Workflow A recommended workflow is: 1. Create, edit, or publish content. 2. Regenerate pages.json. 3. Review the result. 4. Commit the updated index. 5. Push changes. Typical usage: ```bash python3 scripts/generate_pages_json.py git add pages.json git commit -m "Update pages index" git push ``` This ensures that the published index remains synchronised with site content. ## Current Uses The Pages Index was originally created to support retrieval systems and machine-readable access to TSAP content. Current uses include: - Content discovery. - Metadata retrieval. - Project statistics. - External tooling. - Bot integration. Future tools may consume the same index without requiring direct access to repository source files. ## Advantages and Limitations Advantages: - Simple architecture. - Human-readable source data. - Machine-readable output. - No database required. - No build plugins required. - Compatible with GitHub Pages. - Lightweight and portable. Limitations: - Requires manual regeneration. - Newly created pages will not appear until the index is regenerated. - Only pages containing a `created` field are included. - Metadata quality depends upon front matter quality. These limitations are considered acceptable given the project's emphasis on simplicity and maintainability. ## Future Improvements Potential future enhancements include: - Automatic generation during deployment. - Additional metadata fields. - Category-specific exports. - Author-specific exports. - Change tracking. - Additional machine-readable indexes. Any future development should continue to prioritise transparency, portability, and compatibility with GitHub Pages. ## Development History Development began on 8 June 2026. The immediate goal was to create a machine-readable representation of TSAP content that could be consumed by external tools without requiring direct access to repository files. The chosen approach was deliberately simple. Rather than introducing databases, search engines, build plugins, or external services, a standalone Python script was created to scan Markdown files and export selected front matter metadata into a single JSON file. The first successful run generated: ```text Created pages.json with 1048 pages. ``` The resulting file was approximately 576 KB in size and was published at: ```text https://sunilabraham.in/pages.json ``` This established the first structured content index for the project. ## Lessons Learned The development of the Pages Index reinforced an important architectural lesson. The most valuable improvement was not adding another AI model or another search interface. The most valuable improvement was creating a structured representation of TSAP's own content. By transforming page metadata into a machine-readable index, TSAP content becomes accessible to future tools, bots, reports, and retrieval systems while remaining entirely within the project's existing static-site architecture. Future development should continue to favour structured project data and retrieval mechanisms wherever practical.