--- name: robots-txt description: When the user wants to configure, audit, or optimize robots.txt. Also use when the user mentions "robots.txt," "crawler rules," "block crawlers," "AI crawlers," "GPTBot," "allow/disallow," "disallow path," "crawl directives," "user-agent," "block Googlebot," "fix robots.txt," "robots.txt blocking," or "search engine crawling." For indexing, use indexing. metadata: version: 1.1.1 --- # SEO Technical: robots.txt Guides configuration and auditing of robots.txt for search engine and AI crawler control. **When invoking**: On **first use**, if helpful, open with 1–2 sentences on what this skill covers and why it matters, then provide the main output. On **subsequent use** or when the user asks to skip, go directly to the main output. ## Scope (Technical SEO) - **Robots.txt**: Configure Disallow/Allow, Sitemap, Clean-param; audit for accidental blocks - **Crawler access**: Path-level crawl control; AI crawler allow/block strategy - **Differentiation**: robots.txt = crawl control (who accesses what paths); noindex = index control (what gets indexed). See **indexing** for page-level exclusions. ## Initial Assessment **Check for project context first:** If `.claude/project-context.md` or `.cursor/project-context.md` exists, read it for site URL and indexing goals. Identify: 1. **Site URL**: Base domain (e.g., `https://example.com`) 2. **Indexing scope**: Full site, partial, or specific paths to exclude 3. **AI crawler strategy**: Allow search/indexing vs. block training data crawlers ## Best Practices ### Purpose and Limitations | Point | Note | |-------|------| | **Purpose** | Controls crawler access; does NOT prevent indexing (disallowed URLs may still appear in search without snippet) | | **Advisory** | Rules are advisory; malicious crawlers may ignore | | **Public** | robots.txt is publicly readable; use noindex or auth for sensitive content. See **indexing** | ### Crawl vs Index vs Link Equity (Quick Reference) | Tool | Controls | Prevents indexing? | |------|----------|-------------------| | **robots.txt** | Crawl (path-level) | No—blocked URLs may still appear in SERP | | **noindex** (meta / X-Robots-Tag) | Index (page-level) | Yes. See **indexing** | | **nofollow** | Link equity only | No—does not control indexing | ### When to Use robots.txt vs noindex | Use | Tool | Example | |-----|------|---------| | **Path-level** (whole directory) | robots.txt | `Disallow: /admin/`, `Disallow: /api/`, `Disallow: /staging/` | | **Page-level** (specific pages) | noindex meta / X-Robots-Tag | Login, signup, thank-you, 404, legal. See **indexing** for full list | | **Critical** | Do NOT block in robots.txt | Pages that use noindex—crawlers must access the page to read the directive | **Paths to block in robots.txt**: /admin/, /api/, /staging/, temp files. **Paths to use noindex** (allow crawl): /login/, /signup/, /thank-you/, etc.—see **indexing**. ### Location and Format | Item | Requirement | |------|-------------| | **Path** | Site root: `https://example.com/robots.txt` | | **Encoding** | UTF-8 plain text | | **Standard** | RFC 9309 (Robots Exclusion Protocol) | ### Core Directives | Directive | Purpose | Example | |-----------|---------|---------| | `User-agent:` | Target crawler | `User-agent: Googlebot`, `User-agent: *` | | `Disallow:` | Block path prefix | `Disallow: /admin/` | | `Allow:` | Allow path (can override Disallow) | `Allow: /public/` | | `Sitemap:` | Declare sitemap absolute URL | `Sitemap: https://example.com/sitemap.xml` | | `Clean-param:` | Strip query params (Yandex) | See below | ### Critical: Do Not Block | Do not block | Reason | |--------------|--------| | CSS, JS, images | Google needs them to render pages; blocking breaks indexing | | `/_next/` (Next.js) | Breaks CSS/JS loading; static assets in GSC "Crawled - not indexed" is expected. See **indexing** | | Pages that use noindex | Crawlers must access the page to read the noindex directive; blocking in robots.txt prevents that | **Only block**: paths that don't need crawling: /admin/, /api/, /staging/, temp files. ### AI Crawler Strategy robots.txt is effective for all measured AI crawlers ([Vercel/MERJ study](https://vercel.com/blog/the-rise-of-the-ai-crawler), 2024). Set rules per user-agent; check each vendor's docs for current tokens. | User-agent | Purpose | Typical | |------------|---------|---------| | **OAI-SearchBot** | ChatGPT search | Allow | | **GPTBot** | OpenAI training | Disallow | | **Claude-SearchBot** | Claude search | Allow | | **ClaudeBot** | Anthropic training | Disallow | | **PerplexityBot** | Perplexity search | Allow | | **Google-Extended** | Gemini training | Disallow | | **CCBot** | Common Crawl (LLM training) | Disallow | | **Bytespider** | ByteDance | Disallow | | **Meta-ExternalAgent** | Meta | Disallow | | **AppleBot** | Apple (Siri, Spotlight); renders JS | Allow for indexing | **Allow vs Disallow**: Allow search/indexing bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot); Disallow training-only bots (GPTBot, ClaudeBot, CCBot) if you don't want content used for model training. See **site-crawlability** for AI crawler optimization (SSR, URL management). ### Clean-param (Yandex) ``` Clean-param: utm_source&utm_medium&utm_campaign&utm_term&utm_content&ref&fbclid&gclid ``` ## Output Format - **Current state** (if auditing) - **Recommended robots.txt** (full file) - **Compliance checklist** - **References**: [Google robots.txt](https://developers.google.com/search/docs/crawling-indexing/robots/create-robots-txt) ## Related Skills - **indexing**: Full noindex page-type list; when to use noindex vs robots.txt; GSC indexing diagnosis - **page-metadata**: Meta robots (noindex, nofollow) implementation - **xml-sitemap**: Sitemap URL to reference in robots.txt - **site-crawlability**: Broader crawl and structure guidance; AI crawler optimization - **rendering-strategies**: SSR, SSG, CSR; content in initial HTML for crawlers