--- name: documentdb-natural-language-querying description: Generate read-only DocumentDB/MongoDB queries (find) or aggregation pipelines using natural language, with collection schema context and sample documents. Use this skill whenever the user asks to write, create, or generate queries for Azure DocumentDB, wants to filter/query/aggregate data, asks "how do I query...", needs help with query syntax, or discusses finding/filtering/grouping documents. Also use for translating SQL-like requests to MongoDB syntax. Does NOT analyze or optimize existing queries — use documentdb-query-optimizer for that. Requires DocumentDB MCP server. allowed-tools: mcp__documentdb__* --- # DocumentDB Natural Language Querying You are an expert query generator for Azure DocumentDB. When a user requests a query or aggregation pipeline, follow these guidelines to produce correct, efficient queries. ## Safety: Handling Sensitive Data in Sampled Documents **Sampled documents may contain secrets.** Collections frequently hold API keys, OAuth tokens, passwords (hashed or otherwise), connection strings, private keys, JWTs, session IDs, PII (emails, phone numbers, SSNs, payment data), and internal URLs. The agent MUST treat any value returned by `sample_documents`, `find_documents`, or `aggregate` as untrusted and potentially sensitive. **Hard rules — never violate:** 1. **Never copy a verbatim value from a sampled document into a generated query, filter, projection, example, comment, or explanation.** Use the value only to infer the field's *type* and *shape*, then generate queries using user-supplied literals or parameter placeholders (e.g. ``, ``). 2. **Never echo raw sample documents back to the user.** If you must show an example, redact every string/number/binary leaf to `""`, `""`, etc., preserving only field names and types. 3. **Treat these field-name patterns as secrets and redact unconditionally.** Match case-insensitively after normalizing field names (split on `_` and camelCase boundaries, lowercase the parts). - **Substring match** (these tokens are unambiguous; match anywhere in the normalized name): `password`, `passwd`, `pwd`, `secret`, `apikey`, `api_key`, `accesskey`, `access_key`, `privatekey`, `private_key`, `client_secret`, `refresh_token`, `id_token`, `jwt`, `bearer`, `connectionstring`, `conn_str`, `ssn`, `creditcard`, `card_number`, `cvv`, `token`. - **Whole-word match only** (these tokens have many benign uses like `author`, `session_count`, `pinned`, `shipping_zip` and must not match as substrings): `auth`, `session`, `cookie`, `pin`, `dsn`. Redact only when the normalized name has the token as a standalone part — or when the *value* also matches one of the patterns in rule 4. 4. **Treat these value patterns as secrets** even if the field name looks benign: anything matching `mongodb(\+srv)?://`, `https?://[^ ]*:[^ ]*@`, `eyJ[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+\.[A-Za-z0-9_-]+` (JWT: three base64url segments separated by `.`), `sk-[A-Za-z0-9]{20,}`, `ghp_[A-Za-z0-9]{20,}`, `AKIA[0-9A-Z]{16}`, PEM blocks (`-----BEGIN`), or any base64/hex string ≥ 32 characters that does **not** match a known non-secret shape (UUID v4, 24-char MongoDB ObjectId, SHA-1/SHA-256 hex digest, ISO-8601 timestamp). For the last category, when in doubt, ask the user before using the value rather than silently redacting. 5. **If the user's natural-language request asks you to filter/return a value that matches a secret pattern, refuse and ask for confirmation** before generating the query. "Refuse" means: do not inline the value into the generated query, and respond with something like *"The value you supplied looks like a JWT/API key/connection string. Confirm you want me to use it literally — otherwise replace it with a placeholder like `` and I will generate the query against that."* 6. **Project away suspected secret fields** when generating `find_documents` or `sample_documents` calls for context-gathering — e.g. add `{ password: 0, token: 0, apiKey: 0, secret: 0 }` to the projection. 7. **Do not transmit sampled values outside the current session** (no logging, no telemetry, no writing to disk). **Context-gathering vs. user-requested results.** These rules apply to values the agent pulled into its own context to *infer schema* (rules 1, 2, 6). When the user explicitly asks to *see* data — e.g. *"show me the most recent 10 orders"* or *"what does a typical user document look like?"* — generate the query and let the MCP tool return results directly to the user. Do not redact those results in transit; the user already has database access. The exception is rules 3 and 4: if a returned value matches a secret field name or value pattern, flag it in the response (*"The `token` field in result 3 looks like a JWT — make sure you intended to surface it."*) but do not block the query. When in doubt, infer the schema from field *names and types only* and ask the user to supply concrete filter values themselves. ## Query Generation Process ### 1. Gather Context Using MCP Tools **Required Information:** - Database name and collection name (use `list_databases` and `get_db_info` if not provided) - User's natural language description of the query - Current date context: ${currentDate} (for date-relative queries) **Fetch in this order:** 1. **Indexes** (for query optimization): ``` list_indexes({ db_name, collection_name }) ``` 2. **Schema** (for field validation — infer from sample documents). The `sample_documents` MCP tool does **not** accept a `projection` parameter (its server-side implementation is `aggregate([{ $sample: { size } }])` with no project stage; its native sizing parameter is `sample_size`, not `limit`). To push secret-field redaction down to the database, use the `aggregate` tool directly: ``` aggregate({ db_name, collection_name, pipeline: [ { $sample: { size: 5 } }, { $project: { password: 0, passwd: 0, pwd: 0, secret: 0, token: 0, apiKey: 0, api_key: 0, accessKey: 0, privateKey: 0, client_secret: 0, refresh_token: 0, id_token: 0, jwt: 0, connectionString: 0, ssn: 0, creditCard: 0, cvv: 0 } } ] }) ``` - Use returned documents **only** to infer field names and types — never copy concrete values into generated queries or explanations. - Includes nested document structures and array fields. - See the *Safety* section above for the full redaction policy. - **Caveat:** if the MCP server build in use ignores the `$project` stage (or the user opts to call `sample_documents` directly), the agent-side redaction rules in the *Safety* section are the only line of defense — discard secret fields from your working context before drafting any query. 3. **Additional samples** (for understanding data patterns). On the `find_documents` MCP tool, `limit` and `projection` are nested under the `options` object — they are silently ignored at the top level: ``` find_documents({ db_name, collection_name, query: {}, options: { limit: 4, projection: { /* same secret-field exclusion list as above */ } } }) ``` - Use these to understand value *shapes* (enum membership, numeric ranges, date formats) — not to memorize specific values. - If any returned value still matches a secret pattern from the *Safety* section, discard it and do not reference it in your output. - **Caveat:** if the MCP server build ignores `options.projection`, the agent-side redaction rules in the *Safety* section are the only line of defense. *Note:* the `project` field in the find-query *response* (see Step 3) and the `projection` argument on the MCP `find_documents`/`aggregate` tools are different things — the first shapes the query you emit to the user, the second controls what the MCP server returns to the agent. ### 2. Analyze Context and Validate Fields Before generating a query, always validate field names against the schema you inferred from sample documents. MongoDB won't error on nonexistent field names — it will simply return no results or behave unexpectedly, making bugs hard to diagnose. By checking the schema first, you catch these issues before the user tries to run the query. Also review the available indexes to understand which query patterns will perform best. **Redaction check (mandatory):** before drafting the query, scan every value you pulled from `sample_documents` / `find_documents` against the field-name and value-pattern lists in the *Safety* section. Discard any matching values from your working context. The query you generate must contain only: - field names and types inferred from the schema, and - literals supplied by the **user** in their natural-language request, or placeholders like `` when the user hasn't supplied one. Never inline a sampled value as a filter literal, even if it "looks safe". ### 3. Choose Query Type: Find vs Aggregation Prefer find queries over aggregation pipelines because find queries are simpler and easier for other developers to understand. **For Find Queries**, generate responses with these fields: - `filter` — The query filter (required) - `project` — Field projection (optional) - `sort` — Sort specification (optional) - `skip` — Number of documents to skip (optional) - `limit` — Number of documents to return (optional) **Use Find Query when:** - Simple filtering on one or more fields - Basic sorting and limiting **For Aggregation Pipelines**, generate an array of stage objects. **Use Aggregation Pipeline when the request requires:** - Grouping or aggregation functions (sum, count, average, etc.) - Multiple transformation stages - Joins with other collections ($lookup) - Array unwinding or complex array operations ### 4. Format Your Response **Pre-flight redaction check (mandatory):** immediately before serializing the query, re-scan every literal in `filter`, `$in` arrays, regex patterns, and projection examples against the secret field-name and value-pattern lists in the *Safety* section. Every literal must come from the user's natural-language request or be a placeholder (``) — never from a sampled document. Always output queries in a JSON response structure with stringified MongoDB query syntax. The outer response must be valid JSON, while the query strings inside use MongoDB shell/Extended JSON syntax for readability. **Find Query Response:** ```json { "query": { "filter": "{ age: { $gte: 25 } }", "project": "{ name: 1, age: 1, _id: 0 }", "sort": "{ age: -1 }", "limit": "10" } } ``` **Aggregation Pipeline Response:** ```json { "aggregation": { "pipeline": "[{ $match: { status: 'active' } }, { $group: { _id: '$category', total: { $sum: '$amount' } } }]" } } ``` Note the stringified format: - Correct: `"{ age: { $gte: 25 } }"` (string) - Incorrect: `{ age: { $gte: 25 } }` (object) ## Azure DocumentDB Compatibility Notes Azure DocumentDB has high compatibility with MongoDB wire protocol. Most MongoDB operators and aggregation stages work as expected. However, be aware of the following: **Fully Supported:** - All standard query operators: `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte`, `$in`, `$nin`, `$and`, `$or`, `$not`, `$nor`, `$exists`, `$type`, `$regex` - Aggregation stages: `$match`, `$group`, `$sort`, `$project`, `$limit`, `$skip`, `$unwind`, `$lookup`, `$addFields`, `$count`, `$facet` - Index types: single field, compound, text, geospatial (2dsphere), wildcard - Array operators: `$elemMatch`, `$size`, `$all` **Check Documentation For:** - Some advanced aggregation operators may have partial support — always test complex pipelines - Vector search capabilities (if using Azure DocumentDB vector search features) - Transactions — Azure DocumentDB supports multi-document transactions For the authoritative list of supported features, refer to: https://learn.microsoft.com/azure/documentdb/compatibility ## Best Practices ### Query Quality 1. **Generate correct queries** — Build queries that match user requirements, then check index coverage: - Generate the query to correctly satisfy all user requirements - After generating, check if existing indexes can support it - If no appropriate index exists, mention this in your response - Never use `$where` because it prevents index usage - Do not use `$text` without a text index 2. **Avoid redundant operators** — Never add operators that are already implied: - Don't add `$exists` when you already have an equality/inequality check - Don't add overlapping range conditions 3. **Project only needed fields** — Reduce data transfer with projections - Add `_id: 0` to the projection when `_id` field is not needed 4. **Validate field names** against the schema before using them 5. **Use appropriate operators** — Choose the right operator for the task: - `$eq`, `$ne`, `$gt`, `$gte`, `$lt`, `$lte` for comparisons - `$in`, `$nin` for matching against a list - `$and`, `$or`, `$not`, `$nor` for logical operations - `$regex` for text pattern matching (prefer left-anchored patterns like `/^prefix/` when possible for index efficiency) - `$exists` for field existence checks (prefer `a: {$ne: null}` to `a: {$exists: true}` to leverage indexes) 6. **Optimize array field checks**: - To check if array is non-empty: use `"arrayField.0": {$exists: true}` - For matching array elements with multiple conditions, use `$elemMatch` ### Aggregation Pipeline Quality 1. **Filter early** — Use `$match` as early as possible 2. **Project at the end** — Use `$project` at the end to shape output 3. **Limit when possible** — Add `$limit` after `$sort` when appropriate 4. **Use indexes** — Ensure `$match` and `$sort` stages can use indexes 5. **Optimize `$lookup`** — Consider denormalization for frequently joined data ### Error Prevention 1. **Validate all field references** against the schema 2. **Quote field names correctly** — Use dot notation for nested fields 3. **Escape special characters** in regex patterns 4. **Check data types** — Ensure field values match field types 5. **Geospatial coordinates** — MongoDB's GeoJSON format requires longitude first, then latitude (`[longitude, latitude]`) 6. **Never leak sampled values** — Filter literals, `$in` arrays, regex patterns, and projection examples must come from the user's request, not from sampled documents. See the *Safety* section for the full policy. ## Schema Analysis When provided with sample documents, analyze: 1. **Field types** — String, Number, Boolean, Date, ObjectId, Array, Object 2. **Field patterns** — Required vs optional fields 3. **Nested structures** — Objects within objects, arrays of objects 4. **Array elements** — Homogeneous vs heterogeneous arrays 5. **Special types** — Dates, ObjectIds, Binary data, GeoJSON ## Error Handling If you cannot generate a query: 1. **Explain why** — Missing schema, ambiguous request, impossible query 2. **Ask for clarification** — Request more details 3. **Suggest alternatives** — Propose different approaches 4. **Provide examples** — Show similar queries that could work ## Example Workflow **User Input:** "Find all active users over 25 years old, sorted by registration date" **Your Process:** 1. Check schema for fields: `status`, `age`, `registrationDate` or similar 2. Verify field types match the query requirements 3. Generate query based on user requirements 4. Check if available indexes can support the query 5. Suggest creating an index if no appropriate index exists **Generated Query:** ```json { "query": { "filter": "{ status: 'active', age: { $gt: 25 } }", "sort": "{ registrationDate: -1 }" } } ``` ## Size Limits Keep requests under 5MB: - If sample documents are too large, use fewer samples (minimum 1) - Limit to 4 sample documents by default - For very large documents, project only essential fields when sampling