--- name: wikidata-search description: Search for items and properties on Wikidata and retrieve entity details, claims, and external identifiers. Supports both keyword search (Wikidata Action API) and semantic/hybrid search (Wikidata Vector Database), plus direct entity retrieval (Special:EntityData) and structured querying (WDQS SPARQL). --- # Wikidata Search Skill Search and retrieve data from Wikidata, the free knowledge base. ## Choosing An Access Method Use the method that matches the task to reduce load and improve accuracy: - Keyword search by label/alias/description: Action API `wbsearchentities` - Semantic exploration / fuzzy concept search: Wikidata Vector Database (hybrid vector + keyword via RRF) - Fetch a known entity's current JSON quickly: Special:EntityData - Complex graph relations / reporting: Wikidata Query Service (WDQS) SPARQL ## API Endpoints Base URL: `https://www.wikidata.org/w/api.php` Entity JSON (often faster for current state): `https://www.wikidata.org/wiki/Special:EntityData/{ID}.json` SPARQL endpoint: `https://query.wikidata.org/sparql` Vector DB API: `https://wd-vectordb.wmcloud.org` ## Core Functions ### 1. Search Items (wbsearchentities) Search for entities by label or alias. ```bash curl 'https://www.wikidata.org/w/api.php?action=wbsearchentities&search=QUERY&language=en&format=json&type=item&limit=10' ``` Parameters: - `search`: Search term (required) - `language`: Language code (default: en) - `type`: `item` (Q-entities) or `property` (P-entities) - `limit`: Max results (1-50, default: 7) - `continue`: Offset for pagination Response fields per result: - `id`: Entity ID (e.g., Q42) - `label`: Primary label - `description`: Short description - `aliases`: Alternative names - `url`: Wikidata page URL ### 2. Get Entity Details (wbgetentities) Retrieve full entity data including claims/identifiers. ```bash curl 'https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42&format=json&props=labels|descriptions|aliases|claims' ``` Parameters: - `ids`: Pipe-separated entity IDs (max 50) - `props`: `labels|descriptions|aliases|claims|sitelinks|info` - `languages`: Filter languages (e.g., `en|fr|de`) ### 3. Get Claims Only (wbgetclaims) Retrieve claims for specific entity/property. ```bash curl 'https://www.wikidata.org/w/api.php?action=wbgetclaims&entity=Q42&property=P31&format=json' ``` ### 4. Semantic / Hybrid Search (Wikidata Vector Database) When you don't know the exact label, or want "things like this" discovery, use the Vector DB. Item search: ```bash curl 'https://wd-vectordb.wmcloud.org/item/query/?query=QUERY&lang=all&K=20' ``` Property search: ```bash curl 'https://wd-vectordb.wmcloud.org/property/query/?query=QUERY&lang=all&K=20&exclude_external_ids=false' ``` Optional parameters: - `lang`: language code, or `all` for cross-language - `K`: number of results - `instanceof`: comma-separated QIDs to filter items by "instance of" - `rerank`: `true|false` (slower) Response fields: - `QID` / `PID` - `similarity_score` - `rrf_score` - `source` ### 5. Direct Entity JSON (Special:EntityData) ```bash curl 'https://www.wikidata.org/wiki/Special:EntityData/Q42.json?flavor=simple' ``` `flavor`: - `simple`: truthy statements + sitelinks/version - `full`: full data ### 6. Structured Queries (WDQS SPARQL) ```bash curl -G 'https://query.wikidata.org/sparql' --data-urlencode 'query=SELECT * WHERE { wd:Q42 ?p ?o } LIMIT 5' -H 'Accept: application/sparql-results+json' ``` ## Extracting External Identifiers External identifiers are stored as claims with datatype `external-id`. Common identifier properties: | Property | Name | Example | | -------- | ---------------------- | ---------------------- | | P214 | VIAF ID | 75121530 | | P227 | GND ID | 119033364 | | P244 | Library of Congress ID | n79023811 | | P213 | ISNI | 0000 0001 2144 9326 | | P345 | IMDb ID | nm0001354 | | P646 | Freebase ID | /m/0282x | | P349 | NDL ID | 00621256 | | P268 | BnF ID | 11888092r | | P269 | IdRef ID | 026927608 | | P906 | SELIBR ID | 182099 | | P396 | SBN author ID | IT\\ICCU\\CFIV\\000163 | To extract identifiers from `wbgetentities` response: ```python # claims = response['entities']['Q42']['claims'] # For each property P: # claims[P][0]['mainsnak']['datavalue']['value'] -> identifier string ``` ## Python Script Usage Use `scripts/wikidata_api.py` for programmatic access: ```python from scripts.wikidata_api import WikidataAPI wd = WikidataAPI() # Search for items results = wd.search("Albert Einstein", language="en", limit=5) # Get entity with identifiers entity = wd.get_entity("Q937", props=["labels", "descriptions", "claims"]) # Get external identifiers only (all values by default) identifiers = wd.get_identifiers("Q937") # Returns: {'P214': ['75121530', ...], 'P227': '118529579', ...} # Semantic search (Vector DB) candidates = wd.vector_search_items("a famous science fiction writer", lang="en", k=5) # SPARQL raw = wd.execute_sparql("SELECT * WHERE { wd:Q42 ?p ?o } LIMIT 5") ``` ## Response Handling ### Search Response Structure ```json { "searchinfo": {"search": "query"}, "search": [ { "id": "Q42", "label": "Douglas Adams", "description": "English writer and humorist", "aliases": ["Douglas Noël Adams"], "url": "//www.wikidata.org/wiki/Q42" } ] } ``` ### Entity Response Structure ```json { "entities": { "Q42": { "type": "item", "id": "Q42", "labels": {"en": {"language": "en", "value": "Douglas Adams"}}, "descriptions": {"en": {"language": "en", "value": "..."}}, "claims": { "P31": [...], // instance of "P214": [{"mainsnak": {"datavalue": {"value": "113230702"}}}] // VIAF } } } } ``` ## Best Practices 1. **Choose the right access method**: search vs vector search vs entity fetch vs SPARQL 2. **Rate limiting**: add 500ms-1s delay between requests 3. **Batch requests**: use pipe-separated IDs (max 50 per `wbgetentities` call) 4. **Set User-Agent**: include contact info in headers 5. **Handle 429**: respect `Retry-After` and back off 6. **Action API etiquette**: use `maxlag` and request only needed `props`