--- source: "https://github.com/huggingface/skills/tree/main/skills/huggingface-datasets" name: hugging-face-dataset-viewer description: Query Hugging Face datasets through the Dataset Viewer API for splits, rows, search, filters, and parquet links. risk: unknown --- # Hugging Face Dataset Viewer ## When to Use Use this skill when you need read-only exploration of a Hugging Face dataset through the Dataset Viewer API. Use this skill to execute read-only Dataset Viewer API calls for dataset exploration and extraction. ## Core workflow 1. Optionally validate dataset availability with `/is-valid`. 2. Resolve `config` + `split` with `/splits`. 3. Preview with `/first-rows`. 4. Paginate content with `/rows` using `offset` and `length` (max 100). 5. Use `/search` for text matching and `/filter` for row predicates. 6. Retrieve parquet links via `/parquet` and totals/metadata via `/size` and `/statistics`. ## Defaults - Base URL: `https://datasets-server.huggingface.co` - Default API method: `GET` - Query params should be URL-encoded. - `offset` is 0-based. - `length` max is usually `100` for row-like endpoints. - Gated/private datasets require `Authorization: Bearer `. ## Dataset Viewer - `Validate dataset`: `/is-valid?dataset=` - `List subsets and splits`: `/splits?dataset=` - `Preview first rows`: `/first-rows?dataset=&config=&split=` - `Paginate rows`: `/rows?dataset=&config=&split=&offset=&length=` - `Search text`: `/search?dataset=&config=&split=&query=&offset=&length=` - `Filter with predicates`: `/filter?dataset=&config=&split=&where=&orderby=&offset=&length=` - `List parquet shards`: `/parquet?dataset=` - `Get size totals`: `/size?dataset=` - `Get column statistics`: `/statistics?dataset=&config=&split=` - `Get Croissant metadata (if available)`: `/croissant?dataset=` Pagination pattern: ```bash curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=0&length=100" curl "https://datasets-server.huggingface.co/rows?dataset=stanfordnlp/imdb&config=plain_text&split=train&offset=100&length=100" ``` When pagination is partial, use response fields such as `num_rows_total`, `num_rows_per_page`, and `partial` to drive continuation logic. Search/filter notes: - `/search` matches string columns (full-text style behavior is internal to the API). - `/filter` requires predicate syntax in `where` and optional sort in `orderby`. - Keep filtering and searches read-only and side-effect free. ## Querying Datasets Use `npx parquetlens` with Hub parquet alias paths for SQL querying. Parquet alias shape: ```text hf://datasets//@~parquet///.parquet ``` Derive ``, ``, and `` from Dataset Viewer `/parquet`: ```bash curl -s "https://datasets-server.huggingface.co/parquet?dataset=cfahlgren1/hub-stats" \ | jq -r '.parquet_files[] | "hf://datasets/\(.dataset)@~parquet/\(.config)/\(.split)/\(.filename)"' ``` Run SQL query: ```bash npx -y -p parquetlens -p @parquetlens/sql parquetlens \ "hf://datasets//@~parquet///.parquet" \ --sql "SELECT * FROM data LIMIT 20" ``` ### SQL export - CSV: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.csv' (FORMAT CSV, HEADER, DELIMITER ',')"` - JSON: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.json' (FORMAT JSON)"` - Parquet: `--sql "COPY (SELECT * FROM data LIMIT 1000) TO 'export.parquet' (FORMAT PARQUET)"` ## Creating and Uploading Datasets Use one of these flows depending on dependency constraints. Zero local dependencies (Hub UI): - Create dataset repo in browser: `https://huggingface.co/new-dataset` - Upload parquet files in the repo "Files and versions" page. - Verify shards appear in Dataset Viewer: ```bash curl -s "https://datasets-server.huggingface.co/parquet?dataset=/" ``` Low dependency CLI flow (`npx @huggingface/hub` / `hfjs`): - Set auth token: ```bash export HF_TOKEN= ``` - Upload parquet folder to a dataset repo (auto-creates repo if missing): ```bash npx -y @huggingface/hub upload datasets// ./local/parquet-folder data ``` - Upload as private repo on creation: ```bash npx -y @huggingface/hub upload datasets// ./local/parquet-folder data --private ``` After upload, call `/parquet` to discover `//` values for querying with `@~parquet`. ## Limitations - Use this skill only when the task clearly matches the scope described above. - Do not treat the output as a substitute for environment-specific validation, testing, or expert review. - Stop and ask for clarification if required inputs, permissions, safety boundaries, or success criteria are missing.