--- name: ia description: Interact with Internet Archive (archive.org) - upload files, download items, and search the archive using the ia CLI tool. Use when working with archive.org, archiving content, or retrieving historical data. allowed-tools: Bash argument-hint: [search query | identifier | command] --- # Internet Archive CLI Skill This skill enables interaction with the Internet Archive (archive.org) using the `ia` command-line tool from the `internetarchive` Python package. ## Items An item is the fundamental unit on archive.org - a logical grouping of related files sharing common metadata. An item can be a book, a song, an album, a dataset, a movie, an image or set of images, etc. Each item has a unique identifier across the entire archive. Every item contains: - Original uploaded files - Derivative files (automatically generated by archive.org) - `_meta.xml` - item-level metadata - `_files.xml` - file-level metadata Items must belong to a collection. ### Item Limits | Constraint | Recommended | Hard Limit | |------------|-------------|------------| | Item total size | Under 100GB | ~1TB | | Files per item | Under 10,000 | 250,000 (performance degrades >10,000) | | Single file size | Under 50GB | 500-700GB | | Daily upload | Under 1,000 files | 5,000 files (zips count as 1) | **Permanent URL patterns:** - Details page: `https://archive.org/details/` - Download directory: `https://archive.org/download/` - Specific file: `https://archive.org/download//` - Item history: `https://archive.org/history/` **Warning:** Never link to server-specific URLs like `ia802304.us.archive.org` - these break when items migrate between servers. Always use the canonical `archive.org` URLs above. For more details, see: https://archive.org/developers/items.html ## Derivatives When you upload files to the Internet Archive, the system automatically generates derivative files - converted versions in different formats and resolutions. For example: - **Video**: Transcoded to h.264, Ogg, and various bitrates - **Audio**: Converted to MP3 (multiple bitrates), Ogg Vorbis, FLAC - **Text/Books**: OCR processing, searchable PDFs, EPUB, DjVu - **Images**: Thumbnails, JPEG 2000, different resolutions Derivatives make content accessible across different devices and bandwidths. You can identify derivatives in `ia list` output - they have an `original` field pointing to their source file. To skip derivative generation during upload, use `--no-derive`: ```bash ia upload my-item file.mp4 --metadata="mediatype:movies" --no-derive ``` For the complete list of source formats and their generated derivatives, see: **https://archive.org/help/derivatives.php** ## Metadata Schema Internet Archive items use XML-based metadata. Key points: - **Required fields:** `identifier`, `mediatype` - **Recommended fields:** `title`, `description`, `creator`, `date`, `subject`, `collection`, `language` - **Repeatable fields:** collections, creators, subjects, languages support multiple values - **Custom fields:** You can define unlimited custom metadata fields (must follow XML naming rules) **Identifier requirements:** - ASCII alphanumeric, underscores, dashes, or periods only - Must begin with alphanumeric character - 5-100 characters (5-80 recommended) - Unique and unchangeable once set For the complete metadata schema reference, see: **https://archive.org/developers/metadata-schema** ## Collections Collections group related items together. Key points: - **Only IA staff can create collections** - users must request creation - **Minimum 50 items** required for a new collection - Items must be related and typically same media type - Collection creation takes up to two weeks after request To request a collection, contact Internet Archive with: - List of item identifiers or search query identifying items - Desired collection identifier (5-80 chars, alphanumeric only) - Collection title and description - At least one subject tag **Public upload collections** (anyone can upload to): - `opensource_movies`, `opensource_audio`, `opensource_media` - general media - `community_texts`, `community_video`, `community_audio` - community contributions Other collections restrict uploads to designated uploaders only. ## Tool Detection and Installation Before using any `ia` commands, check if the tool is installed: ```bash ia --version ``` If the `ia` command is not found, install it using `uv`: ```bash uv tool install internetarchive ``` Alternative installation methods: - `pipx install internetarchive` - `pip install internetarchive` After installation, verify it works with `ia --version`. ## Global Options These options work with all `ia` commands: | Option | Description | |--------|-------------| | `-h, --help` | Show help message | | `-v, --version` | Display version | | `-c FILE, --config-file` | Path to config file | | `-l, --log` | Enable logging | | `-d, --debug` | Enable debug output | ## Configuration and Authentication Check if `ia` is configured: ```bash ia configure --whoami ``` If not configured (shows error or empty), the user needs to set up credentials: 1. **Interactive setup**: Run `ia configure` and follow prompts 2. **Get credentials**: IA-S3 keys from https://archive.org/account/s3.php 3. **Config location**: Saves to `~/.config/ia.ini` ### Configure Options | Option | Description | |--------|-------------| | `--whoami` | Print current authenticated user | | `--show` | Print current config as JSON | | `--check` | Validate IA-S3 keys (exit 0 if valid, 1 otherwise) | ```bash # Show current config ia configure --show # Validate keys (useful in scripts) ia configure --check && echo "Keys valid" ``` ### Environment Variables Alternative to config file: ```bash export IA_ACCESS_KEY_ID="your-access-key" export IA_SECRET_ACCESS_KEY="your-secret-key" ``` Note: Configuration is required for uploads and metadata modifications. Searching and downloading public items works without authentication. ## User-Agent Identification (Required) **All requests to the Internet Archive must include a proper User-Agent string** that clearly identifies the source of the request. This applies to every request made via any tool - the `ia` CLI, Python library, direct API calls, curl, or any other HTTP client. This is critical for AI agents, bots, and automated tools. The `ia` CLI automatically includes a default User-Agent with your access key: ``` internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 ``` When using Claude Code or other AI/LLM agents, **you must append a custom suffix** that includes: - The tool/agent name and version (e.g., "Claude Code/1.0.0") - The model being used if applicable (e.g., "claude-sonnet-4-20250514") - Any relevant context about the automation The `--user-agent-suffix` CLI option and `user_agent_suffix` config setting require `internetarchive` version 5.7.2 or newer. The default User-Agent (including access key) is always sent - your suffix is appended to it. **CLI:** ```bash ia --user-agent-suffix "Claude Code/1.0.0 (claude-sonnet-4-20250514)" download my-item ``` **INI file (`~/.config/internetarchive/ia.ini`):** ```ini [general] user_agent_suffix = Claude Code/1.0.0 (claude-sonnet-4-20250514) ``` **Python API:** ```python from internetarchive import get_session session = get_session(config={ 'general': {'user_agent_suffix': 'Claude Code/1.0.0 (claude-sonnet-4-20250514)'} }) ``` The resulting User-Agent will look like: ``` internetarchive/5.7.2 (Linux x86_64; N; en; ACCESS_KEY) Python/3.11.0 Claude Code/1.0.0 (claude-sonnet-4-20250514) ``` This helps the Internet Archive track usage patterns, troubleshoot issues, and maintain service quality. Always be specific - include version numbers, model identifiers, and enough detail to distinguish your tool from others. ## Search Operations Search the Internet Archive catalog: ```bash ia search '' ``` ### Search Parameters | Parameter | Description | |-----------|-------------| | `--itemlist` | Output identifiers only, one per line | | `-n, --num-found` | Print only the count of results | | `-s, --sort` | Sort results: `--sort='field desc'` or `--sort='field asc'` | | `-f, --field` | Return specific metadata fields (repeatable) | | `-F, --fts` | Full-text search (search within text content, not just metadata) | | `--parameters` | Raw query parameters: `--parameters="page=N&rows=N"` | ```bash # Get result count only ia search 'collection:nasa' -n # Sort by date descending ia search 'mediatype:texts' --sort='date desc' # Return specific fields ia search 'collection:nasa' --field=identifier --field=title ``` ### Sort Fields Common sort fields for use with `--sort`: | Field | Description | |-------|-------------| | `date` | Content date | | `publicdate` | When item was published to archive.org | | `addeddate` | When added to archive | | `updatedate` | Last updated | | `title` / `titleSorter` | Alphabetical by title | | `creator` / `creatorSorter` | Alphabetical by creator | | `downloads` | Total downloads | | `week` | Downloads this week | | `month` | Downloads this month | | `num_reviews` | Number of reviews | | `num_favorites` | Number of favorites | | `item_size` | Total item size | | `files_count` | Number of files | Use `asc` or `desc` suffix: ```bash ia search 'mediatype:audio' --sort='downloads desc' ia search 'collection:books' --sort='publicdate asc' ia search 'creator:NASA' --sort='title asc' ``` ### Search Query Syntax The Internet Archive uses **Apache Lucene query syntax**. By default, the operator is AND (all terms must be present). #### Query Operators | Operator | Description | |----------|-------------| | `AND` | All terms must be present (default) | | `OR` | Any of the terms can be present | | `NOT` | Exclude documents with term (requires at least one positive term) | | `( )` | Group clauses to form subqueries | #### Field-Specific Searches Use `field:value` syntax to search specific metadata fields: | Query | Description | |-------|-------------| | `'title:"search text"'` | By title | | `'creator:"Author Name"'` | By creator/author | | `'subject:"topic"'` | Search by subject | | `'description:"text"'` | By description | | `'collection:name'` | Items in a collection | | `'mediatype:texts'` | By media type (texts, movies, audio, software, image, data) | | `'contributor:smithsonian'` | By contributor | | `'language:eng'` | By language code | | `'format:pdf'` | Items containing specific file format | | `'isbn:9780123456789'` | By ISBN | | `'licenseurl:http*by-nc*'` | By Creative Commons license | #### Range Queries Search values between bounds using brackets or parentheses: | Syntax | Description | |--------|-------------| | `[1000 TO 2000]` | Inclusive range (includes bounds) | | `{1000 TO 2000}` | Exclusive range (excludes bounds) | | `[1000 TO null]` | Open-ended range (1000 or greater) | | `[null TO 2000]` | Open-ended range (2000 or less) | #### Date Fields Searchable date fields: `addeddate`, `createdate`, `date`, `indexdate`, `publicdate`, `reviewdate`, `updatedate`, `oai_updatedate` | Query | Description | |-------|-------------| | `'date:[2020-01-01 TO 2024-12-31]'` | Date range | | `'publicdate:[2024-01-01 TO 2024-06-30]'` | By publication date | | `'indexdate:[2024-01-01T00:00:00Z TO 2024-12-31T23:59:59Z]'` | With timestamp | | `'date:2024*'` | Wildcard for year (non-range) | #### Fuzzy Queries Append `~` for approximate spelling matches: ```bash ia search 'title:buttonwood~' # Boost fuzzy matches with weights ia search '(title:buttonwood~)^150 OR (subject:buttonwood~)^100' ``` #### Searching for Missing Fields Find items where a field doesn't exist: ```bash ia search 'collection:microfiche AND NOT _exists_:creator' ``` #### Searching by Uploader Search by uploader's user item, screen name, or email: ```bash ia search '_uploader_useritem:@username' ia search '_uploader_screenname:"Display Name"' ia search 'uploader:your@email.com' ``` #### Additional Searchable Fields Beyond standard metadata, you can search by: - `downloads` - download count - `item_size` - total item size in bytes - `files_count` - number of files - `collection_size` - size of collection - `item_count` - items in collection ```bash ia search 'collection:opensource AND downloads:[1000 TO null]' ia search 'mediatype:movies AND item_size:[1000000000 TO null]' ``` #### Combined Queries ```bash # AND is implicit between terms ia search 'collection:nasa mediatype:image' # Explicit operators ia search 'collection:nasa AND mediatype:image' ia search 'mediatype:texts OR mediatype:audio' ia search 'collection:opensource NOT mediatype:software' # Grouped subqueries ia search '(mediatype:texts OR mediatype:audio) AND creator:"Mark Twain"' ``` ### Full-Text Search Use the `-F` (or `--fts`) flag to search within the actual text content of items rather than just metadata. This is particularly powerful for searching text collections like books, documents, and OCR'd materials. **Basic full-text search:** ```bash ia search -F 'collection:collection_name "search phrase"' ``` **How it works:** - Searches inside the full text of documents (OCR'd PDFs, text files, etc.) - More powerful than metadata-only search for finding specific quotes or passages - Requires items to have searchable text (OCR or text files) - Can be combined with collection and metadata filters **Full-text search syntax:** - Use quotes for exact phrases: `"complete phrase"` - Combine with metadata filters: `collection:name AND "text to find"` - Works best with text collections that have been OCR'd ### Examples ```bash # Search NASA images ia search 'collection:nasa mediatype:image' --parameters="rows=10" # Search public domain books ia search 'subject:"public domain" mediatype:texts' # Get just identifiers ia search 'creator:"Mark Twain"' --itemlist # Full-text search within a text collection ia search -F 'collection:books "climate change"' # Full-text search for a specific quote in public domain texts ia search -F '"to be or not to be" mediatype:texts' # Full-text search with collection filter and pagination ia search -F 'collection:usgovernmentdocuments "artificial intelligence"' --parameters="rows=20" ``` ## Download Operations Download files from an Internet Archive item: ```bash ia download ``` ### Download Parameters | Parameter | Description | |-----------|-------------| | `--glob="*.ext"` | Download only matching files (use `\|` for multiple: `'*.mp4\|*.webm'`) | | `--exclude="*pattern*"` | Exclude files matching pattern | | `--format="FORMAT"` | Download specific derivative format | | `--source=SOURCE` | Filter by source: `original`, `derivative`, `metadata` | | `--exclude-source=SOURCE` | Exclude by source type | | `--destdir=path` | Download to specific directory | | `--no-directories` | Flatten directory structure | | `-s, --stdout` | Write file to stdout (for piping) | | `--dry-run` | Show what would be downloaded | | `--checksum` | Skip files that already exist with correct checksum | | `--on-the-fly` | Download on-the-fly files (generated derivatives) | | `--search="QUERY"` | Download from search results | | `--itemlist=FILE` | Download items listed in file | #### Filtering by Source Type Use `--source` and `--exclude-source` to filter by file origin: ```bash # Download only original files (skip all derivatives) ia download my-item --source=original # Download originals and metadata, skip derivatives ia download my-item --exclude-source=derivative # Download only metadata files ia download my-item --source=metadata ``` ### Examples ```bash # Download all files from an item ia download TripDown1905 # Download specific files by name ia download TripDown1905 file1.mp4 file2.ogv # Download only MP4 files ia download TripDown1905 --glob="*.mp4" # Download MP4s but exclude low-quality versions ia download TripDown1905 --glob="*.mp4" --exclude="*512kb*" # Download specific format ia download TripDown1905 --format='512Kb MPEG4' # Download to specific directory ia download TripDown1905 --destdir=./downloads # Download from search results ia download --search 'collection:opensource_movies' --glob="*.mp4" # Download items from a list file ia search 'collection:glasgowschoolofart' --itemlist > itemlist.txt ia download --itemlist itemlist.txt # Preview what will be downloaded ia download my_item --dry-run ``` ## Upload Operations Upload files to the Internet Archive (requires authentication): ```bash ia upload file1 file2 --metadata="mediatype:value" ``` ### Required Metadata The `mediatype` field is required. Common values: - `texts` - Books, documents, PDFs - `movies` - Video files - `audio` - Music, podcasts, sound - `software` - Programs, games - `image` - Photos, graphics - `data` - Datasets, archives ### Upload Parameters | Parameter | Description | |-----------|-------------| | `--metadata="key:value"` | Set metadata (repeatable) | | `--header="key:value"` | Set HTTP header | | `--checksum` | Skip files already uploaded | | `-v, --verify` | Verify data wasn't corrupted after upload | | `--no-derive` | Skip derivative processing | | `--retries=N` | Number of retry attempts | | `--remote-name=NAME` | Set remote filename (for stdin uploads) | | `--keep-directories` | Preserve directory structure in remote filename | | `-o, --open-after-upload` | Open item in browser after upload | | `--file-metadata=FILE` | File-level metadata from JSONL file | | `--spreadsheet=FILE` | Bulk upload from CSV spreadsheet | ### Common Metadata Fields ```bash --metadata="title:My Document Title" --metadata="creator:Author Name" --metadata="description:A description of the content" --metadata="subject:topic1;topic2" --metadata="collection:community_texts" --metadata="date:2024-01-15" --metadata="language:eng" ``` ### Examples ```bash # Upload a PDF document ia upload my-document-2024 document.pdf \ --metadata="mediatype:texts" \ --metadata="title:My Document" \ --metadata="creator:John Doe" # Upload multiple files ia upload my-archive file1.pdf file2.pdf file3.pdf \ --metadata="mediatype:texts" \ --metadata="title:Document Collection" # Upload with checksum verification and retries ia upload my-item large-file.zip \ --metadata="mediatype:data" \ --checksum \ --retries=10 # Upload from stdin cat data.gz | ia upload my-item - \ --remote-name=data.gz \ --metadata="mediatype:data" # Bulk upload using spreadsheet ia upload --spreadsheet=metadata.csv ``` **Notes:** - Items receive `data` mediatype by default if not specified - Mediatype can only be changed after upload with admin support - Derivative generation takes seconds to days depending on file type and system load - Items typically appear in search within minutes, but can take up to 24 hours ### Test Collection Upload to `test_collection` for validation - items are automatically removed after ~30 days: ```bash ia upload my-test-item file.pdf \ --metadata="mediatype:texts" \ --metadata="collection:test_collection" ``` ### Identifier Guidelines - Use lowercase letters, numbers, and hyphens - No spaces or special characters - Keep it descriptive but concise - Check if identifier exists: `ia metadata --exists` ### Item Thumbnail Image To set a custom thumbnail for an item, upload an image named `_itemimage.jpg`: ```bash ia upload my-item my-item_itemimage.jpg ``` ### Restricting Downloads To make files streamable but not downloadable, add the item to the `stream_only` collection: ```bash ia metadata --append-list="collection:stream_only" ``` ## Metadata Operations View and modify item metadata: ```bash # View metadata (JSON output) ia metadata # Extract specific field with jq ia metadata | jq '.metadata.date' # List file formats contained in an item ia metadata --formats # Modify metadata (set or replace) ia metadata --modify="title:New Title" ia metadata --modify="foo:bar" --modify="baz:value" # Remove a metadata field ia metadata --modify="fieldname:REMOVE_TAG" # Append value to existing field ia metadata --append="title:Subtitle Here" # Append to list field (e.g., subjects) ia metadata --append-list="subject:new topic" # Remove specific value from list field ia metadata --remove="subject:old topic" # Modify file-level metadata ia metadata --target="files/foo.txt" --modify="title:My File" # Bulk updates from spreadsheet ia metadata --spreadsheet=metadata.csv ``` ## List Operations List files in an Internet Archive item: ```bash ia list ``` Shows all files with details (name, size, format). ### List Parameters | Parameter | Description | |-----------|-------------| | `--columns=name,size` | Specify columns to show | | `--glob="*.pdf"` | Filter by pattern | | `-l, --location` | Print full URLs for each file | | `-a, --all` | List all available file information | | `-v, --verbose` | Print column headers | ```bash # List with full URLs ia list my-item --location # List all file info with headers ia list my-item --all --verbose # List specific columns ia list my-item --columns=name,size,format ``` ## Tasks and Jobs Check status of catalog tasks (uploads, derives, etc.): ```bash # Check tasks for a specific item ia tasks # Check all your tasks ia tasks ``` ### Darking and Undarking Items To make an item dark (hidden from public access) or undark it: ```bash # Dark an item (requires comment) ia tasks --cmd=make_dark.php --comment="Reason for darking" # Undark an item ia tasks --cmd=make_undark.php --comment="Reason for undarking" ``` ## Bulk Operations with GNU Parallel For batch processing many items, use [GNU Parallel](https://www.gnu.org/software/parallel/) to run `ia` commands concurrently. ### Installation ```bash # macOS brew install parallel # Debian/Ubuntu apt install parallel ``` ### Basic Usage Pipe item identifiers to parallel, using `{}` as placeholder: ```bash # Fetch metadata for many items cat itemlist.txt | parallel 'ia metadata {}' # Download multiple items cat itemlist.txt | parallel 'ia download {}' ``` ### Careful Batch Processing For reliable bulk operations, use job logging to track progress and handle failures: ```bash # Step 1: Create item list ia search 'collection:myproject' --itemlist > itemlist.txt # Step 2: Run with job logging cat itemlist.txt | parallel --joblog job.log 'ia download {}' # Step 3: Check for failures echo $? # 0 = all succeeded # Step 4: Retry only failed jobs parallel --retry-failed --joblog job.log ``` ### Job Log Benefits The `--joblog` file tracks each command's exit status, allowing you to: - Resume interrupted batch jobs - Retry only failed items without re-processing successes - Audit what succeeded and failed ### Dry Run First Always preview before bulk execution: ```bash cat itemlist.txt | parallel --dry-run 'ia download {}' ``` ### Rate Limiting Control concurrency to avoid overwhelming the server: ```bash # Limit to 4 concurrent jobs cat itemlist.txt | parallel -j4 'ia download {}' # Add delay between jobs cat itemlist.txt | parallel --delay 1 'ia download {}' ``` See: https://archive.org/developers/internetarchive/parallel.html ## Best Practices 1. **Always configure before uploading** - Run `ia configure` first 2. **Use meaningful identifiers** - Descriptive, lowercase, hyphenated 3. **Include proper metadata** - At minimum: mediatype, title, creator 4. **Check before uploading** - Verify identifier doesn't exist: `ia metadata ` 5. **Use checksums** - Add `--checksum` for large uploads to enable resume 6. **Respect rate limits** - Don't spam requests; add delays for bulk operations 7. **Test with dry-run** - Use `--dry-run` to preview operations 8. **Use test_collection first** - Validate uploads before committing to permanent collections 9. **Zip large file sets** - Bundle many small files into archives before uploading 10. **Specify language** - Set `language` metadata for proper OCR processing on texts ## Error Handling | Error | Solution | |-------|----------| | "not configured" | Run `ia configure` or set environment variables | | "identifier exists" | Choose a different identifier | | "permission denied" | Check credentials at https://archive.org/account/s3.php | | "network error" | Retry the operation; check internet connection | | "item not found" | Verify the identifier spelling | | "429 Too Many Requests" | Rate limited; wait and retry with `Retry-After` header value | | Item not appearing in search | Usually appears within minutes; check `ia tasks ` for pending jobs | | Derive task failed | Check filename characters, file format, language metadata | ## Quick Reference ```bash # Search ia search 'query' ia search 'query' --itemlist # Download ia download ia download --glob="*.pdf" # Upload (requires auth) ia upload files --metadata="mediatype:texts" # Metadata ia metadata ia metadata --modify="title:New Title" # List files ia list # Tasks ia tasks # Config ia configure ia configure --whoami # Install uv tool install internetarchive ``` ## API Reference For programmatic access beyond the CLI, see the full developer documentation: **https://archive.org/developers** ### Core APIs | API | Description | |-----|-------------| | [Items](https://archive.org/developers/items.html) | Understanding item structure and access | | [Metadata Schema](https://archive.org/developers/metadata-schema/) | Complete metadata field reference | | [Metadata Read](https://archive.org/developers/md-read.html) | Retrieve item metadata via API | | [Metadata Write](https://archive.org/developers/md-write.html) | Modify item metadata via API | | [IAS3](https://archive.org/developers/ias3.html) | S3-compatible API for uploads | | [Tasks](https://archive.org/developers/tasks.html) | Task queue management | ### Additional APIs | API | Description | |-----|-------------| | [Changes](https://archive.org/developers/changes.html) | Track item modifications across the archive | | [Views](https://archive.org/developers/views_api.html) | Access viewing and download statistics | | [Reviews](https://archive.org/developers/reviews.html) | Manage item reviews | | [Simple Lists](https://archive.org/developers/simplelists.html) | Create item relationships and lists | | [OCR Service](https://archive.org/developers/ocr.html) | Text recognition service | | [PDF Service](https://archive.org/developers/pdf.html) | PDF generation and processing | ### Python Library For Python integration: [internetarchive library](https://archive.org/developers/internetarchive/) ### TypeScript Library (Third-Party) A community-maintained TypeScript port is available: [internetarchive-ts](https://github.com/karpour/internetarchive-ts) ([docs](https://karpour.github.io/internetarchive-ts/)) Note: This is a work in progress and not officially maintained by the Internet Archive.