--- namespace: aiwg platforms: [all] name: Archive Acquisition description: Patterns for acquiring content from Internet Archive and archival sources category: media-curator --- # Archive Acquisition Skill Specialized patterns and techniques for acquiring media content from Internet Archive (archive.org) and other archival sources. Focuses on bulk downloads, quality selection, format filtering, and collection management. ## Internet Archive Overview Internet Archive hosts massive collections of audio, video, text, and software. Key characteristics: - **Open access**: Most content freely downloadable - **Multiple formats**: Same item often available in FLAC, MP3, OGG, etc. - **Collections**: Items grouped by uploader, topic, or curator - **Metadata**: Rich JSON metadata for every item - **API access**: Full programmatic access via REST API ## Discovery Patterns ### Search API Find items matching specific criteria: ```bash # Search for items by keyword and mediatype search_archive() { local query="$1" local mediatype="$2" # audio, movies, texts, etc. local rows="${3:-100}" # Results per page curl -s "https://archive.org/advancedsearch.php" \ -d "q=$query AND mediatype:$mediatype" \ -d "fl[]=identifier,title,creator,date,format" \ -d "rows=$rows" \ -d "output=json" \ | jq '.response.docs[]' } # Example: Search for Grateful Dead concerts search_archive "grateful dead" "audio" 50 ``` ### Response Format ```json { "identifier": "gd1977-05-08.sbd.miller.97065.flac16", "title": "Grateful Dead Live at Barton Hall on 1977-05-08", "creator": "Grateful Dead", "date": "1977-05-08", "format": ["Flac", "VBR MP3", "Ogg Vorbis", "Metadata"] } ``` ### Collection Browsing List all items in a collection: ```bash # Get items from a specific collection get_collection_items() { local collection="$1" local rows="${2:-1000}" curl -s "https://archive.org/advancedsearch.php" \ -d "q=collection:$collection" \ -d "fl[]=identifier,title,format" \ -d "rows=$rows" \ -d "output=json" \ | jq -r '.response.docs[] | .identifier' } # Example: Get all items from GratefulDead collection get_collection_items "GratefulDead" 500 ``` ### Advanced Query Patterns ```bash # Find items by date range search_by_date() { local creator="$1" local start_date="$2" # YYYY-MM-DD local end_date="$3" curl -s "https://archive.org/advancedsearch.php" \ -d "q=creator:\"$creator\" AND date:[$start_date TO $end_date]" \ -d "fl[]=identifier,title,date" \ -d "rows=100" \ -d "output=json" \ | jq '.response.docs[]' } # Find items with specific format available search_by_format() { local query="$1" local format="$2" # FLAC, MP3, etc. curl -s "https://archive.org/advancedsearch.php" \ -d "q=$query AND format:$format" \ -d "fl[]=identifier,title,format" \ -d "rows=100" \ -d "output=json" \ | jq '.response.docs[]' } # Example: Find all FLAC recordings of specific artist search_by_format "pink floyd" "Flac" ``` ## Item Metadata Retrieval ### Get Item Details ```bash # Retrieve full metadata for an item get_item_metadata() { local identifier="$1" curl -s "https://archive.org/metadata/$identifier" \ | jq '.' } # Extract specific metadata fields get_item_files() { local identifier="$1" curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | select(.format != "Metadata") | "\(.name)\t\(.format)\t\(.size)"' } # Example output: # gd77-05-08d1t01.flac Flac 45123456 # gd77-05-08d1t01.mp3 VBR MP3 12345678 ``` ### Filter by Format ```bash # Get only FLAC files from an item get_flac_files() { local identifier="$1" curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | select(.format == "Flac") | .name' } # Get best quality audio files (prefer FLAC, fallback to 320K MP3) get_best_audio_files() { local identifier="$1" # Try FLAC first local flac_files=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | select(.format == "Flac") | .name') if [[ -n "$flac_files" ]]; then echo "$flac_files" return 0 fi # Fallback to highest bitrate MP3 curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | select(.format == "VBR MP3" or .format == "320Kbps MP3") | .name' } ``` ## Download Patterns ### Single Item Download ```bash # Download all files from a single item download_item() { local identifier="$1" local output_dir="$2" local format_filter="$3" # Optional: FLAC, MP3, etc. mkdir -p "$output_dir" # Get file list local files if [[ -n "$format_filter" ]]; then files=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r ".files[] | select(.format == \"$format_filter\") | .name") else files=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | select(.format != "Metadata") | .name') fi # Download each file while IFS= read -r file; do echo "Downloading: $file" wget -q --show-progress \ "https://archive.org/download/$identifier/$file" \ -P "$output_dir" done <<< "$files" } # Example: Download all FLAC files from an item download_item "gd1977-05-08.sbd.miller.97065.flac16" "/mnt/archive/grateful-dead/1977-05-08" "Flac" ``` ### Recursive Collection Download ```bash # Download entire collection with format filtering download_collection() { local collection="$1" local output_base="$2" local format="$3" # FLAC, MP3, etc. local max_items="${4:-100}" # Get collection items local items=$(get_collection_items "$collection" "$max_items") local count=0 while IFS= read -r identifier; do count=$((count + 1)) echo "[$count/$max_items] Downloading: $identifier" # Create item directory local item_dir="$output_base/$identifier" mkdir -p "$item_dir" # Download with format filter download_item "$identifier" "$item_dir" "$format" # Download metadata curl -s "https://archive.org/metadata/$identifier" \ > "$item_dir/.curator/metadata.json" # Rate limiting (be nice to archive.org) sleep 2 done <<< "$items" } # Example: Download first 50 FLAC items from collection download_collection "GratefulDead" "/mnt/archive/grateful-dead" "Flac" 50 ``` ### Parallel Collection Download ```bash # Download collection items in parallel (with concurrency limit) download_collection_parallel() { local collection="$1" local output_base="$2" local format="$3" local max_concurrent="${4:-3}" # Get collection items local items=$(get_collection_items "$collection" 1000) # Create job queue local queue_file="/tmp/curator-archive-queue-$$.txt" echo "$items" > "$queue_file" # Process queue with concurrency limit cat "$queue_file" | xargs -P "$max_concurrent" -I {} bash -c " identifier={} item_dir=\"$output_base/\$identifier\" mkdir -p \"\$item_dir\" echo \"Downloading: \$identifier\" # Download best quality audio files=\$(curl -s \"https://archive.org/metadata/\$identifier\" \ | jq -r '.files[] | select(.format == \"$format\") | .name') while IFS= read -r file; do wget -q --show-progress \ \"https://archive.org/download/\$identifier/\$file\" \ -P \"\$item_dir\" done <<< \"\$files\" # Save metadata curl -s \"https://archive.org/metadata/\$identifier\" \ > \"\$item_dir/.curator/metadata.json\" sleep 2 # Rate limiting " rm "$queue_file" } ``` ### wget Recursive Download ```bash # Use wget's recursive mode for efficient bulk download download_with_wget_recursive() { local identifier="$1" local output_dir="$2" local format_pattern="$3" # e.g., "*.flac" or "*.mp3" mkdir -p "$output_dir" wget --recursive \ --no-parent \ --no-directories \ --accept "$format_pattern" \ --directory-prefix="$output_dir" \ --wait=2 \ --random-wait \ "https://archive.org/download/$identifier/" } # Example: Download all FLAC files from item download_with_wget_recursive "gd1977-05-08.sbd.miller.97065.flac16" \ "/mnt/archive/gd-1977-05-08" \ "*.flac" ``` ## Quality Selection Strategy ### Format Preference Hierarchy For audio archival, prefer formats in this order: 1. **FLAC** - Lossless, widely supported 2. **320Kbps MP3** - High quality lossy, universal compatibility 3. **VBR MP3** - Variable bitrate, good quality/size ratio 4. **Ogg Vorbis** - Good quality, open format 5. **128Kbps MP3** - Acceptable for spoken word, not music ### Automatic Quality Selection ```bash # Select best available format for an item select_best_format() { local identifier="$1" # Get available formats local formats=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | .format' | sort -u) # Check in preference order if echo "$formats" | grep -q "Flac"; then echo "Flac" elif echo "$formats" | grep -q "320Kbps MP3"; then echo "320Kbps MP3" elif echo "$formats" | grep -q "VBR MP3"; then echo "VBR MP3" elif echo "$formats" | grep -q "Ogg Vorbis"; then echo "Ogg Vorbis" else echo "$(echo "$formats" | grep -i mp3 | head -1)" fi } # Download using best available format download_best_quality() { local identifier="$1" local output_dir="$2" local best_format=$(select_best_format "$identifier") echo "Selected format: $best_format" download_item "$identifier" "$output_dir" "$best_format" } ``` ### Size vs Quality Trade-offs ```bash # Estimate download size before acquiring estimate_download_size() { local identifier="$1" local format="$2" local total_size=$(curl -s "https://archive.org/metadata/$identifier" \ | jq "[.files[] | select(.format == \"$format\") | .size | tonumber] | add") # Convert to human-readable local size_mb=$((total_size / 1048576)) local size_gb=$((size_mb / 1024)) if [[ $size_gb -gt 0 ]]; then echo "${size_gb}GB" else echo "${size_mb}MB" fi } # Check if size is acceptable before downloading download_if_acceptable() { local identifier="$1" local output_dir="$2" local format="$3" local max_size_gb="${4:-5}" # Default 5GB limit local estimated_size=$(estimate_download_size "$identifier" "$format") local size_value=$(echo "$estimated_size" | sed 's/[^0-9]//g') local size_unit=$(echo "$estimated_size" | sed 's/[0-9]//g') if [[ "$size_unit" == "GB" && $size_value -gt $max_size_gb ]]; then echo "SKIPPED: $identifier ($estimated_size exceeds ${max_size_gb}GB limit)" return 1 fi echo "Downloading: $identifier ($estimated_size)" download_item "$identifier" "$output_dir" "$format" } ``` ## Metadata Extraction and Storage ### Extract Archive Metadata ```bash # Convert Archive.org metadata to curator format extract_archive_metadata() { local identifier="$1" local output_file="$2" curl -s "https://archive.org/metadata/$identifier" | jq '{ source: "archive.org", identifier: .metadata.identifier, title: .metadata.title, creator: .metadata.creator, date: .metadata.date, venue: .metadata.venue, coverage: .metadata.coverage, description: .metadata.description, collection: .metadata.collection, downloads: .item_size.downloads, avg_rating: .reviews.info.avg_rating, num_reviews: .reviews.info.num_reviews, files: [ .files[] | { name: .name, format: .format, size: .size, md5: .md5, sha1: .sha1 } ] }' > "$output_file" } # Example usage extract_archive_metadata "gd1977-05-08.sbd.miller.97065.flac16" \ ".curator/metadata/gd-1977-05-08.json" ``` ### Merge with Curator Metadata ```bash # Merge Archive.org metadata with existing curator metadata merge_metadata() { local curator_metadata="$1" local archive_metadata="$2" local output_file="$3" jq -s '.[0] * .[1]' "$curator_metadata" "$archive_metadata" > "$output_file" } ``` ## Verification and Quality Assurance ### Checksum Verification Archive.org provides MD5 and SHA1 checksums for all files: ```bash # Verify downloaded files against Archive.org checksums verify_archive_downloads() { local identifier="$1" local download_dir="$2" # Get checksums from metadata local checksums=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r '.files[] | "\(.md5) \(.name)"') # Verify each file local failed=0 while IFS= read -r line; do local expected_md5="${line%% *}" local filename="${line#* }" local filepath="$download_dir/$filename" if [[ ! -f "$filepath" ]]; then echo "MISSING: $filename" failed=$((failed + 1)) continue fi local actual_md5=$(md5sum "$filepath" | awk '{print $1}') if [[ "$actual_md5" != "$expected_md5" ]]; then echo "CHECKSUM FAILED: $filename" echo " Expected: $expected_md5" echo " Actual: $actual_md5" failed=$((failed + 1)) else echo "VERIFIED: $filename" fi done <<< "$checksums" if [[ $failed -gt 0 ]]; then echo "Verification failed for $failed files" return 1 fi echo "All files verified successfully" return 0 } ``` ### File Integrity Check ```bash # Verify audio file integrity after download verify_audio_integrity() { local file="$1" # Check file is not zero-byte if [[ ! -s "$file" ]]; then echo "FAILED: Zero-byte file" return 1 fi # Verify with ffmpeg if command -v ffmpeg >/dev/null; then if ffmpeg -v error -i "$file" -f null - 2>&1 | grep -q "error"; then echo "FAILED: Corrupted audio file" return 1 fi fi # FLAC specific: verify with flac tool if [[ "$file" == *.flac ]] && command -v flac >/dev/null; then if ! flac -t "$file" 2>&1 | grep -q "ok"; then echo "FAILED: FLAC verification failed" return 1 fi fi echo "VERIFIED: $file" return 0 } # Verify all files in directory verify_directory_integrity() { local dir="$1" local total=0 local verified=0 local failed=0 find "$dir" -type f \( -name "*.flac" -o -name "*.mp3" -o -name "*.ogg" \) | while read -r file; do total=$((total + 1)) if verify_audio_integrity "$file"; then verified=$((verified + 1)) else failed=$((failed + 1)) fi done echo "Integrity check complete: $verified/$total verified, $failed failed" } ``` ## Rate Limiting and Politeness Archive.org requests responsible use of their bandwidth: ```bash # Implement polite download behavior polite_download() { local identifier="$1" local output_dir="$2" local format="$3" # Add delays between requests local wait_time=2 # seconds # Get file list local files=$(curl -s "https://archive.org/metadata/$identifier" \ | jq -r ".files[] | select(.format == \"$format\") | .name") while IFS= read -r file; do echo "Downloading: $file" wget --quiet \ --show-progress \ --wait="$wait_time" \ --random-wait \ --user-agent="AIWG-MediaCurator/1.0 (research/personal use)" \ "https://archive.org/download/$identifier/$file" \ -P "$output_dir" done <<< "$files" } ``` ### Bulk Download Best Practices 1. **Rate limit**: Wait 2-5 seconds between requests 2. **Random wait**: Add jitter to avoid patterns 3. **User agent**: Identify your tool 4. **Off-peak hours**: Download during US night hours 5. **Resume support**: Use `wget -c` to resume interrupted downloads 6. **Error handling**: Retry failed downloads after delay ## Collection Management ### Build Local Collection Index ```bash # Create searchable index of downloaded items build_collection_index() { local collection_dir="$1" local index_file="$2" echo '{"items": []}' > "$index_file" find "$collection_dir" -name "metadata.json" -type f | while read -r metadata_file; do local item_dir=$(dirname "$metadata_file") # Extract key metadata fields local item_data=$(jq '{ identifier: .metadata.identifier, title: .metadata.title, creator: .metadata.creator, date: .metadata.date, local_path: "'"$item_dir"'", file_count: (.files | length), total_size: ([.files[] | .size | tonumber] | add) }' "$metadata_file") # Append to index jq ".items += [$item_data]" "$index_file" > "$index_file.tmp" mv "$index_file.tmp" "$index_file" done echo "Index built: $index_file" } # Search local collection search_local_collection() { local index_file="$1" local query="$2" jq -r ".items[] | select(.title | test(\"$query\"; \"i\")) | \"\(.date) - \(.creator) - \(.title) (\(.local_path))\"" \ "$index_file" } ``` ### Deduplicate Collection ```bash # Find and remove duplicate items in collection deduplicate_collection() { local collection_dir="$1" # Build checksum database local checksum_file="/tmp/curator-checksums-$$.txt" find "$collection_dir" -type f \( -name "*.flac" -o -name "*.mp3" \) | while read -r file; do local checksum=$(md5sum "$file" | awk '{print $1}') echo "$checksum $file" >> "$checksum_file" done # Find duplicates sort "$checksum_file" | uniq -d -w 32 | while read -r line; do local checksum="${line%% *}" echo "Duplicate found (checksum: $checksum):" grep "^$checksum" "$checksum_file" | while read -r dup_line; do local dup_file="${dup_line#* }" echo " - $dup_file" done # Offer to delete all but first local files=($(grep "^$checksum" "$checksum_file" | cut -d' ' -f3-)) local keep="${files[0]}" echo "Keep: $keep" for ((i=1; i<${#files[@]}; i++)); do echo "Delete: ${files[$i]}" rm "${files[$i]}" done done rm "$checksum_file" } ``` ## Advanced Patterns ### Incremental Collection Sync ```bash # Download only new items added to collection since last sync sync_collection_incremental() { local collection="$1" local output_base="$2" local format="$3" local sync_state_file="$output_base/.curator/sync-state.json" # Get last sync timestamp local last_sync if [[ -f "$sync_state_file" ]]; then last_sync=$(jq -r '.last_sync' "$sync_state_file") else last_sync="1970-01-01" fi # Search for items added/updated since last sync local new_items=$(curl -s "https://archive.org/advancedsearch.php" \ -d "q=collection:$collection AND publicdate:[$last_sync TO null]" \ -d "fl[]=identifier" \ -d "rows=1000" \ -d "output=json" \ | jq -r '.response.docs[] | .identifier') # Download new items local count=0 while IFS= read -r identifier; do count=$((count + 1)) echo "Downloading new item [$count]: $identifier" download_item "$identifier" "$output_base/$identifier" "$format" sleep 2 done <<< "$new_items" # Update sync state echo "{\"last_sync\": \"$(date -Iseconds)\", \"items_added\": $count}" > "$sync_state_file" echo "Sync complete: $count new items added" } ``` ### Filtered Batch Download ```bash # Download items matching complex filter criteria download_filtered_batch() { local query="$1" local output_base="$2" local format="$3" local min_rating="${4:-4.0}" local min_year="${5:-1960}" # Search with filters local items=$(curl -s "https://archive.org/advancedsearch.php" \ -d "q=$query AND avg_rating:[$min_rating TO 5.0] AND year:[$min_year TO 2026]" \ -d "fl[]=identifier,avg_rating,year" \ -d "rows=500" \ -d "output=json") # Download each item echo "$items" | jq -r '.response.docs[] | .identifier' | while read -r identifier; do local rating=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .avg_rating") local year=$(echo "$items" | jq -r ".response.docs[] | select(.identifier == \"$identifier\") | .year") echo "Downloading: $identifier (rating: $rating, year: $year)" download_item "$identifier" "$output_base/$identifier" "$format" sleep 2 done } ``` ## Integration with Media Curator Framework ### Source Plan Generation ```bash # Generate source plan for /acquire command generate_archive_source_plan() { local collection="$1" local output_file="$2" local format="${3:-Flac}" # Get collection items local items=$(get_collection_items "$collection" 100) # Build YAML plan cat > "$output_file" <> "$output_file" <&1 | tee "$output_dir/.curator/metadata-debug.log" # Download with verbose wget wget --verbose \ --debug \ "https://archive.org/download/$identifier/" \ 2>&1 | tee "$output_dir/.curator/download-debug.log" set +x } ``` ## Performance Optimization ### Parallel Download Tuning ```bash # Optimal concurrent downloads based on connection speed determine_optimal_parallelism() { local speed_mbps="$1" if [[ $speed_mbps -lt 10 ]]; then echo 2 # Slow connection elif [[ $speed_mbps -lt 50 ]]; then echo 3 # Medium connection elif [[ $speed_mbps -lt 100 ]]; then echo 5 # Fast connection else echo 8 # Very fast connection fi } # Test connection speed test_connection_speed() { # Download test file and measure speed local test_file="https://archive.org/download/test_item/test.mp3" local start_time=$(date +%s) wget -q -O /tmp/speed-test.tmp "$test_file" local end_time=$(date +%s) local duration=$((end_time - start_time)) local size=$(stat -c%s /tmp/speed-test.tmp 2>/dev/null || stat -f%z /tmp/speed-test.tmp) local speed_mbps=$(( (size * 8) / (duration * 1000000) )) rm /tmp/speed-test.tmp echo "$speed_mbps" } ``` ### Caching Strategy ```bash # Cache metadata to reduce API calls cache_metadata() { local identifier="$1" local cache_dir=".curator/cache/metadata" local cache_file="$cache_dir/$identifier.json" mkdir -p "$cache_dir" # Check cache age if [[ -f "$cache_file" ]]; then local cache_age=$(( $(date +%s) - $(stat -c%Y "$cache_file" 2>/dev/null || stat -f%m "$cache_file") )) # Cache valid for 7 days if [[ $cache_age -lt 604800 ]]; then cat "$cache_file" return 0 fi fi # Fetch and cache curl -s "https://archive.org/metadata/$identifier" | tee "$cache_file" } ``` ## Summary Archive Acquisition skill provides: - **Discovery**: Search API patterns for finding content - **Quality Selection**: Automatic best-format selection (prefer FLAC, accept MP3 320K) - **Bulk Download**: Recursive wget and parallel download patterns - **Verification**: Checksum validation and integrity checks - **Collection Management**: Indexing, deduplication, incremental sync - **Integration**: Source plan generation for `/acquire` command - **Performance**: Parallel tuning, caching, rate limiting Use this skill whenever working with Internet Archive content in the Media Curator framework. ## References - @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/human-authorization.md — Seek explicit authorization before bulk downloads and irreversible archive operations - @$AIWG_ROOT/agentic/code/addons/aiwg-utils/rules/research-before-decision.md — Evaluate archive item quality and metadata before committing to download - @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/acquire/SKILL.md — General acquisition skill that delegates to archive-acquisition for archive.org sources - @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/integrity-verification/SKILL.md — Verify checksums after archive downloads - @$AIWG_ROOT/agentic/code/frameworks/media-curator/skills/quality-filtering/SKILL.md — Quality scoring used to select best format from archive items