--- id: ins_tejas-shahasane-piracy-ai-data-sourcing title: 'Piracy is the only scalable way to get large digitized book datasets for AI training' operator: Tejas Shahasane operator_role: बी2बी | Content Marketer source_url: https://www.linkedin.com/feed/update/urn:li:activity:7348641918928551936/ source_type: thread source_title: 'Piracy is the only scalable way to get large digitized book datasets for AI trai' source_date: 2026-04-10 captured_date: 2026-05-03 domain: [pmm] lifecycle: [] maturity: applied artifact_class: framework score: { originality: 3, specificity: 3, evidence: 2, transferability: 3, source: 3 } tier: B related: [] raw_ref: raw/linkedin/reactions/linkedin-reactions-2026-04-10.md --- # Piracy is the only scalable way to get large digitized book datasets for AI training ## Claim The only remaining option for a large, organized, and digitized repository of books is pirated P2P torrents and archives like Z-Library or Anna's Archive. ## Mechanism Legally obtaining books at scale is cumbersome and not scalable, Anthropic spent millions buying and manually digitizing books. Amazon can't provide a digital library because they're a marketplace, not a reseller, so they'd need to negotiate opt-ins with every publisher and author. That leaves piracy as the only viable source for massive training datasets.