---
id: ins_tejas-shahasane-piracy-ai-data-sourcing
title: 'Piracy is the only scalable way to get large digitized book datasets for AI training'
operator: Tejas Shahasane
operator_role: बी2बी | Content Marketer
source_url: https://www.linkedin.com/feed/update/urn:li:activity:7348641918928551936/
source_type: thread
source_title: 'Piracy is the only scalable way to get large digitized book datasets for AI trai'
source_date: 2026-04-10
captured_date: 2026-05-03
domain: [pmm]
lifecycle: []
maturity: applied
artifact_class: framework
score: { originality: 3, specificity: 3, evidence: 2, transferability: 3, source: 3 }
tier: B
related: []
raw_ref: raw/linkedin/reactions/linkedin-reactions-2026-04-10.md
---

# Piracy is the only scalable way to get large digitized book datasets for AI training

## Claim
The only remaining option for a large, organized, and digitized repository of books is pirated P2P torrents and archives like Z-Library or Anna's Archive.

## Mechanism
Legally obtaining books at scale is cumbersome and not scalable, Anthropic spent millions buying and manually digitizing books. Amazon can't provide a digital library because they're a marketplace, not a reseller, so they'd need to negotiate opt-ins with every publisher and author. That leaves piracy as the only viable source for massive training datasets.