--- title: "Vibe-Scraping: Write outcomes, not scrapers" date: "2025-09-30T13:18:20Z" lastmod: "2025-09-30T13:19:33Z" categories: - coding - llms - visualisation wp_id: 4220 description: I've shifted from writing manual BeautifulSoup scripts to "vibe-scraping" using AI agents. By focusing on outcomes rather than implementation, I generated Bollywood box office datasets and merchant audits in minutes, making traditional web scraping expertise feel obsolete. keywords: [vibe-scraping, ai agents, codex, claude code, web scraping, data extraction, automation] --- ![Vibe-Scraping: Write outcomes, not scrapers](/blog/assets/image-12.webp) There hasn't been a box-office explosion like Dangal in the history of Bollywood. CPI inflation-adjusted to 2024, it is the only film in the ₹3,000 Cr club. 3 Idiots (2009) is the first member of the ₹1,000 Cr club (2024-inflation-adjusted). The hot streak was 2013-2017: **each** year, a film crossed that bar: Dhoom 3, PK, Bajrangi Bhaijaan, Dangal, Secret Superstar. Since then, we **never** saw such a release except in 2023 (Jawan, Pathan). But this story isn't about the box-office drought. It's about **vibe-scraping**. --- To scrape the 1k Cr club data, here's what my process would be in: - 2008-2015. Requests + BeautifulSoup in Python. Takes ~1 day. - 2015-2024. Puppeteer. Still takes ~1 day. - 2025-Today. AI writes code. ~2 hours/site. **4x faster**. - Today-???. Coding agents **scrape directly**. ~30 min/site. **16 times faster**. I passed Codex CLI (roughly) this prompt: > Write scrape.py to scrape the highest-grossing films from Wikipedia's list of Hindi films: 1994 to 2024.\ > Read pages as required. Save results as CSV. Here's what it did. - Read the Wikipedia lists starting [1994](https://en.wikipedia.org/wiki/List_of_Hindi_films_of_1994) - Failed on missing BeautifulSoup dependency. I allowed install. - Discovered that tables below "grossing" or "box office" headings are relevant. - Noticed "Rank" became “No” in the column header since 2016 and adapted. - Fixed all errors and generated a clean CSV. That's… **incredible**! The code was a by-product. The prompt and evals matter. When sites change, agents can fix the code. Or better agents will rewrite it. I guess I'll call this **vibe-scraping**. I also asked Claude Code vibe-code a data story. Here are the links: - [Visualization](https://sanand0.github.io/datastories/bollywood-top-grossing/) - [Code](https://github.com/sanand0/datastories/blob/6f6ef4b92adccdb519a8de717ac330268ab4e341/bollywood-top-grossing/) - [Scraper chat](https://github.com/sanand0/datastories/blob/6f6ef4b92adccdb519a8de717ac330268ab4e341/bollywood-top-grossing/prompts/scraper.md) - [Dataviz chat](https://github.com/sanand0/datastories/blob/6f6ef4b92adccdb519a8de717ac330268ab4e341/bollywood-top-grossing/prompts/dataviz.md) --- This afternoon, in front of a client, I spoke with Codex: > Write a scrape.py that searches Dutch fashion merchant websites and lists what delivery carriers they use. ~10 minutes later, we had a table. The client spotted one error that I couldn't have. Expert review still matters. But what's redundant is my 20-year scraping experience! If agents can scrape on the fly, what **new** questions do we ask? [LinkedIn](https://www.linkedin.com/posts/sanand0_there-hasnt-been-a-box-office-explosion-activity-7378964378899054593-X4mP)