---
name: io-bound-data-processing
description: Processing, transforming, or moving datasets that may exceed RAM on a single low-compute box — covers memory discipline (streaming, generators, dtype shrinkage), I/O access patterns (sequential vs random, mmap, async), data formats (Parquet vs CSV vs JSON, predicate pushdown), chunking & batching, spill-to-disk (external merge sort, DuckDB/Polars), pipelining (bounded queues, backpressure, checkpointing), codec selection (zstd/lz4/gzip), concurrency for I/O-bound workloads (asyncio, threads, prefetch), and observability (iowait vs CPU%, rows/sec, py-spy/strace). Trigger on "process a large file", "stream this", "out-of-core", "OOM kill", "this is slow", or code with `pd.read_csv` of multi-GB files, `requests.get(...).content` on big bodies, `BytesIO` on unbounded inputs, per-row INSERTs, sequential `requests.get` loops, falling `tqdm` rates — even if I/O or memory isn't mentioned. Complement to computer-science-algorithms.
---
# Community I/O-bound data processing on constrained resources Best Practices

A reference for engineers processing datasets larger than RAM on a single low-compute box. Organized by execution-lifecycle impact: rules near the top of the table govern *whether the job runs at all*; rules near the bottom shave the last 10 %. Optimize from the top of the waterfall.

**Scope:** the patterns that show up in real ETL / data-engineering / batch work on a laptop, a 2-vCPU container, or a Raspberry Pi-class node — streaming, formats, chunking, spill, backpressure, codecs, and the concurrency model that actually matches an I/O-bound bottleneck. Out of scope (covered elsewhere): the algorithmic primitives themselves (see [`computer-science-algorithms`](../computer-science-algorithms/)), distributed compute beyond a single box (use Spark/Dask), and database-engine internals (see official docs).

Distilled from [Apache Arrow / Parquet docs](https://arrow.apache.org/docs/), [Polars User Guide](https://docs.pola.rs/), [DuckDB docs](https://duckdb.org/docs/), [pandas — Scaling to large datasets](https://pandas.pydata.org/docs/user_guide/scale.html), Linux man pages ([mmap(2)](https://man7.org/linux/man-pages/man2/mmap.2.html), [sendfile(2)](https://man7.org/linux/man-pages/man2/sendfile.2.html), [posix_fadvise(2)](https://man7.org/linux/man-pages/man2/posix_fadvise.2.html)), Brendan Gregg's [USE method](https://www.brendangregg.com/usemethod.html) and *Systems Performance*, Kleppmann's [*Designing Data-Intensive Applications*](https://dataintensive.net/), and the [zstd](https://github.com/facebook/zstd) / [lz4](https://github.com/lz4/lz4) reference benchmarks.

## When to Apply

Reach for these rules when:

- A job OOM-kills, swaps, or runs much slower than expected on a small box
- Input is *larger* than RAM and you need to scan, filter, aggregate, sort, or join it
- A pipeline has unbounded buffers between stages, or memory grows linearly during a "streaming" job
- You see one-row-per-RTT writes (`INSERT` per row, `requests.get` per URL, `f.read(32)` per record)
- You're picking a format/codec/serializer and the choice matters at scale
- A `top` shows low CPU and high iowait, or you don't know which it is
- "It's slow but I don't know why" — start at the obs- category

## Rule Categories By Priority

| # | Category | Prefix | Impact | Why it cascades |
|---|----------|--------|--------|-----------------|
| 1 | Memory Discipline | `mem-` | CRITICAL | Slurping into RAM defeats every downstream technique on a constrained box |
| 2 | I/O Access Patterns | `io-` | CRITICAL | Disk/net are 10³–10⁶× slower than RAM; access pattern dominates wall-clock |
| 3 | Data Format & Encoding | `fmt-` | HIGH | Format fixes the lower bound on I/O volume + decode cost before any logic runs |
| 4 | Chunking & Batching | `batch-` | HIGH | Granularity controls peak memory and amortizes per-item overhead |
| 5 | Spill-to-Disk & External Memory | `spill-` | HIGH | When data > RAM, the choice is "spill cleanly" or "OOM" |
| 6 | Pipelining & Backpressure | `pipe-` | MEDIUM-HIGH | Unbounded buffers between fast producers and slow sinks = OOM |
| 7 | Compression & Serialization | `codec-` | MEDIUM | Trades CPU for I/O; right codec saves orders of magnitude |
| 8 | Concurrency for I/O-Bound Workloads | `conc-` | MEDIUM | Async / threads / processes are different tools; wrong model wastes CPU |
| 9 | Observability & Throughput Tuning | `obs-` | LOW-MEDIUM | Can't tune what you don't measure; iowait ≠ CPU-bound |

## Quick Reference

### 1. Memory Discipline (CRITICAL)

- [`mem-stream-dont-slurp`](references/mem-stream-dont-slurp.md) — Iterate sources chunk-by-chunk; peak RAM = chunk size, not file size
- [`mem-prefer-generators-over-lists-for-pipelines`](references/mem-prefer-generators-over-lists-for-pipelines.md) — Generators flow; lists materialize
- [`mem-shrink-dtypes-before-loading`](references/mem-shrink-dtypes-before-loading.md) — Narrow ints, categoricals; 2-8× memory reduction at load time
- [`mem-use-views-not-copies`](references/mem-use-views-not-copies.md) — Slicing without copying; NumPy/Arrow zero-copy semantics
- [`mem-bound-the-working-set`](references/mem-bound-the-working-set.md) — Chunk size = budget ÷ row-size × amplification, not a round number
- [`mem-release-references-explicitly`](references/mem-release-references-explicitly.md) — Drop intermediates so peak ≠ N × chunk

### 2. I/O Access Patterns (CRITICAL)

- [`io-prefer-sequential-over-random`](references/io-prefer-sequential-over-random.md) — Sort offsets, advise the kernel, let readahead help
- [`io-buffer-explicitly-for-small-records`](references/io-buffer-explicitly-for-small-records.md) — `BufferedReader` collapses 1000× syscalls
- [`io-stream-http-bodies-with-iter-content`](references/io-stream-http-bodies-with-iter-content.md) — `stream=True` + `iter_content` instead of `.content`
- [`io-mmap-for-random-or-shared-large-files`](references/io-mmap-for-random-or-shared-large-files.md) — Zero-copy + on-demand paging for random access
- [`io-async-for-many-concurrent-streams`](references/io-async-for-many-concurrent-streams.md) — One thread, thousands of awaits
- [`io-batch-and-pipeline-network-roundtrips`](references/io-batch-and-pipeline-network-roundtrips.md) — `COPY`, pipelines, multi-key endpoints, HTTP/2
- [`io-zero-copy-when-moving-bytes-as-is`](references/io-zero-copy-when-moving-bytes-as-is.md) — `sendfile`, `copy_file_range`, `shutil.copyfile`

### 3. Data Format & Encoding (HIGH)

- [`fmt-columnar-for-analytical-scans`](references/fmt-columnar-for-analytical-scans.md) — Parquet/Arrow for filter+project workloads
- [`fmt-line-delimited-for-streaming-row-ingest`](references/fmt-line-delimited-for-streaming-row-ingest.md) — NDJSON over JSON-array for streaming
- [`fmt-push-predicates-into-the-reader`](references/fmt-push-predicates-into-the-reader.md) — Row-group statistics skip whole chunks
- [`fmt-prefer-schema-on-write-when-possible`](references/fmt-prefer-schema-on-write-when-possible.md) — Typed columns beat schema-on-read every time
- [`fmt-avoid-deeply-nested-json-for-hot-paths`](references/fmt-avoid-deeply-nested-json-for-hot-paths.md) — Flat schema, or binary on hot pipes

### 4. Chunking & Batching (HIGH)

- [`batch-pick-chunk-size-by-memory-budget`](references/batch-pick-chunk-size-by-memory-budget.md) — Compute from budget, not a constant
- [`batch-use-vectorized-apis-not-row-loops`](references/batch-use-vectorized-apis-not-row-loops.md) — NumPy / Polars / Arrow kernels, not `iterrows`
- [`batch-process-with-stable-iterators`](references/batch-process-with-stable-iterators.md) — `iter_batches`, `chunksize=`, `collect(streaming=True)`
- [`batch-coalesce-writes-with-buffered-output`](references/batch-coalesce-writes-with-buffered-output.md) — `COPY` / `executemany`, sized write buffers
- [`batch-keyset-pagination-over-offset`](references/batch-keyset-pagination-over-offset.md) — `WHERE id > $last_id`, never `OFFSET N` on deep cursors

### 5. Spill-to-Disk & External Memory (HIGH)

- [`spill-external-merge-sort-when-data-exceeds-ram`](references/spill-external-merge-sort-when-data-exceeds-ram.md) — Out-of-core sort; delegate to `sort` / DuckDB when possible
- [`spill-partition-by-hash-for-out-of-core-groupby-join`](references/spill-partition-by-hash-for-out-of-core-groupby-join.md) — Hash-partition both sides; process per partition
- [`spill-use-temp-files-not-process-memory`](references/spill-use-temp-files-not-process-memory.md) — `SpooledTemporaryFile`, never unbounded `BytesIO`
- [`spill-use-engines-that-spill-automatically`](references/spill-use-engines-that-spill-automatically.md) — DuckDB / Polars / Dask manage spill for you

### 6. Pipelining & Backpressure (MEDIUM-HIGH)

- [`pipe-use-bounded-queues-for-producer-consumer`](references/pipe-use-bounded-queues-for-producer-consumer.md) — Bounded `Queue` is the backpressure mechanism
- [`pipe-apply-backpressure-from-slow-stages`](references/pipe-apply-backpressure-from-slow-stages.md) — Slow sink throttles the fast source
- [`pipe-prefer-pull-iteration-over-push-callbacks`](references/pipe-prefer-pull-iteration-over-push-callbacks.md) — Pull backpressures naturally; push needs policy
- [`pipe-checkpoint-progress-for-resumability`](references/pipe-checkpoint-progress-for-resumability.md) — Atomic checkpoint after each batch; redo bounded

### 7. Compression & Serialization (MEDIUM)

- [`codec-zstd-or-lz4-as-defaults-not-gzip`](references/codec-zstd-or-lz4-as-defaults-not-gzip.md) — Pick codec by access pattern; gzip is legacy
- [`codec-dictionary-encoding-for-repetitive-strings`](references/codec-dictionary-encoding-for-repetitive-strings.md) — 5-50× on low-cardinality columns
- [`codec-prefer-binary-protocols-over-json-for-rpc`](references/codec-prefer-binary-protocols-over-json-for-rpc.md) — Protobuf / Arrow IPC / MsgPack on hot wires
- [`codec-train-a-zstd-dictionary-for-many-small-payloads`](references/codec-train-a-zstd-dictionary-for-many-small-payloads.md) — `zstd --train` for sub-1 KB messages

### 8. Concurrency for I/O-Bound Workloads (MEDIUM)

- [`conc-asyncio-for-many-network-streams-not-for-cpu`](references/conc-asyncio-for-many-network-streams-not-for-cpu.md) — Asyncio multiplexes waits; useless for compute
- [`conc-thread-pools-for-blocking-io-libraries`](references/conc-thread-pools-for-blocking-io-libraries.md) — GIL releases during blocking I/O; threads work
- [`conc-overlap-compute-with-prefetch`](references/conc-overlap-compute-with-prefetch.md) — One batch ahead via background thread / coroutine
- [`conc-tune-parallelism-to-the-bottleneck`](references/conc-tune-parallelism-to-the-bottleneck.md) — Match worker count to the binding resource

### 9. Observability & Throughput Tuning (LOW-MEDIUM)

- [`obs-measure-iowait-not-just-cpu`](references/obs-measure-iowait-not-just-cpu.md) — `iostat -x`, `vmstat`, psutil — find the real bottleneck
- [`obs-instrument-throughput-rows-per-second`](references/obs-instrument-throughput-rows-per-second.md) — Rates normalize across runs; catch regressions
- [`obs-profile-with-py-spy-or-strace-for-syscall-storms`](references/obs-profile-with-py-spy-or-strace-for-syscall-storms.md) — Attach, don't guess

## How to Use

Start with the question that matches the problem:

- **"It OOMs / RAM grows linearly during a streaming job"** → `mem-` and `pipe-` (likely an unbounded buffer or per-iteration accumulation)
- **"Disk is 100 % busy but CPU is idle"** → `io-` (likely random access, unbuffered I/O, or wrong format)
- **"Why is reading this 10 GB file so slow?"** → start with `fmt-columnar-for-analytical-scans` and `io-buffer-explicitly-for-small-records`
- **"How big should the chunk be?"** → [`batch-pick-chunk-size-by-memory-budget`](references/batch-pick-chunk-size-by-memory-budget.md)
- **"Data doesn't fit in RAM"** → `spill-` (and prefer an engine that does it for you: [`spill-use-engines-that-spill-automatically`](references/spill-use-engines-that-spill-automatically.md))
- **"Lots of small network calls"** → [`io-batch-and-pipeline-network-roundtrips`](references/io-batch-and-pipeline-network-roundtrips.md), [`io-async-for-many-concurrent-streams`](references/io-async-for-many-concurrent-streams.md)
- **"Adding workers didn't help"** → [`conc-tune-parallelism-to-the-bottleneck`](references/conc-tune-parallelism-to-the-bottleneck.md), [`obs-measure-iowait-not-just-cpu`](references/obs-measure-iowait-not-just-cpu.md)
- **"It's slow but I don't know why"** → `obs-` first; profile before changing anything

Code examples are in Python (most readable across audiences). The reasoning generalizes — equivalent libraries in other ecosystems (Arrow C++/Rust/Java, Polars Rust, DuckDB everywhere, libuv-style async) follow the same patterns.

## Reference Files

| File | Description |
|------|-------------|
| [references/_sections.md](references/_sections.md) | Category definitions and ordering |
| [assets/templates/_template.md](assets/templates/_template.md) | Template for adding new rules |
| [metadata.json](metadata.json) | Version and reference information |
| [AGENTS.md](AGENTS.md) | Auto-built TOC navigation |

## Related Skills

- [`computer-science-algorithms`](../computer-science-algorithms/) — Algorithmic primitives this skill builds on (external merge sort, sketches, hash partitioning, sampling)