--- title: Faster data crunching date: "2011-09-23T18:20:10Z" categories: - data wp_id: 2684 description: I compared data processing speeds using Python, cut, and awk across different hardware. I discovered that choosing the right combination of tools and server environments can improve performance by over 250x compared to standard local scripting. keywords: [data processing, awk, python, unix command line, benchmarking, amazon ec2, performance optimization] ---
I’ve been playing with big data lately.
The good part is, it’s easy to get interesting results. The data is so unwieldy that even average value calculations provoke a “Amazing! I didn’t know that,” response (No exaggeration. I heard this from two separate ~ $1bn businesses this month.)
The bad part is that calculating even that simple average is slow.
For example, take this 40MB file (380MB unzipped) and extract the first column.
The simplest Python script to get the first column looks like this:
```python for row in csv.reader(fileinput.input(), delimiter='\t'): if len(row) > 0: print row[0] ```That took a good 3 minutes to execute on my laptop.
Since I’m used to UNIX data processing, I tried cut -f1. Weirdly, that’s worse. 5 minutes. Paradoxically, awk '{print $1}' only takes 17 seconds. That's about 12 times faster. Clearly the tool makes a big difference. And we always knew UNIX was fast.
But I also ran these on an Amazon EC2 server, and a Hostgator server. Here’re the results.
| python | cut | awk | |
|---|---|---|---|
| My Dell E5400 | 3:04 (1x) | 5:42 (0.5x) | 0:17 (11x) |
| EC2 standard | 0:33 (6x) | 0:5.6 (33x) | 0:16 (11x) |
| Hostgator | 0:19 (10x) | 0:2.5 (74x) | 0:0.7 (265x) |
What took 3 minutes with Python my Dell E5400 took less than a second on Hostgator’s server with awk. Over 250 times faster. (Not 250%. 250 times).
And it’s not just hardware. A good tool (awk) made things 11x faster on my machine. Good hardware (hostgator) made the same program 10x faster. But choosing the right combination can make things go faster than 11 x 10 = 110 times. Much faster.
There are a few of things I’m taking away from this.