Building Reliable Websites

Load and Performance Edition

Stephen Kuenzli

@author skuenzli

breaking systems for fun and profit since 2000

The Process

  1. determine expected site load
  2. validate site handles expected load
  3. stay operational when load exceeds expectations
  4. profit!

determine expected site load

Key Metrics

  • Throughput: requests per second
  • Performance: response time

What percentage of your customers do you care about?





Define a Service Level Agreement

  • Throughput: 42 requests per second
  • Performance: 99% of response times <= 100ms

Don't Forget!

  • network latency and bandwidth
  • client processing power

HOWTO: measure historical throughput

# total number of GETs to /myservice for a given day
grep -c 'GET /myservice' logs/app*/access.log.2012-11-16

# estimate peak hour for service from sample
grep 'GET /myservice' logs/app??5/access.log.2012-11-16 | \
  perl -nle 'print m|/201\d:(\d\d):|' | sort -n | uniq -c

# total number of GETs to /myservice at peak hour
grep -c '2012-11-16 17:.*GET /myservice' \

HOWTO: measure response time

# processing times recorded by server in access log
grep "GET /myservice" logs/app*/access.log.2012-11-16 | \
  cut -d\" -f7 | sort -n > service.access_times.2012-11-16

what about network latency and bandwidth?

does request fit in the client's resource budget? 50/95/99%?

all models are wrong; some models are useful

model +/- 20%

  1. count
  2. compute
  3. judge

adjust for

  1. growth
  2. seasonal loading
  3. margin of error

validate site handles expected load

validation process

  1. select a tool
  2. build a simulation
  3. run simulation multiple times / periodically
  4. analyze trends

Gatling is an Open Source Stress Tool with:

  • A DSL to describe scenarios
  • High performance
  • HTTP support
  • Meaningful reports
  • Executable from command-line or maven
  • A scenario recorder

build simulation

run simulation multiple times / periodically

  • gather statistically significant results
  • establish a baseline
  • verify site does [not meet] SLAs
  • detect changes over time

analyze trends

detect changes in trend with control charts

is process changing?

is process changing?

stay operational when load exceeds expectations

the needs of the many outweigh the needs of the few or the one

a dead site is no good to anyone

know the site's limits and stay within them

implement a series of circuit breakers that can be tripped to reduce load in a managed way

  • manual breakers, tripped by operations staff
  • automatic breakers, tripped by software

Example services ranked by criticality

  1. send marketing email off-line / off-hours
  2. update customer's dashboard breaker #1
  3. upload images breaker #2
  4. render images breaker #3
  5. sign-in
  6. checkout
  7. save customer's work

there's usually a trade-off available


  • This presentation:
  • Concurrency Limiting Filter:
  • Web Operations:
  • Circuit Breaker Pattern:
  • Gatling:
  • USL
