Building Reliable Websites


Load and Performance Edition


Stephen Kuenzli

@author skuenzli


breaking systems for fun and profit since 2000



https://github.com/skuenzli

The Process

  1. determine expected site load
  2. validate site handles expected load
  3. stay operational when load exceeds expectations
  4. profit!


determine expected site load

Key Metrics

  • Throughput: requests per second
  • Performance: response time

What percentage of your customers do you care about?


50%

95%

99%


?

Define a Service Level Agreement


  • Throughput: 42 requests per second
  • Performance: 99% of response times <= 100ms

Don't Forget!


  • network latency and bandwidth
  • client processing power

HOWTO: measure historical throughput


# total number of GETs to /myservice for a given day
grep -c 'GET /myservice' logs/app*/access.log.2012-11-16

# estimate peak hour for service from sample
grep 'GET /myservice' logs/app??5/access.log.2012-11-16 | \
  perl -nle 'print m|/201\d:(\d\d):|' | sort -n | uniq -c

# total number of GETs to /myservice at peak hour
grep -c '2012-11-16 17:.*GET /myservice' \
  logs/app*/access.log.2012-11-16
						

HOWTO: measure response time


# processing times recorded by server in access log
grep "GET /myservice" logs/app*/access.log.2012-11-16 | \
  cut -d\" -f7 | sort -n > service.access_times.2012-11-16
						

what about network latency and bandwidth?


does request fit in the client's resource budget? 50/95/99%?

all models are wrong; some models are useful


model +/- 20%

  1. count
  2. compute
  3. judge

adjust for

  1. growth
  2. seasonal loading
  3. margin of error




validate site handles expected load

validation process

  1. select a tool
  2. build a simulation
  3. run simulation multiple times / periodically
  4. analyze trends

Gatling is an Open Source Stress Tool with:

  • A DSL to describe scenarios
  • High performance
  • HTTP support
  • Meaningful reports
  • Executable from command-line or maven
  • A scenario recorder


gatling-tool.org

build simulation

run simulation multiple times / periodically

  • gather statistically significant results
  • establish a baseline
  • verify site does [not meet] SLAs
  • detect changes over time

analyze trends

detect changes in trend with control charts

is process changing?

is process changing?



stay operational when load exceeds expectations



the needs of the many outweigh the needs of the few or the one


a dead site is no good to anyone


know the site's limits and stay within them

implement a series of circuit breakers that can be tripped to reduce load in a managed way


  • manual breakers, tripped by operations staff
  • automatic breakers, tripped by software

Example services ranked by criticality

  1. send marketing email off-line / off-hours
  2. update customer's dashboard breaker #1
  3. upload images breaker #2
  4. render images breaker #3
  5. sign-in
  6. checkout
  7. save customer's work


there's usually a trade-off available

Resources

  • This presentation: https://github.com/skuenzli/building-reliable-websites
  • Concurrency Limiting Filter: https://github.com/skuenzli/simplyreliable
  • Web Operations: http://shop.oreilly.com/product/0636920000136.do
  • Circuit Breaker Pattern: http://doc.akka.io/docs/akka/2.1.0-RC1/common/circuitbreaker.html
  • Gatling: http://gatling-tool.org
  • USL

fin

https://github.com/skuenzli