You don’t have to be a pre-cog to find and deal with infrastructure and
application problems; you just need good monitoring. We had quite a day
Monday during the EC2 EBS availability incident. Thanks to some early
alerts - which started coming in about 2.5 hours before AWS started reporting
problems - our ops team was able to intervene and make sure that our
customers’ data was safe and sound. I’ll start with screenshots of what
we saw and experienced, then get into what metrics to watch and alert on in
your environment, as well as how to do so in TraceView.
10:30 AM EST: Increased disk latency, data pipeline backup
Around 10 am, we started to notice that writes weren’t moving through our
pipeline as smoothly as before. Sure enough, pretty soon we started seeing
alerts about elevated DB load and disk latency. Here’s what it looked
Figure 1: At 10 AM, we s... (more)
In part 1 of this article, we covered writing web app load tests using
multi-mechanize. This post picks up where the other left off and will
discuss how to gather interesting and actionable performance data from a
load-test, using (of course) Traceview as an example.
The big problem we had after writing load tests was that timing data gathered
by multi-mechanize is inherently external to the application. This means it
can tell us the response times of requests when the app is under load but
doesn't identify bottlenecks or configuration problems. So we need to be
gathering a bi... (more)
Many types of performance problems can result from the load created by
concurrent users of web applications, and all too often these scalability
bottlenecks go undetected until the application has been deployed in
production. Load-testing, the generation of simulated user requests, is a
great way to catch these types of issues before they get out of hand. Last
month I presented about load testing with Canonical's Corey Goldberg at
the Boston Python Meetup last week and thought the topic deserved blog
discussion as well.
In this two-part series, I'll walk through generating lo... (more)
A few weeks back webserver request queueing came under heightened scrutiny
as rapgenius blasted Heroku for not using as much autotune as promised in
their “intelligent load balancing”. If you somehow missed
the write-up (or response), check it out for its great simulations of load
balancing strategies on Heroku.
What if you’re not running on Heroku? Well, the same wisdom still applies
– know your application’s load balancing and concurrency and measure its
performance. Let’s explore how request queueing affects applications in the
non-PaaS world and what you can do about it.
Our fundamental unit of performance data is the trace, an incredibly rich
view into the performance of an individual request moving through your web
application. Given all this data and the diversity of the contents of any
individual trace, it’s important to have an interface for understanding
what exactly was going on when a request was served. How did it get handled?
What parts were slow, and what parts were anomalous?
Over the past year, the TraceView team has been listening to your thoughts on
this topic as well as hatching some of our own. Today we get to share the
fruit of... (more)