Case: SEMA - Swedish Emergency Management Agency
How to verify, test and improve performance of a web site ahead of an upcoming, expected traffic peak load
The Swedish Emergency Management Agency – SEMA (Krisberedskapsmyndigheten (KBM), (http://www.krisberedskapsmyndigheten.se/) is a government agency tasked with coordinating crisis preparedness for Sweden's citizens.
During the Spring of 2008, SEMA carried out a vast communications exercise called SAMÖ2008. The purpose of the exercise was to try the preparedness of financial systems against a possible IT-based attack. The exercise drew approximately 5,000 participants from the Swedish government, other central government bodies, the counties, organizations and companies.
The focus of the exercise was on organized IT attacks that could threaten and bring down Sweden's financial systems, leading to a crisis of confidence. In cases like this, the normal functions of traditional payment systems can be affected by severe distortions.
The exercise was a simulated crises scenario, which played out with the participants in the exercise and was headed up by SEMA, the exercise was a simulated crisis scenario involving approximately 100 participants.
"Large communications exercises is an important means to strengthen the capacity to handle crises and to drill coordination between different sectors which work with crises preparedness in the society", said Helena Lindberg, leader of the exercise SAMÖ2008 and General Director of SEMA in a press release.
Technical challenge
The exercise had a central web site - www.samö2008.se – which delivered continuously updated information during the exercise for all participants. The performance of this site was of ultimate importance as the thousands of participants in the exercise periodically all logged in during a few minutes to update themselves on their exercise status.
Among others, the big issues at hand for SEMA were "Can the site handle 5,000 logged-in users at the same time?" and "How can we best scale our web application and server environments to avoid bottlenecks?"
Working closely with Apica, SEMA technical supervisor Per Soderstrom designed a load test of the production environment to verify the maximum performance of the site. The test was designed to simulate a scenario where 1,000 and up to 10,000 users log on to the site simultaneously to access information from a number of sub-pages. Test results were aggregated regarding:
- No. of active users
- Data from server-CPU and web server
- Response times for the scenario as well as URL’s
Test results and actions
The result of the first test was that, without any kind of trimming, the site could handle approximately 4,000 users logged in at the same time. The response times, however, were too long.
This result did not certify that the site could live up to the expected size of traffic. One of the actions taken which had an impact was to make the size of the landing page smaller. But this was not sufficient.
The solution was to put a separate front-end cache before SAMO's web site. This meant that the ordinary web site could offload all static traffic, what for larger web sites is called a Content Delivery Network, or CDN.
But why cannot the ordinary cache on a typical web server handle this?
The answer depends on the kind of cache implemented in the web application. A separate front-end cache typically delivers much better performance than cache built in an ordinary web server. It's also very important to reduce the number of inquiries-per-second to the web server, since the cache treats all static content. The CPU load is thus drastically lowered.
However, it's worth mentioning that, even if content is flagged as cached in a web server, the actual number of hits-per-second becomes a separate problem when you reach high volumes of inquiries to the site.
It is impossible to have a general opinion on how a web cluster/web server will handle high load. The only way to be certain is to load test the production environment of the system and analyze how the separate components of the site reacts.
A front-end cache based on Varnish gives a much better throughput for static content than an equivalent web server. The most simple way of explaining it is that design and structure in the code is optimized specifically for delivery of images and not to generate complex web pages or all the rest of the functionality that comes with a modern web server.
On a source code level, Varnish is optimized to deliver maximum data per instruction. That type of optimization is impossible to achieve on a conventional web server.