Case Study: Handling the MN GOP Convention

This article is a technical case-study for how Voter-Science’s services handled a high-traffic event: the Minnesota GOP State Convention on May 30^th. The MNGOP was clear that it was absolutely critical for the site to keep up with the surge in traffic and stay fully responsive during their event. The event was successful. There were surge periods hitting over 1000 requests/second to our servers, and the servers averaged responses in under 100ms.

Here were the engineering steps we took to provide the MNGOP that guarantee…

Some background

These state conventions are large events with thousands of users voting on a variety of important items, including platform, RNC delegates, and presidential electors. Conventions also have added challenges around security, privacy, custom rules from bylaws, and seating of delegates and alternates.

Overall, the event was a successful event with over 1500 active participants and over 17,000 total ballots counted. The MNGOP staff did an excellent job of training, logistics and helping thousands of people make the adjustment to online technology.

For scoping purposes here, we’ll ignore the actual voting features of the site and focus on the critical problem: how do we ensure the site stays responsive during peak load.

Summary of solution

Our solution was built entirely on Microsoft Azure and Azure AppService (formerly Websites). A single AppService is actually a group of machines and can scale up to 30 instances. The websites were stateless which allowed them to be dynamically scaled up (larger machines) and out (multiple instances). All persistent state (such as configuration, voter permission, and results) was in storage accounts collocated in the same data center and only accessible from the front end sites. The sites had highly tactical cashes to avoid unnecessary network calls. We used Azure Blob and Table Storage and avoided SQL because SQL is a common failure point especially under high load.

We had a front-end loadbalancer to direct users to one of the sites at random, and then each site could be scaled to up to 30 individual machines. We had 3 sites, which gave us capacity for up to 90 individual machines if needed.

Here were some specific steps we took:

Fundamental Design Decisions:

[1] Design for scale out – A common problem is to store critical state in a server’s memory – that’s the easiest way to write a server, but it means you can’t scale out and add more servers. We designed the individual sites to be stateless (all state was stored in Azure Storage and not the server’s memory) so that they could be rapidly scaled out. This means we could start with 1 machine running the election, and within seconds scale out to dozens of machines. Azure Websites would automatically loadbalance amongst the pool.

[2] Avoid SQL – SQL is both incredibly powerful but also extremely unreliable and one of the most common points of failure in most web applications. It is common for a SQL database to work fine on small traffic, but start timing out (“hanging”) when more users try to access the database. This can directly lead to the website being non-responsive and the entire system coming to a halt. Sometimes, you can mitigate this by buying a larger SQL instance – but this quickly becomes prohibitively expensive and it still has very limited scaling. We completely avoided SQL and ran with Azure Storage instead.

We had run a previous website for the Washington State Republican Party (a caucus locator that looked up names in a 5 million row voter database and told them their caucus location). It was originally built on SQL and quickly started timing out as usage ramped up. We identified the problem and switched it over to Azure Tables, and it easily handled the load.

[3] Review the SLA for all dependencies – Services fail because their dependencies fail (see SQL). We minimized the dependencies (just Azure WebSites and Azure Storage), and then review the Service Licensing Agreement for each dependency. For example, Azure Storage’s SLA can support up to 20k table operations / second and guarantees single table operations will occur within 2 seconds. From that, we can then compute what our system is capable of. For example, given our topology above and access frequency, the storage SLA meant that we could “only” support about 500,000 active users at once – which was plenty of capacity for the event.

[4] Review all storage access for contention and reader/writer – For every table and blob access, be aware of:

who is reading it?
who is writing it? What’s the contention policy?
What’s the consistency story? Read and understand the CAP Theorem.

When a round of voting opened, we’d get 1000 votes within the first 10 seconds. Each vote was a table write to a unique row indexed by user and ballot – which meant that votes had no contention so we could easily handle a surge of voting via scale out. (In contrast, if we wrote all votes to a single blob, that would be high contention). If somebody attempted to double vote (and circumvented our client-side checks, such as via replaying a network event), the worst case is they’d overwrite their own vote.

Engineering and Deployment Decisions

[5] We provisioned dedicated hardware specifically in the data center closest to the users (in this case, the closest data center to Minnesota was the Central US region located within 100 miles). While Voter-Science normally runs multi-tenant services, we chose a dedicated hardware approach so that a freak spike in our other services would not impact the Minnesota convention.

We measured latency and found about 40% speed improvement by switching to the closer data center.

[6] A backup setup was provisioned at the next closest data center – which was in Illinois (North Central US). If we needed, we would have switched the DNS entries over to redirect traffic to the backup.

[7] We had ability to rapidly scale up to 90 machines if needed. Specifically, there were concerns from their previous experience we may be hit by a denial-of-service attack. We planned for capacity to absorb such an attack and keep the convention running uninterupted. We could also single out the source of attack and redirect users onto new hardware.

[8] Identified all network calls on the “hot path” – The “hot path” is the high-frequency code paths that the voters would be exercising when using the product. Management and low-frequency operations were designated as a “cold path”. Hot path items might occur 1000s of times a minute. Any hot path calls were heavily scrutinized, removed if possible, and carefully tested for failure. The code was refactored to emphasize hot path calls so that developers could not accidentally break a hot path.

Operations

[9] Attack yourselves first: We had multiple training sessions with 300 active users. However, we ran live trials in a “stress mode” where the webpage for each users spun up a background thread to hammer our servers and act like the load of 10-20 users. We essentially launched a distributed denial of service attack against ourselves to test the system and capacity; and we did it during a real training session with hundreds of real users so we could be confident it was working. These let us simulate of a total load of around 6000 active users on a small server pool.

[10] Measure your capacity: You can review SLAs to make estimates on how much capacity your system can handle, but you must measure it under real load to be sure. We took real measurements during training sessions to build a table of “# of Users / # of machine instances”. We monitored the requests / minutes versus number of users to get a clear feel for throughput of the nodes. From these tests, we were confident we could handle a load of at least 100,000 active users.

[11] We also closely monitored web traffic. We had the right telemetry in place to proactively know how the system was doing. For example, we had alerts for if traffic (requests / sec) scaled faster than number of users; that would suggest a possible attack.

[12] We also setup decoy servers to sniff out possible attacks. These were machines that we were not actually using for the convention, but were easy for an attacker to discover (and hence attack). Any traffic on these sites served as an early warning of a possible attack.

Contingency plans?

We made a “worst case scenarios” handbook and listed out contingency plans for them. Some items included:

[13] What if we got a denial of service attack? If Azure’s existing DOS detection didn’t protect us, we would have flipped DNS entries to spread the users to new machines. We timed this and could do a full migration of 5000 users within 15 minutes.

[14] What if the Azure Data Center went down? We had a backup data center in another region we would switch to.

[15] Power outage for administrator machines? During the convention, there were several privileged users to manage the election. This included moving between the various election stages, pulling reports, updating credentials, etc. If these administrators lost power / network connections, we needed to still be able to operate the election. To handle this, we ensured that the key management operations could be handled from a smart phone over a cellular network. We also had contingencies for backup power, cellular hotspots, and multiple administrators at different locations.

As luck would have it, while it was sunny in Minnesota, we actually got thunderstorms in Seattle and our administrators lost power for part of the convention!

Conclusion

There were many practical steps taken before and during the convention to ensure Voter-Science’s technology would work reliably.

Share this:

Related

Leave a comment Cancel reply