When I work with customers, I like to think of myself like House, the Hugh Laurie TV doctor who can solve the strange and unique challenges. A recent Mission client, Your Call Football, presented one of those challenges: deploying a high volume mobile application in AWS that had to handle large traffic spikes at known times—so not your typical auto-scaling situation. We turned to Kubernetes (an open-source system that handles automated deployments, scaling and management of containers) to solve the problem. . Here’s how we did it.
Your Call Football is launching an app that is a cross between video games, fantasy football and real-life football. Each down, the offense’s coach picks three plays, and fans (including those playing at home) vote on which of the three plays will actually get called and run by the players on the field. Fans playing on the YCF app earn points based on how they pick and how well the plays work, and they compete with each other for prizes. It’s a really cool concept, and we did our first live game last week.
If you’re interested in trying it out, the next game is May 17th and you can get the app for free—and the prizes are pretty cool. But enough plugging our client! Back to the tech!
Your Call Football came to us with a very specific challenge regarding their application: they were making a product that would either have a handful of active users or potentially 100,000 active users—and nothing in between. This wasn’t a simple situation of autoscaling for production because the app had to work at a large scale at specific, scheduled times. The other wrench—and it’s a big one—is that the voting would happen simultaneously, in a matter of seconds, for everyone involved. Any lag, delay in service, or timed-out request would defeat the whole purpose and represent a complete failure of the system; all it takes is one angry fan in the stands or at home to get on Twitter to undermine all of everyone’s hard work.
These tight performance requirements meant that even autoscaling could be too slow. So we had a checklist of what we had to solve:
Now that we knew our challenge, it was time to get building.
The first step was building out the Infrastructure as Code. We knew YCF would be a perfect fit for IAC because pieces needed to be enabled quickly. Next was finding the right tools for this – and we prescribed Kubernetes, AWS RDS and Elasticache.
I’ve written extensively about the benefits of Amazon RDS—it’s the best database solution out there. As for Elasticache, we needed a way to ensure seamless deployment of in-memory data across thousands of users—otherwise all the voting would be for nothing! The final piece of the puzzle was Kubernetes.
Kubernetes’ main benefit is that it scales higher than ECS—and it’s powered by a large community of users who regularly add new tools. In my experience, I find it easier and faster to get a cluster up and running in Kubernetes—and it’s more configurable than ECS.
To build this system, we needed to do some extensive instance research. AWS doesn’t always make it easy to measure network performance—they categorize performance with squishy terms like “Low to Moderate” or “High” without giving real specs— so we started by doing some tests to determine which instance type performed best.
T2s did nothing for us: we got hit hard on the networking limitations of the system and throttling. Then we moved to M4s and continued to test. Even though M4s are labeled “moderate” to “high,” only 10 and 25 GB were available on the largest servers with enhanced networking, and these add-ons made it inefficient to scale. So we moved to R4, which listed “Up to 10 GB” or higher on all servers in the class, and the R4.large instance type seemed to be the perfect size for the job.
With RDS, you need a terabyte of storage to max out your I/O. With anything less than a terabyte volume size, you run the risk of hitting I/O walls in the EBS I/O Credit and Burst economy when using SSD. Which we learned the hard way. While we were trying to handle this with a smaller 20 GB database volume size, we hit I/O speed limits. But by resizing the volume to 1 terabyte—which is WELL larger than we theoretically needed—we were able to leverage the additional I/O without having to use provisioned IOps—which is a lot more expensive.
How deep is too deep for IAC? I think I discovered it with this project.
We’re doing everything through Terraform for Infrastructure as Code since it’s a key component of this design. We need the ability to build additional servers to spec in minutes and then tear them all down so we’re not burning money on servers that are idle between games. Terraform is a great tool for that.
Then we realized we needed to leverage Elastic Beanstalk for this project for a late-addition microservice. That’s when I discovered that Elastic Beanstalk deploys and manages its environment via CloudFormation. So we were using Terraform to deploy Elastic Beanstalk, which deploys using CloudFormation. Which made simple configuration changes challenging.
It’s IAC all the way down.
Amazon’s new NLB service has been marketed as a solution for handling huge influxes of traffic. We gave it a shot but found that it couldn’t handle the huge spikes fast enough or smooth enough. So we built a system with pre-warmed ELBs instead.This meant we could prep the load balancers before the massive load spikes for a smoother experience.
Service Limits were another area of constant improvement as we tested out a variety of options before finding the instance type that worked best. YCF’s performance requirements definitely pushed some AWS services pretty hard, and we tested some of the theoretical limits of the AWS platform. Anytime you need to get into the theoretical limits of a service like Elasticache, you know as a Solutions Architect that things are going to get interesting!
I’ve heralded the benefits and value of AWS Business Support—it’s one of the most essential tools for anyone running in AWS. They deserve another shout-out on this one, as they came through multiple times during this project. We knew our infrastructure would work when the time came because of the confidence AWS support supplied.
Kubernetes is like surfing: you just have to grab on and ride the waves as they come. It’s self-healing, and I’m a big fan of that. Honestly, it’s such a cool tool that there are still moments when if feels like magic to me. The graph above are some example of the grafana performance graphs, and it’s a little like finding a needle in a haystack because there is so much information available.
It’s a cool tool, for sure, but you can’t just kick back and let it do all your work for you. We used kops (Kubernetes Operations) to create the Terraform templates we needed to create the clusters, and now can manage Kubernetes through Terraform going forward. This is a good solution because it allows us to use variables for the values we are continually changing and can automate rolling out those changes. Like cluster node sizes and numbers, for example.
Another important lesson we learned with Kubernetes is that there is a limit on pods to nodes and that limit is application dependent. It’s roughly around 20, but Kubernetes comes with a handful of their own nodes that vary depending on your build and add-on services.
We are starting to deploy and see some great wins using fargate and EKS. Less management overhead along with faster and cheaper deployments. However, at the time of this writing, EKS or Elastic Kubernetes Service, is still in preview and not generally available. We are definitely looking forward to this new service. As it will save us even more time and energy that we devoted to creating and managing the clusters.
Senior Solution Architect