AWS Cost Optimization Strategies
Learn more about the best practices, processes, and tools you need to implement a strong cost management practice on AWS.
Moving your IT infrastructure to AWS is very attractive because you are able to scale resources as needed, and don’t have to worry about the nuts-and-bolts of underlying management operations. In theory, you pay for what you use only, and thus save money. However, unless you are careful and diligent, it’s easy to lose control of your cloud spend through over provisioning, turning off unwanted resources and so on.This brings us to our topic for today - cost optimization and the gotchas to watch out for.
Here at Mission, we believe that even before we get into cost optimization, we need to start at a more fundamental level - Security and Fault Tolerance. Optimizing your architecture is pointless if you don’t have proper security controls in place. You can use native tools in AWS to set up monitors and notifications for resources being spun up in unused regions. You have to make sure your account is secure to safeguard against a couple of scenarios: first, so nobody can get in unauthorized and start spinning up resources, and second, so somebody who actually should have access to your account isn’t spinning up resources that they shouldn’t be. I’m sure you’ve all heard of stories like this — Bitcoin mining on an employer’s dime, for example. There are ways to monitor that kind of usage and put up guardrails, following Center of Internet Security (CIS) best practices.
Fault tolerance is similar to security. If your site or app is down, it doesn’t matter how much money you are saving on infrastructure. In order to make this a real cost optimization opportunity, you need to design for failure. This includes using multiple availability zones to prevent your site from going down, making AMIs for all of your instances, and auto-scaling.
With your infrastructure secure and fault tolerant, you can move on to the actual cost-optimization activities.
Start with analyzing your infrastructure’s performance metrics to make sure you are using the right instance sizes (Right-sizing) and families (Right-typing) for the application you are running. For example, you don’t want to be running a compute-intensive application on a memory-optimized instance. Knowing the exact instance type and size, including generation, to run your application can be a big cost saving factor.
When you are doing analysis of your instances, do consider the memory and storage usage of those instances. Out of the box, AWS does not provide memory metrics for their EC2 instances. The only way to get those is by installing an agent like CloudWatch Memory Agent which provide statistics around memory usage, disk usage, etc.
When AWS releases a new generation of instances, it usually has faster processors, better performance, and reduced cost. So it’s kind of a double win if you move up to those new generations at the same size, because you get better performance at lower cost. But, that’s not always the case. Since Windows does per-core licensing, there have been times when AWS has updated their instance generations, and the new generation has more cores, which has increased the cost of upgrading to that generation. So be on the lookout for that, but in general, moving to newer generations is going to save you money.
Instance Scheduling, or the process of turning off instances overnight and when they are not being used, can actually help you save more money than making a reservation. Let’s break this down further: if you’re able to turn off your instances overnight, for 12 hours a night, and on weekends as well for your dev instances, then you’ll be turning the instances off for about 108 hours out of 168 in a week, which is roughly a 65% savings. That’s clearly going to save you more money than using reservations on those instances and just getting the 30-32% off.
There are third-party tools to do this, and you can use Lambda functions. I suggest automating it, so somebody doesn’t forget to turn it off when they leave, or leave it on over the weekend unnecessarily. The third-party tools you can use will have scheduling as well, and with some you can set the default setting to “off.” This can be very useful, because when a user does need to work on the instance, they have to snooze the “Off” state for say 8 or 10 hours that they intend to work. At the end of the day, the user doesn’t have to worry about remembering to turn it off - the instance will automatically go to the ‘Off’ state.
Once you’ve done your right-sizing and right-typing, and determined which instances can be scheduled, you are ready to make reservations. Let’s discuss one of the common issues we see with our customers: they have costs running away, they bought their Reserved Instances (RIs) but they aren’t seeing the cost savings they thought they would.
What’s the reason for this? Well, a lot of times a customer buys the RIs and then they forget about them. If there are any changes in their account, they’re not monitoring their utilization of those RIs to know if they are being used completely. For example, if you end up changing an instance size to a larger one, it may proportionally still apply to the new instance size, but it’s not going to be completely covered. If you change it to a smaller instance size, there may be some of that reservation that’s going unused. Sometimes you may change to a different instance altogether, but then you’ll forget that you have those RIs, and they become another expense that you’re not getting anything out of. So make sure that you’re monitoring utilization of your RIs.
Something to note: AWS does not provide notifications about expiring RIs, and there’s no mechanism to automatically renew them. So I suggest setting a reminder in your calendar as soon as you buy the RIs, and give yourself 30 days notice to plan any upgrades to the instances. That way, when you get to the time when your instances expire, you have the work planned to update to the new instance size or type, and you can purchase reservations quickly so you’re not running on-demand very much.
Finally: don’t forget about other reservation options. A lot of our customers don’t realize that you can buy reserved usage for ElastiCache, DynamoDB, and Redshift as well, and there’s considerable savings to be had there, so consider those options.
One of the biggest benefits of the cloud is that you can spin things up very quickly, say, for dev purposes or troubleshooting. The risk however is that, over time, these instances and the volumes attached to them are forgotten. AWS does not automatically delete them either, and they thus become “zombie” resources. This can really get away from people, and the cost can be considerable if you’re not paying attention, so be on the lookout for unused EC2 or RDS instances from test and dev environments.
There are third-party tools, as well as automated backup plans, that do usually have a way to weed out the old instances and terminate them. However, manual snapshots do not get automatically deleted. So make sure you’re going in and finding those old manually created snapshots, and delete those. Remember: when you delete something, the EBS volume is still there, and sometimes the elastic IPs still stick around if you don’t release those, which AWS will charge you a small fee for.
AWS allows you to tag your infrastructure by product, owner, team, environment, business unit and so on. You can use several tags per instance, as well as tools like AWS Cost Explorer and cloud management platforms like CloudCheckr and Cloudhealth. You can use these tools to slice and dice your data finely and see where your costs are coming from. The more tags you have, the more visibility you have into your environment.
Humans are prone to make errors. By automating your processes in the cloud, you reduce waste, reduce human errors and you will be able to move much faster in the cloud.
If you do use multiple availability zones for fault tolerance, you might have to pay extra for the inter-region Network Address Translation (NAT) cost for data transfer from one zone to another. So incorporate these costs into your cost budgets.
If you are not already doing so in your currently, think about how to architect the next generation of your app or site to leverage features like serverless, containerization and so on. That will further prepare you to realize further cost savings these new technologies enable. As always, reach out to us here at Mission to learn more.