AWS Well-Architected Framework: The Operational Excellence Pillar
This article is part of an ongoing series taking a deep look at Amazon’s Well-Architected Framework, which is an incredible collection of best practices for cloud-native organizations. This month, we’re digging into the Operational Excellence Pillar.
When most people hear the term “operational excellence,” they think of policies and procedures, runbooks, and best practices. The reality is that operational excellence is a consequence of culture. Amazon’s Operational Excellence Pillar Whitepaper certainly provides best practices and guidelines, but their summary of the pillar itself truly cuts to the heart of the issue:
The operational excellence pillar includes the ability to run and monitor systems to deliver business value and to continuously improve supporting processes and procedures.
Organizations that have the ability to be operationally excellent will have a culture of continuous improvement driven by a desire to create business value. What does that look like, in practice?
- Engineers working in tandem with product/service owners on roadmap prioritization, weighing features/enhancements and technical improvements on the merit of their ultimate value to the business.
- Highly visible, regularly scheduled “Game Days,” which highlight opportunities to evolve internal procedures.
- Commitments from product and marketing leadership to adopt a cadence of rapid, incremental evolution, rather than massive timed release cycles.
- Publicized post-mortems, both internal, and external.
- Deep passion for automation, permeating all facets of the organization, with a strong distaste for manual operations.
Evolving company culture is intimidating, but leveraging the best practices and design principles of the Operational Excellence Pillar can help drive that change.
If you’re interested in learning about the birth of the AWS Well-Architected Framework, then check out our initial post, “Introducing the AWS Well-Architected Program”.
Amazon outlines six design principles for operational excellence in the cloud:
- Perform operations as code
- Annotated documentation
- Make frequent, small, reversible changes
- Refine operations procedures frequently
- Anticipate failure
- Learn from all operational failures
Perform Operations as Code
Over the last four decades, the practice of software engineering has evolved to become highly disciplined, with rigorous methods of testing, evolving, and iterating to minimize risk. Changes to software can be managed in version control, run through automated testing, and delivered with confidence.
Similarly, the advent of the cloud has enabled that rigor to extend to the entire application environment. With cloud infrastructure, everything can be defined, managed, and implemented in software. For operations, this has been a revolutionary change, enabling operations procedures to be codified, scripted, and automated, reducing human error, and ensuring consistency.
For years, Mission managed its infrastructure in data centers under our own control. Documentation was a tedious process, and while some was able to be created in a scripted fashion, much more had to be created by hand. Worse, difficult change management processes had to be implemented to ensure that the documentation stayed in sync with the environment. Even with best efforts, our documentation tended to drift from reality. When customer issues arose, it was difficult for us to rely on our own documentation, greatly impacting our ability to troubleshoot.
With cloud, documentation can be created automatically as an artifact of the build process, and can be used by both humans and systems. By automating the creation of detailed, annotated documentation, you can have confidence that your documentation always reflects reality.
At Mission, we manage hundreds of large cloud workloads for our managed services customers, and our automated documentation gives our engineers the ability to quickly react, improving our efficiency and ability to deliver for our customers.
Make Frequent, Small, Reversible Changes
Agile methodologies now permeate most organizations, so the concept of iteration is generally accepted. Yet, product organizations still tend to bundle changes into large releases, that span many iterations, imposing huge amounts of change on not only their customers, but on dependant systems and components.
Operational excellence requires workloads to be designed for an environment that is constantly in flux, evolving, and improving. That requires a diligent approach to systems design, with small, highly-focused components that are resilient to failure, composed into a holistic system. This way, changes can be rolled out rapidly, and rolled back easily in the case of failure.
Refine Operations Procedures Frequently
Most companies implementing agile methodologies have “retrospective” meetings on a regular cadence. These meetings are an opportunity for the team to reflect, suggest, and implement improvements to help make them more efficient. Yet, the vast majority of companies have no such equivalent for operations procedures, instead choosing only to reflect on potential improvements in “post-mortems,” if at all.
Operationally excellent companies will schedule regular “Game Days,” where operational procedures can be put to the test and practiced, ensuring their continuing evolution and improvement.
Post-mortem exercises are useful, but are inherently designed to address failures after they’ve already happened. Operationally excellent teams will add the “pre-mortem” exercise to their routine, working to identify potential sources of failure before they wreak havoc in production. Failure scenarios can be tested, validated, and measured for impact during Game Days, enabling teams to improve resiliency.
Netflix and Chaos Engineering
Netflix is a world renowned cloud-native organization, and their famous Chaos Monkey project is an excellent example of this design principle. Chaos Monkey, and other tools that embrace Chaos Engineering, encourage engineers to build resilient systems, and to root out sources of failure before they are exposed in production.
Learn From All Operational Failures
While post-mortems are common, they are frequently done in isolation, and their outputs are not always shared broadly. An operationally excellent organization will include all stakeholders in post-mortems, ensure that they are “blameless” in nature, and include suggested improvements for all parts of the business, from engineering to product, marketing, and finance.
Armed with these design principles, your organization can evolve its culture to drive operational excellence.
Areas of Operational Excellence
Amazon describes operational excellence in the cloud through three areas:
Let’s dig into these areas.
Operational excellence cannot be achieved in a vacuum. It demands a detailed understanding of each workload, what it is trying to achieve, and how it will achieve those goals. Without these insights, it’s impossible to design a system that will surface its status, or to create a procedure that will effectively support that system.
Preparation: Operational Priorities
Successful operations teams are enlightened operations teams. They have a complete understanding of:
- Workloads they’re responsible for
- Shared business goals
- Their role in achieving the goals
- Regulatory or compliance requirements
Only with these insights can teams prioritize their efforts. If a particular workload has stringent compliance requirements, those should be prioritized ahead of, say, monitoring enhancements, for example.
AWS provides a number of resources to help teams set their operational priorities:
- AWS Support, including the AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center
- AWS Documentation, which is now available on GitHub as an open source project
- AWS Cloud Compliance
- AWS Trusted Advisor, which includes core checks for environmental improvements
In addition, AWS Certified Partners like Mission can act as an extension of your operations team, providing consulting, professional services, and managed services to ensure that you are setting your operational priorities in such a way that you’ll be supporting shared business goals.
Preparation: Design for Operations
Well-architected workloads intentionally consider deployment, updates, and operations in their design. They’re observable by design, with logging, instrumentation, and metrics baked in from the start.
AWS enables you to model your entire workload in code, including applications, infrastructure, policy, governance, and operations. Applying rigorous engineering discipline to not only your application code, but to your entire stack, ensure that you’re designed from the start for operations. AWS provides a huge number of tools and services that enable you to design for operations, including CloudFormation and the AWS Developer Tools.
At Mission, we help drive DevOps transformation for our customers by fully embracing everything-as-code. Our engineering teams leverage Terraform and AWS CloudFormation to create templated infrastructure that is version-controlled in Git, run through CI/CD pipelines, and then deployed with confidence. Mission also utilizes Ansible and Chef to do automated configuration management, with shared, tested libraries of cookbooks for common use cases.
APM at Mission
Mission engineers have an obsession with data, as it enables us to quickly gain insights into our customers’ infrastructure. We capture metrics and information from many sources to surface anomalies and opportunities for improvement. In fact, one of Mission’s core competencies is Application Performance Management (APM), which we offer as a service.
Recently, a customer was struggling with what appeared to be a DDoS attack on their E-Commerce website. By gathering extensive log data and aggregating it into AWS CloudWatch, we were able to determine the cause of the problem, and quickly mitigate it through Amazon Web-Application Firewall.
Operating a workload requires observability, and AWS enables your teams to build highly observable systems with AWS CloudTrail, AWS VPC Flow Logs, AWS CloudWatch, and more. Application logs can be ingested into CloudWatch through a number of mechanisms (CLI, API, CloudWatch Events, etc.), and applications can even publish business and technical metrics directly to CloudWatch. For even deeper insights, AWS X-Ray can track requests from end-to-end through your application.
Preparation: Operational Readiness
Operational excellence is about more than technology. Its also about process and procedure. Teams who have mastered operational excellence will create and maintain a consistent, repeatable, and well-maintained process for deploying and operating their workloads, and the follow-through to actually implement the process. What does this look like?
- Documentation that accurately reflects the process, including checklists, runbooks, and playbooks.
- A properly trained team, right-sized to cover your operational activities. No shortcuts here! The team will need to expert on your procedures, workloads, and the underlying AWS infrastructure.
- Governance that ensures no corners are cut before launching your workload.
Mission’s University Model
Given the current demand for AWS Certified talent, building a team isn’t always easy. AWS provides a huge number of resources, including AWS Online Tech Talks and AWS Training and Certification, along with instructor-led training. At Mission, we’ve implemented an education and training program to ensure that every one of our engineers is AWS Certified at our cost – a great benefit to our employees and their careers.
In addition to the right people and process, operationally excellent teams leverage automation and operations-as-code to codify runbooks, playbooks, and procedures to reduce risk. Leveraging resource tags, events triggers, and other features of AWS, operations teams can automate the evaluation of their environments. Procedures can be scripted using AWS Systems Manager’s Run Command and Lambda in response to CloudWatch Events. Configurations can be validated and automated using AWS Config rules, and can be benchmarked against best practices.
At Mission, our team also leverages the power of AWS to validate change and practice procedures in temporary staging and testing environments. Complete environments are routinely spun up so that our team can practice and test improvements discovered in “pre-mortems,” Game Days, and more.
What does operational success look like? Based upon shared business goals, operations teams should create, publish, and agree upon key metrics and outcomes to define operational success for their business and workload. Clear definitions help your teams to respond to events quickly, and in ways that directly impact your business goals.
To operate successfully, your team must first understand, and then respond.
Operation: Understanding Operational Health
At Mission, we operate large-scale workloads on behalf of hundreds of customers with varying expectations, requirements, and use cases. Some of our customer workloads demand extremely low-latency, high-throughput performance, and will prioritize those requirements above all else, including cost. Others are willing to sacrifice some level of performance to cost-effectively ensure high availability. Thus, the definition of operational health varies from workload to workload, so its critical for us to understand the key metrics that properly capture the customer’s definition of operational success!
Similarly, your teams must huddle with key stakeholders from your business to define those key metrics in outcomes. What are the priorities of your business? Performance, cost, availability, latency, etc., must be balanced to properly support your business goals. With these key metrics in hand, your team can get to work gathering the data to understand the operational health of your workloads at a glance.
Implementing key metrics requires gathering lots of data, aggregating it, and surfacing insights through dashboards and alerting. Thankfully, AWS provides the tools and services your team needs to analyze your workloads:
- By sending log data to CloudWatch Logs, baselines can be established that define “normal.” CloudWatch Dashboards can then be used to create system level and business level views of those key metrics.
- Amazon ElasticSearch with Kibana can help create even more detailed visualizations for your operational health metrics.
- For monitoring portions of your workload that are delegated to AWS through the shared responsibility model, you can leverage the AWS Service Health Dashboard and the Personal Health Dashboard. If you have an AWS support subscription, you can integrate with the Personal Health Dashboard API.
In addition to the tools and services provided by AWS, consider integrating other best of breed tools and services like Logstash and Grafana.
Operation: Responding to Events
Being operationally excellent doesn’t guarantee that you won’t have to deal with operational events. That said, operational excellence requires you to properly anticipate that failures will happen, and be ready to respond quickly and effectively by leveraging your operational health metrics, processes, and procedures.
Considering Business Impact
One critical factor of responding to events is understanding the business impact of the components that make up your workloads, to help your teams prioritize their focus. Business impact metrics can be built into your dashboards, so that this critical information is available at-a-glance.
AWS enables operations teams to script event response through operations as code. By leveraging your monitoring data, you should create triggers to help mitigate in the case of failure. For example, automated rollbacks to known good versions of components can minimize impact to production services, giving you time to perform analysis in non-production environments.
The key AWS service for responding to events is Amazon CloudWatch, including Amazon CloudWatch Events, as they can act as a central hub for coordinated automated response. For example, when a key metric crosses a threshold, CloudWatch alarms can be created to auto scale Amazon EC2 instances, and to send notifications via Amazon SNS.
After an event has been navigated successfully, a root cause analysis, followed by a full post-mortem should be included in your standard procedures, to make sure that opportunities for improvement don’t get lost in the shuffle.
In my experience running operations teams, the most important predictor of success is a passion for learning. Teams that want to achieve operational excellence need to cultivate a culture of curiosity and continuous improvement, where every experience is an opportunity first to learn, and then to share those lessons far and wide.
Evolution: Learning from Experience
While no operations team looks forward to production issues, I’ve found that the best teams enjoy digging in to learn from failure. As a leader, encouraging operations teams to analyze, experiment, and improve will pay great dividends over time.
AWS provides an extensive platform to enable analysis and experimentation:
- Amazon CloudWatch and CloudTrail can be combined with Amazon ElasticSearch with Kibana.
- Exporting large amounts of data to Amazon S3 enables analysis with Amazon Athena and Amazon QuickSight, including rich visualizations to help your teams gain insights.
When experimenting and evolving, ensure to pull in other parts of the business to add their points of view. Frequently, new opportunities for improvement will surface when additional perspectives are solicited.
Evolution: Share Learnings
Because of Mission’s status as an AWS Certified MSP, we have an opportunity to learn from hundreds of workloads, and dozens of use cases. Mission’s engineers enjoy little more than evolving our platform based upon those lessons. Every improvement we make rolls out to all of our customers over time, giving maximum benefit to our entire customer base. Similarly, many organizations have multiple product and operations teams. By sharing lessons broadly, you give the entire company the benefit of your own evolution.
AWS enables sharing of best practices. Your teams can define shared libraries for implementing best practices, including CloudFormation templates, Chef Cookbooks or Ansible Playbooks, Lambda functions for common operational tasks, and more. When sharing resources, leverage AWS IAM to define permissions enabling controlled access.
I hope this deep dive into operational excellence has inspired you to promote a culture of continuous improvement and curiosity in your operations teams. If you’d like to learn more about the AWS Well-Architected Framework, checkout our webinar on the Operational Excellence Pillar. By following some of the best practices laid out in the AWS Well-Architected Whitepaper on Operational Excellence, you can help drive that change in your organization.
As an AWS Well-Architected Review Launch Partner, Mission processes assurance for your AWS infrastructure, processing compliance checks across the five key pillars. Reach out to Mission now to schedule a Well-Architected Review.