In today’s fast-paced technological landscape, organizations need to be able to create value efficiently and effectively. To master site reliability engineering, they must prioritize automation, resilience, observability, and velocity. However, this is easier said than done.
It requires a fundamental shift in mindset, which can be challenging to accomplish.
The adoption of Agile, DevOps, ITIL, and SRE cultures necessitates a change in how we approach problems and work collaboratively toward achieving wider organizational goals.
In this blog, we will explore mastering site reliability engineering, what it takes to develop an SRE mindset, and how it can help organisations build more resilient systems, increase efficiency, and deliver better customer experiences. So, let’s dive in!
What is Site Reliability Engineering (SRE)?
Site reliability engineering (SRE) is a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. (Source: Wikipedia)
Important considerations of SRE
SRE was started at Google in 2003 by Ben Treynor Sloss. Many organizations are still discovering what SRE is all about. There are myths about what SRE is and what it’s not.
Josh Armitage, a Distinguished Technologist at Contino, emphasizes three important considerations about SRE:
- SRE is not one size fits all
There is no doubt that Google is a leader in SRE. AS-IS implementation of Google’s SRE is impossible.
However, organizations can include some of their cultures, and best practices to embrace SRE.
Organizations with complex, mission-critical systems may require a more robust implementation of SRE, while those with simpler systems may need a more lightweight approach.
Reliability and stability must be balanced with innovation and development.
SRE can only succeed if it is aligned with the overall business strategy and goals of the organization while remaining agile and responsive to changing demands.
- SRE is fundamentally different from traditional operations
The rebranding of traditional operations as SRE and the claim that we’re doing SRE doesn’t add value.
Feather goals are impossible to achieve. A system with 100% reliability is toxic and deviates from its goals.
Traditional operations teams focus mainly on maintaining systems, but SRE teams look at the overall system reliability and scalability.
For example, SRE teams strive to improve the availability of services through automation, greater reliability engineering, and intelligent system design, not just by responding to outages.
- Failure is inevitable, and that’s ok
Systems do fail. System failures occur at Microsoft, Google, and Amazon. That’s ok.
A lot of this involves understanding that failure is inevitable. What builds your SRE practice is how you deal with it and how you learn from it.
The goal of SRE teams is not to eliminate all failures but to manage and mitigate their impacts through techniques like service level objectives, service level indicators, error budgets, and incident management.
When SRE teams establish clear priorities, communication channels, and escalation procedures, failures can be minimized and issues can be resolved quickly.
To succeed in SRE, you need a different mindset. It is important to learn the mental models and how Google arrived at this and the problems they were trying to solve.
SRE is often considered the solution to a problem; instead, people should focus on more fundamental things for improvements.
Importance of Team Topologies
Teams must organize themselves in a way that facilitates decision-making. In an organization or within engineering functions, every team should be aligned to one of four types of streams.
This will enable them to deliver business value and make the right decisions.
(Source: Team Topologies)
- A complicated Subsystem Team requires significant expertise
- Stream-aligned Team is based on the flow of work from a specific business domain
- Platform Team involved people working on different platforms (E.g., cloud platform)
- Enabling the Team aids in overcoming obstacles
SRE falls into one of these four team types. An organization can choose to do Site Reliability Engineering by having the flavors of ‘enabling team’ based SRE or by ‘platform team’ based SRE.
On the Platform side, Kitchen sinks are generally how people get started. As an SRE team, they are trying to figure out how they are going to work; they are going to use that knowledge to figure out what to do.
They are still at a very nascent stage and trying to do a bit of everything and sort of working between teams to see what works.
The platform team, which might be SRE, provides infrastructure (Kubernetes or something similar) for consumption.
Furthermore, some critical operations are handled by a dedicated SRE team for products and applications.
This may be close to traditional operations, and in general, it’s not a wise idea. It restricts learning across organizations, resulting in silos and tends to take us backward from DevOps.
Enabling the team embeds SRE team members in other teams to promote SRE practice, creating bespoke tools to assist other teams.
Consulting allows the SRE team to work with another team on short-term projects to upskill.
Platform teams scale through technology and/or Enabling teams to scale through upskilling
Team upskilling is facilitated by combining platforms and enabling cultures. Feedback loops facilitate meaningful conversations and help to build a platform and infrastructure that serves well. Agile working methods, trade-offs, and prioritization offer the option of scaling both up and down.
The mindset of an SRE
The delivery teams build software, and the SRE teams own reliability, so where does reliability come from?
According to Emily Freeman, author of DevOps for Dummies, the Revolutionary Software Development model represents the dynamic, fluid, and non-linear nature of software development today.
From architecting at the outermost ring through development, automating, and deploying to operating at the center, the five circles represent the critical roles of software development.
Observability, reliability, testability, flexibility, and scalability are the six spokes that segment these rings.
A system’s reliability comes from all of the steps you take when building it.
These include operating, deploying, automating, developing, and architecting. Reliability is affected by every action.
No matter how many SREs there are, it can never survive with only one. This is how reliable systems are built.
The Quality Feedback Loop
Quality should be considered everywhere. We do not want a system that is poorly developed and has to operate reliably.
The SRE team at Google initially specified that they would not accept any system that hasn’t been running in production for at least six months and that has a proven track record of being generally stable.
It looks like pushback – only if the quality is good will the team take over operations. If quality is not met, the team will end up doing the operations themselves until the quality improves.
Getting feedback is essential. Developers might not have to be on call, but they need to be aware that, if the quality drops, they will be on call.
SRE teams at Google scaled up when they realized firefighting took up more than 50% of what the team does on a day-to-day or week-to-week basis.
Firefighting will not allow us to move forward. Too much technical debt prevents developers from moving forward.
They must continue to invest in aspects of the system that users might not be concerned about because they allow you to do what people care about tomorrow.
There is no clear way to say if a 50-50 rule works for every organization, but you must understand what percentage on either side you are happy to accept.
Whether it is 50-50 or 30/70 or 40/60, having this gauge of where you are helps you make better decisions for better ways of doing things.
Value of SRE Certification
Mastering in Site Reliability Engineering (SRE) is a valuable way to demonstrate your knowledge and skills and can make you stand out during a competitive interview.
Certifications should not be viewed as a substitute for real-world experience or practical skills. Employers often seek candidates with both certifications and practical experience.
In addition, SRE certifications should not be viewed as the sole measure of expertise or success in this field.
If you want to learn more about the benefits of becoming an SRE-certified professional, be sure to check out our previous blog post “The Benefits of SRE Certification“.
It’s a great resource for understanding the advantages and opportunities that come with earning an SRE certification.
Thank you for reading!