This post defines the roles and responsibilities of a site reliability engineer and shows how SRE can improve the resilience of your people, processes, and technology.
Software development is getting faster and more complex – frustrating IT operations teams more than ever. So, DevOps gained popularity in order to combat siloed workflows, decreased collaboration, and a lack of visibility. While establishing a culture of DevOps has helped teams collaborate better and deliver reliable software faster, DevOps teams don’t necessarily have someone specifically dedicated to developing systems that increase site reliability and performance. That’s where a site reliability engineer (SRE) comes into the picture.
The concept of SRE was initially brought to life by Google engineer, Ben Treynor. Then, shortly after implementing SRE, they published their popular SRE eBook – helping the movement gain traction in the industry. Site reliability engineers sit at the crossroads of traditional IT and software development. Basically, SRE teams are made up of software engineers who build and implement software to improve the reliability of their systems.
So, let’s first define the basic roles and responsibilities of a site reliability engineer and show how SRE can drastically improve the resilience of your people, processes, and technology.
What Is Site Reliability Engineering (SRE)?
In the words of Ben Treynor, SRE is “what happens when you ask a software engineer to design an operations function.” In a traditional setup of siloed IT operations and software development teams, developers would throw their code over to IT professionals. Then, IT would be in charge of deployment, maintenance, and any on-call responsibilities associated with the system in production. Luckily, DevOps came along and forced developers to share accountability for systems in production, own their code, and take on-call responsibilities.
DevOps pushed shared responsibility for the reliability of your applications and infrastructure. And, while this is a great first step forward, it doesn’t proactively help teams add resilience to their system. Many DevOps teams, even with shortened feedback loops and improved collaboration, can still find themselves deploying new, unreliable services into production at a rapid pace.
Site reliability engineering is a way to bridge the gap between developers and IT operations, even in a DevOps culture. It isn’t SRE vs. DevOps – it’s SRE with DevOps. SRE is kind of like a more proactive form of QA. Site reliability engineers will be dedicated full-time to creating software that improves the reliability of systems in production, fixing issues, responding to incidents, and usually taking on-call responsibilities.
Common Roles and Responsibilities for a Site Reliability Engineer
Implementing an SRE team will greatly benefit both IT operations and software development teams. Not only can SRE drive deeper reliability to systems in production but it will likely help IT, support, and development teams spend less time working on support escalations and give them more time to build new features and services.
So, let’s quickly go over common site reliability engineering roles and responsibilities you can expect to see.
Building Software to Help Operations and Support Teams
SRE teams are in charge of proactively building and implementing services to make IT and support better at their jobs. This can be anything from adjustments to monitoring and alerting to code changes in production. A site reliability engineer can be tasked with building a homegrown tool from scratch to help with weaknesses in software delivery or incident management.
Fixing Support Escalation Issues
Similar to the point above, a site reliability engineer can expect to spend time fixing support escalation cases. But, as your SRE operations mature, your systems will become more reliable and you’ll see fewer critical incidents in production – leading to fewer support escalations. Because an SRE team touches so many different parts of the engineering and IT organization, they can be a great source of knowledge and can be helpful for routing issues to the right people and teams.
Optimizing On-Call Rotations and Processes
More times than not, site reliability engineers will need to take on-call responsibilities. At most organizations, the SRE role will have a lot of say in how the team can improve system reliability through the optimization of on-call processes. SRE teams will help add automation and context to alerts – leading to a better real-time collaborative response from on-call responders. Additionally, site reliability engineers can update runbooks, tools, and documentation to help prepare on-call teams for future incidents.
Documenting “Tribal” Knowledge
SRE teams gain exposure to systems in both staging and production, as well as all technical teams. They take part in work with software development, support, IT operations, and on-call duties – meaning they build up a great amount of historical knowledge over time. Instead of siloing this knowledge into the mind of one team or one person, site reliability engineers can be tasked with documenting much of what they know. Constant upkeep of documentation and runbooks can ensure that teams get the information they need right when they need it.
Conducting Post-Incident Reviews
Without thorough post-incident reviews, you have no way to identify what’s working and what’s not. SRE teams need to keep teams honest and ensure that everyone – software developers and IT professionals – are conducting post-incident reviews, documenting their findings, and taking action on their learnings. Then, site reliability engineers are often tasked with action items for building or optimizing some part of the SDLC or incident lifecycle to bolster the reliability of their service.
Where Does SRE Fit on Your Team?
Site reliability engineering roles and responsibilities are crucial to the continuous improvement of people, processes, and technology within any organization. Whether your team has already taken on a full-blown DevOps culture or you’re still attempting to make the transition, SRE offers numerous benefits to speed and reliability. SRE fits right at the crossroads of IT operations, support, and software engineering. SRE serves as the perfect blend of skills to tighten the relationship between IT and developers – leading to shorter feedback loops, better collaboration, and more reliable software.
Pros and Cons of Being a Site Reliability Engineer
Catchpoint recently put out their 2021 SRE Report showing that site reliability engineers were some of the happiest employees in software development and IT. While SREs can’t spend all of their time building new features for customers, they’re constantly making an impact on customer experience. In fact, if you’re looking for a role designed to help customers the most – then SRE is it.
Site reliability engineering not only improves the lives of customers but, when done right, improves the lives of on-call teams, IT professionals, and software developers. SRE can be one of the most fulfilling roles for a software engineer. It can help you better understand the struggles of IT and support, making you a better developer going forward.
See how we added SRE into our own DevOps culture – driving deeper reliability and collaboration across all of our teams. Check out our complete resource guide, to see how site reliability engineering can increase system reliability and quickly drive value for your own team.