What is a Site Reliability Engineer (SRE)?

In a previous article for the Life in Tech section of FOSSlife, we looked at the duties and responsibilities of a system administrator. This time, we’ll look at the role of site reliability engineer (SRE), which is related to system administration but goes beyond that role and requires a markedly different skillset. We’ll explain what you need to know and provide an overview of the job expectations to help you understand this relatively new career path.

Background of SRE

The site reliability engineering concept originated at Google. The idea is closely related to the principles of DevOps and was conceived as a way to reduce tension between software engineers and product developers (Dev) and sys admins and operations staff (Ops) that can arise at scale due to differing costs, timelines, and perceived priorities. The SRE role can also serve as a bridge between development and operations and is rooted in the approach of applying a software engineering mindset to system administration concepts.

The Site Reliability Engineering book states:

“Traditional operations teams and their counterparts in product development thus often end up in conflict, most visibly over how quickly software can be released to production. At their core, the development teams want to launch new features and see them adopted by users. At their core, the ops teams want to make sure the service doesn’t break while they are holding the pager. Because most outages are caused by some kind of change—a new configuration, a new feature launch, or a new type of user traffic—the two teams’ goals are fundamentally in tension.”

Google addressed the issue by hiring software engineers to manage, execute, and automate the work that would otherwise be performed by sys admins. These engineers engage with the entire software lifecycle to build, deploy, monitor, and maintain the underlying software systems.

Ben Treynor Sloss, who founded the Site Reliability team at Google, describes the role this way: “In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning,” he said in conversation with Niall Murphy.

Roles and Responsibilities

What does the SRE role encompass in terms of daily activities or required skills? According to VictorOps, common roles and responsibilities include: 

  • Building software to help operations and support teams.
  • Fixing support escalation issues.
  • Optimizing on-call rotations and processes.
  • Documenting “tribal” knowledge.
  • Conducting post-incident reviews.

More specifically, as an SRE at GitLab, your duties may involve the following, but every company will have its own set of requirements:

  • Be on pager duty rotation to respond to incidents and provide support for service engineers with customer incidents.
  • Use your on-call shift to prevent incidents from ever happening.
  • Run our infrastructure with Chef, Terraform, and Kubernetes.
  • Make monitoring and alerting activate on symptoms and not on outages.
  • Document every action so your findings turn into repeatable actions–and then into automation.
  • Improve the deployment process to make it as boring as possible.
  • Debug production issues across services and levels of the stack.
  • Design, build, and maintain core infrastructure pieces; plan growth of infrastructure.

Setting up for Success

Other essential traits and abilities for this role go beyond the necessary technical skills. In an article called, “7 Habits of Highly Successful Site Reliability Engineers” on NewRelic, Kevin Casey describes these important characteristics in more detail:

  1. You analyze every change in the context of the (much) bigger picture.
  2. You’re pragmatic and forward-thinking about that analysis.
  3. You are willing to move on when something isn’t actually helping.
  4. You embrace every opportunity to automate.
  5. You can persuade organizations to do what needs to be done.
  6. You expand your existing skillset to include new tools and approaches.
  7. You trust the process.

This big picture mindset is key. Alice Goldfuss, in her excellent guide about “How to Get Into SRE,” states, “A Software Engineer who is on-call understands how their code works, how it can break, and how to fix it. A Site Reliability Engineer understands how that code fits into the larger tapestry of the company’s architecture and tries to set the whole system up for success.”

If you think a role in site reliability engineering is right for you, you can learn much more about what to expect and how to prepare from the resources below.

Learn More

Keys to SRE talk by Benjamin Treynor Sloss 
How to Get Into SRE by Alice Goldfuss
Site Reliability Engineering edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy
DevOps Jobs: 5 Trends to Watch

Comments