NEW YORK — Site reliability engineers (SREs) are warning of a looming scalability ceiling and saying the adoption of AIOps isn’t happening at a high rate, according to a recent survey.
The survey of nearly 300 SREs is part of the “2021 SRE Report,” which has played a role in “defining the nature of what it means to be a SRE” since it launched four years ago, according to its authors.
The report is by Catchpoint, a maker of a digital experience management (DEM) platform, along with VMware Tanzu and the DevOps Institute. They released the report this week.
“SREs deal with a very broad set of challenges that span across transformational and operational activities,” said Mehdi Daoudi, CEO, Catchpoint.
“This report arms them with the insights they need to help address these challenges — to balance the need for agility against the need for stability when building and operating massive, distributed and reliable systems.”
An SRE is “what you get when you treat operations as if it’s a software problem,” according to Google’s SRE team.
For instance, the SRE team at Google says it protects, provides for and progresses the software and systems behind the company’s products.
Details on the findings from the “2021 SRE Report” are below.
Multiple providers creating scalability ceiling
SREs are contending with the rising use of multi-cloud as well as the increase in the “volume, velocity and variety” of monitoring data, according to the authors.
The resulting lack of visibility across the stack (53%) is the most cited cloud-app monitoring challenge, and SREs are continually refining service level objectives (50%).
Kurt Andersen, an SRE architect at Blameless, addressed how companies can scale SRE implementations:
“Spanning the gaps between the interfaces and the data that each provider offers increases the difficulty for SRE teams to automate across those multiple providers,” Andersen said.
“These integrations are rarely simple, except for the most superficial aspects. Effectively mapping disparate data models together may be the next frontier for SRE in a multi-vendor environment.”
Slow move to AIOps
AIOps has been “widely touted to reduce laborious ops work” and sift through the volumes of data presented to organizations, according to the authors.
The survey of SREs, however, shows that many of them have never used AIOps, and their rating of its received value “evenly spanned the 1-9 value scale.”
J. Bobby Dorlus, a staff SRE at Twitter, addressed where AIOps could figure in the role:
“Most SREs working at this scale are already leveraging machine learning, especially when it comes to efficiencies around data centers (locations, cooling and all the things that happen inside it) for networks and building out infrastructure,” Dorlus said.
“Evolving that into AIOps is the next logical step.”
Observability and business KPIs
SREs that “fail to deliver customer value run the risk of being stuck in an operational toil rut,” according to the authors.
At the same time, companies that fail to recognize the importance of SRE activities “risk losing talented employees and their competitive edge.”
The survey of SREs indicates the highest-ranked driver for successful SRE implementations is incident resolution (60%), while expanding the business is the fifth lowest (33%).
SREs are inwardly focusing on IT operations rather than outwardly focusing on “business results that deliver customer value.”
The report’s authors recommend that SRE teams expand their observability boundaries to include digital experience metrics and business KPIs.
“The balancing work of innovation while providing operational excellence has forced many IT teams to put heavy emphasis on improving reliability and stability of services and applications,” said Eveline Oehrlich, chief research and content officer, DevOps Institute.
“What SREs now need to do is make sure the value of these reliable services and applications are understood by the customer.”
Toil levels fall
For SREs, toil is considered the work tied to a production service that tends to be “manual, repetitive, automate-able and devoid of enduring value,” according to the authors.
SREs should do no more than 50% ops work, including toil, and 50% dev work, Google says.
In the 2021 report, there’s an average year-over-year drop in toil of 15% among SREs.
“The reason this is such an impactful insight is that the drop in toil was across all geographies,” said Tony Ferrelli, VP of technical operations, Catchpoint.
“If this drop in toil was because work felt more meaningful since COVID-19 led to SREs working from home, then will reported toil levels rise next year as people return to the office or a hybrid work environment?”