前言

在搜索SRE和DevOps相关概念的过程中偶然发现Google Cloud的Blog专门制作了这样一篇文章,国内虽然有不少翻译但并没有完全做到翻译术语中的“信,雅,达”,这里转载Google官方的文章和YouTube视频,同时也选择了网友精心翻译的文章并把视频搬运至bilibili也就是B站方便大家浏览,相信大家可以对SRE和DevOps有更深入的理解。

SRE vs. DevOps: competing standards or close friends?

更新历史

2019年06月25日 - 初稿

阅读原文 - https://wsgzao.github.io/post/sre-vs-devops/

扩展阅读

SRE vs. DevOps: competing standards or close friends? - https://cloud.google.com/blog/products/gcp/sre-vs-devops-competing-standards-or-close-friends
DevOps 和 SRE - https://blog.alswl.com/2018/09/devops-and-sre/


英文原文

SRE vs. DevOps: competing standards or close friends?

Seth Vargo: Staff Developer Advocate
Liz Fong-Jones: Site Reliability Engineer
May 9, 2018

Site Reliability Engineering (SRE) and DevOps are two trending disciplines with quite a bit of overlap. In the past, some have called SRE a competing set of practices to DevOps. But we think they’re not so different after all.

What exactly is SRE and how does it relate to DevOps? Earlier this year, we (Liz Fong-Jones and Seth Vargo) launched a video series to help answer some of these questions and reduce the friction between the communities. This blog post summarizes the themes and lessons of each video in the series to offer actionable steps toward better, more reliable systems.

1. The difference between DevOps and SRE

It’s useful to start by understanding the differences and similarities between SRE and DevOps to lay the groundwork for future conversation.

The DevOps movement began because developers would write code with little understanding of how it would run in production. They would throw this code over the proverbial wall to the operations team, which would be responsible for keeping the applications up and running. This often resulted in tension between the two groups, as each group’s priorities were misaligned with the needs of the business. DevOps emerged as a culture and a set of practices that aims to reduce the gaps between software development and software operation. However, the DevOps movement does not explicitly define how to succeed in these areas. In this way, DevOps is like an abstract class or interface in programming. It defines the overall behavior of the system, but the implementation details are left up to the author.

SRE, which evolved at Google to meet internal needs in the early 2000s independently of the DevOps movement, happens to embody the philosophies of DevOps, but has a much more prescriptive way of measuring and achieving reliability through engineering and operations work. In other words, SRE prescribes how to succeed in the various DevOps areas. For example, the table below illustrates the five DevOps pillars and the corresponding SRE practices:

DevOps SRE
Reduce organization silos Share ownership with developers by using the same tools and techniques across the stack
Accept failure as normal Have a formula for balancing accidents and failures against new releases
Implement gradual change Encourage moving quickly by reducing costs of failure
Leverage tooling & automation Encourages “automating this year’s job away” and minimizing manual systems work to focus on efforts that bring long-term value to the system
Measure everything Believes that operations is a software problem, and defines prescriptive ways for measuring availability, uptime, outages, toil, etc.

If you think of DevOps like an interface in a programming language, class SRE implements DevOps. While the SRE program did not explicitly set out to satisfy the DevOps interface, both disciplines independently arrived at a similar set of conclusions. But just like in programming, classes often include more behavior than just what their interface defines, or they might implement multiple interfaces. SRE includes additional practices and recommendations that are not necessarily part of the DevOps interface.

DevOps and SRE are not two competing methods for software development and operations, but rather close friends designed to break down organizational barriers to deliver better software faster. If you prefer books, check out How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) for a more thorough explanation.

2. SLIs, SLOs, and SLAs

The SRE discipline collaboratively decides on a system’s availability targets and measures availability with input from engineers, product owners and customers.

It can be challenging to have a productive conversation about software development without a consistent and agreed-upon way to describe a system’s uptime and availability. Operations teams are constantly putting out fires, some of which end up being bugs in developer’s code. But without a clear measurement of uptime and a clear prioritization on availability, product teams may not agree that reliability is a problem. This very challenge affected Google in the early 2000s, and it was one of the motivating factors for developing the SRE discipline.

SRE ensures that everyone agrees on how to measure availability, and what to do when availability falls out of specification. This process includes individual contributors at every level, all the way up to VPs and executives, and it creates a shared responsibility for availability across the organization. SREs work with stakeholders to decide on Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

  • SLIs are metrics over time such as request latency, throughput of requests per second, or failures per request. These are usually aggregated over time and then converted to a rate, average or percentile subject to a threshold.
  • SLOs are targets for the cumulative success of SLIs over a window of time (like “last 30 days” or “this quarter”), agreed-upon by stakeholders

The video also discusses Service Level Agreements (SLAs). Although not specifically part of the day-to-day concerns of SREs, an SLA is a promise by a service provider, to a service consumer, about the availability of a service and the ramifications of failing to deliver the agreed-upon level of service. SLAs are usually defined and negotiated by account executives for customers and offer a lower availability than the SLO. After all, you want to break your own internal SLO before you break a customer-facing SLA.

SLIs, SLOs and SLAs tie back closely to the DevOps pillar of “measure everything” and one of the reasons we say class SRE implements DevOps.

3. Risk and error budgets

We focus here on measuring risk through error budgets, which are quantitative ways in which SREs collaborate with product owners to balance availability and feature development. This video also discusses why 100% is not a viable availability target.

Maximizing a system’s stability is both counterproductive and pointless. Unrealistic reliability targets limit how quickly new features can be delivered to users, and users typically won’t notice extreme availability (like 99.999999%) because the quality of their experience is dominated by less reliable components like ISPs, cellular networks or WiFi. Having a 100% availability requirement severely limits a team or developer’s ability to deliver updates and improvements to a system. Service owners who want to deliver many new features should opt for less stringent SLOs, thereby giving them the freedom to continue shipping in the event of a bug. Service owners focused on reliability can choose a higher SLO, but accept that breaking that SLO will delay feature releases. The SRE discipline quantifies this acceptable risk as an “error budget.” When error budgets are depleted, the focus shifts from feature development to improving reliability.

As mentioned in the second video, leadership buy-in is an important pillar in the SRE discipline. Without this cooperation, nothing prevents teams from breaking their agreed-upon SLOs, forcing SREs to work overtime or waste too much time toiling to just keep the systems running. If SRE teams do not have the ability to enforce error budgets (or if the error budgets are not taken seriously), the system fails.

Risk and error budgets quantitatively accept failure as normal and enforce the DevOps pillar to implement gradual change. Non-gradual changes risk exceeding error budgets.

4. Toil and toil budgets

An important component of the SRE discipline is toil, toil budgets and ways to reduce toil. Toil occurs each time a human operator needs to manually touch a system during normal operations—but the definition of “normal” is constantly changing.

Toil is not simply “work I don’t like to do.” For example, the following tasks are overhead, but are specifically not toil: submitting expense reports, attending meetings, responding to email, commuting to work, etc. Instead, toil is specifically tied to the running of a production service. It is work that tends to be manual, repetitive, automatable, tactical and devoid of long-term value. Additionally, toil tends to scale linearly as the service grows. Each time an operator needs to touch a system, such as responding to a page, working a ticket or unsticking a process, toil has likely occurred.

The SRE discipline aims to reduce toil by focusing on the “engineering” component of Site Reliability Engineering. When SREs find tasks that can be automated, they work to engineer a solution to prevent that toil in the future. While minimizing toil is important, it’s realistically impossible to completely eliminate. Google aims to ensure that at least 50% of each SRE’s time is spent doing engineering projects, and these SREs individually report their toil in quarterly surveys to identify operationally overloaded teams. That being said, toil is not always bad. Predictable, repetitive tasks are great ways to onboard a new team member and often produce an immediate sense of accomplishment and satisfaction with low risk and low stress. Long-term toil assignments, however, quickly outweigh the benefits and can cause career stagnation.

Toil and toil budgets are closely related to the DevOps pillars of “measure everything” and “reduce organizational silos.”

5. Customer Reliability Engineering (CRE)

Finally, Customer Reliability Engineering (CRE) completes the tenets of SRE (with the help in the video of a futuristic friend). CRE aims to teach SRE practices to customers and service consumers.

In the past, Google did not talk publicly about SRE. We thought of it as a competitive advantage we had to keep secret from the world. However, every time a customer had a problem because they used a system in an unexpected way, we had to stop innovating and help solve the problem. That tiny bit of friction, spread across billions of users, adds up very quickly. It became clear that we needed to start talking about SRE publicly and teaching our customers about SRE practices so they could replicate them within their organizations.

Thus, in 2016, we launched the CRE program as both a means of helping our Google Cloud Platform (GCP) customers with improving their reliability, and a means of exposing Google SREs directly to the challenges customers face. The CRE program aims to reduce customer anxiety by teaching them SRE principles and helping them adopt SRE practices.

CRE aligns with the DevOps pillars of “reduce organization silos” by forcing collaboration across organizations, and it also closely relates to the concepts of “accepting failure as normal” and “measure everything” by creating a shared responsibility among all stakeholders in the form of shared SLOs.

Looking forward with SRE

We are working on some exciting new content across a variety of mediums to help showcase how users can adopt DevOps and SRE on Google Cloud, and we cannot wait to share them with you. What SRE topics are you interested in hearing about? Please give us a tweet or watch our videos.

Posted in:

中文翻译

中文翻译原文为繁体中文,我转化为简体中文,视频替换为B站

[好文翻譯] 你在找的是 SRE 還是 DevOps?

Neil Wei in KKStream
Aug 3, 2018

敝社这半年来开始大举征才,其中不乏 DevOps 和 SRE 的职缺,然而 HR (或其他部门的同事) 对于两者的相异之处并不了解,甚至认为 SRE 和传统维运单位一样,只是换个名字,从管机房到管云端而已,究竟两者到底有什么差别呢?

这对前来的面试的应征者会有负面的影响,好像连我们自己要找什么样的人都不清楚似的。于是,花了点时间跟 HR 介绍两者的差异,也在支援了 SRE 团队四个月后留下这篇翻译文加一点点心得。

请先记得…

SRE is a DevOps (香蕉是一种水果)

DevOps is NOT a SRE (水果不是香蕉)

DevOps 并不是一个 “工作职称”,SRE 才是

《本文已取得原作者之一 Seth Vargo 同意翻译刊登》

原文网址:https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html?m=1


正文开始

Site Reliability Engineering (SRE) 和 DevOps 是目前相当热门的开发与维运文化,有着很高的相似程度。然而,早期有些人会把 SRE 视为和 DevOps 不同的实践方式,认为两者不一样,必需选择其一来执行,但是现在大家更倾向两者其实其实很相似。

究竟 SRE 和 DevOps 有什么相同点呢?在年初,Google 的工程师 (Liz Fong-JonesSeth Vargo) 准备了一系列的影片去解答这些问题以及尝试跳出来去减少社群间的意见分歧,本篇文章总结了影片中所涵盖到的主题,以及如何实际去建置一个更加可靠的系统。


1. SRE 和 DevOps 的差异

在开始之前,先了解一下 SRE 和 DevOps 有什么相同之处?又有什么相异之处?

DevOps 文化的兴起是因为在早期 (约十年前),有许多开发者对于自己的程式是怎么跑在真实世界,其实所知有限。开发者要做的事情就是将程式打包好,然后扔给维运部门后,自己的工作周期就结束了,而维运部门会负责将程式安装与部署到所有生产环境的机器上,同时也要想尽各种辨法与善用各种工具,确保这些程式持续正常地执行,即使维运部门完全不了解这些程式的实作细节。

这样的工作模式很容易造成两个部门之间的对立,各自的部门都有自己的目标,而各自的目标和公司商业需求可能会不一致。DevOps 的出现是为了带来一种新的软体开发文化,用以降低开发与维运之间的鸿沟。

然而,DevOps 的本质并不是教导大家怎么做才会成功,而是订定一些基本原则让大家各自发挥,以程式设计的术语来说,DevOps 比较像是一个抽象类别 (abstract class),或是介面 (interface),定义了这种文化该有什么样的行为,实作则是靠各个部门成员一起决定,只要符合这个「介面」,就可以说是 DevOps 文化的实践。

SRE 一词由 Google 提出,是 Google 在这十多年间为了解决内部日渐庞大的系统而制定出一连串的规范和实作,和 DevOps 不同的是,它实作了 DevOps 的所定义的抽象方法,而且规范了更多关于如何用软体工程的方法与从维运的角度出发,以达成让系统稳定的目的。简单来说,SRE 实作了 DevOps 这个介面 (interface),以下列出五点 DevOps 定义的介面以及 SRE 如何实作

DevOps:减少组织之间的谷仓效应

SRE:在整个开发周期中,和开发团队使用相同的工具以及一起分享与所有权。(注:Infra as code, configuration as code)

DevOps:接受失效,视失效为开发周期中的一个元素

SRE: 对于新的版本,建立一套可以量化的指标去衡量 “意外” 和 “失效”

DevOps: 逐渐改变

SRE:鼓励团队透过降低排除故障的成本来达成速交付的目的 (就是不需要一次做到最好,而是逐渐改变)

DevOps:善用工具和自动化

SRE:鼓励团队把自己今年的工作自动化,最小化” 工人智慧” 要做的事,把精力放在中长期的系统改善。

DevOps:任何事都是可以被量测的

SRE:相信维运是软体工程的范筹,规范关于可用性,运行时间 (uptime),停机时间 (outages),哪些是苦工等量测值。

如果你已经认同 DevOps 是一个 “介面 (interface)”,那么以程式语言的角度来说就是:

class SRE implements DevOps

虽然实际上两者之间仍有需多独立的原则,SRE 并非完全 1:1 实作了 DevOps 的所有的概念,但最终他们两个的结论是相同的,也和程式语言相同,类别在继承介面之后,可以做更多的延伸,也可以实作更多不同的介面,SRE 包含了更多细节是 DevOps 原本所没有定义的。

在软体开发和维运的领域中,DevOps 和 SRE 并非互相竞争谁才是业界标准,相反地,两者都是为了减少组职之间的隔阂与更快更好的软体所设计出来的方法,如果你想看更多细节的话,How SRE relates to DevOps (Betsy Beyer, Niall Richard Murphy, Liz Fong-Jones) 这本书值得一看。


2. SLIs, SLOs, and SLAs

SRE 的原则之一是针对不同的职务,给出不同的量测值。对于工程师,PM,和客户来说,整个系统的可用程度是多少,以及该如何测量,都有不同的呈现方式。

如果无法衡量一个系统的运行时间与可用程度的话,是非常难以维运已经上线的系统,常常会造成维运团队持续处在一个救火队的状态,而最终找到问题的根源时,可能会是开发团队写的 code 出了问题。

如果无法定出运行时间与可用程度的量测方法的话,开发团队往往不会将「稳定度」视为一个潜在的问题,这个问题已经困扰了 Google 好多年,这也是为什么要发展出 SRE 原则的动机之一。

SRE 确保每一个人都知道怎么去衡量可靠度以及当服务失效时该做什么事。这会细到当问题发生时,从 VP 或是 CxO,至最组织内部的每一个相关员工,都有做己该做的事。每一个「人」,该做什么「事」都被规范清楚,SRE 会和所有的相关人员沟通,去决定出 Service Level Indicators (SLIs) 与 Service Level Objectives (SLOs)。

SLIs 定义了和系统「回应时间」相关的指标,例如回应时间,每秒的吞吐量,请求量,等等,常常会将这个指标转化为比率或平均值。

SLOs 则是和相关人员讨论后,得出的一个时间区间,期望 SLIs 所能维持一定水准的数字,例如「每个月 SLIs 要有如何的水准」,比较偏内部的指标。

该影片也讨论到了 Service Level Agreements (SLAs),即使这不是 SRE 每天所关心的数字。作为一个线上服务的提供者,SLA 是对客户的承诺,确保服务持续运行的百分比,通常是和客户「谈」出来的,每年 (或每月) 的停机时间不得低于几分钟。

SLI, SLO, SLA 的概念和 DevOps 所提的「任何事都可以被量测」非常相似,这也就是为什么会说 class SRE implements DevOps 的原因之一了。


3. 风险和犯错预算

对于风险,我们会用犯错预算来评估,犯错预算是一个量化的值,用来描述服务每天 (或每月) 可以失效的时间,若服务的 SLAs 是 99.9%,那么开发团队就等于有 0.1%的犯错预算通可以用。这个值是一个和 Product Owner 和开发团队谈过之后取得平衡的值,以下的影片也讲到了为什么 0 犯错预算并不是一个适合的值。

致力于将一个系统的可用程度维持在 100% 是一件会累死你又无意义的事情,不切实际的目标会限制了开发团队推出新功能到使用者手上速度,而且使用者多半也不会注意到这件事 (例如可靠度是 99.999999%),因为他们的 ISP 业者,3G/4G 网路,或是家里的 WiFi 可能都小于这个数字。致力维持一个 100% 不间断的服务会严重限制开发团队将新功能交付出去的时间。为了要达成这个严酷的限制,开发人员往往会选择不要修 bug,不要增加功能,不要改进系统,反之,应该要保留一些弹性让开发团队可以自由发挥。

SRE 的原则之一就是计算出可以容忍的「犯错预算」,一旦这个预算耗尽,才应该开始将重点放在可靠性的改善而非持续开发新功能。

如第二个影片提到的,这个文化能让管理阶层买单是最重要的事,因为 SLIs 是大家一起订出来的,如果不照游戏规则走的话,SRE 又会沦为持续为了让系统维持一定的稳定度了而一直做苦力的事,但是没人知道 (因为没有订标准),最终这个服务一定会失败。风险和犯错预算会将犯错视为正常的事,而改善的方式之一是让新功能持续且小规模的发布,这也和 DevOps 的原则相符合。


4. 琐事和琐事预算

另一个 SRE 的原则是琐事的控管,如何减少琐事?何谓琐事?

维运中需要手动性操作的、重复的,可以被自动化的

或是一次性,没有持久价值的工作,都是琐事。

然而琐事并不是「我不想做的事」,举例来说,公司会有许多经常性的事务,一再的发生,例如开会,沟通,回 email,这些都不是琐事。

反之,像是每天手动登入某台机器,取得某个档案后做后续的处理,然后做成报告寄出来,这种就是琐事,因为他是手动,重复,可以被自动化的。

SRE 的原则是尝试使用软体工程的方法消除这些事情,当 SRE 发现事情可以被自动化后,便会着手执行自动化流程的开发,避免之后再做一样的事情,虽然使琐事最小化很重要,但实际上,这是不可能完全消除的,Google 致力于将 SRE 的日常琐事缩小到 50% 以下,使得 SRE 成员可以将时间发费在更有意义的事情上,每季的回顾也都会检视成果。

然而琐事也并非完全是坏事,对于新进成员来说,先参与这事例行事务有助于了解这个服务该做些什么事情,这是相对低风险与低压力的,但是长远来看,任何一个工程师都不该一直在做琐事。

琐事管理也和 DevOps 的原则 — 任何事都是可被测量与减少组织之间的谷仓效应相符。


5. 客户可靠性工程 (Customer Reliability Engineering, CRE)

个人觉得这个主题对目前而言稍微走远了,就不逐句翻译。

大意如何将 SRE 的概念传达出去,让 GCP 的客户知道该怎么正确的使用 GCP 的各项服务以及推广 SRE 的风气。


个人后记

其实目前敝社渐渐转型中,的确处在一个从传统开发与维运转互相独立,到目前渐渐实做 DevOps 文化的路上,在支援了 SRE 部门 4 个月后,参与了很多现实面会碰到的挑战,也和大家一起制定自动化流程与改善目前现有的琐事,也渐渐朝 DevOps 的文化前进中,希望让大家可以知道:

SRE 是软体工程,不该只是维运人员或是系统管理员。

DevOps 并不是一个职称,SRE 才是,就像你不会到市场菜摊跟老板说我要买 “青菜”,而且会说要买高丽菜还是小白菜吧!

不过理想总是完美的,还是要面对现实,我们的公司不叫 Google,大部份的人也进不去 Google,Google 的 SRE 可能比大多数公司的软体开发工程师还要会写 code,比网路工程师还要懂网路,比维运工程师还要懂维运,在我们周围的环境所开的 SRE 职缺,其实很多都不是想象中的这样美好,琐事 / 手动的事可能还是占大多数,部门间还是存在隔阂,不会写 code 的 SRE 可能也很多,维运还是占日常工作的多数等现况。

传统维运人员或 IT 网管人员若想往 SRE 发展的话,也必需改变一下思维,跳脱舒适圈,在这个什么都 as code,什么都 as a service 的年代,不写 code 就等著等淘汰了。

改变是缓慢而且需要慢慢培养的,就让我们… 咦…P0 事件发生了!先这样啦!

延伸阅读

在此感谢所有人的分享,推动技术的不断进步

文章目录
  1. 1. 前言
  2. 2. 更新历史
  3. 3. 英文原文
    1. 3.1. 1. The difference between DevOps and SRE
    2. 3.2. 2. SLIs, SLOs, and SLAs
    3. 3.3. 3. Risk and error budgets
    4. 3.4. 4. Toil and toil budgets
    5. 3.5. 5. Customer Reliability Engineering (CRE)
    6. 3.6. Looking forward with SRE
  4. 4. 中文翻译
  5. 5. 延伸阅读