AgileJazz - Russ Wangler's Blog: DevOps Handbook Summary 2 of 4

Book summary of The DevOps Handbook by Gene Kim et. al. Excerpted content is formatted in italics.

Part IV: The Second Way - The Technical Practices of Feedback

Part four explores:

Creating telemetry to enable seeing and solving problems
Using our telemetry to better anticipate problems and achieve goals
Integrating user research and feedback into the work of product teams
Enabling feedback so Dev and Ops can safely perform deployments
Enabling feedback to increase the quality of our work through peer reviews and pair programming

Create telemetry to enable seeing and solving problems

Strong DevOps performers have Mean Time To Recover (MTTR) of 165x better than their peers w/80% of the MTTR time typically spent determining what changed (see Puppet Labs annual DevOps reports)
Two top predictors of DevOps success:

Operations (and Infrastructure) code is highly automated and under source control
Proactive monitoring and telemetry

The authors share Turnball's a best practice framework for monitoring (telemetry collection and consumption):

Key characteristics:

Covers all layers: business, application, infrastructure, client software, deployment pipeline
Centralize data for ease of access
Self-service enabled for creation, access, analysis etc.
Highly available

Make it easy to on a daily basis:

Create telemetry
Search
Relate telemetry to

production changes, deployments, release events
business events etc.

See - provide information radiators to all major stakeholders: Business (e.g. A/B test outcomes, usage anomalies...), Operations and Infrastructure, Development, Security...

Choosing the right logging level is important. Dan North, a former ThoughtWorks consultant who was involved in several projects in which the core continuous delivery concepts took shape, observes, “When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 a.m. Low printer toner is not an ERROR.”

To help ensure that we have information relevant to the reliable and secure operations of our service, we should ensure that all potentially significant application events generate logging entries, including those provided on this list assembled by Anton A. Chuvakin, a research VP at Gartner’s GTP Security and Risk Management group:

Authentication/ authorization decisions (including logoff)
System and data access
System and application changes (especially privileged changes)
Data changes, such as adding, editing, or deleting data Invalid input (possible malicious injection, threats, etc.)
Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits)
Health and availability Startups and shutdowns
Faults and errors
Circuit breaker trips
Delays
Backup success/ failure

To make it easier to interpret and give meaning to all these log entries, we should (ideally) create logging hierarchical categories, such as for non-functional attributes (e.g., performance, security) and for attributes related to features (e.g., search, ranking).

At the heart of monitoring and telemetry is the need for healthy organizations to make the reality of complex environments visible. We must avoid the counter productive, blame based "mean time until proven innocent" culture that rumor and hearsay that replaces the lack of readily available facts from across the platform (app, ops, infra, sec) to consider.

Analyze telemetry to better anticipate problems and achieve goals

Goal: Create tools to find ever weaker failure signals in production telemetry so we can avoid catastrophic failures

Key idea: Identify outliers in gaussian and non-gaussian data sets i.e. having normal and other distributions e.g. chi-squared ops data

Netflix used outlier detection to identify servers among thousands that were unusual and likely misbehaving. They didn't try to understand what was wrong. It was easier and faster to just destroy and recreate the servers. This improved the customer experience and cost much less than trying to analyze the outliers problems.

As usual the overall process (the what) is relatively simple and the implementation (the how) is tricky. In short:

Analyze the data collected during severe incidents
Identify the telemetry that would have enabled faster detection and diagnosis and alert on them
Identify the telemetry that would have verified that the problem was resolved
Continue this process to identify ever weaker failure signals

Many operational data sets are not gaussian ie. not bell curved, Apply smoothing techniques to avoid over alerting and creating operational noise. Examples include: fast furrier transformation, linear regression, rolling averages etc.

A handful of example charts from the book to illustrate:

Feedback enables development and operations safely deploy code and release features

DevOps success = automated deployments + integrated monitoring and telemetry + shared feeling of responsibility for the health of the entire value stream.

Enablers:

Small, frequent changes that all can inspect and understand i.e. smooth, continuous flow
Drive out fear! Focus on Mean Time To Recover rather than Mean Time Between Failures
Tie telemetry reporting to deployments as they are typically, the top cause for failure

How do we inspire the needed collaboration and shared accountability across development, infrastructure, operations etc? The authors aim to create the conditions of empowered empathy by enriching the experiences of their people:

Have developers experience the up and downstream impacts of their choices and omissions

Pain users feel
Pain downstream players feel keeping brittle things running

Have developers work with operations, infrastructure etc. identify telemetry and automation opportunities

Organizational practices:

Developers self-manage their applications post release for some amount of time prior to operations acceptance.
If you don't have it consider creating Launch Guidance. Launch guidance and requirements will likely include the following:

Defect counts and severity: Does the application actually perform as designed?
Type/ frequency of pager alerts: Is the application generating an unsupportable number of alerts in production?
Monitoring coverage: Is the coverage of monitoring sufficient to restore service when things go wrong?
System architecture: Is the service loosely-coupled enough to support a high rate of changes and deployments in production?
Deployment process: Is there a predictable, deterministic, and sufficiently automated process to deploy code into production?
Production hygiene: Is there evidence of enough good production habits that would allow production support to be managed by anyone else?

Design for operations i.e. include definition of telemetry, monitoring and other operational requirements at the outset of every IT project
For functional organizations consider Google's Service Reliability Team:

Hypothesis driven development and A/B testing

Fast, safe deployments enable us to learn from our customers actual vs. predicted behavior based on product changes. The Lean Start Up deep dives this much broader topic e.g. the value hypothesis and growth hypothesis.

A/B or Split Test website example:

Create two cohorts - control and treatment
Assign the current feature to the current state
Assign new feature to the treatment
Statistically significant results determine success or failure of a new feature

Some great feature ideas actually make things worse - despite our organizations belief to the contrary.

Typical hypothesis example:

We believe bigger pictures of hotels on the booking page
Will result in improved conversion rates
We will adopt if we see at least a x% conversion improvement

Create review and coordination processes to increase the quality of our current work

This chapter of the book focusses on an irony that the authors each find all too familiar. High profile incidents tend to generate two "counter factual arguments" i.e. arguments fulfilling our human tendency to quickly identify possible alternatives (that seem right) rather than being right or factually correct. The two most common being:

This was a change control failure - we should've had better reviews in place
This was a testing failure - they should have found this sooner and recovered from it faster

These arguments typically lead to a downward spiral: add more reviews from often external reviewers. These reviewers are farther away from the context. These reviews require more complex coordination and lead to reviewing larger batch sizes. The distance and batch size reduce the effectiveness of the review process and ironically, tend to reduce rather than improve quality leading to more severe and frequent production issues rather than less. One of the authors began their careers as an auditor and found this irony a very painful lesson to learn.

The authors basically argue, based on value stream analysis, that the only practical means of ensuring quality is to build it in via:

Small batches of
Loosely coupled units of code
Designed to address relevant functional and non-functional requirements
Continuously integrated and tested via automated tests and
Frequently reviewed by competent staff closest to the work

And they point out that the Agile practices of eXtreme Programing enable all of the above.

Other key practices:

Pull requests (peer review)
Static code analysis tools
etc.

These practice are deceptively simple i.e. they are hard to get right. They depend on people, bottom up to self-organize to address the inherent human and procedural complexities of software delivery and operation.

Supporting charts from the book:

External, cross organizational boundary reviews of high risk changes are still needed an can add significant value. But must be carefully managed and appropriately limited because building quality into the product as shown above presents exponentially higher returns and lower risks.

Next up: DevOps Handbook Summary 3 of 4 - The Third Way