Part IV: The Second Way - The Technical Practices of Feedback
Part four explores:
- Creating telemetry to enable seeing and solving problems
- Using our telemetry to better anticipate problems and achieve goals
- Integrating user research and feedback into the work of product teams
- Enabling feedback so Dev and Ops can safely perform deployments
- Enabling feedback to increase the quality of our work through peer reviews and pair programming
Create telemetry to enable seeing and solving problems
- Strong DevOps performers have Mean Time To Recover (MTTR) of 165x better than their peers w/80% of the MTTR time typically spent determining what changed (see Puppet Labs annual DevOps reports)
- Two top predictors of DevOps success:
- Operations (and Infrastructure) code is highly automated and under source control
- Proactive monitoring and telemetry
The authors share Turnball's a best practice framework for monitoring (telemetry collection and consumption):
- Covers all layers: business, application, infrastructure, client software, deployment pipeline
- Centralize data for ease of access
- Self-service enabled for creation, access, analysis etc.
- Highly available
Make it easy to on a daily basis:
- Create telemetry
- Relate telemetry to
- production changes, deployments, release events
- business events etc.
- See - provide information radiators to all major stakeholders: Business (e.g. A/B test outcomes, usage anomalies...), Operations and Infrastructure, Development, Security...
Choosing the right logging level is important. Dan North, a former ThoughtWorks consultant who was involved in several projects in which the core continuous delivery concepts took shape, observes, “When deciding whether a message should be ERROR or WARN, imagine being woken up at 4 a.m. Low printer toner is not an ERROR.”
To help ensure that we have information relevant to the reliable and secure operations of our service, we should ensure that all potentially significant application events generate logging entries, including those provided on this list assembled by Anton A. Chuvakin, a research VP at Gartner’s GTP Security and Risk Management group:
- Authentication/ authorization decisions (including logoff)
- System and data access
- System and application changes (especially privileged changes)
- Data changes, such as adding, editing, or deleting data Invalid input (possible malicious injection, threats, etc.)
- Resources (RAM, disk, CPU, bandwidth, or any other resource that has hard or soft limits)
- Health and availability Startups and shutdowns
- Faults and errors
- Circuit breaker trips
- Backup success/ failure
At the heart of monitoring and telemetry is the need for healthy organizations to make the reality of complex environments visible. We must avoid the counter productive, blame based "mean time until proven innocent" culture that rumor and hearsay that replaces the lack of readily available facts from across the platform (app, ops, infra, sec) to consider.
Analyze telemetry to better anticipate problems and achieve goals
Goal: Create tools to find ever weaker failure signals in production telemetry so we can avoid catastrophic failures
Key idea: Identify outliers in gaussian and non-gaussian data sets i.e. having normal and other distributions e.g. chi-squared ops data
Netflix used outlier detection to identify servers among thousands that were unusual and likely misbehaving. They didn't try to understand what was wrong. It was easier and faster to just destroy and recreate the servers. This improved the customer experience and cost much less than trying to analyze the outliers problems.
As usual the overall process (the what) is relatively simple and the implementation (the how) is tricky. In short:
- Analyze the data collected during severe incidents
- Identify the telemetry that would have enabled faster detection and diagnosis and alert on them
- Identify the telemetry that would have verified that the problem was resolved
- Continue this process to identify ever weaker failure signals
Many operational data sets are not gaussian ie. not bell curved, Apply smoothing techniques to avoid over alerting and creating operational noise. Examples include: fast furrier transformation, linear regression, rolling averages etc.
A handful of example charts from the book to illustrate:
Feedback enables development and operations safely deploy code and release features
DevOps success = automated deployments + integrated monitoring and telemetry + shared feeling of responsibility for the health of the entire value stream.
- Small, frequent changes that all can inspect and understand i.e. smooth, continuous flow
- Drive out fear! Focus on Mean Time To Recover rather than Mean Time Between Failures
- Tie telemetry reporting to deployments as they are typically, the top cause for failure
How do we inspire the needed collaboration and shared accountability across development, infrastructure, operations etc? The authors aim to create the conditions of empowered empathy by enriching the experiences of their people:
- Have developers experience the up and downstream impacts of their choices and omissions
- Pain users feel
- Pain downstream players feel keeping brittle things running
- Have developers work with operations, infrastructure etc. identify telemetry and automation opportunities
- Developers self-manage their applications post release for some amount of time prior to operations acceptance.
- If you don't have it consider creating Launch Guidance. Launch guidance and requirements will likely include the following:
- Defect counts and severity: Does the application actually perform as designed?
- Type/ frequency of pager alerts: Is the application generating an unsupportable number of alerts in production?
- Monitoring coverage: Is the coverage of monitoring sufficient to restore service when things go wrong?
- System architecture: Is the service loosely-coupled enough to support a high rate of changes and deployments in production?
- Deployment process: Is there a predictable, deterministic, and sufficiently automated process to deploy code into production?
- Production hygiene: Is there evidence of enough good production habits that would allow production support to be managed by anyone else?
- Design for operations i.e. include definition of telemetry, monitoring and other operational requirements at the outset of every IT project
- For functional organizations consider Google's Service Reliability Team:
Hypothesis driven development and A/B testing
Fast, safe deployments enable us to learn from our customers actual vs. predicted behavior based on product changes. The Lean Start Up deep dives this much broader topic e.g. the value hypothesis and growth hypothesis.
A/B or Split Test website example:
- Create two cohorts - control and treatment
- Assign the current feature to the current state
- Assign new feature to the treatment
- Statistically significant results determine success or failure of a new feature
Some great feature ideas actually make things worse - despite our organizations belief to the contrary.
Typical hypothesis example:
- We believe bigger pictures of hotels on the booking page
- Will result in improved conversion rates
- We will adopt if we see at least a x% conversion improvement
Create review and coordination processes to increase the quality of our current work
This chapter of the book focusses on an irony that the authors each find all too familiar. High profile incidents tend to generate two "counter factual arguments" i.e. arguments fulfilling our human tendency to quickly identify possible alternatives (that seem right) rather than being right or factually correct. The two most common being:
- This was a change control failure - we should've had better reviews in place
- This was a testing failure - they should have found this sooner and recovered from it faster
These arguments typically lead to a downward spiral: add more reviews from often external reviewers. These reviewers are farther away from the context. These reviews require more complex coordination and lead to reviewing larger batch sizes. The distance and batch size reduce the effectiveness of the review process and ironically, tend to reduce rather than improve quality leading to more severe and frequent production issues rather than less. One of the authors began their careers as an auditor and found this irony a very painful lesson to learn.
The authors basically argue, based on value stream analysis, that the only practical means of ensuring quality is to build it in via:
- Small batches of
- Loosely coupled units of code
- Designed to address relevant functional and non-functional requirements
- Continuously integrated and tested via automated tests and
- Frequently reviewed by competent staff closest to the work
And they point out that the Agile practices of eXtreme Programing enable all of the above.
Other key practices:
- Pull requests (peer review)
- Static code analysis tools
These practice are deceptively simple i.e. they are hard to get right. They depend on people, bottom up to self-organize to address the inherent human and procedural complexities of software delivery and operation.
Supporting charts from the book:
External, cross organizational boundary reviews of high risk changes are still needed an can add significant value. But must be carefully managed and appropriately limited because building quality into the product as shown above presents exponentially higher returns and lower risks.
Next up: DevOps Handbook Summary 3 of 4 - The Third Way