AgileJazz - Russ Wangler's Blog: DevOps Handbook Summary 1 of 4

DevOps Handbook Summary 1 of 4 - The First Way

Book summary of The DevOps Handbook by Gene Kim et. al. Excerpted content is formatted in italics.

Part I: Introduction

The first part of the book is an introduction to the three ways:

The First Way enables fast left-to-right flow of work from Development to Operations to the customer. In order to maximize flow, we need to make work visible, reduce our batch sizes and intervals of work, build in quality by preventing defects from being passed to downstream work centers, and constantly optimize for the global goals.
The Second Way enables the fast and constant flow of feedback from right to left at all stages of our value stream. It requires that we amplify feedback to prevent problems from happening again, or enable faster detection and recovery.
The Third Way enables the creation of a generative, high-trust culture that supports a dynamic, disciplined, and scientific approach to experimentation and risk-taking, facilitating the creation of organizational learning, both from our successes and failures. Furthermore, by continually shortening and amplifying our feedback loops, we create ever-safer systems of work and are better able to take risks and perform experiments that help us learn faster than our competition and win in the marketplace.

Its sets the overall context and top three supporting metrics as follows:

In DevOps, we typically define our technology value stream as the process required to convert a business hypothesis into a technology-enabled service that delivers value to the customer.

The three key metrics include:

Lead Time
Processing Time
%Complete and Accurate (to next step in value chain)

The authors are well read and pull heavily from Lean and Kanban concepts such as:

Make work visible
Limit work in process
Reduce batch sizes - ideally to single piece flow
Reduce hand offs

And the Five Focusing Questions (this is a book unto itself - for IT it's David Andersen's Kanban)

Identify the constraint
Decide how to exploit the constraint
Subordinate everything else to the constraint
Elevate the constraint
If the constraint has been broken - go back to step 1 (but avoid inertia)

This handbook is aligns closely to David Andersen's book on Kanban and his recipe to apply Kanban principles to IT work:

Focus on quality
Reduce WIP
Deliver often
Balance demand against throughput
Prioritize
Attack sources of variability to improve predictability

The typical DevOps constraints are:

Environment creation
Code deployment
Test set up and run
Architecture (tightly coupled)

The authors call out the Poppendieck's IT wastes: partially done work, extra processes, extra features, task switching, waiting, motion, defects, non-standard or manual work and heroics but extend the definition of waste. More modern interpretations of Lean have noted that “eliminating waste” can have a demeaning and dehumanizing context; instead, the goal is reframed to reduce hardship and drudgery in our daily work through continual learning in order to achieve the organization’s goals. For the remainder of this book, the term waste will imply this more modern definition, as it more closely matches the DevOps ideals and desired outcomes.

Enabling organizational learning and safety is needed to enable a DevOps culture.

Dr. Ron Westrum was one of the first to observe the importance of organizational culture on safety and performance. He observed that in healthcare organizations, the presence of “generative” cultures was one of the top predictors of patient safety. Dr. Westrum defined three types of culture:

Pathological organizations are characterized by large amounts of fear and threat. People often hoard information, withhold it for political reasons, or distort it to make themselves look better. Failure is often hidden.
Bureaucratic organizations are characterized by rules and processes, often to help individual departments maintain their “turf.” Failure is processed through a system of judgment, resulting in either punishment or justice and mercy.
Generative organizations are characterized by actively seeking and sharing information to better enable the organization to achieve its mission. Responsibilities are shared throughout the value stream, and failure results in reflection and genuine inquiry.

A key theme throughout the book is the idea that IT is a complex system and thus often unpredictable. The three ways are means of not only reducing the risks within that complexity but to enable the safety required to create world class IT organizations and results.

Part II: Getting Started

Part two focusses on how to get started, which value stream to choose, how to approach it and how to design your DevOps organization keeping Conway's law in mind. Lots of practical if sometimes obvious advice and ideas from The Lean Start Up movement.

Get started with a business area that is struggling as they're more willing to try something new.

(start with) Find Innovators and Early Adopters
Build critical mass and a silent majority
Identify the holdouts

Green or brownfield system of record and system of engagements both are fine - you often can't achieve meaningful change to one without changing the other anyway.

Some have come to wrongly interpret Gartner's Bi-Modal IT as meaning poor quality is fine in the legacy world. Scott Prugh, VP of Product Development at CSG, observed, “We’ve adopted a philosophy that rejects bi-modal IT, because every one of our customers deserve speed and quality. This means that we need technical excellence, whether the team is supporting a 30 year old mainframe application, a Java application, or a mobile application.” Consequently, when we improve brownfield systems, we should not only strive to reduce their complexity and improve their reliability and stability, we should also make them faster, safer, and easier to change.

Once the business area is selected create a value stream map. Identify all the relevant teams. A typical set might include: Product owner, Development, QA, Operations, Infosec, Release managers, Infrastructure and Technology executives or value stream managers. Determine over all cycle time, processing times, and Complete and Accurate percentages. Figure out what is slowing you down! Make all of this big and visible. Put another way... just read and apply David Andersen's Kanban.

As recommended in Lean Start Up create a dedicated team empowered to work outside of the typical rule sets. Agree on a shared goal, keep your batch sizes small, planning horizons short and reserve at least 20% of the teams time to work NFR's e.g. automation, infrastructure as code etc. and technical debt.

Leverage tools e.g. chat rooms, shared work item management repositories e.g. Jira, Service Now to reinforce desired behaviors.

Organizational Models and Policies

Great outcomes come from integrating daily development and operations teams work. Here's what they are seeing work:

Develop E shaped people on small teams (see below)
Fund products and services and not projects (in general these guys align to SAFE's economic model without the formalization)
Define team boundaries respecting Conway's law (they share some great stories on unintended consequences)
Teams should share relevant work items, planning cycles, stories etc.
Ideally, architect for loosely coupled tightly aligned services, apps etc.

While all three major archetypes of fundamental organizational models can be found in high performers, Functional, Matrix, and Market (value stream aligned), they strongly favor Market based which seek to optimize for end to end speed (not necessarily functional efficiency).

Overall three strategies have the most traction:

Create self-service operations capabilities for developers
Embed operations into development teams
Assign ops liaisons to development teams and business value streams

Part III: The First Way - The Technical Practices of Flow

Create the foundation of our Pipelines!

Their definition of done is a good summary of the intended capability:

Validated in a production like environment via
Automatically (i.e. on demand)

Provision hardware
Deploy and test code across
Development, test and staging environments

All configuration, code, db changes scripts, test data are under source control in one repository

Includes build and deployment scripts etc.

Ideally this should happen using immutable infrastructure where no manual changes occur - make it easier to destroy and rebuild than to configure

Enable fast and reliable testing

Key principles / requirements:

Design, architect and build for testability. The easiest way to do this is to adopt Test Driven Development (Design)
Create a proper test pyramid (see above)
Tests must be fast

how fast? Faster than that - Automate!
Run in parallel if needed

Tests must cover all code: application, infrastructure e.g. Chef, puppet, operations e.g. deployment, rollback, security...
Some manual testing will always be needed e.g. exploratory, minimize it
Be practical though when chasing #4
Pull the andon cord (stop the line) when the pipeline breaks

Continuous Integration and Trunk Based Development

Optimize for teams
Small batches!
Leverage short lived branches

Individuals hours to a couple of days
Teams days to a week, max

Keep trunk in a deployable state

Automate Deployments

Enablers:

Automate deployments to all environments
Make and keep all environments as similar as is practical (they must be and remain consistent - eliminate, minimize drift)
Provide self-service capabilities with appropriate controls
Includes all steps for example

Packaging code in ways suitable for deployment
Creating pre-configured virtual machine images or containers
Automating the deployment and configuration of middleware
Copying packages or files onto production servers
Restarting servers, applications, or services
Generating configuration files from templates
Running automated smoke tests to make sure the system is working and correctly configured
Running testing procedures
Scripting and automating database migrations

Representative Benefit:

Decouple Deployment from Release

Deploy code regularly to stay safe Release features as needed to deliver value.

Common strategies:

Environment based:

Blue Green - bring up the old and new version on separate clusters/servers, route traffic to new, experience problems, switch back
Canary - serve new version to a subset of the customer base prior to broad release to validate all is well
Cluster immune system - builds on the canary pattern such that deployments/releases are tied to business and environmental telemetry that will automatically roll or switch back to previous version if problems are experienced e.g. average number of orders per minute drops dramatically post release

Application based:

Feature flags - enable features to be toggled on and off to

Roll back easily
A/B test
Gracefully degrade deployment
Increase resiliency via SOA e.g. toggle a service based on its availability

Dark launch - make a feature available in production for testing evaluation prior to making it available to customers (see below)
Graceful degradation under stress e.g. if performance begins to falter a lesser version of the feature is toggled on to maintain at least the basic capability

Dark Launch exposition:

Feature toggles allow us to deploy features into production without making them accessible to users, enabling a technique known as dark launching. This is where we deploy all the functionality into production and then perform testing of that functionality while it is still invisible to customers. For large or risky changes, we often do this for weeks before the production launch, enabling us to safely test with the anticipated production-like loads.

For instance, suppose we dark launch a new feature that poses significant release risk, such as new search features, account creation processes, or new database queries. After all the code is in production, keeping the new feature disabled, we may modify user session code to make calls to new functions— instead of displaying the results to the user, we simply log or discard the results.

For example, we may have 1% of our online users make invisible calls to a new feature scheduled to be launched to see how our new feature behaves under load. After we find and fix any problems, we progressively increase the simulated load by increasing the frequency and number of users exercising the new functionality. By doing this, we are able to safely simulate production-like loads, giving us confidence that our service will perform as it needs to.

Jez Humble wrote Continuous Delivery and addresses recent confusion of Continuous Delivery and Continuous Deployment this way:

When all developers are working in small batches on trunk, or everyone is working off trunk in short-lived feature branches that get merged to trunk regularly, and when trunk is always kept in a releasable state, and when we can release on demand at the push of a button during normal business hours, we are doing continuous delivery. Developers get fast feedback when they introduce any regression errors, which include defects, performance issues, security issues, usability issues, etc. When these issues are found, they are fixed immediately so that trunk is always deployable.

In addition to the above, when we are deploying good builds into production on a regular basis through self-service (being deployed by Dev or by Ops)— which typically means that we are deploying to production at least once per day per developer, or perhaps even automatically deploying every change a developer commits— this is when we are engaging in continuous deployment.

Each organization will need to find the balance that makes sense for their organization. Even Google and other innovators are not all or nothing in this area.

Architect for low risk releases

Amazon Case Study Excerpt:

One of the most studied architecture transformations occurred at Amazon. In an interview with ACM Turing Award-winner and Microsoft Technical Fellow Jim Gray, Amazon CTO Werner Vogels explains that Amazon.com started in 1996 as a “monolithic application, running on a web server, talking to a database on the back end. This application, dubbed Obidos, evolved to hold all the business logic, all the display logic, and all the functionality that Amazon eventually became famous for: similarities, recommendations, Listmania, reviews, etc.”

As time went by, Obidos grew too tangled, with complex sharing relationships meaning individual pieces could not be scaled as needed. Vogels tells Gray that this meant “many things that you would like to see happening in a good software environment couldn’t be done anymore; there were many complex pieces of software combined into a single system. It couldn’t evolve anymore.”

Describing the thought process behind the new desired architecture, he tells Gray, "We went through a period of serious introspection and concluded that a service-oriented architecture would give us the level of isolation that would allow us to build many software components rapidly and independently.”

Vogels notes, “The big architectural change that Amazon went through in the past five years [from 2001– 2005] was to move from a two-tier monolith to a fully-distributed, decentralized, services platform serving many different applications. A lot of innovation was necessary to make this happen, as we were one of the first to take this approach.” The lessons from Vogel’s experience at Amazon that are important to our understanding of architecture shifts include the following:

Lesson 1: When applied rigorously, strict service orientation is an excellent technique to achieve isolation; you achieve a level of ownership and control that was not seen before.
Lesson 2: Prohibiting direct database access by clients makes performing scaling and reliability improvements to your service state possible without involving your clients.
Lesson 3: Development and operational process greatly benefits from switching to service-orientation. The services model has been a key enabler in creating teams that can innovate quickly with a strong customer focus. Each service has a team associated with it, and that team is completely responsible for the service— from scoping out the functionality to architecting, building, and operating it.

The extent to which applying these lessons enhances developer productivity and reliability is breathtaking. In 2011, Amazon was performing approximately fifteen thousands deployments per day. By 2015, they were performing nearly 136,000 deployments per day.

Architectural Implications of DevOps:

Modern bias is for micro-service architectures
Micro-services best match the philosophy of DevOps and the Market based organizational model leveraging small teams
It isn't that the monolithic architectures are "wrong" but rather:

Its a realization that they have created unnecessary constraints that must be managed well as a business changes and that
The internal design / implementation of Monolithic architectures can support rather than constrain this evolution e.g. Clean Code, Clean Architecture

Dealing with architectural entropy that constrains business value delivery and DevOps adoption:

Consider Martin Fowler's The Strangler Pattern to incrementally re-architect tightly coupled, brittle applications and platforms. Here's a nice bottoms up example of the pattern in action.
Apply clean code and architectural principles

Next up: DevOps Handbook Summary 2 of 4 - The Second Way