DevOps Handbook Summary 3 of 4 - The Third Way

Book summary of The DevOps Handbook by Gene Kim et. al. Excerpted content is formatted in italics.

Part V: The Third Way - The Technical Practices of Learning

In Part V, The Third Way: The Technical Practices of Learning, we present the practices that create opportunities for learning, as quickly, frequently, cheaply, and as soon as possible. This includes creating learnings from accidents and failures, which are inevitable when we work within complex systems, as well as organizing and designing our systems of work so that we are constantly experimenting and learning, continually making our systems safer. The results include higher resilience and an ever-growing collective knowledge of how our system actually works, so that we are better able to achieve our goals. 

In the following chapters, we will institutionalize rituals that increase safety, continuous improvement, and learning by doing the following: 

  • Establish a just culture to make safety possible
  • Inject production failures to create resilience
  • Convert local discoveries into global improvements Reserve time to create organizational improvements and learning

Enable and inject learning into daily work

IT work operates within a complex and as a result is often impossible to predict the outcomes of our actions. Put another way - accidents are going to happen, we want to make their impacts ever smaller until they are most frequently hard to notice.

An empowering insight of Lean is that the root of most problems is flaws found not in our people but in our processes and organizational systems. These processes and systems have often become invisible and are taken for granted "that's just how it is" or "how we do it" here. Thus well intentioned people will attempt to take or assign the "blame" or responsibility for problems without recognizing that the system is more at fault then they are. Beyond inspiring heroism little actionable learning results. Worse, this does them and our business an injustice

A Just culture goes beyond arbitrary acceptance of responsibility or assignment of blame to factual analysis of root cause. This creates a environment where people feel safe to openly examine their role, the role of the system, of random cause etc. This safety is the pre-requisite of achieving a self-diagnosing, problem solving resilient DevOps culture.

The Blameless Postmortem
In the blameless post-mortem meeting, we will do the following: 

  1. Construct a timeline and gather details from multiple perspectives on failures, ensuring we don’t punish people for making mistakes 
  2. Empower all engineers to improve safety by allowing them to give detailed accounts of their contributions to failures
  3. Enable and encourage people who do make mistakes to be the experts who educate the rest of the organization on how not to make them in the future 
  4. Accept that there is always a discretionary space where humans can decide to take action or not, and that the judgment of those decisions lies in hindsight 
  5. Propose countermeasures to prevent a similar accident from happening in the future and ensure these countermeasures are recorded with a target date and an owner for follow-up 
To enable us to gain this understanding, the following stakeholders need to be present at the meeting:

  • The people involved in decisions that may have contributed to the problem 
  • The people who identified the problem 
  • The people who responded to the problem 
  • The people who diagnosed the problem 
  • The people who were affected by the problem 
  • And anyone else who is interested in attending the meeting.

By recording the outcomes of these meetings in a central place we help others avoid making the same mistakes - if its easy to record and then search for and find relevant learning on demand. Etsy's approach:

This desire to conduct as many blameless post-mortem meetings as necessary at Etsy led to some problems— over the course of four years, Etsy accumulated a large number of post-mortem meeting notes in wiki pages, which became increasingly difficult to search, save, and collaborate from. 

To help with this issue, they developed a tool called Morgue to easily record aspects of each accident, such as the incident MTTR and severity, better address time zones (which became relevant as more Etsy employees were working remotely), and include other data, such as rich text in Markdown format, embedded images, tags, and history. 

Morgue was designed to make it easy for the team to record: Whether the problem was due to a scheduled or an unscheduled incident The post-mortem owner Relevant IRC chat logs (especially important for 3 a.m. issues when accurate note-taking may not happen) Relevant JIRA tickets for corrective actions and their due dates (information particularly important to management) Links to customer forum posts (where customers complain about issues)

If embraced, over time these techniques enable organizations to reduce incident tolerances and focus on ever weaker failure signals. The truth is that high performers fail more often but their failures are smaller. Just telling everyone to do this doesn't work. Specific and interlocking conditions must be in place:

  • Small batches
  • Frequent, automated deploys
  • Strong telemetry and analytical capabilities
  • A/B testing
  • Blue green and canary deployment
These conditions make it safe to take calculated risks and redefine failure as a natural part of organizational learning.  They engender resilient products, services and culture. They can culminate in designing, deploying and production testing failure modes in our products and services. Examples include:
  • Graceful feature degradation under unexpectedly high loads
  • Rehearsing large scale failures
  • Injecting controlled faults in production to practice resolution - e.g. Netflix's chaos monkey

Convert local discoveries into global improvements

Use chat rooms e.g. Slack, HipChat and bots to automate and collect organizational knowledge. E.g. Bot's to deploy a change, to send a deployment or pull request notification, to retrieve FAQ information, to announce a product issue etc.

Benefits experienced:
  • Is visible to all, unlike email
  • Documents events (in real time as a part of daily work rather than something extra)
    • Testing sessions
    • Incidents
    • Change coordination
    • ...
Effectively organized, archived, searchable etc. can have a dramatic effect on knowledge sharing.

GitHub example:
They created Hubot, a software application that interacted with the Ops team in their chat rooms, where it could be instructed to perform actions merely by sending it a command (e.g., “@ hubot deploy owl to production”). The results would also be sent back into the chat room. Having this work performed by automation in the chat room (as opposed to running automated scripts via command line) had numerous benefits, including: 
  • Everyone saw everything that was happening. 
  • Engineers on their first day of work could see what daily work looked like and how it was performed. 
  • People were more apt to ask for help when they saw others helping each other. 
  • Rapid organizational learning was enabled and accumulated.

Automate standard operating processes in software for reuse and store it in a single source code repository (in addition to application source code):
  • Configuration standards for our libraries, infrastructure, and environments (Chef recipes, Puppet manifests, etc.) 
  • Deployment tools 
  • Testing standards and tools, including security 
  • Deployment pipeline tools 
  • Monitoring and analysis tools 
  • Tutorials and standards
  • Code templates for global standards (providing templates, automation, etc helps make their usage easier than not using)
Justin Arbuckle was chief architect at GE Capital in 2013 when he said, “We needed to create a mechanism that would allow teams to easily comply with policy— national, regional, and industry regulations across dozens of regulatory frameworks, spanning thousands of applications running on tens of thousands of servers in tens of data centers.” By encoding our manual processes into code that is automated and executed, we enable the process to be widely adopted, providing value to anyone who uses them. Arbuckle concluded that “the actual compliance of an organization is in direct proportion to the degree to which its policies are expressed as code.”

Ownership supports knowledge maintenance and sharing

Every shared library at Google has an owner who is responsible for ensuring it passes tests of all its consumers. They are responsible for safely migrating each consuming group across versions. They seek to support one version of a library in production at a time.

Spread knowledge by using automated tests as documentation

Shared libraries and services must support change across the organization. Write automated tests at various levels to make them safe to use and change. Test Driven Development (TDD or Test Driven Design) automatically provides these benefits.

Design for operations through codified Non Functional Requirements

Implementing these non-functional requirements will enable our services to be easy to deploy and keep running in production, where we can quickly detect and correct problems, and ensure it degrades gracefully when components fail. Examples of non-functional requirements include ensuring that we have:
  • Sufficient production telemetry in our applications and environments 
  • The ability to accurately track dependencies 
  • Services that are resilient and degrade gracefully 
  • Forward and backward compatibility between versions 
  • The ability to archive data to manage the size of the production data set 
  • The ability to easily search and understand log messages across services 
  • The ability to trace requests from users through multiple services 
  • Simple, centralized runtime configuration using feature flags and so forth 
By codifying these types of non-functional requirements, we make it easier for all our new and existing services to leverage the collective knowledge and experience of the organization. These are all responsibilities of the team building the service.

Advice on standards

View technology choices holistically across development, operations, infrastructure, security to support organizational goals. We typically see local optimizations within each function development, operations etc. that inadvertently impede rather than support flow.

If we do not have a list of technologies that Operations will support, collectively generated by Development and Operations, we should systematically go through the production infrastructure and services, as well as all their dependencies that are currently supported, to find which ones are creating a disproportionate amount of failure demand and unplanned work. Our goal is to identify the technologies that: 
  • Impede or slow down the flow of work 
  • Disproportionately create high levels of unplanned work 
  • Disproportionately create large numbers of support requests 
  • Are most inconsistent with our desired architectural outcomes (e.g. throughput, stability, security, reliability, business continuity)
In a presentation that he gave with Olivier Jacques and Rafael Garcia at the 2015 DevOps Enterprise Summit, Ralph Loura, CIO of HP, stated: 
Internally, we described our goal as creating “buoys, not boundaries.” Instead of drawing hard boundaries that everyone has to stay within, we put buoys that indicate deep areas of the channel where you’re safe and supported. You can go past the buoys as long as you follow the organizational principles. After all, how are we ever going to see the next innovation that helps us win if we’re not exploring and testing at the edges? As leaders, we need to navigate the channel, mark the channel, and allow people to explore past it.

Make time to learn, fund it (you'll pay either way)

One of the practices that forms part of the Toyota Production System is called the improvement blitz (or sometimes a kaizen blitz), defined as a dedicated and concentrated period of time to address a particular issue, often over the course of a several days. Dr. Spear explains, “... blitzes often take this form: A group is gathered to focus intently on a process with problems… The blitz lasts a few days, the objective is process improvement, and the means are the concentrated use of people from outside the process to advise those normally inside the process.” 

Spear observes that the output of the improvement blitz team will often be a new approach to solving a problem, such as new layouts of equipment, new means of conveying material and information, a more organized workspace, or standardized work. They may also leave behind a to-do list of changes to be made down the road.

Improvement blitzes, hackathon's etc.

DevOps organizations from Target to Google and beyond have implemented versions of this approach. They range in levels of formality and intensity. They often leverage internal consulting and may not involve anyone outside of a team.  Finally some spill out of DevOps into more general business sessions seeking to improve the company's end products e.g. include product management to focus on identifying building new features.  While helpful, until you've achieved a high level of DevOps maturity feature building may keep your cobbler's children barefoot.

The Target example is highly formalized and intense. They hold 30-day challenges in a dedicated facility. Up to eight whole teams will come and work with coaches to address significant problems. Outcomes include reduced lead times for environment creation and other services from 3-6 months to days. They also hold "flash build" sessions to deliver MVP's in two to three days.

Many DevOps organizations have adapted the Improvement Blitz by holding day to week long sessions where one team, one department or the entire organization focuses on improving their daily work - removing work arounds, tackling technical debt, automating repetitive tasks etc.  No feature or project work allowed.  In some cases the outcomes have been dramatic.  For example Facebook teams built the  "Hip Hop" compiler to handle 6x user loads by converting interpreted PHP code into compiled C++ binaries.

What makes improvement blitzes so powerful is that we are empowering those closest to the work to continually identify and solve their own problems. Consider for a moment that our complex system is like a spider web, with intertwining strands that are constantly weakening and breaking. If the right combination of strands breaks, the entire web collapses. There is no amount of command-and-control management that can direct workers to fix each strand one by one. Instead, we must create the organizational culture and norms that lead to everyone continually finding and fixing broken strands as part of our daily work. As Dr. Spear observes, “No wonder then that spiders repair rips and tears in the web as they occur, not waiting for the failures to accumulate.”

While not discussed in the book SAFE embodies this spirit by including Innovation and Planning Iterations within each Program Increment.

Enable everyone to teach and learn

At the lowest level of detail creating a self-diagnosing, problem solving and resilient culture requires most all employees to teach and learn from each other and from their daily experiences. To flourish these skills need to be made visible and valued or more simply put, nurtured.  This excerpt says it well:

Steve Farley, VP of Information Technology at Nationwide Insurance, said, “We have five thousand technology professionals, who we call ‘associates.’ Since 2011, we have been committed to create a culture of learning— part of that is something we call Teaching Thursday, where each week we create time for our associates to learn. For two hours, each associate is expected to teach or learn. The topics are whatever our associates want to learn about— some of them are on technology, on new software development or process improvement techniques, or even on how to better manage their career. The most valuable thing any associate can do is mentor or learn from other associates.” 

As has been made evident throughout this book, certain skills are becoming increasingly needed by all engineers, not just by developers. For instance, it is becoming more important for all Operations and Test engineers to be familiar with Development techniques, rituals, and skills, such as version control, automated testing, deployment pipelines, configuration management, and creating automation. Familiarity with Development techniques helps Operations engineers remain relevant as more technology value streams adopt DevOps principles and patterns.

Often DevOps goals are based not on cost reduction via job elimination but rather on increasing speed, agility and repurposing talent.  Accomplishing this goal requires a specific focus on how to help employees teach and learn both the skills they have and those they need to develop.

Create internal consulting and coaching to spread practices - and a final word on TDD

I'm closing this section with a long excerpt relating Google's internal consulting story. It reflects how they applied the principles of the third way that leveraged their culture.  It also highlights the critical importance of thorough testing to achieving DevOps success.  What it lacks is an exposition of why and how testing enables this success. 

This entire book stresses the importance of:
  • Small batches
  • Loosely coupled code at the unit, components, service, application and architectural levels
  • Continuous integration and automated testing
  • Building quality in and reviewing it for quality as close to the source as practical
The "easiest" way to reliably achieve these important goals is to adopt Test Driven Development from unit to acceptance test levels. It delivers code that is intrinsically testable and thus loosely coupled. The process focusses on testing code small unit by unit and continuously integrating it via automated builds and deployments thus delivering both small batches and the foundation of release automation. Finally, the pull request, pairing and related practices build quality in and keep reviewers close to the source.

 "Easiest" was in quotes because it is not easy to adopt. It is not a silver bullet - though it is as close to one as I've ever experienced. It alone will not deliver great designs and architecture - you need skilled craftsman for that. But it will help those less skilled create code that is orders of magnitude better than those following traditional test after techniques.  Better means better able to change and adapt as circumstances warrant e.g. from Monolithic to  modern micro-service.

The google testing example

Earlier in the book, we began the story of how the Testing Grouplet built a world-class automated testing culture at Google starting in 2005. Their story continues here, as they try to improve the state of automated testing across all of Google by using dedicated improvement blitzes, internal coaches, and even an internal certification program. 

Bland said, at that time, there was a 20% innovation time policy at Google, enabling developers to spend roughly one day per week on a Google-related project outside of their primary area of responsibility. Some engineers chose to form grouplets, ad hoc teams of like-minded engineers who wanted to pool their 20% time, allowing them to do focused improvement blitzes. 

A testing grouplet was formed by Bharat Mediratta and Nick Lesiecki, with the mission of driving the adoption of automated testing across Google. Even though they had no budget or formal authority, as Mike Bland described, “There were no explicit constraints put upon us, either. And we took advantage of that.”

They used several mechanisms to drive adoption, but one of the most famous was Testing on the Toilet (or TotT), their weekly testing periodical. Each week, they published a newsletter in nearly every bathroom in nearly every Google office worldwide. Bland said, “The goal was to raise the degree of testing knowledge and sophistication throughout the company. It’s doubtful an online-only publication would’ve involved people to the same degree.” 

Bland continues, “One of the most significant TotT episodes was the one titled, ‘Test Certified: Lousy Name, Great Results,’ because it outlined two initiatives that had significant success in advancing the use of automated testing.” 

Test Certified (TC) provided a road map to improve the state of automated testing. As Bland describes, “It was intended to hack the measurement-focused priorities of Google culture... and to overcome the first, scary obstacle of not knowing where or how to start. Level 1 was to quickly establish a baseline metric, Level 2 was setting a policy and reaching an automated test coverage goal, and Level 3 was striving towards a long-term coverage goal.” 

The second capability was providing Test Certified mentors to any team who wanted advice or help, and Test Mercenaries (i.e., a full-time team of internal coaches and consultants) to work hands-on with teams to improve their testing practices and code quality. The Mercenaries did so by applying the Testing Grouplet’s knowledge, tools, and techniques to a team’s own code, using TC as both a guide and a goal. Bland was eventually a leader of the Testing Grouplet from 2006 to 2007, and a member of the Test Mercenaries from 2007 to 2009. 

Bland continues, “It was our goal to get every team to TC Level 3, whether they were enrolled in our program our not. We also collaborated closely with the internal testing tools teams, providing feedback as we tackled testing challenges with the product teams. 

We also collaborated closely with the internal testing tools teams, providing feedback as we tackled testing challenges with the product teams. We were boots on the ground, applying the tools we built, and eventually, we were able to remove ‘I don’t have time to test’ as a legitimate excuse.” 

He continues, “The TC levels exploited the Google metrics-driven culture— the three levels of testing were something that people could discuss and brag about at performance review time. The Testing Grouplet eventually got funding for the Test Mercenaries, a staffed team of full-time internal consultants. This was an important step, because now management was fully onboard, not with edicts, but by actual funding.” 

Another important construct was leveraging company-wide “Fixit” improvement blitzes. Bland describes Fixits as “when ordinary engineers with an idea and a sense of mission recruit all of Google engineering for one-day, intensive sprints of code reform and tool adoption.” He organized four company-wide Fixits, two pure Testing Fixits and two that were more tools-related, the last involving more than one hundred volunteers in over twenty offices in thirteen countries. He also led the Fixit Grouplet from 2007 to 2008. 

These Fixits, as Bland describes means that we should provide focused missions at critical points in time to generate excitement and energy, which helps advance the state-of-the-art. This will help the long-term culture change mission reach a new plateau with every big, visible effort. 

The results of the testing culture are self-evident in the amazing results Google has achieved, presented throughout the book.

No comments:

Post a Comment