Kanban Summary 6 of 7: Improve - Bottlenecks and Waste

Book summary of Kanban: Successful Evolutionary Change for Your Technology Business by David Andersen. Excerpted content in italics.

Making Improvements

Use models to identify and make improvements. David covers the three most common models to address the most common opportunities for improvement using Kanban. Each of these has their own body of knowledge and many practitioners. Others are also available. We'll cover background on each and then deep dive.

Theory of Constraints (bottlenecks) Background

Finding and addressing bottlenecks using The Five Focusing Steps is the basis of improvement in the Theory of Constraints. Eli Goldratt, the author of The Goal, developed this theory in the early 1980's.  

In the 1990's this theory evolved into a set of Thinking Processes for root cause analysis and change management. However, the community also moved away from the Five Focussing steps and tended to accept rather than challenge paradigms. For example, it accepted the triple constraint and dependency graph model of task management and created Critical Chain analysis rather than modeling projects as value streams with flow problems as David did in his book Agile Management for Software Engineering.

Background on Lean Models: Reduce Waste,  Reduce Variability 

The Machine That Changed the World, by Womack, Jones, and Daniels - popularized Lean in the early 1990's. The book outlined the Toyota Production System and as such was incomplete. Key underlying ideas from Deming's System of Profound Knowledge such as managing variability were omitted. Generally, early lean often oversimplified concepts like waste reduction viewing them in isolation and not as a part of the cultural and systemic whole. Examples include:
  • Focusing on eliminating waste ignoring the key concept of Value, Value Stream, and Flow. Fully utilized human resources typically reduce rather than improve system-wide performance.
  • Reducing variability using a top-down, command and control approach to Six Sigma (Statistical Process Control) instead of a bottom up Kaizen approach to continuous improvement. The former tending to breed a low trust rather than a generative Lean culture

The Five Focussing Steps (Dealing with Bottlenecks)

The Five Focusing Steps is a simple formula for a process of ongoing improvement. It states: 
  1. Identify the constraint. 
  2. Decide how to exploit the constraint. 
  3. Subordinate everything else in the system to the decision made in Step 2. 
  4. Elevate the constraint. 
  5. Avoid inertia; identify the next constraint and return to step 2. 
Kanban uses protection and subordination actions. The steps address capacity and non-instant availability resource constraints.

Step 1: "Identify the constraint" is asking us to find a bottleneck in our value stream. 

Working intensely a user interface designer supporting two teams can't keep up with the demand. She's a capacity constrained resource. As the Mythical Man Month illustrates, the coordination cost (getting a new person up to speed) and transaction cost (finding, hiring etc.) of adding capacity in the near term will typically slow rather than speed delivery. The rule of thumb is that "elevating a constraint," i.e. changing the system to remove the constraint, is the last resort.

Step 2: "Decide how to exploit the constraint," asks us to identify the potential throughput of that bottleneck and compare that to what is actually happening. As you will see, the bottleneck is rarely or never working at its full capacity. So ask, “What would it take to get the full potential out of our bottleneck? What would we need to change to make that happen?” This is the “decide” part of step 2. 
A situation where the actual throughput was less than 20 percent of potential, turns out to be quite common with knowledge-work problems such as requirements analysis and software development. It is often possible to see improvements of up to four times in delivery rate by exploiting a bottleneck.
Assume a test team is a bottleneck. 
  • We can remove all non-value added activities from their day e.g. have others fill out time sheets, skip future project planning sessions, ask business analysts to author test plans etc. 
  • Another approach is to bifurcate or split the testing work i.e. have the test team handle the higher risk testing and delegate lower risk testing to others e.g. business analysts, developers etc. 
  • Aggressively address external blockers. Project Managers, leaders, the whole project team if necessary should swarm to resolve any issue blocking the test team from testing. Identifying, escalating, and resolving issues are critical to addressing capacity constrained resources.
  • Never allow a capacity constrained resource to be idle due to a temporary lack of work: protect the resource with a buffer.  
The buffer is intended to absorb the variability in the arrival rate of new work queuing, in this example, for our user-experience designer. Buffering adds total WIP to our system. From a Lean perspective, adding a buffer of work adds waste, and it increases lead times. However, the throughput advantage provided by ensuring a steady flow of work through our capacity-constrained resource is usually a better trade. You will get more work done despite the slightly longer lead time and the slightly greater total work-in-progress.
Step 3: "Subordinate everything else in the system to the decision made in Step 2" asks us to make whatever changes are necessary to implement the ideas from step 2. This may involve making additional changes elsewhere in the value stream in order to get the maximum capacity from the bottleneck. This action of maximizing the bottleneck’s capability is known as “exploiting the bottleneck.”

Counter-intuitively, most bottleneck management happens away from the bottleneck. Many of the changes focus on reducing failure load to the bottleneck in order to maximize its throughput. As a general rule, expect to maximize exploitation of bottleneck capacity, and hence maximize throughput, and as a result, minimize the delivery time on your project by taking actions all over the value stream, and most likely not at the bottleneck itself.
Step 4: "Elevate the constraint" suggests that if the bottleneck is operating at its full capability and is still not producing enough throughput, its capability needs to be enhanced in order to increase throughput. Step 4 asks us to implement an improvement to enhance capability and increase throughput sufficiently so that the current bottleneck is relieved and the system constraint moves elsewhere in the value stream. 

Automation is often a powerful means of elevation:
  • Test automation
  • Environment creation
  • Code deployment 

Step 5: "Avoid inertia; identify the next constraint and return to step 2" requires that we give the changes time to stabilize and then identify the new bottleneck in the value stream and repeat the process. The result is a system of continuous improvement in which throughput is always increasing. If the Five Focusing Steps is institutionalized properly, a culture of continuous improvement will have been achieved throughout the organization.

Non-instant Availability Resource Constraints

Non-instant availability in knowledge work occurs when shared resources try to work on many things at once. While called multi-tasking, in reality, it is task switching with its attendant personal and organizational switching (coordination) costs. Moreover, if we are asked to work on three things simultaneously, we work on the first thing for a while, then switch to the second, then to the third. If someone is waiting for us to finish the first thing while we are working on the second or third, then we would appear to be non-instantly available from that person’s (and the first task’s) perspective.

Dave tells the story of Doug the build engineer. Only configuration team members were allowed to build and deploy code into the test environment. Having project teams do this themselves had led to long outages due to conflicting changes. The Build engineer was responsible for avoiding these outages by staying on top of the various team's changes. Build engineers were assigned to projects. Since each project only needed a couple hours of their time per a week they were often assigned to multiple projects and or other responsibilities such as production maintenance work.

Doug would work on a build/deploy for an hour each morning and then apply patches and OS upgrades to production servers. While he was working in production all build requests would sit idle. Since Doug worked in a Kanban shop the impact was quickly visible - the requesting team would be blocked and testers, developers, and many others would be idled, often until the next morning.

Exploitation/Protection Actions

Work was backing up when he wasn’t available because the kanban limits were tightly defined. Because Doug was a source of variability in flow, the correct course of action was to place a buffer of work in front of Doug. The trick was to make this buffer big enough to allow flow to continue without making it so large that Doug would become a capacity-constrained resource.

They determined that Doug could handle up to 7 build and deploys in the hour he had to do builds each day. They created a buffer, called "build ready" on the Kanban board, with a limit of 7.  This added 20% to work in process but enabled flow. Later they realized that if split Doug's build hour in half, 30 minutes in the morning and 30 in the afternoon flow would be even better, a buffer of 2 or 3 limited the work in progress impact to 10%.

As a general rule, when encountering non-instant availability problems, think about how to improve the availability. The ultimate is to turn a non-instant availability problem into an instantly available resource.

Subordination Actions

Steps one and two solved Doug's immediate problem. But since we've added WIP and slowed lead time we continue to ask questions:

  1. Was asking Doug to do three different functions the best alternative? The manager determined that the team liked the variety, they liked being generalists which gave the manager options and avoided other resource capacity constraints. Net: asking team members to only focus on the builds was not the best option.
  2. Should we dedicate Doug to the builds and accept the idle time? Mitigating the risk of shared test environment downtime created the situation - is it worth hiring someone else to do Doug's other work? To determine this we looked at the "cost of delay" and determined that nothing these project's worked introduced a strategically significant cost of delay.
  3. Other subordination alternatives like providing dedicated test environments, having the teams do their own builds were considered but rejected. The team concluded that they're continuing to provide the service made sense in the near term.
The solution at this point was tactical and effective but carried costs, for example, the additional WIP and lead time impacts.

Elevation Actions

Doug's team considered elevation actions as well:
  1. Could the process be automated? Yes, and the investment would be significant. Automation, configuration management, and project coordination capabilities would be needed. A team to build the system would be needed as well.
  2. Could the environments be virtualized? Yes, again significant investment required but this could mitigate the original risk by allowing projects to have their own environments. Virtualizing the shared environment could also enable it to be quickly re-established upon and failure. 

Automation was eventually implemented. It took around six months in total elapsed time, and two contractors for eight weeks. The total financial cost was around $ 60,000. However, the end result was that Doug was no longer required for builds, and builds were instantly available when developers needed them. At this point, it was possible to eliminate the buffer and reduce the system WIP. That, in turn, resulted in a slight reduction in lead time.

An Economic Model For Lean

Lean uses waste as a metaphor to describe things done, often necessary things that were done, which do not add value to the product.  This is confusing to people so David has adopted and refined an alternative representation: thinking of wastes as costs.

Describe Waste as a Cost

David classifies costs into three main abstract categories: Transaction Costs; Coordination Costs; and Failure Load.

Over time there are a number of value-added activities for an iteration or project. Surrounding those activities are transaction and coordination costs. The capacity for value-added activities can be displaced by work that can be considered failure load, that is, work that is either rework or demand placed on the system because of a previously poor implementation.

Transaction costs for projects and iterations: Set up and Clean Up

  • Setup Costs
    • Project planning, resource planning
    • Recruitment, budgeting, estimating 
    • Risk planning, communication planning
    • Facilities acquisition 
    • Etc.
  • Cleanup or Back-end Costs
    • Delivery to the customer 
    • Tear down of environments
    • Retrospectives, reviews
    • Audits, user training
    • Etc.
  • Setup Costs
    • Iteration planning and backlog selection (or requirements scoping), 
    • Estimation
    • Budgeting 
    • Resourcing
    • Environment setup
    • etc.
  • Cleanup or Back-end Costs
    • Integration
    • Delivery, 
    • Retrospective, and
    • Environment tear down
    • etc.
Coordination Costs
Coordination is necessary as soon as two or more people try to achieve a common goal together. Generally, coordination costs on projects are any activities that involve communicating and scheduling. 

Meetings that directly add value to the product or service and are not a cost or waste. For example,  meetings to:
  • Create a design
  • Complete a test, 
  • Do a piece of analysis
  • Write code
  • etc.
Examples of meetings that are coordination costs:
  • Status meetings
  • Daily stand ups
  • Task assignment meetings
  • etc.
Effective coordination is required because, without it, we will fail. However, it should be accomplished as efficiently as practical. When in doubt, ask yourself: “If this activity is truly value-adding, would we do more of it?”

Is correctly classifying the type of wasteful costs important? No. Is minimizing or eliminating them worthwhile, yes. Does this mean we shouldn't have stand ups? Not necessarily - can the objectives of a stand up be fully met another way? 

Central Point on Self-Organization

In Kanban self-organization is facilitated by making the current circumstances and policies clear so that a person or a team can make good decisions without needed to engage (i.e. coordinate, meet with) a manager or other authority. Some decisions should only be made with collaboration and support of others e.g. manager, product executive, architect etc. Others can and should be made by a person on the team or by the team together. Kanban seeks to help differentiate the two types of decisions and provide the information needed to make good decisions easier.

Failure Load

Failure load is demand generated by the customer that might have been avoided through higher quality delivered earlier. This may be the end customer or could be the next person in your value chain e.g. a developer may omit the monitoring needed by the operations team. 

  • Bugs
  • Non-functional problems
    • unexpected performance issues
    • usability problems
  • Missed requirements 
Failure load doesn’t create new value; it enables value “left on the table” from a previous delivery. More than likely the earlier release of the product or service failed to deliver on its projected payoff function. Although some of this may be due to market variability or unpredictability, some of the shortfalls will have been discovered to come from problems with the earlier release.

So the picture is muddy. Failure load still adds value. But what is important is that it adds value that should have been there already. Reducing failure load reduces the opportunity cost of delay.

Reduced costs mean more profits sooner. Reduced failure load means more of the available capacity can be spent on new functionality. Reduced failure load enables a business to pursue more market niches, with more product offerings. Reduced failure load enables more options. Reduced failure load may enable a reduction in team size, and hence, a reduction in direct costs.

No comments:

Post a Comment