Kanban Summary 7 of 7: Improve - Reduce Variability, Resolve Issues

Book summary of Kanban: Successful Evolutionary Change for Your Technology Business by David Andersen. Excerpted content in italics.

Preface/Review: Sources of Variability and Predictability (i.e. making dates)

Generally speaking, business owners and senior managers value predictability more than throughput. Predictability builds and holds trust, a core Agile value, better than does delivery of more with less reliability.

  • Variability results in more work-in-progress and longer lead times. 
  • Variability creates a greater need for slack in non-bottleneck resources in order to cope with the ebb and flow of work as its effects manifest on the flow of work through the value stream. (needed to make dates, SLA's)
  • Variability in the size of requirements, and in the amount of effort expended on analysis, design, coding, testing, build integration, and delivery adversely affects the throughput of a process and the costs of running a software development value stream.

Types of Variability

Walter Shewhart's pioneered the analysis of industrial process variability in the 1920's. The Toyota Production System, Six Sigma methods and the Software Engineering Institute benefit and extend his work to this day. Statistical Process Control (SPC) is the common quantitative toolset more generally applied to variability today. However, the use of SPC is considered an advanced and high-maturity topic that will be addressed in a later book. Here, we will talk about variation in the most general terms and its simplest form. 

Shewhart classified variability and variations in process performance into two categories:
  • Internal sources of variation are under the control of the system in operation. Shewhart named these "chance cause" variations. "Common cause" is a prevalent synonym.
    • “Chance” implies that the variation is random and the randomness is a direct consequence of the system (process) design. It does not imply that the randomness is evenly distributed or follows a standard distribution. Changes to the process design via changes in internal policies will affect any variation’s mean, spread, and shape of distribution.
    • From the name - chance cause - we infer that while a specific single cause may be unclear a set of likely causes and opportunities to address them exist.
    • An internal, chance-cause variation would be the number of bugs created per line of code, per requirement, per task, or per unit of time. The mean number, the spread, and the distribution of the bug (or defect) rate can be affected by changing the tools and process, such as by insisting on unit tests, continuous integration, and peer code reviews.
  • External sources of variation are events that happen that the immediate team or management can not directly control. He used "assignable cause" to signify these. "Special cause" is a prevalent synonym.
    • External sources of variation require a different approach to managing them. They cannot be directly affected by policies, but a process can be put in place to deal effectively with external variations. The body of knowledge that relates directly to this field is issue and risk management.
    • By “assignable,” he implied that someone (or a group of people) could easily point at the source of the problem and consistently describe it— such as, “There was a storm. It rained really hard and our server farm was flooded.” Assignable-cause variations cannot be controlled by the local team or management but they can be predicted, and plans made and processes designed to cope with them gracefully.
What matters most is the problem is visible and we take action not that we have precisely defined its source.

Internal Sources of Variability

The software development and project-management process in place, coupled with the organizational maturity and capability of the individuals on the team determines the number of internal sources of variability and the degree of that variability. 

To avoid confusion, Kanban must not be thought of as a software development lifecycle process or a project-management process. Kanban is a change-management technique that requires making alterations to an existing process: changes such as adding work-in-progress limits to it.

Work Item Size

Be the work items use-cases or stories, decomposing requirements introduces chance cause variation. London's XP community presents a quick case study. They had been using index cards to describe requirements. The size of the index card was intended to limit the size of the requirement.  However, they experienced excessive variation. Story completion time ranged from 1/2 day to 5 weeks!

By changing their policy to following the now commonly accepted story structure:
As a < user >, I want a < feature >, in order to < deliver some value >
one of the creators of this approach, Tim McKinnon, reported to me in 2008 that he now had data to show that the average user story was 1.2 days of effort and the spread of variation was a half-day to about four days.

Work Item Type Mix

Different types of work frequently require different amounts of time to complete.  For example, Epic vs. a Story, Use Case v.s. a Requirement, Production Bug vs. an Internal Bug.  Managing and measuring completion of these different sized items as one type increases delivery variability thus reducing predictability. By using techniques to identify different work item types, we can change the mean and spread of variability and improve the predictability in the system for any one type of work.

One strategy to improve predictability is to allocate total WIP capacity by type. 
For example,  some XP teams have come to classify three types of stories: Epics, which may take several weeks to complete, Stories, which might typically take less than a week, and Grains of Sand, which a pair might complete in less than a day.
Consider a team with a Kanban board on which there is a limit of two epics, eight regular stories, and four grains of sand. Two epics are in progress. A slot opens up in the queue for a regular story but there are none in the immediate backlog ready to start. The team has a choice of starting an epic or a grain of sand, or sticking to the type allocation and incurring some idle time.  
If they start an epic, and a few days later a regular story shows up in the backlog, they would be unable to start the regular story for quite a while. This will increase the lead-time spread for regular stories. Starting a smaller grain of sand is a better choice, as it might be finished before another regular story is ready to start. In that case, there is no impact, but there is a benefit from additional throughput. However, if they don’t get lucky and they fail to complete the smaller item before a story is ready to start, then the lead-time spread for regular stories will be affected adversely, though not as badly as in the epic scenario.

Class of Service Mix

Manage classes of service in the same way as work item types: allocate WIP capacity by class of service. 
For example, Consider a team with a WIP limit of 20, allocated as 4 fixed-date items, 10 standard class items, and 6 intangible class items. You can have a policy that these limits must be strictly adhered to, or you can loosen the rule and allow a standard or intangible item to fill a slot for a fixed date item when there is insufficient seasonal demand for fixed date items. These policies can be switched over at different times of year to improve the overall economic outcome and ensure that the system remains fairly predictable.


Rework, whether from internal bugs being fixed before release or production defects displacing new customer-valued work, affects variability. If they occur at a predictable rate and are consistently sized then the system can be tuned to handle them gracefully. This is often not the case. Unplanned rework due to bugs lengthens lead times, tends to increase the spread of variation, and greatly reduces throughput.

The best strategy for reduction of variability due to defects is to relentlessly pursue high quality with very low defect counts. 

Making changes to the software development lifecycle process can greatly affect defect rates. Use of peer reviews, pair programming, unit tests, automated testing frameworks, continuous (or very frequent) integration, small batch sizes, cleanly defined architectures, and well-factored, loosely coupled, highly cohesive code design will greatly reduce defects. Changes that directly affect defect rates and indirectly improve the predictability of the system are directly under the control of the local management and the team.

Irregular Flow - Internal Source Conclusion

Kanban avoids irregular flow i.e. avoids allowing the time it takes to complete most work to be unpredictable. Irregular flow is caused by internal and external sources. Predictability breeds trust.

When rigorously followed the various work in process limits handle the randomness of items and classes of items of different sizes, risk profiles etc. However, the greater their variability the more buffering is used. More buffering causes higher work in process. The higher the work in process rises the longer it takes work to flow through the system. This is generally a more desirable outcome as managers, owners, and usually customers value predictability over the random chance of a shorter lead time or greater throughput.

External Sources of Variability

External sources of variability come from places that are not directly controlled by the team's software development process or project management method. Examples include server failures, environmental outages, power outages, other teams 

Requirements Ambiguity

Poorly written requirements, ill-defined business plans, and lack of strategic planning, vision, or any other context-setting information may mean that a team member is unable to make a decision and therefore unable to complete a piece of work. A work item becomes blocked due to this inability to make a decision; new information is required to clarify the situation so that the team member can make a good-quality decision, allowing the work-in-progress to flow toward completion. 

In order to reduce the impact of such blockages, the team and direct management need to implement an effective issue-management and resolution process,

Requirements ambiguity, like many other external sources, can be influenced but not controlled. For example,

At Corbis in 2007, this was achieved through a gradual process. First, the kanban system was implemented, including a visual board, an electronic tracking system, and the transparency that comes with that. The business became more and more involved and interested in the software development activity and in monitoring the process performance. A report was generated showing the number of open issues, the number of work items blocked, and the average time to resolve. (See Figure 12.6, the Issues and Blocked Work Items report). 

When a requirement made it the whole way through to acceptance testing before it was rejected as not what the business really needed, the team reacted by creating a waste bin on the board and placing the ticket in it, as shown in Figure 19.1. Management then asked for a small set of electronic reports that showed work that had entered the system but had failed to make it the whole way through (Figure 19.2).

The combination of transparency, reporting, and building awareness of the impact and cost of poor requirements resulted in the business voluntarily changing its behavior.

Expedite Requests

Expedite requests happen because of external events, such as an unexpected customer order, or due to some breakdown in a company’s internal process, for example, a lack of communication that results in the late discovery of some important requirement.

Expediting is known in industrial engineering to be bad. It affects the predictability of other requests. It increases mean lead time and the spread of variability and it reduces throughput. Evidence collected at Corbis throughout 2007 demonstrated that this industrial engineering result held true for software development processes: Expediting is undesirable even if it is being done to generate value.

Ideally, application of the Kanban system to a process makes the impact of expedite requests on delivery clear and motivates strict limits to be set for them. Over time they should be eliminated.

Environment Availability

Environment availability is an extremely typical assignable cause variation. 

Irregular Flow - External Source Conclusion

Team's see external sources of variability as blockers. Reliance on environments and shared specialist resources such as DBA's, system and deployment engineers are typical blockers. Sadly, often organizations lack an effective issue and risk management capability.

There are two fundamental approaches to smoothing the irregular flow that blocked items create:

  1. Define higher work in process limits and passively accept longer lead times with less predictability. 
    • Fits immature organizations in domains that can absorb the cost and schedule implications.
    • Teams experience some of the Kanban system's benefits, reducing internal sources variability, improved quality, transparency, self-organization, etc. 
    • The pressure is off and the catalytic effect of Kanban is lost. Discussion of organizational improvements is not had so the dysfunction remains.
  2. Maintain tight work in process limits, keep buffer sizes low and actively resist the longer lead times and lower predictability.
    • Pursue issue management and resolution relentlessly and, as the team matures, to move toward root-cause analysis and elimination with specific improvements designed to prevent assignable-cause variations in the future.
    • Fits more mature organizations and domains that can not absorb the cost and schedule implications.
    • Create the conditions e.g. capacity, shared objectives, etc. to support the cross team collaboration needed to identify and address root causes. 
    • The pressure is on the organization and the catalytic effect of Kanban is realized.
There is now sufficient evidence to suggest that Kanban does provoke a culture that is focused on continuous improvement. The consistent process elements among the examples seem to be a willingness to enforce tight WIP policies, to mark work as blocked, to allow the line to stop, to incur idle time, and to pursue issue management and resolution as an organizational discipline.

Issue Management and Resolution

The convention in Kanban is to indicate blocked items on a card wall by marking them with attaching an Issue card to them, a sticker, a border or any cue that makes the blocked item obvious and hard to ignore. Select and configure electronic systems to make blocked items and their corresponding issues easy to see, manage, and hard to ignore. Making them visible is easy but building an organizational capability to take action, conduct root cause analysis, and resolve blockers is not.

Managing Issues

    • Use an Issue work item type and treat it as a first-class work item
      • Attach it to the blocked item on the card wall (See figure 20.1)
      • Associate it to the blocked item in the electronic tracking tool
    • Track the blocked item. 
      • Minimum required information includes:
        • Reason for blockage
        • Start and end date
        • Link to impacted work items
      • Additional information might include:
        • Resolution history of who has done what so far including the escalation path followed
        • Estimated time to resolution
        • Impact assessment
        • Suggested root-cause fixes for future prevention

      Example Issues:

      • Ambiguity in the requirements and a knowledgeable person is not available to instantly resolve the ambiguity.
      • An environment setup is required and an engineer to perform this task is currently unavailable
      • A specialist is required to work on the item and that person is unavailable due to vacation, sickness, or other out-of-office time.
      Kanban emphasizes team self-organization based on visible information and policies for making the right decisions about that information  e.g. who's working on what kind of work item and why. This enables stand-ups to focus more on maintaining flow, addressing blocked work items.

      Questions should be asked about who is working on resolving the issue and the status of progress on resolution. Does the issue need to be escalated? If so, to whom? Idle team members should be encouraged to volunteer to track down issues and generally to swarm on problems and assist however they can to resolve them and restore flow to the system. 
      A team with strong self-organization capability will tend to do this naturally. Team members will volunteer to help resolve issues. Where that self-organization capability has yet to emerge, the project manager may need to assign team members to work issues to resolution.

      Escalating Issues

      When the team is unable to resolve an issue on its own, or an external party is required to resolve an issue and is unavailable or unresponsive, the issue must be escalated to a more senior manager or other department. An ineffective escalation policy/process means flow will frequently be impeded and value lost.

      The basis of sound issue escalation is a collaboratively defined process and policy set. The value stream participants must:
      • Make the process simple and unambiguous. 
      • Ensure anyone working in the stream:
        • Is aware of, has easy access to, and is aligned with the process.
        • Is committed to following the process.
      By taking the time to define escalation paths and write policy around it, the team knows where to send issues for resolution. This saves time figuring out to whom an issue should be escalated and it sets expectations for those more senior individuals that they are expected to be a part of the process. Senior managers need to take responsibility to resolve issues. This will help to maintain flow and ultimately to minimize cost of delay (or optimize payoff from speedy delivery.)

      Tracking and Reporting Issues

      Even though pink tickets on the card wall provide a strong visualization of how many items are currently blocked, it is also useful to track and report issues in other ways. A cumulative-flow diagram of issues and blocked work items provides a strong visual indicator of the organizational capability at issue management and resolution. 

      The trend in blocked work items over time indicates whether a capability of root-cause analysis and resolution— improvement opportunities to eliminate assignable-cause variations— is developing.

      These reports should be presented at each operations review and time should be set aside to discuss the emergence and maturity of organizational capability of issue management and resolution and root-cause analysis and resolution. 

      The organization should be aware of the failure-load impact of blocking issues. This will enable objective decisions about improvement opportunities and the likely benefits of investment in root-cause fixes to prevent special-cause variations. 

      No comments:

      Post a Comment