Getting Beyond Operator Error: Using systems to analyse events

copyright American Avalanche Association


Getting Beyond Operator Error: Using systems to analyse events 
(my article as it appeared in the Feb.14 issue of the Avalanche Review)

The emerging systems based approach to risk management planning has altered the way we conceive, organize and implement risk systems. Many high risk industries have incorporated systems based risk management to analyze and understand critical events beyond the default causes of inherent risk and operator error. This paper introduces a systems approach to looking beyond operator error and understanding the latent and organizational causes of events and accidents. While my own perspective and this model’s assumptions are based on a guide/operator within an organizational setting, ‘organization’ can be interpreted at the widest level: recreational groups, ski areas, or events are a form of ‘organization’ beyond the typical guide-for-hire or backcountry program.

Operator error:

“Human error is a consequence, not a cause.” Reason (1997)
Mountain guiding belongs to a small group of industries in which both ‘production’ and ‘protection’ lie in the hands of a sole operator. The guide is responsible to create and deliver a backcountry experience, while at the same time oversee and balance the safety and the protection of clients. There is continual tension between these two poles, and in some cases outright conflict. Given the purposeful exposure to risk as the defining feature of an adventure activity, production involves seeking risk while protection requires insulation from it. For any specific event, the balance between positive exposure (production) and negative exposure (too little protection) is open to interpretation. In hind sight, it is easy to second guess the operator’s on-the-spot balance between the two. It takes the right combination of small errors, at a particular time, to cascade into a large scale crisis. In hindsight, these factors become errors (Weick, 1990).

Writes industrial psychologist James Reason “Human fallibility, like gravity, weather or terrain, is just another foreseeable hazard...” (Reason, 1997). He continues “... The issue is not why an error occurred but how it failed to be corrected.”

Why we blame the guide:

When something goes wrong the spotlight historically shines almost exclusively on the hazard at hand and the individual’s actions and decisions in the moments or events leading up to encountering it. This has, as an underlying assumption, the idea of the ‘fallible guide’ – somewhere, someone made a mistake. By dissecting the event an error or cause will be found.

On top of this, there are predictable psychological factors at play. Consider Attribution Error, where people tend to blame the person over circumstance (Ross & Nisbett, 1991); Confirmation Bias, which is the tendency to match a situation with what is already suspected or known (Reason, 2001); or Hindsight Bias, where retrospective connections seem obvious that might not have been visible at the time (Hoffrage, Hertwig & Gigerenzer, 2000).

Regardless of human tendency and a history predisposed to blame the operator, program managers attempt to devise systems, policies and procedures that will prevent error, or at the least minimize it. Systems based risk planning represents the most sophisticated form of this to date. But consider:
“While the probability of operator error can often be reduced, there is no evidence whatever that it can be eliminated altogether... Human errors are fundamentally ‘caused’ by human variability, which cannot be designed away.” Ayres and Rohatgi (1987)

Understanding errors:

The field of error management recognizes two types of errors: active and latent (Table 1). Active errors are the immediate, guide based slips, lapses, and mistakes – the “sharp end”  (Reason, 1990) of a risk event. But Perrow (1999) cautions:  “Be suspicious of operator error…” as it is often the easy target in an unclear scenario. He claims 60-80% of system errors are blamed on the operator.

System errors are considered latent errors; dormant, long term conditions that set the stage for any number of unconnected active errors. Latent errors are the “blunt end” of a risk event, and could include anything from poor equipment design, bad management decisions, poor planning, communication difficulties, or legislative or regulatory failure. Latent errors are created by the system that hosts them and are difficult to detect, since the ‘active’ and visible portion of the risk event usually takes the focus. Plus, the current ‘objective hazard + subjective hazard + unsafe act’ does not look for these latent, background contributors.

Active errors
Latent errors
          Guide slips, lapses, mistakes
          ‘sharp end’
          Focus of trigger/event based RM
          Dormant, long term conditions
          ‘blunt end’
          Focus of systems based RM
Table 1: Active and latent errors

Writes James Reason (1990) in Human Error, “There is a growing awareness… [that to] discover latent failures is the best means of limiting [active] error.” Mountain guides inherit the system defects and latent errors that set them up for active errors: staffing decisions, logistics restrictions, client screening, continuing down a possibly long list. While it is the guide who pulls the trigger, so to speak, it is the organization that put the gun in their hand.

Using systems to understand and analyse critical events:
A systems based approach to understanding critical events is based on the premise that “Human error is a consequence, not a cause” (Reason, 1997). It incorporates the operator’s contributing actions (active error) within a greater context of social, organizational, and latent factors (Figure 1).

Step 1: Understanding what happened


Figure 1: Systems based approach to understanding critical events
Understanding what happened precedes any deeper analysis, and includes actions, decisions, conversations and events both leading up to and after the critical event. This may prove deceptively difficult, given the subjective nature of human memory, especially when challenged with confusing, complex and stressful, situations (Hoffrag, Hertwig & Gigerenzer, 2000).

This step also deals with the active error, but steers away from blame and towards what is known as ‘sensemaking’. Rather than looking for bad people making poor decisions (operator error), sensemaking tries to understand how good people attempt to make sense of a situation, and enacted what they likely thought was the best idea given their understanding of the situation (Weick, 1998). This particular step is not the focus of this paper.

Step 2: Substitution test

The substitution test is an important lens through which to assess an event. It defines an event as either a true operator error situation, or one involving latent factors (Johnston, 1995). The substitution test asks this question:
‘Given how events unfolded and were perceived in real time, is it likely that a new individual, with the same training and experience, would have behaved any differently?’
If the answer is an honest ‘yes’ (accounting for hindsight bias and attribution error mentioned previously), as in a similar person would not have behaved the same way, then the event could be considered primarily an operator error situation: a slip, lapse, or mistake. In such case, driving to ‘why’ yields little information to improve safety or prevent a similar event. The investigation can end here.

If the answer is ‘no’ - a similar person would likely have acted and behaved in a similar way - then latent conditions played some role in causing the event. The substitution test implies that if the scenario were to present itself again, another individual would respond in the same way. These latent conditions are explored next.

Step 3: Group contribution

This first layer of latent conditions is the social interactions which directly or indirectly steer action, decisions and sensemaking in the moment. This layer is rich in explanatory power although is difficult to access given the complex nature of social groups. Primarily these interactions revolve around authority and role definition and the assumptions and expectations they create. A guiding situation is influenced by the organization and management/supervisory structure, while recreational groups are victim of much looser assumptions regarding expertise and leadership. This analysis can also extend to team functionality, peer pressure and group interaction. The human factors has been introduced into the avalanche world as a means of addressing these, but is only the tip of the iceberg. These particular interactions are not the focus of this paper, but readers are directed to the work of Snook (2000) and his analysis of group interaction as latent cause in one particular case.

Step 4: Organizational factors

Key organizational processes and factors form the base layer of potential latent errors and causes of events. Any of these may be perfectly functional in ‘normal’ conditions, but can prove to be poorly conceived, implemented or supervised when faced with an abnormal situation or when combined in unforeseen ways (Perrow, 1999). Organizational factors with the most potential for latent errors are briefly introduced below.

4.1   Risk tolerance

Risk tolerance is the articulated limits on the nature and magnitude of hazards and uncertainty to which an organization will expose its clients, staff and self. Best when explicitly stated, it can also be viewed within program parameters and the exposure limits inherent in the organization’s chosen activities or operating environments.

As an analysis tool, the guide’s sensemaking and contributing actions reflect their understanding of the organization’s risk tolerance. Any discrepancies here need to be examined. It is important to note that a written risk tolerance statement serves little use if it conflicts with the actual risk culture in the organization (its true risk tolerance). The prevalence of a culture of safety (vs. production), where management chooses to direct their attention and where money gets spent are all signals that the guide interprets in their own understanding of risk tolerance.


Social groups in a recreational setting, too, have a risk tolerance, although heavily skewed with individual target level risk and social dynamics. This social risk tolerance is best examined as a group contribution in Step 3, above.

4.2   Core process map

Ski areas or events that inadvertently
 host BC skiing raise interesting questions at this point. What kind of expectations were set up in advance? What messages were being communicated to potential participants? While the core process as envisioned here revolves around a commercial contract, a similar process can be imagined which generates social expectations or inadvertent duty of care.

Systems based risk management planning is organized around a core process; the central interactions that produce the programs, trips or services the organization offers (Figure 2). As it is the focal point of systems based planning, analyzing the core process map in detail looks for gaps, failures or inadequate system performance standards that may have created latent conditions. In effect, this asks the question ‘Did everything perform as it is supposed to?’. Answering no shows a clear break down, but even with the answer yes, there is a follow up question: Is the current vision of how it supposed to work good enough? ‘Good enough’ needs to be related to the organization’s risk tolerance, performance standards / expectations, sense of values, and industry standards. This continues by assessing the seven systems (below).

4.3 Seven systems analysis

Risk management planning is about systems planning. These systems are turned into processes and routines. These routines ensure that system and organizational targets are met.What’s more, these systems and routines are examined in light of the event and guide’s sensemaking, contributing actions, an assessment of risk tolerance, and the basic interaction of the core process. This examination looks for more subtle or sophisticated interactions, and detailed system maps make these points apparent.


Latent errors need not be just one thing, but can be a combination of well intentioned, normally adequate system or operational structures.

4.4 Operational features

If systems provide the structure, organizations adapt them to their own needs. As operations grow and evolve, certain operational features may lend themselves to latent errors (Perrow, 1999; Reason, 1997).

Coupling is the amount of slack or free space in an operation or activity. A tightly scheduled, high volume or tight and efficient operation is more at risk of error, for the simple fact there is less time to correct them. Small errors cascade quickly in an environment where things happen quickly – the typical BC setting. Inserting slack into an operation is always a good idea when it comes to preventing errors, but is directly at odds with efficiency (an example of conflict between production and protection).

Operational consistency, Supervisory and management models and complexity creep all play a role. Critical incident experience is an indicator of future individual and system resiliency. A system that has been tested is more predictable than one that hasn't been (even if it failed the first time). Individuals within and the system itself will have experience recognizing what failure looks like, and either predict/prevent or effectively manage it prior to escalation (Jackson, 2009).

Within this, individual experience at failure level is good for error prevention. Training above and beyond normal operating levels (to the point of failure) builds an understanding of where the edge lies, and how events unfold there. The point is to be able to recognize when failure is near, and have the ability to make sense of a critical situation as it unfolds.

Conclusion:

This article provides a systems approach to looking beyond operator error and understanding the latent and organizational causes of events and accidents. This analysis framework examines the operator’s contributing actions, but also looks at group and system contributions. From a systems perspective, risk tolerance, the core process and system maps provide concrete points of examination, as do operational factors such as coupling and supervisory models. This systems based analysis model can be applied to critical and non-critical events, and to different program and organizational structures.
Reason (1997) writes, “We cannot change the human condition; people will always make errors.” He continues, however, to assert “We can change the conditions under which they work and make unsafe acts less likely.” By understanding the system and operational factors that contribute to latent errors is to make progress in minimizing them.

Bibliography:

Hoffrage, U., Hertwig, R., & Gigerenzer, G. (2000). Hindsight bias: a by-product of knowledge updating? Journal of Experimental Psychology: Learning, Memory and Cognition, 26(3), 566-581.
Jackson, J. (2009). SCIRA: A Risk System Management Tool, Proceedings from the 2009 Wilderness Risk Management Conference, http://www.nols.edu/wrmc/resources.shtml
Jackson, J. & Heshka, J. (2010). Managing Risk, Systems Planning for Outdoor Adventure Programs, Direct Bearing Inc., Palmer Rapids, ON.
Johnston, N. (1995). Do blame and punishment have a role in organizational risk management? Flight Deck, Spring 1995.
Perrow, C. (1999). Normal Accidents, Living with high risk technologies. Princeton University Press, Princeton, N.J.; reprint of 1984 Basic Books.
Reason, J. (1990). Human Error, Cambridge University Press, New York, NY.
Reason, J. (1997). Managing the Risks of Organizational Accidents. Ashgate, Aldershot, England.
Reason, J. T. (2001). Understanding adverse events: the human factor. In C. Vincent (Ed.), Clinical risk management. Enhancing patient safety (2 ed., pp. 9--30). London: BMJ Books.
Ross, L., & Nisbett, R. E. (1991). The person and the situation. Perspectives of social psychology. New York: McGraw Hill.
Snook, S. (2000). Friendly Fire. The accidental shootdown of U.S. Black Hawks over Northern Iraq.  Princeton University Press, Princeton, N.J.
Weick, K. (1990) The Vulnerable System: an analysis of the Tenerife air disaster. Journal of Management. Vol. 16, No. 3
Weick, K. (1988). ‘Enacted Sensemaking in Crisis Situations’. Journal of Management Studies 25:4