copyright American Avalanche Association |
Getting Beyond Operator Error: Using
systems to analyse events
(my article as it appeared in the Feb.14 issue of the Avalanche Review)
The emerging systems based approach to risk
management planning has altered the way we conceive, organize and implement
risk systems. Many high risk industries have incorporated systems based risk
management to analyze and understand critical events beyond the default causes
of inherent risk and operator error. This paper introduces a systems approach
to looking beyond operator error and understanding the latent and
organizational causes of events and accidents. While my own perspective and
this model’s assumptions are based on a guide/operator within an organizational
setting, ‘organization’ can be interpreted at the widest level: recreational
groups, ski areas, or events are a form of ‘organization’ beyond the typical
guide-for-hire or backcountry program.
Operator error:
“Human error is a
consequence, not a cause.” Reason (1997)
Mountain guiding belongs to a small group
of industries in which both ‘production’ and ‘protection’ lie in the hands of a
sole operator. The guide is responsible to create and deliver a backcountry
experience, while at the same time oversee and balance the safety and the
protection of clients. There is continual tension between these two poles, and
in some cases outright conflict. Given the purposeful exposure to risk as the
defining feature of an adventure activity, production involves seeking risk
while protection requires insulation from it. For any specific event, the
balance between positive exposure (production) and negative exposure (too
little protection) is open to interpretation. In hind sight, it is easy to
second guess the operator’s on-the-spot balance between the two. It takes the
right combination of small errors, at a particular time, to cascade into a
large scale crisis. In hindsight, these factors become errors (Weick, 1990).
Writes industrial psychologist James Reason
“Human fallibility, like gravity, weather or terrain, is just another
foreseeable hazard...” (Reason, 1997). He continues “... The issue is not why
an error occurred but how it failed to be corrected.”
Why we blame the guide:
When something goes wrong the spotlight historically
shines almost exclusively on the hazard at hand and the individual’s actions
and decisions in the moments or events leading up to encountering it. This has,
as an underlying assumption, the idea of the ‘fallible guide’ – somewhere,
someone made a mistake. By dissecting the event an error or cause will be
found.
On top of this, there are predictable
psychological factors at play. Consider Attribution Error, where people tend to
blame the person over circumstance (Ross & Nisbett, 1991); Confirmation
Bias, which is the tendency to match a situation with what is already suspected
or known (Reason, 2001); or Hindsight Bias, where retrospective connections
seem obvious that might not have been visible at the time (Hoffrage, Hertwig
& Gigerenzer, 2000).
Regardless of human tendency and a history
predisposed to blame the operator, program managers attempt to devise systems,
policies and procedures that will prevent error, or at the least minimize it.
Systems based risk planning represents the most sophisticated form of this to
date. But consider:
“While
the probability of operator error can often be reduced, there is no evidence
whatever that it can be eliminated altogether... Human errors are fundamentally
‘caused’ by human variability, which cannot be designed away.” Ayres and
Rohatgi (1987)
Understanding errors:
The field of error management recognizes
two types of errors: active and latent (Table 1). Active errors are the
immediate, guide based slips, lapses, and mistakes – the “sharp end” (Reason,
1990) of a risk event. But Perrow (1999) cautions: “Be suspicious of operator error…” as it is
often the easy target in an unclear scenario. He claims 60-80% of system errors are blamed on the
operator.
System errors are considered latent errors;
dormant, long term conditions that set the stage for any number of unconnected
active errors. Latent errors are the “blunt end” of a risk event, and could
include anything from poor equipment design, bad management decisions, poor
planning, communication difficulties, or legislative or regulatory failure.
Latent errors are created by the system that hosts them and are difficult to
detect, since the ‘active’ and visible portion of the risk event usually takes
the focus. Plus, the current ‘objective hazard + subjective hazard + unsafe
act’ does not look for these latent, background contributors.
Active errors
|
Latent errors
|
•
Guide slips, lapses,
mistakes
•
‘sharp end’
•
Focus of trigger/event
based RM
|
•
Dormant, long term conditions
•
‘blunt end’
•
Focus of systems based RM
|
Table 1: Active and latent errors
Writes James Reason (1990) in Human
Error, “There is a growing awareness… [that to] discover latent failures is
the best means of limiting [active] error.” Mountain guides inherit the system
defects and latent errors that set them up for active errors: staffing
decisions, logistics restrictions, client screening, continuing down a possibly
long list. While it is the guide who pulls the trigger, so to speak, it is the
organization that put the gun in their hand.
Using systems to understand and analyse
critical events:
A systems based approach to understanding
critical events is based on the premise that “Human error is a consequence, not
a cause” (Reason, 1997). It incorporates the operator’s contributing actions
(active error) within a greater context of social, organizational, and latent
factors (Figure 1).
Figure 1:
Systems based approach to understanding critical events
|
This step also deals with the active error,
but steers away from blame and towards what is known as ‘sensemaking’. Rather
than looking for bad people making poor decisions (operator error), sensemaking
tries to understand how good people attempt to make sense of a situation, and
enacted what they likely thought was the best idea given their understanding of
the situation (Weick, 1998). This particular step is not the focus of this
paper.
Step 2: Substitution test
The substitution test is an important lens through which to assess
an event. It defines an event as either a true operator error situation, or one
involving latent factors (Johnston, 1995). The substitution test asks this
question:
‘Given how events
unfolded and were perceived in real time, is it likely that a new individual,
with the same training and experience, would have behaved any differently?’
If the answer is an honest ‘yes’
(accounting for hindsight bias and attribution error mentioned previously), as
in a similar person would not have
behaved the same way, then the event could be considered primarily an operator
error situation: a slip, lapse, or mistake. In such case, driving to ‘why’
yields little information to improve safety or prevent a similar event. The
investigation can end here.
If the answer is ‘no’ - a similar person
would likely have acted and behaved in a similar way - then latent conditions
played some role in causing the event. The substitution test implies that if
the scenario were to present itself again, another individual would respond in
the same way. These latent conditions are explored next.
Step 3: Group contribution
This first layer of latent conditions is
the social interactions which directly or indirectly steer action, decisions
and sensemaking in the moment. This layer is rich in explanatory power although
is difficult to access given the complex nature of social groups. Primarily
these interactions revolve around authority and role definition and the
assumptions and expectations they create. A guiding situation is influenced by
the organization and management/supervisory structure, while recreational
groups are victim of much looser assumptions regarding expertise and leadership.
This analysis can also extend to team functionality, peer pressure and group
interaction. The human factors has been introduced into the avalanche world as
a means of addressing these, but is only the tip of the iceberg. These particular
interactions are not the focus of this paper, but readers are directed to the
work of Snook (2000) and his analysis of group interaction as latent cause in
one particular case.
Step 4: Organizational factors
Key organizational processes and factors
form the base layer of potential latent errors and causes of events. Any of
these may be perfectly functional in ‘normal’ conditions, but can prove to be
poorly conceived, implemented or supervised when faced with an abnormal
situation or when combined in unforeseen ways (Perrow, 1999). Organizational
factors with the most potential for latent errors are briefly introduced below.
4.1
Risk tolerance
Risk tolerance is the articulated limits on the
nature and magnitude of hazards and uncertainty to which an organization will
expose its clients, staff and self. Best when explicitly stated, it can also be
viewed within program parameters and the exposure limits inherent in the
organization’s chosen activities or operating environments.
As an analysis tool, the guide’s sensemaking and
contributing actions reflect their understanding of the organization’s risk
tolerance. Any discrepancies here need to be examined. It is important to note
that a written risk tolerance statement serves little use if it conflicts with
the actual risk culture in the organization (its true risk tolerance). The
prevalence of a culture of safety (vs. production), where management chooses to
direct their attention and where money gets spent are all signals that the guide
interprets in their own understanding of risk tolerance.
4.2
Core process map
Ski areas or events that inadvertently host BC skiing raise interesting questions at this point. What kind of expectations were set up in advance? What messages were being communicated to potential participants? While the core process as envisioned here revolves around a commercial contract, a similar process can be imagined which generates social expectations or inadvertent duty of care.
Systems based risk management planning is organized around a core process; the central interactions that produce the programs, trips or services the organization offers (Figure 2). As it is the focal point of systems based planning, analyzing the core process map in detail looks for gaps, failures or inadequate system performance standards that may have created latent conditions. In effect, this asks the question ‘Did everything perform as it is supposed to?’. Answering no shows a clear break down, but even with the answer yes, there is a follow up question: Is the current vision of how it supposed to work good enough? ‘Good enough’ needs to be related to the organization’s risk tolerance, performance standards / expectations, sense of values, and industry standards. This continues by assessing the seven systems (below).
4.3 Seven systems analysis
Risk management planning is about systems
planning. These systems are turned into processes and routines. These routines
ensure that system and organizational targets are met.What’s more, these systems and routines are
examined in light of the event and guide’s sensemaking, contributing actions,
an assessment of risk tolerance, and the basic interaction of the core process.
This examination looks for more subtle or sophisticated interactions, and
detailed system maps make these points apparent.
4.4 Operational features
If systems provide the structure,
organizations adapt them to their own needs. As operations grow and evolve,
certain operational features may lend themselves to latent errors (Perrow,
1999; Reason, 1997).
Coupling is the amount of slack or free space in an operation or activity. A
tightly scheduled, high volume or tight and efficient operation is more at risk
of error, for the simple fact there is less time to correct them. Small errors
cascade quickly in an environment where things happen quickly – the typical BC
setting. Inserting slack into an operation is always a good idea when it comes
to preventing errors, but is directly at odds with efficiency (an example of
conflict between production and protection).
Operational
consistency, Supervisory
and management models and complexity
creep all play a role. Critical
incident experience is an indicator of future individual and system resiliency.
A system that has been tested is more predictable than one that hasn't been
(even if it failed the first time). Individuals within and the system itself
will have experience recognizing what failure looks like, and either
predict/prevent or effectively manage it prior to escalation (Jackson, 2009).
Within this, individual experience at
failure level is good for error prevention. Training above and beyond normal
operating levels (to the point of failure) builds an understanding of where the
edge lies, and how events unfold there. The point is to be able to recognize
when failure is near, and have the ability to make sense of a critical
situation as it unfolds.
Conclusion:
This article provides a systems approach to
looking beyond operator error and understanding the latent and organizational
causes of events and accidents. This analysis framework examines the operator’s
contributing actions, but also looks at group and system contributions. From a
systems perspective, risk tolerance, the core process and system maps provide
concrete points of examination, as do operational factors such as coupling and
supervisory models. This systems based analysis model can be applied to
critical and non-critical events, and to different program and organizational
structures.
Reason (1997) writes, “We cannot change the
human condition; people will always make errors.” He continues, however, to
assert “We can change the conditions under which they work and make unsafe acts
less likely.” By understanding the system and operational factors that
contribute to latent errors is to make progress in minimizing them.
Bibliography:
Hoffrage, U., Hertwig, R., &
Gigerenzer, G. (2000). Hindsight bias: a by-product of knowledge updating? Journal
of Experimental Psychology: Learning, Memory and Cognition, 26(3),
566-581.
Jackson, J. (2009). SCIRA: A Risk System Management Tool, Proceedings from the 2009 Wilderness Risk Management Conference, http://www.nols.edu/wrmc/resources.shtml
Jackson, J. & Heshka, J. (2010). Managing
Risk, Systems Planning for Outdoor Adventure Programs, Direct Bearing Inc.,
Palmer Rapids, ON.
Johnston, N. (1995). Do blame and
punishment have a role in organizational risk management? Flight Deck, Spring 1995.
Perrow, C. (1999). Normal Accidents,
Living with high risk technologies. Princeton University Press, Princeton,
N.J.; reprint of 1984 Basic Books.
Reason, J. (1990). Human Error,
Cambridge University Press, New York, NY.
Reason, J. (1997). Managing the Risks of
Organizational Accidents. Ashgate, Aldershot, England.
Reason, J. T. (2001). Understanding adverse
events: the human factor. In C. Vincent (Ed.), Clinical risk management.
Enhancing patient safety (2 ed., pp. 9--30). London: BMJ Books.
Ross, L., & Nisbett, R. E. (1991). The
person and the situation. Perspectives of social psychology. New
York: McGraw Hill.
Snook, S. (2000). Friendly Fire. The
accidental shootdown of U.S. Black Hawks over Northern Iraq. Princeton University Press, Princeton, N.J.
Weick, K. (1990) The Vulnerable System: an
analysis of the Tenerife air disaster. Journal
of Management. Vol. 16, No. 3
Weick, K. (1988). ‘Enacted Sensemaking in
Crisis Situations’. Journal of Management Studies 25:4