Human factors in aviation accident investigation
CASI Human Factors Symposium
Member, Transportation Safety Board of Canada
Toronto, Ontario, 30 April 2013
Check against delivery.
Slide 1: Title page
Good afternoon. Thank you for the invitation to speak. It's a real pleasure to be here.
Slide 2: Outline
Today I'll start off with a little bit of background on the Transportation Safety Board of Canada—who we are, what we do. Then I'll briefly review the evolution of accident investigation, from simply looking at mechanical failure or old attitudes of blaming the pilot(s), to examining the organizational factors that played a role.
I'm going to take you through a specific accident investigation recently conducted by the TSB and the human factors issues raised during that investigation.
I'll then talk about how organizations drift into failure. I will argue that many, if not most, accidents can be attributed to a breakdown in the way organizations pro-actively identify and mitigate hazards and manage risks and how the components of Safety Management Systems (SMS) were designed to provide a formal, structured process to help companies "find trouble, before trouble finds them"
Slide 3: Who we are
Slide 4: What we do
Our only mandate is to advance transportation safety. We do that by conducting independent investigations, identifying safety deficiencies, identifying causes and contributing factors, making recommendations, and making our reports public.
Slide 5: TSB offices
Our head office is located in Gatineau, Quebec, and our Engineering Branch is in Ottawa. We have eight other regional offices across the country.
Slide 6: Communicating what we know
The TSB has a variety of escalating tools to communicate identified safety issues to those stakeholders who can make a difference—e.g., regulators, manufacturers, operators, industry associations I'll touch specifically on a few of the items listed here.
Following an occurrence, Safety Information Letters and Safety Advisories can be used by our Directors of Investigation to communicate with regulatory or industry stakeholders about unsafe conditions the TSB has found. The TSB can also decide to conduct a full investigation and release a public report on its findings, so that the regulator or industry can take the necessary safety actions.
A Board Safety Concern provides a marker to the industry and the regulator that there is a safety deficiency that warrants further attention, while a Board Recommendation is a powerful tool for change, used to identify sysTEMic safety issues that need to be addressed to reduce the risk of similar future occurrences.
Slide 7: The Watchlist
In March 2010, and again in June 2012, the Board issued a Watchlist of those transportation safety issues posing the greatest risk to Canadians. In each case, the TSB has found that actions taken to date are inadequate, and that industry and regulators need to take additional concrete measures to eliminate the risks.
That's why, when we updated the Watchlist in 2012, we were able to remove old issues, where enough progress had been made, and add new ones as transportation risks have evolved.
This slide shows those iTEMs currently on the Watchlist, including notably, the issue of Safety Management SysTEMs which I'll touch on later.
Slide 8: Background
In the immediate aftermath of a serious aviation accident, the questions people usually ask are "What happened?"; "Was it mechanical breakdown or human error?"; "What or who is to blame?" But modern aircraft accident investigations look beyond the "what" to try to determine the "why," because the primary objective is not to find fault or attribute blame, but to advance transportation safety by identifying the underlying causal or contributing factors, and the ones that create risks in the transportation system.
In the early days of aviation, accidents were more often attributed to mechanical breakdown, bad weather, or pilot handling errors. With the advent of human factors research and analysis, investigators started to look at how aircraft and cockpit design ergonomics may have contributed to "pilot errors". Later, more attention was paid to physiological factors—fatigue, circadian rhythms and spatial illusions—and psychological biases that could influence pilot decision-making and risk-taking behaviour. Over time, this has evolved to not only look at the performance of individual pilots but also of flight crews, leading to the modern concepts of Crew Resource Management (CRM) and Threat and Error Management (TEM).
Following a number of serious accidents involving complex, safety-critical technologies, investigators and researchers began to examine what role management and organizational factors played in these accidents. Dr. James Reason developed the well-known Reason's (or Swiss cheese) model to illustrate how management policies and decisions can contribute to latent pre-conditions which, combined with active operational failures and breakdowns in systemdefences, may converge to allow a window of opportunity for accidents to occur.
Slide 9: TSB Investigation Report A11F0012
So let's look at a real-life example of an occurrence investigated by the TSB, and the human factors issues involved.
On January 14, 2011, an Air Canada Boeing 767 was flying from Toronto, Ontario, to Zurich, Switzerland. Approximately halfway across the Atlantic, during the hours of darkness, the aircraft experienced a 46-second "pitch excursion." This resulted in an altitude deviation of minus 400 feet to plus 400 feet from the assigned altitude of 35 000 feet above sea level.
The seatbelt sign had been selected "on" approximately 40 minutes prior to the pitch excursion. Yet 14 passengers and 2 flight attendants were injured. The flight continued to destination, whereupon 7 passengers were sent to hospital and later released.
Slide 10: The media response…
Some of you may remember reading about this in the media after the TSB published its final report. Here is a sampling of what was said in the press at the time.
Slide 11: More headlines…
Here are some more headlines.
But let's look at what really happened, and why.
Slide 12: What really happened?
On screen, you can see the detailed sequence of events from that night.
Slides 13: What really happened?(cont'd)
Here is some more of what our investigators found.
Slide 14: Why did this happen?
The TSB's report contained a number of findings. Here are a few.
Slide 15: Factors to consider
We live in a 24/7 world. Many employees have to work overnight. Pilots face an additional challenge because overseas pilots also have to adapt to changing time zones.
One of the issues is that flying overnight, when our body naturally wants to sleep, and through a natural circadian low which occurs during the early morning hours, can lead to performance decrements.
Many organizations with shift workers have adopted formal fatigue risk management plans which include:
- Education and awareness training and strategies
- Policies for napping and controlled rest
- Scheduling practices and, in the case of airlines, use of a relief pilot
Allowing naps and controlled rest can lead to another issue known as sleep inertia. The TSB report says:
Sleep inertia 22 refers to the post–sleep performance decrements that occur immediately after awakening. Sleep inertia is a transient physiological state characterized by confusion, disorientation, low arousal, and deficits in various types of cognitive and motor performance. 23 Although the duration of sleep inertia is usually short, from 1 to 15 minutes 24, some deleterious effects can last 30 minutes 25 or longer. 26
Research indicates that the duration and severity of sleep inertia can be worse
- if naps are longer 27;
- if naps occur during the circadian core body TEMperature trough or circadian low 28 (normally in the middle of the night for a diurnally-oriented person);
- when the person is sleep-deprived or has been awake for an extended period 29; and the nap contains or ends with slow-wave sleep. 30
One of the detrimental effects of sleep inertia is a decrease in cognitive processing speed. 31 For example, it takes longer than normal for a person experiencing sleep inertia to filter out incongruous visual information. 32
So short naps —that is, of a 20- to 40-minute duration—are better, to avoid going into slow-wave sleep; and it is important to allow sufficient recovery time after a nap to offset sleep inertia's effects.
Slide 16: Controlled rest
This informed Air Canada's controlled rest policy.
The quote is taken from TSB investigation Report A11F0012, which in turn takes it from Air Canada Flight Operations Manual, Section 2.9.10 —Alertness Management.
Slide 17: Factors to consider (cont'd)
The TSB's report into this occurrence also identified some other factors.
Slide 18: Balancing competing priorities
Many organizations claim "safety is our first priority". There is, however, convincing evidence that, for some, the top priority is really customer service, or return on shareholder investment. However, products and services still need to be "safe" if an organization wants to stay in business and maintain public confidence—while avoiding accidents and costly litigation.
Therefore, balancing competing priorities and managing risk is part of any manager's decision-making process. And while some risks are easier to assess than others, it is very difficult to foresee what combination of circumstances might result in an incident or accident. This is particularly challenging in a complex socio-technical organization with a very low accident rate—e.g., air traffic control, and flight operations.
Slide 19: Limits of acceptable performance
Rasmussen suggests that, under the influence of pressure toward cost-effectiveness in an aggressive, competitive environment, organizations tend to migrate to the limits of acceptable performance. In other words, they drift.
Slide 20: Organizational drift
Sidney Dekker explains that organizational drift is generated by normal processes of reconciling differential pressures on an organization (efficiency, capacity utilization, safety) against a background of uncertain technology and imperfect knowledge. Drift may be visible to outsiders, especially following an adverse outcome. But drift is not necessarily obvious within the organization, since incremental changes are always occurring.
However, given the constant need to reconcile competing goals and the uncertainties involved in assessing safety risks, how can managers recognize if or when they are drifting outside the boundaries of safe operation while they focus on customer service, productivity and efficiency?
The focus of this question falls heavily on management because, by their nature, management decisions tend to have a wider sphere of influence on how the organization operates and a longer term effect than the individual actions of operators (e.g., pilots, or maintenance engineers). Managers create the operating environment by establishing and communicating the goals and priorities and by providing the tools, training, and resources to accomplish the tasks required to produce the goods and services.
So decision-makers and organizations need to develop "mindfulness" of the impact of their priorities, policies and processes. To quote Reason: "If eternal vigilance is the price of liberty, then chronic unease is the price of safety."
Slide 21: Safety Management Systems (SMS)
Traditional approaches to safety management—based primarily on compliance with regulations, reactive responses following accidents and incidents, and a "blame and punish" philosophy—have been recognized as being insufficient to reduce accident rates (ICAO, 2008).SMS was designed around evolving concepts about risk management and safety culture, including the research into High Reliability Organizations, which are believed to offer great potential for more effective safety management.
A successful SMS is systematic, explicit, and comprehensive. Reason says it "becomes part of an organization's culture, and of the way people go about their work."
Some people in this country misconstrue SMS as a form of deregulation or industry self-regulation. However, just as organizations rely on internal financial and human resources management systems to manage their financial assets and human resources, SMS is a framework designed to enable companies to better manage their safety risks. This does not obviate the need for effective regulatory oversight.
Slide 22: SMS Requires the Following
The generally accepted components of a safety management systeminclude:
- an accountable executive or designated authority for safety;
- a safety policy on which the systems based (articulating senior management commitment);
- a process for setting safety goals and measuring their attainment;
- a process for identifying hazards, evaluating and managing the associated risks;
- a process for ensuring that personnel are trained and competent to perform their duties;
- a process for internal reporting and analyzing of hazards, incidents and accidents and for taking corrective actions;
- a process for documenting SMS and making staff aware of their responsibilities; and
- a process for conducting periodic reviews or audits of the SMS.
In essence, SMS requires:
- Proactive hazard identification
- Incident reporting and analysis
- Strong safety culture.
Slide 23: Investigating for organizational factors
Previous research into TSB accident investigations involving air/rail/marine operators revealed that organizational factors often played a role, and there were patterns of accident pre-cursors, specifically:
- Goal conflicts: production versus protection
- Inadequate risk analysis, including:
- no formal risk analysis conducted
- risk analysis conducted, but hazard not identified
- hazard identified, but residual risk underestimated
- risk-control procedures not in place, or in place but not followed
- Employee adaptations
- Failure to heed "weak signals," including:
- inadequate tracking or follow-up of safety deficiencies
- ineffective sharing of information before, during or after the event, including verbal communications, record-keeping, or other documentation
In keeping with Reason's model, there was often a complex interaction of causal and contributing factors. In other words, there was no single factor that "caused" the accident.
Let's look at these four points individually—and how they played out in the Air Canada pitch excursion.
Slide 24: Employee adaptations
This leads me to the next factor of employee adaptations.
In theory, everybody has procedures that specify how work should be performed. But as we all know, these don't always describe how work actually gets done. This difference can cause problems, and "employee adaptations" can inadvertently sabotage safety.
Accident investigation reports often refer to these as "violations" or "deviations from SOPs". But let's look at this in a different light. Think about it in the context of limited resources: faced with time pressures and multiple goals, workers and management may be tempted to create "locally efficient practices" in order to get the job done. This can put safety goals directly in conflict with customer service and financial goals. It's often tough to say exactly how much less safe such a practice is, but if it works—and the more often it does work—then departures from the routine become the routine. Past successes are taken as a guarantee of future safety.
The important point here is that organizations need to anticipate or look for such adaptations and think about the possible consequences—good or bad.
There are many reasons why procedures may not be followed. Sometimes it may reflect a lack of training, practice or awareness about the procedure, or about the rationale behind the procedure. In other cases, strictly following the procedures may conflict with other goals, whereas taking a shortcut may save time and effort. Perhaps there is a lack of supervision or quality assurance. The key is to understand the context-specific reasons for the gap between written procedures and real practices. This will help organizations better understand this natural phenomenon, and allow for more effective interventions, beyond simply telling the workers to "Follow the rules!" or to "Be more careful!"
In many accident investigations, "weak signals", indicating potential trouble, were missed. By their nature, weak signals may not be sufficient to attract the attention of busy managers, who often suffer information overload while juggling many competing priorities under significant time pressures.
Employees will submit more incident reports if they are trained to recognize specific hazardous situations or conditions and areas they think the SMS should review. If all employees do not fully understand their reporting obligations and have not adopted a safety reporting culture as part of everyday operations, SMS will be less effective in managing risks.
SMS is intended to provide an infrastructure in which "weak signals" can be amplified to the point where they will be acted upon before an accident occurs.
Slide 25: Pilot error or management error?
As this example has shown, organizational drift, goal conflicts and employee adaptations occur naturally in any complex organization. Organizations can and should learn from these occurrences, since they also demonstrate patterns of accident pre-cursors (e.g., not thinking ahead to what might go wrong, not having an effective means to track and highlight recurrent maintenance or other safety deficiencies, insufficient training and/or resources to deal with unexpected events).
Although an individual operator's actions or inactions can clearly cause or contribute to incidents and accidents, "human error" is an attribution made following an adverse outcome, usually with the benefit of hindsight. Most people don't set out to make an error or cause an accident; they just want to get the work done. So it's important to view their actions/inactions in the organizational context in which they occurred. And, following an accident, it is important to figure out "why" they did what they did.
Decision makers at all levels of an organization set and communicate the goals and priorities. They are usually aware of risks that need to be addressed and the need to make trade-offs. Routine pressures to get the job done can often exert an insidious influence on operators and decision makers, leading them to act in ways that are riskier than they realize.
Slide 26: Conclusions
Decision makers normally don't want to cause or contribute to an accident either; but just wanting operations to be safe won't create safety unless this commitment is also supported by "mindful" processes such as formal risk assessments, increased hazard reporting, tracking of safety deficiencies, and effective follow-up.
Ultimately, an SMS is only as effective as the safety culture in which it is embedded. There is a complex relationship between culture and process. SMS won't take hold unless there is a strong underlying commitment and buy-in to safety. While process changes (such as formal risk assessments and incident reporting) can stimulate changes in culture, they will only be sustainable in the long term if they are seen to add value.
While SMS may not eliminate all accidents, a properly implemented safety management system can help reduce the risk. Over time, this should reduce the accident rate. And that's why the TSB has included the implementation of SMS in the Marine and Aviation industries on our Watchlist.
Slide 27: Questions?
Thank you for your time and attention. I'd be happy to take any questions you may have.
Slide 28: Canada wordmark
- Date modified: