High reliability organizations in commercial aviation

International High Reliability Organizing Conference

View presentation in PDF

You need a PDF reader to access this file; find out more on our help page

Kathy Fox
Member, Transportation Safety Board of Canada
Midland, Michigan, April 08, 2013

[Check against delivery]

Slide 1: Title page

Good morning. Thank you for the invitation to speak. It's a real pleasure to be here.

Today I'd like to talk about how the principles of High Reliability Organizations underpin the evolution of formal safety risk management systems which are being adopted in the transportation industry, and particularly in commercial aviation.

Slide 2: Outline

I'll start off with a little bit of background on the evolution of accident investigation, from simply looking at mechanical failure or pilot error, to examining the organizational factors that played a role. I'll talk about how organizations drift into failure and how the components of Safety Management Systems (SMS) are intended to provide a formal, structured process—or, in HRO terms, "a mindful infrastructure"—helping companies "find trouble, before trouble finds them". Using real-life examples from TSB investigations, I'm going to argue that many, if not most, accidents can be attributed to a breakdown in the way organizations pro-actively identify and mitigate hazards and manage risks.

I'll also examine the role that governance plays, both internal and external, to hold companies accountable for having effective risk management systems in place.

Slide 3: Background

In the immediate aftermath of a serious aviation accident, the questions people usually ask are "What happened?"; "Was it mechanical breakdown or human error?"; "What or who is to blame?" But modern aircraft accident investigations look beyond the "what" to try to determine the "why," because the primary objective is not to find fault or attribute blame, but to advance transportation safety by identifying the underlying causal or contributing factors, and the ones that create risks in the transportation system.

In the early days of aviation, accidents were more often attributed to mechanical breakdown, bad weather, or pilot handling errors. With the advent of human factors research and analysis, investigators started to look at how aircraft and cockpit design ergonomics may have contributed to "pilot errors". Later, more attention was paid to physiological factors—fatigue, circadian rhythms and spatial illusions—and psychological biases that could influence pilot decision-making and risk-taking behaviour. Over time, this has evolved to not only look at the performance of individual pilots but also of flight crews, leading to the modern concepts of Crew Resource Management (CRM) and Threat and Error Management (TEM).

Following a number of serious accidents involving complex, safety critical technologies, investigators and researchers began to examine what role management and organizational factors played in these accidents. Dr. James Reason developed the well-known Reason's (or Swiss cheese) model to illustrate how management policies and decisions can contribute to latent pre-conditions which, combined with active operational failures and breakdowns in system defences, may converge to allow a window of opportunity for accidents to occur.

Slide 4: Balancing competing priorities

Many organizations claim "safety is our first priority". There is, however, convincing evidence that, for some, the top priority is really customer service, or return on shareholder investment. However, products and services still need to be "safe" if an organization wants to stay in business and maintain public confidence—while avoiding accidents and costly litigation.

Therefore, balancing competing priorities and managing risk is part of any manager's decision-making process. And while some risks are easier to assess than others, it is very difficult to foresee what combination of circumstances might result in an accident. This is particularly challenging in a complex socio-technical organization with a very low accident rate—e.g., air traffic control, and flight operations.

Slide 5: Limits of acceptable performance

Rasmussen suggests that, under the influence of pressure toward cost-effectiveness in an aggressive, competitive environment, organizations tend to migrate to the limits of acceptable performance. In other words, they drift.

Slide 6: Organizational drift

Sidney Dekker explains that organizational drift is generated by normal processes of reconciling differential pressures on an organization (efficiency, capacity utilization, safety) against a background of uncertain technology and imperfect knowledge. Drift may be visible to outsiders, especially following an adverse outcome. But drift is not necessarily obvious within the organization, since incremental changes are always occurring.

However, given the constant need to reconcile competing goals and the uncertainties involved in assessing safety risks, how can managers recognize if or when they are drifting outside the boundaries of safe operation while they focus on customer service, productivity and efficiency? Well, to quote Reason: "If eternal vigilance is the price of liberty, then chronic unease is the price of safety."

Slide 7: Impact of management

The focus of this question falls heavily on management because, by their nature, management decisions tend to have a wider sphere of influence on how the organization operates and a longer term effect than the individual actions of operators (e.g., pilots, or maintenance engineers). Managers create the operating environment by establishing and communicating the goals and priorities and by providing the tools, training, and resources to accomplish the tasks required to produce the goods and services.

So decision-makers and organizations need to develop "mindfulness" of the impact of their priorities, policies and processes.

Slide 8: A "mindful infrastructure" would …

Weick and Sutcliffe define "mindfulness" as "a rich awareness of discriminatory detail." Of course, we all have blind spots, expectations that guide us toward soothing perceptions that confirm our hunches and away from more troublesome ones. However, "a mindful infrastructure" would continually do the following:

  • Track small failures
  • Resist oversimplification
  • Remain sensitive to operations
  • Maintain capabilities for resilience
  • Take advantage of shifting locations of expertise
  • Listen for and heed "weak signals"

Slide 9: Characteristics of effective safety management

But are "mindful" processes enough?

In a review paper prepared for the Canadian Civil Air Navigation Services provider, Westrum (1999) identified a number of organizational factors indicative of a strong safety culture, namely:

  1. a strong organizational emphasis on safety;
  2. high collective efficacy (i.e., a high degree of cooperation and cohesiveness);
  3. congruence between tasks and resources;
  4. a culture encouraging effective and free-flowing communications;
  5. clear mapping of its safety state;
  6. a learning orientation; and
  7. clear lines of authority and accountability.

Slide 10: Safety management systems (SMS)

Traditional approaches to safety management—based primarily on compliance with regulations, reactive responses following accidents and incidents, and a "blame and punish" philosophy—have been recognized as being insufficient to reduce accident rates (ICAO, 2008). SMS was designed around evolving concepts about risk management and safety culture, including the research into High Reliability Organizations, which are believed to offer great potential for more effective safety management.

A successful SMS is systematic, explicit, and comprehensive. Reason says it "becomes part of an organization's culture, and of the way people go about their work."

Some people misconstrue SMS as a form of deregulation or industry self-regulation. However, just as organizations rely on internal financial and Human Resources management systems to manage their financial assets and human resources, SMS is a framework designed to enable companies to better manage their safety risks. This does not obviate the need for effective regulatory oversight.

Slide 11: SMS requires the following

The generally accepted components of a Safety Management System include:

  • an accountable executive or designated authority for safety;
  • a safety policy on which the system is based (articulating senior management commitment);
  • a process for setting safety goals and measuring their attainment;
  • a process for identifying hazards, evaluating and managing the associated risks;
  • a process for ensuring that personnel are trained and competent to perform their duties;
  • a process for internal reporting and analyzing of hazards, incidents and accidents and for taking corrective actions;
  • a process for documenting SMS and making staff aware of their responsibilities; and
  • a process for conducting periodic reviews or audits of the SMS.

In essence, SMS requires:

  • Proactive hazard identification
  • Incident reporting and analysis
  • Strong safety culture.

Slide 12: Investigating for organizational factors

Previous research into TSB accident investigations involving air/rail/marine operators revealed that organizational factors often played a role, specifically:

  • Goal conflicts
  • Inadequate risk analysis, including:
    • no formal risk analysis conducted
    • risk analysis conducted, but hazard not identified
    • hazard identified, but residual risk underestimated
    • risk-control procedures not in place, or in place but not followed
  • Employee adaptations
  • Failure to heed "weak signals," including:
    • inadequate tracking or follow-up of safety deficiencies
    • ineffective sharing of information before, during or after the event, including verbal communications, record-keeping, or other documentation

In keeping with Reason's model, there was often a complex interaction of causal and contributing factors. In other words, there was no single factor that "caused" the accident.

Let's look at these four points individually – starting with goal conflicts.

Slide 13: Goal conflicts

In 2004, a Boeing 747 on an international cargo flight crashed during take‐off from Halifax International Airport. The crew had inadvertently used the aircraft weight from a previous leg to calculate takeoff performance data. This resulted in a thrust setting too low to enable the aircraft to take off safely given its actual weight. Crew fatigue likely increased the probability of errors in calculating takeoff performance data and degraded the flight crew's ability to detect the errors, in combination with the dark take‐off environment.

Why? Well, the company was experiencing significant growth and a shortage of flight crews. During the previous four years, the company had gradually increased the maximum allowable duty period from 20 hours (with a maximum of 16 flight hours) to 24 hours (with a maximum of 18 flight hours). Originally, the crew complement consisted of two captains, two co‐pilots and two flight engineers but was then revised to include three pilots and two flight engineers. At the time of the accident, the flight crew had been on duty for almost 19 hours and, due to earlier delays experienced, would likely have been on duty for approximately 30 hours at their final destination had the remaining flights continued uneventfully.

Prior to the crash of MK Airlines, the Crewing Department routinely scheduled flights in excess of the 24-hour limit. In fact, this routine non‐adherence to the Company Operations Manual contributed to an environment where some employees and company management felt that it was acceptable to deviate from company policy and/or procedures when it was considered "necessary to complete a flight or a series of flights."

We might also ask what, if any, risk analysis was conducted prior to making these changes to crew levels and duty times? The answer is none. Let's look at another example.

Slide 14: Inadequate risk analysis

On November 11, 2007, a Global 5000 corporate jet touched down 7 feet short of the runway in Fox Harbour, Nova Scotia. The main landing gear was damaged when it struck the edge of the runway, and directional control was lost when the right main landing gear collapsed. The aircraft departed the right side of the runway and came to a stop 1000 feet from the initial touchdown point.

The company had introduced a pretty substantial equipment change—a new and bigger aircraft—without carrying out an effective risk analysis, first.

Slide 15: Aircraft Attitude at Threshold

In particular, the company had transferred many of its standard operating procedures and practices from the smaller Challenger 604 aircraft previously used to the new and larger Global 5000. The investigation determined that the operator endorsed a practice whereby flight crews would "duck under" visual glide slope indicator systems to land as close as possible to the beginning of the relatively short runway. The crew had previously flown in to this airport in a Challenger 604 and the pilots were still adjusting to this new larger aircraft. They weren't aware of the Global's eye‐to‐wheel height or the fact that the visual glide slope indicator system in use at Fox Harbour was not suitable for that aircraft type, which resulted in the aircraft not meeting the manufacturer's recommendation of crossing the threshold at 50 feet above. The crew flew the same profile they had flown on previous flights, without taking into consideration that the Global 5000 aircraft was bigger than the Challenger 604. They misjudged their height and did not recognize that they were too low. This practice, combined with a number of other factors, reduced the threshold crossing safety margin to an unacceptable level, contributing to the accident.

Slide 16: Employee adaptations

This leads me to the next factor of employee adaptations.

In theory, everybody has procedures that specify how work should be performed. But as we all know, these don't always describe how work actually gets done. This difference can cause problems, and "employee adaptations" can inadvertently sabotage safety.

Accident investigation reports often refer to these as "violations" or "deviations from SOPs". But let's look at this in a different light. Think about it in the context of limited resources: faced with time pressures and multiple goals, workers and management may be tempted to create "locally efficient practices" in order to get the job done. It's often tough to say exactly how much less safe such a practice is, and if it works—and the more often it works—then departures from the routine become the routine. Past successes are taken as a guarantee of future safety.

The important point here is that organizations need to anticipate or look for such adaptations and think about the possible consequences – good or bad.

Slide 17: Employee adaptations (continued)

On 16 July 2011, a Boeing 727 on a scheduled cargo flight with 3 crew members on board landed at St. John's International Airport. Following touchdown, the crew was unable to stop the aircraft on the wet runway before the end of the runway and came to rest in the grass, with the nose wheel approximately 350 feet beyond the end of the pavement. Fortunately, there were no injuries and the aircraft had minor damage.

The investigation found that 3 of the 4 tires were in excess of 80% worn, while the 4th tire was about 65% worn. Utilizing tires that are more than 80% worn reduces wet-runway traction thereby increasing the risk of hydroplaning and possible runway overruns.

At some of the line bases, maintenance personnel had adopted a local practice which involved misting the wheel with water to aid in brake cooling after landing. This method was not identified in the company's SOPs, nor was it a manufacturer's approved standard practice. It was employed as a means to ensure the brake temperatures were within the departure allowable limits. The information that this local practice had been put in place was not identified as a potential safety hazard for further assessment by the company Risk Management System (RMS).

Each main wheel is equipped with a thermal fuse plug that will melt if the temperature of the wheel gets too high. The purpose of the fuse plug is to protect against an explosive release of tire nitrogen. Fuse-plug releases are not considered normal wear and tear. In accordance with company procedures, these should be reported through the company's RMS. The company had carried out a review for an 8 month period and identified 7 unique occurrences of fuse-plug releases, which had been reported through various maintenance procedures but not through its RMS.

Comprehensive reporting of safety hazards is essential to an effective safety management system (SMS). This would include reporting of non-routine situations that could represent hazards such as local practices or adaptations from documented procedures (e.g. misting of wheels) and occurrences where issues could have been reported through the company's RMS (e.g., fuse-plug releases) but were not. This resulted in a missed opportunity to identify potential safety risks (loss of directional control, hydroplaning, runway overruns) and take appropriate mitigating actions.

Employees will submit more incident reports if they are trained to recognize specific hazardous situations or conditions and areas they think the SMS should review. If all employees do not fully understand their reporting obligations and have not adopted a safety reporting culture as part of everyday operations, SMS will be less effective in managing risks.

There are many reasons why procedures may not be followed. Sometimes it may reflect a lack of training, practice or awareness about the procedure, or about the rationale behind the procedure. In other cases, strictly following the procedures may conflict with other goals, whereas taking a shortcut may save time and effort. Perhaps there is a lack of supervision or quality assurance. The key is to understand the context-specific reasons for the gap between written procedures and real practices. This will help organizations better understand this natural phenomenon, and allow for more effective interventions, beyond simply telling the workers to "Follow the rules!" or to "Be more careful!"

The fuse-plug releases in this accident also provide an example of "weak signals".

Slide 18: Weak signals

In many accident investigations, "weak signals" indicating potential trouble were missed. By their nature, weak signals may not be sufficient to attract the attention of busy managers, who often suffer information overload while juggling many competing priorities under significant time pressures.

Routine pressures to get the job done can often exert an insidious influence on operators and decision makers, leading them to act in ways that are riskier than they realize. In a 2008 accident in British Columbia, a Grumman Goose amphibious aircraft with one pilot and seven passengers aboard departed from Vancouver on a flight to Powell River, in marginal visual conditions. 19 minutes later, the aircraft crashed in dense fog.

Sometimes the client used other operators when this company would not fly, which has the effect of putting safety goals directly in conflict with customer service and financial goals.

This pilot was known for pushing the weather, and clients often requested this pilot, since he flew when others didn't. The company discussed its concerns over his decision making with the pilot three times. However, these were not documented, as required by the company SMS.

Slide 19: Weak signals (continued)

In another example, on January 7, 2007, a Beech King Air was on a medevac flight with two pilots and two emergency medical technicians aboard. While in a long landing flare on approach at night, the crew decided to go around. The aircraft did not maintain a positive climb rate and crashed into the trees at the other end of the runway. The captain was killed, while the others survived with injuries.

The TSB investigation found that the crew of two pilots was unable to work effectively as a team to avoid, trap, or mitigate errors and safely manage the risks associated with the flight. As our lead investigator at the time put it, "This crew did not employ basic strategies that could have helped prevent the chain of events leading to this accident." This lack of coordination can be attributed in part to the fact that the crew had not received crew resource management (CRM) training. Previously, there had been numerous "crew pairing issues" with respect to this crew. The company's management knew about this, although they were unaware of the extent to which these factors could impair effective crew coordination.

Slide 20: Weak signals (continued)

In a 2009 article, William Voss, then President and CEO of the Flight Safety Foundation said: "As random as these recent accidents look, though, one factor does connect them. We didn't see them coming and we should have… the data were trying to tell us something but we weren't listening."

SMS is intended to provide an infrastructure in which "weak signals" can be amplified to the point where they will be acted upon before an accident occurs.

Slide 21: SMS in air carrier operations

A Boeing 737 was departing from Toronto–Lester B. Pearson International Airport with 189 passengers and 7 crew members on board. During the take-off run, at about 90 knots, the auto-throttle disengaged after takeoff thrust was set. As the aircraft approached the critical engine failure recognition speed, the first officer, who was the pilot flying, noticed an AIRSPEED DISAGREE alert and transferred control of the aircraft to the captain, who then continued the take-off. During the initial climb, the aircraft received a stall warning (stick shaker), followed by a flight director command to pitch to a 5° nose-down attitude. The take-off was being conducted in visual conditions, allowing the captain to determine that the flight director commands were erroneous. The captain ignored the flight director commands and maintained a climbing attitude. The crew advised the air traffic controller of a technical problem that required a return to Toronto. The crew did not declare an emergency, but requested that aircraft rescue and firefighting services be placed on standby due to the overweight landing.

Initially, this occurrence was not reported to the TSB, because neither the crew nor the company realized that calling out rescue and firefighting services on standby constituted a reportable occurrence.

Subsequently, the operator then carried out an assessment of the occurrence. Its draft SMS report identified 2 issues. The first was the airspeed disagreement, for which the root cause was concluded to be a technical issue (pitot tube contamination) requiring no further analysis. The second was that the event was not classified as a TSB-reportable occurrence.

The TSB investigation identified a number of other underlying hazards and risks which the company had not identified or assessed. The recognition of hazards and risk management are central to the SMS concept, which underpins the regulation of scheduled air carrier operations in Canada. In this occurrence, the operator did not recognize any hazards worthy of analysis by its SMS. The effective performance of the crew masked the underlying risks that may not be mitigated by the lack of guidance, training and procedures available to them.

Slide 22: Pilot error or management error?

As these examples have shown, organizational drift, goal conflicts and employee adaptations occur naturally in any complex organization. Organizations can and should learn from these occurrences, since they also demonstrate patterns of accident pre-cursors (e.g., not thinking ahead to what might go wrong, not having an effective means to track and highlight recurrent maintenance or other safety deficiencies, insufficient training and/or resources to deal with unexpected events).

Although an individual operator's actions or inactions can clearly cause or contribute to incidents and accidents, "human error" is an attribution made following an adverse outcome, usually with the benefit of hindsight. Most people don't set out to make an error or cause an accident; they just want to get the work done. So it's important to view their actions/inactions in the organizational context in which they occurred. And, following an accident, it is important to figure out "why" they did what they did.

The decision to value production over safety is implicit—and unrecognized. Moreover, persistent success leads to a tendency to underestimate the amount of risk involved.

Slide 23: Pilot Error or Management Error? (continued)

Decision makers at all levels of an organization set and communicate the goals and priorities. They are usually aware of risks that need to be addressed and the need to make trade-offs. Charles Perrow says that "if investing in safety would improve a company's quarterly or yearly rate of return, they would do it. But often, if the investment—say, rebuilding pipelines that are becoming hazardous—might pay off only if a storm hits that particular area in the next twenty years, a company would be more reluctant to do so, especially if the executive making the decision is to retire in five years".

That being said, decision makers normally don't want to cause or contribute to an accident; but just wanting operations to be safe won't create safety unless this commitment is also supported by "mindful" processes such as formal risk assessments, increased hazard reporting, tracking of safety deficiencies, and effective follow-up.

Ultimately, an SMS is only as effective as the safety culture in which it is embedded. There is a complex relationship between culture and process. SMS won't take hold unless there is a strong underlying commitment and buy-in to safety. While process changes (such as formal risk assessments and incident reporting) can stimulate changes in culture, they will only be sustainable in the long term if they are seen to add value.

Slide 24: The role of governance / oversight

So, who holds senior management to account for the consequences of their trade-offs and the decisions they make about risk?

The owner or Board of Directors is usually the first line of governance in most private or public organizations and holds senior management to account. But depending on the owner's/ director's background and level of expertise, he/she may or may not fully grasp the possible impact of certain decisions and practices on safety risks. Most Boards of Directors have created committees to oversee key management areas (e.g., HR, Finance, Pensions). Similarly a Board Safety Committee could serve as a focal point for overseeing a company's safety management activities and safety performance.

Shareholders also have a real interest in the company's financial viability, but, as with owners and Boards of Directors, they may not be aware of the details of the day-to-day operational safety risks.

Customers (and by extension the general public) have a vested interest in the safe transport of their goods and personnel, whose very lives are in the hands of the air operator. Customers (particularly major oil and chemical companies) are becoming increasingly insistent that transportation companies demonstrate they have effective safety/risk management systems before contracting with them for transportation of their employees and products. Thus, SMS provides at least a marketing advantage, if not a pre-requisite to doing business with these companies.

Insurance companies? Well, they're ultimately on the hook with respect to the limit of policies for loss of life, property damage and third-party liability claims. If a company has repeated insurable losses, beyond a premium increase, they may become uninsurable.

Transport regulators and regulations are in place to protect the public by requiring operators to meet certain minimum standards.

Ultimately, it's all of the above.

But—are regulators doing enough?

Slide 25: Governance / oversight (continued)

In June 2010, a Beech King Air 100 departed the Quebec City airport with a reduced engine power setting—a procedure established by the company to reduce wear and tear on the engine, but not one endorsed by the manufacturer. As such, the aircraft's performance during the takeoff was lower than that established during the aircraft's type certification. Within seconds after takeoff, the crew reported a problem with the right engine and stated their intention to return to land. Because the right engine's propeller blades were not feathered, there was excessive aerodynamic drag, which compromised the aircraft's ability to climb or maintain level flight. The aircraft descended and struck the terrain at the end of the runway, hitting a berm and killing the 5 passengers and 2 crew members in the post-impact fire.

The TSB found numerous safety deficiencies in the areas of pilot training, company operating procedures, maintenance documentation, and the company's safety culture. And while inspections performed by Transport Canada (TC) revealed unsafe practices, the measures TC took to ensure compliance with regulations were not effective. As such, the unsafe practices continued.

At the end of the day, what evidence and processes are necessary to substantiate a regulator's decision to shut down a non-compliant operator?

Slide 26: Governance / oversight (continued)

Some short-sighted operators may not be truly committed to implementing sound risk management policies, processes, and practices—believing these to add more bureaucracy than value. Regulators are increasingly encouraging operators to adopt new technology, training and safety programs (including SMS) on a voluntary basis to avoid lengthy and costly rulemaking processes which don't always pass the cost-benefit analysis based on an already low accident rate.

In a recent message from William Voss, then President and CEO of the Flight Safety Foundation (AeroSafety World, July 2012), he wrote: "A major U.S. airline that implements all the voluntary FAA programs is clearly very safe, but that airline may have to compete with another carrier that decides to cut costs and not implement any of the same programs. The gap between what is legal and what is safe already is large, and it will get bigger… Is this regulatory approach sustainable? Is it fair to airlines that do everything right? Is it fair to an unknowing public?"

Slide 27: Conclusions

While SMS may not eliminate all accidents, a properly implemented safety management system can help reduce the risk. Over time, this should reduce the accident rate. Many lessons can be drawn from accident investigations and the experience of those operators who have implemented SMS.

It's no longer about mere "operator error." Rather, safety is about identifying the risks that inevitably crop up because we're human, and then managing or minimizing them. Organizational drift, goal conflicts, competing priorities, local adaptations … We'll never get rid of them entirely, so it's how we deal with them that matters.

Getting back to the best principles of HRO, SMS can help develop a "mindful infrastructure".

Mindful infrastructure: But SMS is only as effective as the organizational safety culture in which it is embedded. Just wanting safety isn't enough. Organizations must "institutionalize" their approach so that goals, policies, processes, practices, communications, and culture are consistent and integrated.

Accountability is key: True "accountability" is about more than "blame and retrain" or firing the "bad apples." It is about looking forward, and requires organizations and professions to take full responsibility to fix identified problems. So called "near-miss" incidents must be viewed as "free opportunities" for organizational learning; people will only report if they feel "safe" to do so.

Regulatory oversight: There is a risk that some short-sighted companies will take a minimalist, bureaucratic, or checklist approach to adopting SMS and then believe they are "safe" because they have a "compliant" SMS. Regulators should not hesitate to enforce existing regulations to the point of limiting a company's ability to operate until the operator has demonstrated that effective actions have been implemented and unsafe practices eliminated.

Success takes time: Organizations must recognize it will take unrelenting commitment, time, resources and perseverance to implement an effective SMS.

Slide 28: Questions?

Slide 29: Canada wordmark