Safety, Risk Management, Governance, and Accountability

Signal Charlie Seminar

View presentation in PDF

You need a PDF reader to access this file; find out more on our help page

Kathy Fox
Board member, Transportation Safety Board of Canada
Pensacola, Florida, 13-14 September, 2012

Check against delivery.

Slide 1: Title Page

Good afternoon. It's a pleasure to be here today.

Today I'd like to talk about safety risk management, governance and accountability.

Slide 2: Outline

I'll start off with a little bit of background on the evolution of accident investigation, from simply looking at mechanical failure or pilot error, to examining the organizational factors that played a role. I'll talk about how organizations drift into failure and how the components of SMS are intended to provide a formal, structured process to help companies “find trouble, before trouble finds them”. Using real-life examples from TSB investigations, I'm going to argue that many, if not most, accidents can be attributed to a breakdown in the way organizations pro-actively identify and mitigate hazards and manage risks.

I'll also examine the role that governance plays, both internal and external, to hold companies accountable for having effective risk management systems in place.

Slide 3: Background

In the immediate aftermath of a serious aviation accident, the questions people usually ask are “What happened?”; “Was it mechanical breakdown or human error?”; “What or who is to blame?” But modern aircraft accident investigations look beyond the “what” to try to determine the “why,” because the primary objective is not to find fault or attribute blame, but to advance transportation safety by identifying the underlying causal or contributing factors, and the ones that create risks in the transportation system.

In the early days of aviation, accidents were more often attributed to mechanical breakdown, bad weather, or pilot handling errors. With the advent of human factors research and analysis, investigators started to look at how aircraft and cockpit design ergonomics may have contributed to “pilot errors”. An early example led to the re-design of the flap and landing gear controls, to be more readily distinguishable to a pilot's touch, reducing the risk of inadvertent landing gear retraction. Later, more attention was paid to physiological factors—fatigue, circadian rhythms and spatial illusions—and psychological biases that could influence pilot decision-making and risk taking behaviour. Over time, this has evolved to not only look at the performance of individual pilots but also of flight crews, leading to the modern concepts of Crew Resource Management and Threat and Error Management.

Slide 4: Reason's Model (“Swiss Cheese”)

Following a number of serious accidents involving complex, safety critical technologies, investigators and researchers began to examine what role management and organizational factors played in these accidents.  Dr. James Reason developed the well-known Reason's (or Swiss cheese) model to illustrate how management policies and decisions can contribute to latent pre-conditions which, combined with active operational failures and breakdowns in system defences, may converge to allow a window of opportunity for accidents to occur.

Slide 5: Balancing Competing Priorities

Many organizations claim “safety is our first priority”. There is, however, convincing evidence that, for some, the top priority is really customer service, or return on shareholder investment. However, products and services still need to be “safe” if an organization wants to stay in business and maintain public confidence—while avoiding accidents and costly litigation.

Therefore, balancing competing priorities and managing risk is part of any manager's decision-making process.  And while some risks are easier to assess than others, it is very difficult to foresee what combination of circumstances might result in an accident. This is particularly challenging in a complex socio-technical organization with a very low accident rate—e.g., air traffic control, and flight operations.

Slide 6: Limits of Acceptable Performance

Rasmussen suggests that, under the influence of pressure toward cost-effectiveness in an aggressive, competitive environment, organizations tend to migrate to the limits of acceptable performance.  In other words, they drift.

Slide 7: Organizational Drift

Sidney Dekker explains that organizational drift is generated by normal processes of reconciling differential pressures on an organization (efficiency, capacity utilization, safety) against a background of uncertain technology and imperfect knowledge. Drift may be visible to outsiders, especially following an adverse outcome. But drift is not necessarily obvious within the organization, since incremental changes are always occurring (Dekker, 2005).

Slide 8: MK Airlines

Here's an example of drift, from one of the TSB's own investigations—the 2004 crash of a Boeing 747 on an international cargo flight during take‐off. The crew had inadvertently used the aircraft weight from a previous leg to calculate takeoff performance data. This resulted in incorrect V speeds and a thrust setting too low to enable the aircraft to take off safely given its actual weight. Crew fatigue likely increased the probability of errors in calculating takeoff performance data and degraded the flight crew's ability to detect the errors, in combination with the dark take‐off environment.

Why? Well, the company was experiencing significant growth and a shortage of flight crews. During the previous four years, the company had gradually increased the maximum allowable duty period from 20 hours (with a maximum of 16 flight hours) to 24 hours (with a maximum of 18 flight hours). Originally, the crew complement consisted of two captains, two co‐pilots and two flight engineers but was then revised to include three pilots and two flight engineers. At the time of the accident, the flight crew had been on duty for almost 19 hours and, due to earlier delays experienced, would likely have been on duty for approximately 30 hours at their final destination had the remaining flights continued uneventfully.

Let's ask one obvious question: What, if any, risk assessment was done at the time to assess the impact of these changes to crew duty days? It's worth asking because, prior to the crash of MK Airlines, the Crewing Department routinely scheduled flights in excess of the 24-hour limit. In fact, this routine non‐adherence to the Company Operations Manual contributed to an environment where some employees and company management felt that it was acceptable to deviate from company policy and/or procedures when it was considered “necessary to complete a flight or a series of flights.”

Slide 9: Impact of Management

The focus of this question falls heavily on management because, by their nature, management decisions tend to have a wider sphere of influence on how the organization operates and a longer term effect than the individual actions of operators (e.g., pilots, or maintenance engineers). Managers create the operating environment by establishing and communicating the goals and priorities and by providing the tools, training, and resources to accomplish the tasks required to produce the goods and services.

However, given the constant need to reconcile competing goals and the uncertainties involved in assessing safety risks, how can managers recognize if or when they are drifting outside the boundaries of safe operation while they focus on customer service, productivity and efficiency? Well, to quote Reason: “If eternal vigilance is the price of liberty, then chronic unease is the price of safety.”

Another way of putting it is for decision-makers and organizations to develop “mindfulness.”

Slide 10: A “mindful infrastructure” would …

“Mindfulness” is defined as “a rich awareness of discriminatory detail.” Of course, we all have blind spots, expectations that guide us toward soothing perceptions that confirm our hunches and away from more troublesome ones. However, “a mindful infrastructure” would continually do the following:

  • Track small failures
  • Resist oversimplification
  • Remain sensitive to operations
  • Maintain capabilities for resilience
  • Take advantage of shifting locations of expertise
  • Listen for and heed "weak signals"

Slide 11: Safety Management System (SMS)

Traditional approaches to safety management—based primarily on compliance with regulations, reactive responses following accidents, and incidents and a ‘blame and punish' philosophy—have been recognized as being insufficient to reduce accident rates (ICAO, 2008). SMS was designed to offer great potential for more effective risk management. A successful SMS is systematic, explicit, and comprehensive. Reason says it “becomes part of an organization's culture, and of the way people go about their work.”

Some people misconstrue SMS as a form of deregulation or industry self-regulation. However, just as organizations rely on internal financial and HR management systems to manage their financial assets and human resources, SMS is a framework designed to enable companies to better manage their safety risks. This does not obviate the need for effective regulatory oversight.

Slide 12: SMS Requires the Following

The generally accepted components of a Safety Management System include:

  • an accountable executive or designated authority for safety;
  • a safety policy on which the system is based (articulating senior management commitment);
  • a process for setting safety goals and measuring their attainment;
  • a process for identifying hazards, evaluating and managing the associated risks;
  • a process for ensuring that personnel are trained and competent to perform their duties;
  • a process for internal reporting and analyzing of hazards, incidents and accidents and for taking corrective actions;
  • a process for documenting SMS and making staff aware of their responsibilities; and
  • a process for conducting periodic reviews or audits of the SMS.

In essence, SMS requires:

  • Proactive hazard identification
  • Incident reporting and analysis
  • Strong safety culture.

Slide 13: SMS Requirements in Canada

In Canada, the transportation regulator, Transport Canada, currently requires SMS in air, rail and marine transportation modes. Specifically, in aviation, SMS is required for scheduled airline operators, maintenance organizations, certified airport authorities, and the privatized providers of air navigation services. While not yet required for commuter airlines, air taxi or flight-training units, the regulator has stated its intention to move in that direction over the next few years.

However, the transition is not easy. It takes significant commitment, time, and resources for a company's SMS to become fully effective.

Slide 14: Investigating for Organizational Factors

Previous research into TSB accident investigations involving air/rail/marine operators revealed that organizational factors often played a role, specifically:

  • Inadequate risk analysis, including:
    • no formal risk analysis conducted
    • risk analysis conducted but hazard not identified
    • hazard identified but residual risk underestimated
    • risk control procedures not in place, or in place but not followed
  • Employee adaptations
  • Goal conflicts
  • Failure to heed “weak signals,” including:
    • inadequate tracking or follow-up of safety deficiencies
    • ineffective sharing of information before, during or after the event including verbal communications, record-keeping or other documentation

In keeping with Reason's model, there was often a complex interaction of causal and contributing factors. In other words, there was no single factor that “caused” the accident.

Let's look at these four points individually.

Slide 15: Inadequate Risk Analysis

On November 11, 2007, a Global 5000 corporate jet touched down 7 feet short of the runway in Fox Harbour, Nova Scotia. The main landing gear was damaged when it struck the edge of the runway, and directional control was lost when the right main landing gear collapsed. The aircraft departed the right side of the runway and came to a stop 1000 feet from the initial touchdown point.

The company had introduced a pretty substantial equipment change—a new and bigger aircraft—without carrying out an effective risk analysis, first.

Slide 16: Aircraft Attitude at Threshold

In particular, the company had transferred many of its standard operating procedures and practices from the Challenger 604 aircraft previously used to the new and larger Global 5000. The investigation determined that the operator endorsed a practice whereby flight crews would “duck under” visual glide slope indicator systems to land as close as possible to the beginning of the relatively short runway. The crew had previously flown in to this airport in a Challenger 604 and the pilots were still adjusting to this new larger aircraft. They weren't aware of the Global's eye‐to‐wheel height or the fact that the visual glide slope indicator system in use at Fox Harbour was not suitable for that aircraft type, which resulted in the aircraft not meeting the manufacturer's recommendation of crossing the threshold at 50 feet above. The crew flew the same profile they had flown on previous flights, without taking into consideration that the Global 5000 aircraft was bigger than the Challenger 604. They misjudged their height and did not recognize that they were too low. This practice, combined with a number of other factors, reduced the threshold crossing safety margin to an unacceptable level, contributing to the accident.

Slide 17: Inadequate Risk Analysis (continued)

In another 2007 accident, a Beech King Air 100 was flying an instrument approach to an airport in Quebec in IFR weather conditions with two pilots aboard. They missed the first approach and decided to go around and attempt another approach. On the second approach, the aircraft was left of runway centreline. The crew a made right turn and then a steep left turn. After the left turn, the aircraft struck the runway about 500 feet from the threshold. A severe post-impact fire ensued, which killed the crew and destroyed the aircraft.

The crew was qualified for the intended flight and had received Crew Resource Management (CRM) training. However, both pilots had limited experience flying in instrument meteorological conditions and working in a multi-crew environment. A robust hazard identification system could have identified this risk and helped make a more appropriate crew pairing decision for the flight conditions.

Slide 18: Employee Adaptations

In theory, everybody has procedures that specify how work should be performed. But as we all know, these don't always describe how work actually gets done. This difference can cause problems, and “employee adaptations” can inadvertently sabotage safety.

Accident investigation reports refer to these as “violations” or “deviations from SOPs”. But let's look at this in a different light. Think about it in the context of limited resources: faced with time pressures and multiple goals, workers and management may be tempted to create “locally efficient practices” in order to get the job done. It's often tough to say exactly how much less safe such a practice is, and if it works—and the more often it works—then departures from the routine become the routine. Past successes are taken as a guarantee of future safety.

Slide 19: Employee Adaptations (continued)

Here's a 2009 example involving a risk of collision between a CL600 and an airport snow sweeper: On initial contact, no current position report or estimate for the airport was given by the crew or requested by the tower controller. The tower requested the crew to report 10 miles final, and advised that runway sweeping was in progress. The crew acknowledged the request. The aircraft landed approximately nine minutes later, after flying over two runway snow sweepers operating on the portion of the runway located before the displaced threshold for Runway 31L. A position report was not provided to the tower at 10 miles final, and no landing clearance was issued.

This investigation identified a number of deviations from Standard Operating Procedures by both ATC and the flight crew, including some procedures developed to guard against “forgetting” about a vehicle on the runway or an aircraft on final.

There are many reasons why procedures may not be followed. Sometimes it may reflect a lack of training, practice or awareness about the procedure, or about the rationale behind the procedure. In other cases, strictly following the procedures may conflict with other goals, whereas taking a shortcut may save time and effort. Perhaps there is a lack of supervision or quality assurance. The key is to understand the context-specific reasons for the gap between written procedures and real practices. This will help organizations better understand this natural phenomenon, and allow for more effective interventions, beyond simply telling the workers to “Follow the rules!” or to “Be more careful!”

Slide 20: Goal Conflicts

Routine pressures to get the job done can often exert an insidious influence on operators and decision makers, leading them to act in ways that are riskier than they realize. In a 2008 accident in British Columbia, a Grumman Goose amphibious aircraft with one pilot and seven passengers aboard departed from Vancouver on a flight to Powell River, in marginal VFR conditions. 19 minutes later, the aircraft crashed in dense fog.

The pilot was known for pushing the weather, and clients often requested this pilot, since he flew when others didn't. The company discussed its concerns over his decision making with the pilot three times. However, these were not documented, as required by the company SMS. Sometimes the client used other operators when this company would not fly, which has the effect of putting safety goals directly in conflict with customer service and financial goals.

Slide 21: Weak Signals

In many accident investigations, “weak signals” indicating potential trouble were missed. By their nature, weak signals may not be sufficient to attract the attention of busy managers, who often suffer information overload while juggling many competing priorities under significant time pressures.

On January 7, 2007, a Beech King Air was on a medevac flight with two pilots and two emergency medical technicians aboard. While in a long landing flare on approach at night, the crew decided to go around. The aircraft did not maintain a positive climb rate and crashed into the trees at the other end of the runway. The captain was killed, while the others survived with injuries.

The TSB investigation found that the crew of two pilots was unable to work effectively as a team to avoid, trap, or mitigate errors and safely manage the risks associated with the flight. As our lead investigator at the time put it, “This crew did not employ basic strategies that could have helped prevent the chain of events leading to this accident.” This lack of coordination can be attributed in part to the fact that the crew had not received crew resource management (CRM) training. Previously, there had been numerous “crew pairing issues” with respect to this crew. The company's management knew about this, although they were unaware of the extent to which these factors could impair effective crew coordination.

Slide 22: Weak Signals (continued)

In a 2009 article, William Voss, President and CEO of the Flight Safety Foundation said: “As random as these recent accidents look, though, one factor does connect them. We didn't see them coming and we should have… the data were trying to tell us something but we weren't listening.”

SMS is intended to provide an infrastructure in which ‘weak signals' can be amplified to the point where they will be acted upon before an accident occurs.

Slide 23: Pilot Error or Management Error?

As these examples have shown, organizational drift, goal conflicts and employee adaptations occur naturally in any complex organization. Organizations can and should learn from these occurrences, since they also demonstrate patterns of accident pre-cursors (e.g., not thinking ahead to what might go wrong, not having an effective means to track and highlight recurrent maintenance or other safety deficiencies, insufficient training and/or resources to deal with unexpected events).

Although an individual operator's actions or inactions can clearly cause or contribute to incidents and accidents, “human error” is an attribution made following an adverse outcome usually with the benefit of hindsight. Most people don't set out to make an error or cause an accident; they just want to get the work done. So it's important to view their actions/inactions in the organizational context in which they occurred. And, following an accident, it is important to figure out “why” they did what they did.

The decision to value production over safety is implicit—and unrecognized. Moreover, persistent success leads to a tendency to underestimate the amount of risk involved.

Slide 24: Pilot Error or Management Error? (continued)

Decision makers at all levels of an organization set and communicate the goals and priorities. They are usually aware of risks that need to be addressed and the need to make trade-offs. Charles Perrow says that “if investing in safety would improve a company's quarterly or yearly rate of return, they would do it. But often, if the investment—say, rebuilding pipelines that are becoming hazardous—might pay off only if a storm hits that particular area in the next twenty years, a company would be more reluctant to do so, especially if the executive making the decision is to retire in five years”.

That being said, decision makers normally don't want to cause or contribute to an accident either; but just wanting operations to be safe won't create safety unless this commitment is also supported by ‘mindful' processes such as formal risk assessments, increased hazard reporting, tracking of safety deficiencies and effective follow-up.

Ultimately, an SMS is only as effective as the safety culture in which it is embedded. There is a complex relationship between culture and process. SMS won't take hold unless there is a strong underlying commitment and buy-in to safety. While process changes (such as formal risk assessments and incident reporting) can stimulate changes in culture, they will only be sustainable in the long term if they are seen to add value.

Slide 25: The Role of Governance / Oversight

So, who holds senior management to account for the consequences of their trade-offs and the decisions they make about risk?

The owner or Board of Directors is usually the first line of governance in most private or public organizations and holds senior management to account. But depending on the owner's/ director's background and level of expertise, he/she may or may not fully grasp the possible impact of certain decisions and practices on safety risks. Most Boards of Directors have created committees to oversee key management areas (e.g., HR, Finance, Pensions). Similarly a Board Safety Committee could serve as a focal point for overseeing a company's safety management activities and safety performance.

Shareholders also have a real interest in the company's financial viability, but, as with owners and Boards of Directors, they may not be aware of the details of the day-to-day operational safety risks.

Customers (and by extension the general public) have a vested interest in the safe transport of their goods and personnel, whose very lives are in the hands of the air operator. Customers (particularly major oil and chemical companies) are becoming increasingly insistent that transportation companies demonstrate they have effective safety/risk management systems before contracting with them for transportation of their employees and products. Thus, SMS provides at least a marketing advantage, if not a pre-requisite to doing business with these companies.

Insurance companies? Well, they're ultimately on the hook with respect to the limit of policies for loss of life, property damage and third-party liability claims. If a company has repeated insurable losses, beyond a premium increase, they may become uninsurable.

Ultimately, transport regulators and regulations are in place to protect the public by requiring operators to meet certain minimum standards.

But—are regulators doing enough?

Slide 26: Governance / Oversight  (continued)

In June 2010, a Beech King Air 100 departed the Quebec City airport with a reduced engine power setting—a procedure established by the company to reduce wear and tear on the engine, but not endorsed by the manufacturer. As such, the aircraft's performance during the takeoff was lower than that established during the aircraft's type certification. Within seconds after takeoff, the crew reported a problem with the right engine and stated their intention to return to land. Because the right engine's propeller blades were not feathered, there was excessive aerodynamic drag, which compromised the aircraft's ability to climb or maintain level flight. The aircraft descended and struck the terrain at the end of the runway, hitting a berm and killing the 5 passengers and 2 crew members in the post-impact fire.

The TSB found numerous safety deficiencies in the areas of pilot training, company operating procedures, maintenance documentation and the company's safety culture. And while inspections performed by Transport Canada (TC) revealed unsafe practices, the measures TC took to ensure compliance with regulations were not effective. As such, the unsafe practices continued.

At the end of the day, what evidence and processes are necessary to substantiate a regulator's decision to shut down a non-compliant operator?

Slide 27: Governance / Oversight  (continued)

Some short-sighted operators may not be truly committed to implementing sound risk management policies, processes, and practices—believing these to add more bureaucracy than value. Regulators are increasingly encouraging operators to adopt new technology, training and safety programs (including SMS) on a voluntary basis to avoid lengthy and costly rulemaking processes which don't always pass the cost-benefit analysis based on an already low accident rate.

In a recent message from William Voss, President and CEO of the Flight Safety Foundation (AeroSafety World, July 2012), he wrote: “A major U.S. airline that implements all the voluntary FAA programs is clearly very safe, but that airline may have to compete with another carrier that decides to cut costs and not implement any of the same programs. The gap between what is legal and what is safe already is large, and it will get bigger… Is this regulatory approach sustainable? Is it fair to airlines that do everything right? Is it fair to an unknowing public?”

Slide 28: Conclusions

While SMS may not eliminate all accidents, a properly implemented safety management system can help reduce the risk. Over time, this should reduce the accident rate. Many lessons can be drawn from accident investigations and the experience of those operators who have implemented SMS.

Old views of safety are changing: It's no longer about mere “operator error.” Rather, safety is about identifying the risks that inevitably crop up because we're human, and then managing or minimizing them. Organizational drift, adaptations, goal conflicts, competing priorities … We'll never get rid of them entirely, so it's how we deal with them that matters.

No one can predict the future with perfect accuracy: That's impossible. However, a formal and sophisticated risk assessment process is a good start.

Mindful infrastructure: SMS is only as effective as the organizational safety culture in which it is embedded. Just wanting safety isn't enough. Organizations must “institutionalize” their approach so that goals, policies, processes, practices, communications, and culture are consistent and integrated.

Accountability is key: True “accountability” is about more than “blame and retrain” or firing the “bad apples.” It is about looking forward, and requires organizations and professions to take full responsibility to fix identified problems. So called “near-miss” incidents must be viewed as “free opportunities” for organizational learning; people will only report if they feel “safe” to do so.

Regulatory oversight: There is a risk that some short-sighted companies will take a minimalist, bureaucratic, or checklist approach to adopting SMS and then believe they are “safe” because they have a “compliant” SMS. Regulators should not hesitate to enforce existing regulations to the point of limiting a company's ability to operate until the operator has demonstrated that effective actions have been implemented and unsafe practices eliminated.

Success takes time: Organizations must recognize it will take unrelenting commitment, time, resources and perseverance to implement an effective SMS.  [Click for next slide]

Slide 29: Questions?

Slide 30: Canada wordmark