Speeches

The 26th International System Safety Conference
Kathy Fox

Board Member
Transportation Safety Board of Canada
at the 26th International System Safety Conference, Evolving Approaches to Managing Safety and Investigating Accidents, "From Individual to Organizational Safety Management"
Vancouver, British Columbia
August 28, 2008

Introduction

Good morning, ladies and gentlemen. It's a pleasure to be here. This is the first time I have attended the International System Safety Conference. I've learned a lot listening to the other speakers and enjoyed meeting many of you in the past few days.

In keeping with the theme of this conference, "The Next Generation of System Safety Professionals", I would like to share with you some of my experiences and lessons I've learned about managing safety risks during the various phases of my career, and how my own thinking has evolved about what needs to be done to reduce, if not eliminate, safety risks.

During my talk, I will briefly review some of the schools of thought about accident causation, and more importantly, prevention, and how that led to the development of safety management systems in the Canadian aviation industry. In particular, I will outline some of the lessons I learned with respect to hazard identification, incident reporting and safety measurement. And finally, I will discuss how the Transportation Safety Board of Canada investigates transportation occurrences in the interests of advancing safety. While much of my talk will use examples from the transportation sector, I believe that many of the principles, processes and lessons learned are transferable to other safety critical industries, such as nuclear power and medicine.

Early Thoughts on Safety

As you heard in my bio, my whole career has been about practicing safety. As an air traffic controller, I was responsible to ensure that aircraft didn't run into each other or into the ground, and to provide pilots with operational information about weather, runway conditions and any other information necessary for flight safety. As a commercial pilot, I was responsible for transporting people safely from Point A to Point B. As a flight instructor, I teach people the basics of safe flying so they can earn their pilot's license. As a Pilot Examiner, I assess pilots' skills to make sure they meet Transport Canada licensing standards and can fly safely. And in my current role as a Member of the Transportation Safety Board of Canada, I am responsible with the other Board members for analyzing the safety deficiencies found in accident investigations and making recommendations to regulators, operators and manufacturers on what needs to be done to reduce risks to transportation safety.

It may seem odd, but early in my career, I didn't often think about what the word "safety" really meant. I was taught and believed that as long as you followed standard operating procedures, paid attention to what you were doing, didn't make any stupid mistakes or break any rules, and as long as equipment didn't fail, things would be "safe".

Accidents happened to those who didn't follow the necessary steps. This belief persisted when I went on to be the manager responsible for investigating Air Traffic Control losses of separation and other incidents where safety may have been compromised. We often attributed incidents to "not following procedures" or "loss of situational awareness" without really looking deeper into the "whys".

When I became Director of Safety at NAV CANADA, the then-recently privatized Air Navigation Service provider, I was responsible for implementing and managing a corporate safety oversight program. This involved developing a number of policies, processes and practices focused on identifying operational safety hazards, analyzing risks and reporting to the decision-makers who could take action on them.

Looking Beyond Operator Error

This role caused a major turning point in my thinking about safety. For the first time, I started to explicitly think and speak of risk management versus safety. I came to realize that safety does not mean zero risk, and that organizations must balance competing priorities and manage multiple risks: safety, economic, regulatory, financial, environmental and technological, to name a few.

While many safety critical organizations state that "safety is our first priority", I would argue that "customer service" or "return on shareholder investment" are really their top priorities. However, products and services must be "safe" if companies want to stay in business, avoid accidents and costly litigation and maintain customer confidence and a competitive advantage.

Any large safety-critical organization, be it an airline, an ANS provider, a railway, a nuclear power plant or hospital, must reconcile competing priorities and manage multiple risks.

Production pressures are often in conflict with safety goals. Case studies and accident investigations have frequently revealed how organizational factors such as the emphasis on productivity and cost control can result in trade-offs which inadvertently contribute to the circumstances that lead to an accident.

Reason's Model

I had the opportunity to meet Dr. James Reason at a safety conference and briefly discussed risk management concepts with him. I found his work very compelling in terms of showing how a chain of events, including organizational factors, can converge to create a window of opportunity for an accident to occur. Using his accident causation model, it all seemed logical how and why accidents happened.1

Reason's model was also useful for explaining to colleagues working in non-operational departments, such as Finance and HR, how their policies and practices could inadvertently create the conditions leading to an accident. As such safety wasn't only the responsibility of Operations, Engineering or Maintenance departments.

Here's a safety enhancement example involving a non-operational department: when NAV CANADA decided to replace its fleet of vans used to go out on runways at airports to maintain navigational aids, the Maintenance department was able to convince Finance to purchase school bus yellow vans rather than less expensive white vans, since the yellow ones are easier to see on snow-covered runways.

Westrum and SMS in Aviation

In 1998, NAV CANADA commissioned a review by Dr. Ron Westrum, a sociology professor at Eastern Michigan University to identify the desirable characteristics of organizations that effectively manage safety.

Here they are, on the screen. There is much similarity with what other scholars have identified as the characteristics of High Reliability Organizations.

The characteristics Dr. Westrum identified became the genesis of the development of NAV CANADA's Safety Management System.

Since 2001, Transport Canada has been implementing requirements for Safety Management Systems, or SMS, in the railway and marine sectors. The commercial aviation sector has gradually been implementing SMS since 2005.

Now, let's put some of the requirements for SMS in aviation from the Canadian Aviation Regulations up on the screen.

As you see, Transport Canada's requirements map very closely to the characteristics of an organization that effectively manages safety, as identified by Dr. Westrum.

Let's look at some of these elements of SMS more closely. I will go over these also in the context of some other major concepts on system safety. These same concepts guide the work of the Transportation Safety Board when it conducts investigations, identifies safety deficiencies and makes recommendations to reduce risks.

Sidney Dekker

Some of my greatest insights came from studying the works of Dr. Sidney Dekker, Professor of Human Factors and Flight Safety and Director of Research at the School of Aviation at Lund University in Sweden. Professor Dekker maintains that:

  • Safety is never the only goal: organizations exist to provide goods and services and to make money at it;


  • People do their best to reconcile different goals simultaneously: for example, service or efficiency versus safety;


  • A system isn't automatically safe: people actually have to create safety through practice at all levels of the organization;


  • Production pressures influence people's trade-offs: this makes what was previously thought of as irregular or unsafe, normal or acceptable.2

He goes on to say that "human error is not a cause of failure. Human error is the effect, or symptom, of deeper trouble. Human error is not random. It is systematically connected to features of people's tools, tasks and operating environment."3

This is important because it is the organization that creates the operating environment, provides the tools, training and resources required to produce goods and services. Hence, in my view, organizational commitment to safety must permeate the day-to-day operations if it's to be truly meaningful and effective.

Safety Hazards

The proactive identification of safety hazards is an important part of SMS.

For example, NAV CANADA adopted a policy that a formal risk assessment and safety management plan would be done on any significant changes to procedures, technology or organizational changes. When making changes to levels of service, the company would also conduct follow-up assessments 90 days and 1 year after to confirm that potential hazards were identified and mitigated and that no new hazards were introduced.

Identifying safety hazards is not without its challenges:

At the design phase, it is virtually impossible to predict all possible interactions of seemingly unrelated subsystems in a complex system, such as an airliner. Charles Perrow, a Yale University sociology professor who introduced the concepts of "complex interactions"4 and "tight coupling"5 , gave one example of this: an incident in the early 80s where frozen drinking water in the galley of a DC-8 cracked the water tank. Heat from ducts to the tail section melted this ice, and the resultant water spray landed on an outflow valve in the cabin pressurization system, which in turn made cabin pressure control difficult and led to an emergency landing. To quote Perrow: "This is not an interaction that a design engineer would normally think of when placing a drinking water tank next to the fuselage of a jetliner."6

In an operational context, changes in procedure might inadvertently compromise safety. Sidney Dekker asked thoughtfully "why do safe systems fail?" One factor he identified was the "drift into failure"7. This is about the slow and incremental movement of systems operations towards the edge of their safety envelope. Pressures of scarcity and competition typically fuel this drift. Without knowledge of where the boundaries actually are, people don't see the drift and thus don't do anything to stop it.8 Dekker used the example of Air Alaska Flight 261, where the failure of a jackscrew in the MD-80 aircraft trim system resulted in the aircraft losing control and crashing off the coast of California, with the loss of all souls on board. The investigation revealed that over time, the lubrication schedule for the failed jackscrew increased from an interval of once every 300-350 flight hours to once every 8 months, or approximately 2550 hours. The jackscrew recovered from the accident site revealed no evidence that there had been adequate lubrication at the previous interval, meaning it might have been more than 5000 hours since the assembly had last been greased.9 The question then becomes: what if any risk assessment was done at the time to assess the impact of changes to lubrication intervals?

Even when risk assessments are done, we are limited by our own imagination. People don't fully appreciate what can go wrong, how independent events can interact, the effects of proximity and common-mode failures. Boston College Sociologist Diane Vaughan, who also researched this phenomenon, identified the limited abilities to search for information and solutions to problems, combined with the influence of deadlines, limited participation and the number of problems under consideration as limiting factors in the scope of risk assessments.10

An example of this was the risk assessment done prior to the decision to launch the Space Shuttle Challenger in spite of very cold temperatures in 1986. In her extensive review of the ill-fated shuttle launch, Vaughan describes how the Solid Rocket Booster working group negotiated the risk of the SRB joints. The work group first identified a technical deviation that was subsequently reinterpreted as within the norm for acceptable joint performance, and then finally officially labeled as an acceptable risk. In other words, they redefined evidence that deviated from an acceptable standard so that it became the standard.11 She termed this behaviour as the "normalization of deviance".12

Incident Reporting

Incident reporting is a critical component of SMS.

A key aspect in establishing an incident reporting system is what types of events are categorized as reportable incidents. Traditionally they have been events resulting in adverse outcomes. By defining incidents too narrowly, an organization risks losing information about other types of events that indicate potential risks and vulnerabilities in the system, such as evidence of drift into failure or normalization of deviance.

In the Canadian Air Navigation System, not only are actual losses of separation, those instances were minimum spacing between aircraft was not achieved reported, but also those where spacing was achieved but not assured. This generates much richer data for analysis of system vulnerabilities.

In his book "Forgive and Remember: Managing Medical Failure", Charles Bosk cautions that dangerous near-misses are, as a rule, only appreciated as harbingers of disaster after a disaster has materialized. Until then, they are weak or missed signals.13 He also goes on to say:

"Medical decision making is a probabilistic enterprise. Presented with a patient with a set of symptoms, the physician makes a diagnosis and decides on an intervention. not all diagnoses and treatments that later experience proves wrong are mistakes; some are actions that any reasonable physician would have made under the circumstances."14

If data on "near misses" is not analyzed properly, opportunities are lost to prevent further incidents from occurring.15

A caution though: simply counting errors doesn't necessarily generate any meaningful or relevant safety data. Furthermore, measuring performance based solely on error trends can be misleading, as the absence of error and incidents does not imply an absence of risk.16

Another key aspect is what kind of processes and structure an organization will need to support incident reporting. Will it be voluntary or mandatory? Will the system identify reporters, be confidential or anonymous? To whom will the reports be submitted? Are reporters susceptible to discipline?

The consensus, according to Dekker and Laursen, is that fear of retribution hampers peoples' willingness to report. Conversely, non-punitive systems generate more reports - and by extension, more learning - because people feel free to tell about their troubles.17 Additionally Dekker and Laursen found it interesting that the main reason why operators make reports was not the lack of retribution but rather the realization they could "make a difference". They also note the importance of reporting to an operationally knowledgeable safety group, who in turn helps the reporter make sense of performance and context in which the incident occurred.18

Organizational Culture

Transport Canada is introducing SMS to aviation because it believes that proactive safety management fully integrated into a company's day-to-day operations is a more effective way to further reduce an already low accident rate.

But SMS is only as effective as the organizational culture in which it is enshrined.

Diane Vaughan says "The work group culture contributes to decision making . by becoming part of the worldview that the individuals in the work group bring to the interpretation of information."19 Each sub-unit within a larger organization may exhibit a different culture.

A finding of the Columbia Accident Investigation Board, which investigated the loss of the Space Shuttle Columbia in 2002, was that detrimental cultural traits and organizational practices were allowed to develop. Key among these were:

  • reliance on past success as a substitute for sound engineering practices;


  • organizational barriers that prevented effective communication of safety-critical information and stifled professional differences of opinion; and,


  • the evolution of an informal chain of command and decision-making processes that operated outside the organization's rules."20

Charles Perrow warns of the risk of undesirable cultural traits that can form in an organization's management team. He said, "Managers come to believe their own rhetoric about safety first because information indicating otherwise is suppressed for reasons of organizational politics."21

Safety and Accountability

How do organizations respond to failure? How are safety and accountability balanced? This is particularly relevant with the apparent and increased trend towards the criminalization of human error.

The criminalization of human error can have a detrimental effect on safety. Sidney Dekker's recently-released book, Just Culture explores this. "When a professional mistake is put on trial, safety almost always suffers. Rather than investing in safety improvements, people in the organization or profession invest in defensive posturing. Rather than increasing the flow of safety-related information, legal action has a way of cutting off that flow."22

He goes on to describe the desirable traits of a "just culture", which you will now see on the screen.

Just before I retired in 2007, NAV CANADA management and the Canadian air traffic controllers' union had commenced discussions about developing a "just culture" framework for deciding when "operational misbehaviour" might warrant discipline. The emphasis was on developing an open and transparent process involving management and peers that would be used to identify if a specific incident warranted discipline or not. In my view, the most important aspect of this joint initiative was the willingness of the union and management to work together to solve problems and enhance safety.

The Transportation Safety Board (TSB)

I would now like to talk a little bit about the Transportation Safety Board of Canada and how many of the concepts I have discussed so far have shaped the thinking of this organization.

The TSB is an independent government organization with a mandate to advance transportation safety by conducting investigations in the marine, rail, air and pipeline modes.

When the TSB investigates accidents, our mandate is not to assign fault or determine civil or criminal liability.

In the early 1990s, the TSB adopted Reason's model of accident causation as a fundamental framework underlying its approach to accident investigation. Not only did it represent the notion of multicausality, but it clearly demonstrated that although investigators see human error as a last step in an accident sequence, in fact it takes place within a broader organizational context. From this perspective, human error is merely seen as a starting point for an investigation.

In the mid-90's, TSB formalized their accident investigation methodology, the Integrated Safety Investigation Methodology (ISIM). The ISIM process begins immediately after the notification of an occurrence, when data is collected and assessed to determine if a full investigation is warranted. This decision hinges on whether there is significant potential for the investigation to reduce future risks to people, property or the environment. At the core of the methodology is Reason's model which is one of a number of human factors frameworks used to analyze the information that is collected.

In the late 90's, Westrum's concepts were incorporated into the investigation methodology, with the issuance of a Guide for Investigating Organizational and Management Factors. More recently, investigators are being taught concepts such as Snook's "practical drift"23 , Vaughan's "normalization of deviance"24 and Dekker's work on the need to understand an accident from the perspective of the individual involved, in particular, understanding why the individual's actions made sense at the time given the specific operational and organizational context.25

SWISSAIR FLIGHT 111

Let me illustrate these using the crash of Swissair Flight 111 on September 2, 1998 as an example. As some of you might not be familiar with what happened that day, I will take a few minutes to provide a synopsis.

Swissair Flight 111, a McDonnell-Douglas MD-11, departed New York City on a scheduled flight to Geneva, Switzerland, with 215 passengers and 14 crew on board.

About 53 minutes later, while cruising at flight level 3-3-0, the crew smelled an abnormal odour in the cockpit. Their attention was drawn to the area behind and above them and they began to investigate the source - the air conditioning system. After further troubleshooting, they assessed there was definitely smoke and decided to divert to Halifax.

While the flight crew was preparing to land, they were unaware that a fire was spreading above the cockpit ceiling. Soon thereafter, the aircraft's FDR logged a rapid succession of system failures. The crew declared an emergency and an immediate need to land.

About one minute later, radio communications and radar contact were lost, and the flight recorders stopped functioning. About five and a half minutes later, the aircraft crashed into the ocean with the loss of all 229 souls on board.

The crew did what made sense to them at the time. Knowing what they knew, and piecing together the sequence of events, we ran a number of detailed scenarios and concluded the crew could not have landed the plane.

As with all our investigations, the TSB took the time necessary to conduct a thorough investigation of the safety deficiencies, causes and contributing factors to an accident. We did not lay any fault or blame. In employing the ISIM methodology, we looked beyond the immediate causes to find underlying failures in the system in which aircraft and humans operate to make recommendations to prevent a similar occurrence in the future.

The Board made a total of 23 recommendations as part of the Swissair investigation. They can be grouped in the following areas:

  • on-board recorders,
  • circuit breaker resetting procedures,
  • the supplemental type certification process,
  • material flammability, and
  • in-flight firefighting.

Let's take a look at the area of material flammability in the context of some of the ideas I talked about earlier.

The TSB investigation found that the most significant deficiency causing the Swissair crash was the presence of flammable materials that allowed the fire to ignite and propagate. A fire started above the cockpit ceiling, where wiring arcs caused adjacent insulation blankets to ignite and the fire to spread. The material used (MPET) was certified according to flammability standards in place at the time.

The investigation found that the existing flammability testing requirements were not stringent or comprehensive enough to represent the full range of potential ignition sources. These procedures also did not replicate the behaviour of materials as they are found in typical aircraft installations and realistic operating environments. These factors are examples of the difficulty to predict all the possible interactions of various subsystems, as Perrow indicates, and of the limited scope of risk assessments, as described by Diane Vaughan.

TSB investigators also looked at the factors influencing the flammability standards in place at the time of the accident. For example, the FAA concentrated its fire prevention efforts on cabin interior materials and materials in designated fire zones, such as those materials that passengers can see or touch. Consequently a lower priority was placed on fire threats from other areas.

Prior to the Swissair accident, there were some ground-fire occurrences involving the type of insulation material used in the accident aircraft. This prompted action from McDonnell-Douglas, other manufacturers and civil aviation authorities including the FAA to do some additional work to better understand the risks, but the FAA did not mandate any action to mitigate them. McDonnell-Douglas even stopped using MPET insulation in its production aircraft and issued a service bulletin recommending its removal, but the FAA did not issue any directives to mandate its removal at the time.

The action taken in regards to insulation flammability involved a number of manufacturers and international civil aviation authorities. Here we see how the cultural traits of the organizations involved may have been a factor. Given these early warning signs, some were willing to take action; while others for various reasons were reluctant to do so, perhaps because prior solutions become institutionalized due to past successes with them, rather than revisiting them when a problem is identified.

The TSB issued 8 flammability-related recommendations as part of the Swissair investigation. The recommendations ranged from more rigorous flammability testing standards, removal of materials that do not meet flammability standards and certification requirements that more closely represent realistic operating conditions and interactions among subsystems.

Following the investigation, regulatory authorities mandated the removal of MPET insulation from many aircraft and instituted a new, more rigorous flammability test.

However, more action needs to be taken to review other insulation materials used in aircraft, to evaluate risks posed by materials that failed new flammability requirements, and to develop more realistic standards to evaluate wiring failure characteristics and evaluate how aircraft systems and components might worsen fires. Unfortunately, work still remains to be done on 18 of the 23 Swissair recommendations 10 years after this devastating accident. So the Board decided to issue a call for action to remind regulators, manufacturers and operators that these issues and recommendations are still relevant.

Conclusion

In closing, and in keeping with the theme of this conference, here is some of what I think the safety professional of the future needs to consider:

  • That adverse outcomes arise from a complex interaction of organizational, technical and human performance factors that are difficult, if not impossible to predict, even using formal and sophisticated risk assessment processes;


  • That people at all levels in the organization create safety - or not - and that organizations must "internalize" their approach to safety risk management so that goals, policies, processes, practices, communications and culture are consistent and integrated (admittedly much easier said than done!);


  • That so called "near-miss" incidents must be viewed as a "free opportunity" for organizational learning; that people will only report if they feel "safe" to do so, knowing that their input will be treated seriously, and they are empowered to offer their own suggestions for solutions;26


  • That accident investigations are at best "constructions" of what happened and why, often based on incomplete data. Investigators must focus on why it made sense for those involved to do what they did at the time; refrain from using judgmental language; and try to avoid the trap of hindsight bias, micro-matching or cherry picking data with a world they now know to be true;27


  • That true "accountability" is about more than retribution (i.e. find and retrain, discipline or fire the "bad apples"); it is forward-looking and requires organizations and professions to take full responsibility to fix identified problems.28, 29

I encourage you to read widely but critically, to discuss/debate these issues with your colleagues and other subject matter experts in the field of safety risk management, and to continually challenge the assumptions people make and the language they use about what promotes effective safety management or leads to accidents.

Thank you for your attention.

Bibliography

  1. Bosk, C.L. (2003) Forgive and Remember: Managing Medical Failure 2nd Edition The University of Chicago Press


  2. Columbia Accident Investigation Report Vol. 1, August 2003


  3. Dekker , S. (2005) Ten Questions About Human Error Lawrence Erlbaum Associates


  4. Dekker, S. (2006) The Field Guide to Understanding Human Error Ashgate Publishing Ltd.


  5. Dekker, S. & Laursen, T. (2007) From Punitive Action to Confidential Reporting Patient Safety and Quality Healthcare September/October 2007


  6. Dekker, S. (2007) Just Culture Ashgate Publishing Ltd.


  7. Perrow C. (1999) Normal Accidents Princeton University Press


  8. Reason, J. (1995) A Systems Approach to Organizational Error Ergonomics, 39, 1708-1721


  9. Rochlin,G., La Porte, T., Roberts, K. The Self-Designing High-Reliability Organization: Aircraft Carrier Flight Operations at Sea Naval War College Review Autumn 1987


  10. Sharpe, V.A. (2004) Accountability Patient Safety and Policy Reform Georgetown University Press


  11. Snook S. (2000) Friendly Fire Princeton University Press


  12. Vaughan D. (1996) The Challenger Launch Decision The University of Chicago Press

1.   Reason, J., A Systems Approach to Organizational Error, Ergonomics, vol. 39, pp. 1708-1721.

2.   Dekker, S (2006) The Field Guide to Understanding Human Error, Ashgate Publishing Ltd., p. 3.

3.Dekker (2006) p. 15.

4.   Perrow, C. (1999), Normal Accidents, Princeton University Press p. 78

5.   Perrow, C. (1999), p. 89-90)

6.   Perrow, C. (1999), pp. 135-136.

7.   Dekker, S. (2005) Ten Questions About Human Error, Lawrence Erlbaum Associates, p. 18.

8.   Dekker, S. (2005) p. 18

9.   Dekker, S (2005), pp. 18-24

10.   Vaughan, D (1996) The Challenger Launch Decision, University of Chicago Press, p. 37.

11.   Vaughan, D (1996), p. 65

12.   Vaughan, D (1996)

13.   Bosk, C.L. (2003), Forgive and Remember: Managing Medical Failure 2nd Edition, The University of Chicago Press (p. xxiv)

14.   Bosk., C.L. (2003) pp. 23-24.

15.   Dekker, S. and Laursen, T.(2007) From Punitive Action to Confidential Reporting, Patient Safety and Quality Healthcare, September/October 2007.

16.   Reason, J. (1995)

17.   Dekker, S. and Laursen, T.(2007), p. 50

18.   Dekker, S. and Laursen, T.(2007), p. 54

19.   Vaughan, D (1996), p. 64-65.

20.   Columbia Accident Investigation Report, Vol. 1, August 2003, p. 177

21.   Perrow, C. (1999), pp. 370.

22.   Dekker, S. (2007) Just Culture Ashgate Publishing Ltd., p. 21

23.   Snook, S (2000), Friendly Fire, Princeton University Press

24.   Vaughan, D (1996)

25.   Dekker, S (2005) (2006)

26.   Dekker, S. and Laursen, T.(2007)

27.   Dekker, S. (2006), p. 29

28.   Dekker, S. (2007)

29.   Sharpe, V.A., (2004) Accountability Patient Safety and Policy Reform, Georgetown University Press