Opening Remarks at the International Women in Aviation Conference
Transportation Safety Board of Canada
Evolving Approaches to Managing Safety and Investigating Accidents
"From Individual to Organizational Safety Management"
27 February, 2010
Slide 1 - Introduction
Good afternoon. It's a pleasure to be here. I've learned a lot listening to the other speakers and enjoyed meeting many of you in the past few days.
I would like to share with you some of my experiences and the lessons I've learned about managing safety risks during the various phases of my career, and how my own thinking has evolved about what needs to be done to reduce, if not eliminate, safety risks.
Slide 2 - Presentation Outline
During my talk, I will briefly review some of the schools of thought about accident causation, and more importantly, prevention, and how that led to the introduction of safety management systems in the Canadian aviation industry. In particular, I will outline some of the lessons I learned about organizational drift into failure, employee adaptations, hazard identification, incident reporting and safety measurement.
And finally, I will discuss how the Transportation Safety Board of Canada investigates transportation accidents with the goal of advancing safety. While much of my talk will use examples from the aviation sector, I believe that many of the principles, processes and lessons learned are transferable to other safety critical industries, such as nuclear power and medicine.
Slide 3 - Early Thoughts on Safety
Like many of you who fly or work in aviation, my whole career has been about practicing safety. As an air traffic controller, I was responsible for ensuring that aircraft didn't run into each other or into the ground, and providing pilots information necessary for flight safety, such as weather or runway conditions. As a commercial pilot, I was responsible for transporting people safely from Point A to Point B. As a flight instructor, I teach people the basics of safe flying so they can earn their pilot's license. As a Pilot Examiner, I assess pilots' skills to make sure they meet Transport Canada licensing standards and can fly safely. And in my current role as a Member of the Transportation Safety Board of Canada, I am responsible with the other Board members for analyzing the safety deficiencies found in accident investigations and making recommendations to regulators, operators and manufacturers on what needs to be done to reduce risks to transportation safety.
It may seem odd, but early in my career, I didn't often think about what the word "safety" really meant. I was taught and believed that as long as you followed standard operating procedures, paid attention to what you were doing, didn't make any stupid mistakes or break any rules, and as long as equipment didn't fail, things would be "safe".
Accidents happened to those who didn't follow the necessary steps. This belief persisted when I went on to be the manager responsible for investigating Air Traffic Control losses of separation and other incidents where safety may have been compromised. We often attributed the causes to "not following procedures" or "loss of situational awareness" without really looking deeper into the "whys".
When I became Director of Safety at NAV CANADA, the then-recently privatized Air Navigation Service provider, I was responsible for implementing and managing a corporate safety oversight program. This involved developing a number of policies, processes and practices for identifying operational safety hazards, analyzing risks and reporting them to decision-makers who could take action on them.
This role caused a major turning point in my thinking about safety. For the first time, I started to explicitly think and speak of risk management versus safety.
Slide 4 - Balancing Competing Priorities
I came to realize that safety does not mean zero risk, and that organizations must balance competing priorities and manage multiple risks: safety, economic, regulatory, financial, environmental and technological, to name a few.
While many safety critical organizations state that "safety is our first priority", there is convincing evidence to suggest that "customer service" or "return on shareholder investment" are really their top priorities. However, products and services must be "safe" if companies want to stay in business, avoid accidents and costly litigation and maintain customer confidence and a competitive advantage.
Production pressures are often in conflict with safety goals. Case studies and accident investigations often reveal how organizational factors such as the emphasis on productivity and cost control can result in trade-offs which inadvertently contribute to the circumstances leading to accidents.
Slide 5 - Reason's Model
Some of you are familiar with Reason's model, more commonly known as the "Swiss cheese" model of accident causation.
I had the opportunity to meet Dr. James Reason at a safety conference and briefly discussed risk management concepts with him. I found his work very compelling in terms of showing how a chain of events, including organizational factors, can converge to create a window of opportunity for an accident to occur. Using his accident causation model, it all seemed logical how and why accidents happened. Reason's model was also useful for explaining to colleagues working in non-operational departments, such as Finance and HR, how their policies and practices could inadvertently create the conditions leading to an accident. As such safety wasn't only the responsibility of Operations, Engineering or Maintenance departments.
Slide 6 - Sidney Dekker - Understanding Human Error
Some of my greatest insights came from studying under Dr. Sidney Dekker, Professor of Human Factors and Flight Safety and Director of Research at the School of Aviation at Lund University in Sweden. Professor Dekker maintains that:
- (click) Safety is never the only goal: organizations exist to provide goods and services and to make money at it;
- (click) People do their best to reconcile different goals simultaneously: for example, service or efficiency versus safety;
- (click) A system isn't automatically safe: people actually have to create safety through practice at all levels of the organization;
- (click) Production pressures influence people's trade-offs: this makes what was previously thought of as irregular or unsafe, normal or acceptable.
Slide 7 - Sidney Dekker - Understanding Human Error (Cont'd)
He goes on to say that "human error is not a cause of failure. Human error is the effect, or symptom, of deeper trouble. Human error is not random. It is systematically connected to features of people's tools, tasks and operating environment."
This is important because it is the organization that creates the operating environment, provides the tools, training and resources required to produce goods and services. Hence, in my view, organizational commitment to and investment in safety must permeate the day-to-day operations if it's to be truly meaningful and effective.
Given the constant need to reconcile competing goals and the uncertainties involved in assessing safety risks, how can managers recognize if or when they are drifting outside the boundaries of safe operation while they focus on their other priorities?
Safety Management Systems were designed to help build safety into everything your company does.
Slide 8 - Safety Management Systems (SMS)
James Reason describes Safety Management Systems (or SMS) as "A systematic, explicit and comprehensive process for managing safety risks... it becomes part of that organization's culture, and of the way people go about their work." (J. Reason 2001)
SMS is generally defined as a formalized framework for integrating safety into an organization's daily operations, including the necessary organizational structures, accountabilities, policies and procedures. The concept of Safety Management Systems (SMS) evolved from the contemporary principles of high reliability organizations, strong safety culture and organizational resilience. It originated in the chemical industry in the 1980s and has since evolved and been progressively adopted in other safety critical industries around the world. In particular, the International Civil Aviation Organization recognized that traditional approaches to safety management based primarily on compliance with regulations, reactive responses following accidents and incidents and a 'blame and punish' or 'blame and retrain' philosophy was insufficient to reduce accident rates.
Slide 9 - Safety Management Systems (SMS)
Since 2001, Transport Canada has been implementing requirements for SMS, in the railway and marine sectors. The commercial aviation sector has gradually been implementing SMS since 2005. The requirements are listed here.
Some people misconstrue SMS as a form of deregulation or industry self-regulation. However, just as organizations rely on internal financial and HR management systems to manage their financial assets and human resources, SMS is a framework designed to enable companies to better manage their safety risks. This does not preclude the need for effective regulatory oversight.
Slide 10 - Elements Of SMS
SMS includes the following vital elements: hazard identification, a system to report and analyze information about incidents and a company safety culture with clear lines of accountability that ensures hazards and incidents continue to be identified, reported, analyzed and acted upon.
Let's look at some of these elements of SMS more closely. I will also go over these within the context of some the major concepts on system safety we have discussed thus far. These same concepts guide the work of the Transportation Safety Board of Canada when it conducts investigations, identifies safety deficiencies and makes recommendations to reduce risks.
Slide 11 - SMS: Hazard Identification
The proactive identification of safety hazards is a key cornerstone of SMS. It's really about using a structured process to think through what might go wrong - to find trouble before trouble finds you.
Identifying safety hazards is not without its difficulties. Before an accident, it can be quite challenging intellectually to try to identify all of the ways that things might go wrong. Sociologist Ron Westrum calls this ability "requisite imagination". "The cultivation of imaginative inquiry into potential problems often avoids the occurrence of these problems in real life... Members of the organization are given a license to think, and use it to probe into things that might go wrong."
Slide 12 - SMS: Hazard Identification (cont'd)
Often, a company doesn't recognize changes in their operations or didn't consider the impacts of equipment design factors, operator training/experience/workload or local adaptations. In describing why deteriorations in safety defences leading up to two accidents had not been detected and repaired, Reason suggested "the people involved had forgotten to be afraid...If eternal vigilance is the price of liberty, then chronic unease is the price of safety."
In an operational context, changes in procedure might inadvertently compromise safety. Sidney Dekker asked thoughtfully "why do safe systems fail?" One factor he identified was the "drift into failure". This is about the slow and incremental movement of systems operations towards the edge of their safety envelope. Pressures of scarcity and competition typically fuel this drift. Without knowledge of where the safety boundaries actually are, people don't see the drift and thus don't do anything to stop it.
Slide 13 - MK Airlines - (October 2007)
With the best of intentions, organizations may develop policies and procedures to mitigate known safety risks which then subsequently erode under production pressures. The following case provides a concrete example:
A Boeing 747 on an international cargo flight crashed during take-off when the crew inadvertently used the aircraft weight from a previous leg to calculate take-off performance data. This resulted in incorrect V speeds and a thrust setting too low to enable the aircraft to take off safely given its actual weight. Crew fatigue likely increased the probability of errors in calculating takeoff performance data and degraded the flight crew's ability to detect the errors, in combination with the dark take-off environment.
The company was experiencing significant growth and a shortage of flight crews. During the previous four years, the company had gradually increased the maximum allowable duty period from 20 hours (with a maximum of 16 flight hours) to 24 hours (with a maximum of 18 flight hours). Originally, the crew complement consisted of two captains, two co-pilots and two flight engineers but was then revised to include three pilots and two flight engineers. At the time of the accident, the flight crew had been on duty for almost 19 hours and, due to earlier delays experienced, would likely have been on duty for approximately 30 hours at their final destination had the remaining flights continued uneventfully. The Crewing Department routinely scheduled flights in excess of the 24 hour limit. This routine non-adherence to the Operations Manual contributed to an environment where some employees and company management felt that it was acceptable to deviate from company policy and/or procedures when it was considered necessary to complete a flight or a series of flights.
Sociologist Diane Vaughan has defined this as the "normalization of deviance" - when deviations from a standard are institutionalized so that they, in effect, become a new standard.
Slide 14 - Organizational Drift / Employee Adaptations
Organizational drift usually isn't visible from inside the organization because incremental changes are always occurring. It often becomes visible to outsiders only after an adverse outcome (such as an accident), and then often thanks primarily to the benefits of hindsight.
Drift can also occur at an operation's front lines. In the context of limited resources, time pressures and multiple goals, workers often create "locally efficient practices" to get the job done. Accident investigation reports sometimes describe these as "violations" or "deviations from SOPs". But let's look at this in a different light. Dekker says: "Emphasis on local efficiency or cost-effectiveness pushes operational people to achieve or prioritize one goal or a limited set of goals... (that are) easily measurable ... whereas it is much more difficult to measure how much is borrowed from safety. Past success is taken as a guarantee of future safety. Each operational success achieved at incremental distances from the formal, original rules can establish a new norm....Departures from the routine become routine...violations become compliant behaviour".
Slide 15 - Fox Harbour - Touch Down Short Of Runway (November 2007)
A practical example of this was revealed by the TSB investigation into why a Global 5000 business jet touched down 7 feet short of the runway in Fox Harbour, Nova Scotia. The TSB learned that the operator endorsed a practice whereby flight crews would "duck" under visual glide slope indicator systems to land as close as possible to the beginning of the relatively short runway. The crew had previously flown in to this airport in a Challenger 604 and was still adjusting to this new larger aircraft. They weren't aware of the Global's eye-to-wheel height or the fact that the VGSI in use was not suitable for that type, and resulted in the aircraft not meeting the manufacturer's recommendation of crossing the threshold at 50' above.
Understanding the context-specific reasons for the gap between written procedures and real practices will help organizations better understand this natural phenomenon and, allow for more effective interventions beyond simply telling the workers to "Follow the rules!" or to "Be more careful!".
Slide 16 - SMS: Incident Reporting
Incident reporting is another critical component of SMS.
The main question has always been, "what's a reportable incident?" Traditionally these have been defined as events resulting in adverse outcomes. By defining incidents too narrowly, an organization risks losing information about events that could indicate potential system vulnerabilities, such as evidence of drift into failure or the normalization of deviance.
In the Canadian Air Navigation System, not only are actual losses of IFR separation reported- those instances were minimum spacing between aircraft was not achieved, but also those where adequate spacing was achieved but not assured. This generates much richer data for analyzing system vulnerabilities.
But information is only as good as how it's analyzed. While organizations gather incident reports, many organizations have limited resources to properly analyze them, keep track of deficiencies and identify patterns. Perhaps the persons who are in the best position to identify and act on known hazards are stretched too thin or focussed on other priorities.
In his book "Forgive and Remember: Managing Medical Failure", Charles Bosk cautions that dangerous near-misses are, as a rule, only appreciated as harbingers of disaster after a disaster has materialized. Until then, they are weak or missed signals. If data on 'near misses' is not analyzed properly, organizations lose opportunities learn how to prevent future incidents.
By their nature, 'weak signals' may not be sufficient to attract the attention of busy managers, who often suffer from information overload while juggling many competing priorities under significant time pressures. In several accidents, early warning signs of a hazardous situation were either not recognized or not effectively addressed.
Further to this, in an article reviewing a number of recent aviation accidents, William Voss, President and CEO of Flight Safety Foundation said: "As random as these recent accidents look, though, one factor does connect them. We didn't see them coming and we should have... the data were trying to tell us something but we weren't listening."
Slide 17 - SMS: Incident Reporting (Cont'd)
A caution though: simply counting errors doesn't necessarily generate any meaningful or relevant safety data. Furthermore, measuring performance based solely on error trends can be misleading, as the absence of error and incidents does not imply an absence of risk.
Another key aspect is what kind of processes and structure an organization will need to support incident reporting. Will it be voluntary or mandatory? Will the system identify reporters, be confidential or anonymous? To whom will the reports be submitted? Are reporters susceptible to discipline?
The consensus, according to Dekker and Laursen, is that fear of retribution hampers peoples' willingness to report. Conversely, non-punitive systems generate more reports - and by extension, more learning - because people feel free to tell about their troubles. Additionally Dekker and Laursen found it interesting that the main reason why operators make reports was not the lack of retribution but rather the realization they could 'make a difference'. They also note the importance of reporting to an operationally knowledgeable safety group, who in turn helps the reporter make sense of performance and context in which the incident occurred.
Slide 18 - SMS: Organizational Culture
Canada's aviation regulator, Transport Canada, is introducing SMS to aviation because it believes that integrating proactive safety management into a company's day-to-day operations would further reduce an already low accident rate.
But SMS is only as effective as the organizational culture in which it is enshrined.
Diane Vaughan says "The work group culture contributes to decision making ... by becoming part of the worldview that the individuals in the work group bring to the interpretation of information." Each sub-unit within a larger organization may exhibit a different culture.
The investigation into the loss of the Space Shuttle Columbia in 2002, found that detrimental cultural traits and organizational practices were allowed to develop. Key among these were:
- reliance on past success as a substitute for sound engineering practices;
- organizational barriers that prevented effective communication of safety-critical information and stifled professional differences of opinion; and, the evolution of an informal chain of command and decision-making processes that operated outside the organization's rules.
Charles Perrow warns of the risk of undesirable cultural traits that can form in an organization's management team. He said, "Managers come to believe their own rhetoric about safety first because information indicating otherwise is suppressed for reasons of organizational politics."
Slide 19 - SMS: Accountability
How do organizations respond to failure? How do they balance safety and accountability? This is particularly relevant with the apparent and increased trend towards criminalizing human error.
The criminalization of human error can have a detrimental effect on safety. Sidney Dekker's book, Just Culture explores this. "When a professional mistake is put on trial, safety almost always suffers. Rather than investing in safety improvements, people in the organization or profession invest in defensive posturing ... Rather than increasing the flow of safety-related information, legal action has a way of cutting off that flow."
Slides 20 and 21
*Please note that Items for the next two slides will start appearing one by one as soon as you click on slide 20.
He goes on to describe the desirable traits of a "just culture", which you will now see on the screen. (PAUSE)
Slide 22 - SMS Benefits and Pitfalls
Implementing SMS does not and (realistically) cannot totally immunize organizations against failing to identify hazards and mitigate risks or ensure that goal conflicts are always reconciled in favour of safety or avoid insidious adaptations and drift. In fact, nothing can.
But there is lots of evidence to support that SMS is having a positive impact on how many organizations make decisions and manage risk. In particular, some organizations have adopted more formal, structured approaches to searching for and documenting hazards (i.e. a "mindful infrastructure"), causing decision-making criteria to shift conservatively. By implementing such process changes, thinking about what might go wrong becomes part "of the way people go about their work" and "part of (that) organization's culture". They are receiving more reports from employees about 'near misses', events they would not have heard about previously, but only when employees feel "safe" to report them without retribution and confident that their reports will be acted upon. These reports can help organizations to identify "drift" and the "boundaries of safe operation". Amplifying 'weak signals' and self-auditing their safety management processes continues to be a challenge but will likely improve as companies gain more experience and their SMS matures.
Slide 23 - About The TSB
I would now like to talk a little bit about the Transportation Safety Board of Canada and how many of the concepts I have discussed so far have shaped the thinking of this organization.
The TSB is an independent government organization with a mandate to advance transportation safety by conducting investigations in the marine, rail, air and pipeline modes.
When the TSB investigates accidents, our mandate is not to assign fault or determine civil or criminal liability.
Slide 24 - About The TSB (cont'd)
In the early 1990s, the TSB adopted Reason's model of accident causation as a fundamental framework underlying its approach to accident investigation. Not only did it represent the notion of multicausality, but it clearly demonstrated that although investigators see human error as a last step in an accident sequence, in fact it takes place within a broader organizational context. From this perspective, human error is merely seen as a starting point for an investigation.
In the mid-90's, TSB formalized its accident investigation methodology, the Integrated Safety Investigation Methodology (ISIM). It incorporates all the modern concepts about accident causation, human and organizational factors, such as Reason's model, Snook's "practical drift", Vaughan's "normalization of deviance" and Dekker's work on the need to understand an accident from the perspective of the individual involved, in particular, understanding why the individual's actions made sense at the time given the specific operational and organizational context.
The ISIM process begins immediately after being notified of an accident. Investigators collect data and assess it to determine if a full investigation is warranted. This decision hinges on whether there is significant potential for the investigation to reduce future risks to people, property or the environment. The TSB takes the time necessary to conduct a thorough investigation of the safety deficiencies, causes and contributing factors to an accident. We do not lay any fault or blame. Using this methodology, we look beyond the immediate causes to find underlying failures in the system in which aircraft and humans operate to make recommendations to prevent a similar accident in the future.
Slide 25 - Summary
In closing, here is some of what I think the safety professional of the future needs to consider:
- That adverse outcomes arise from a complex interaction of organizational, technical and human performance factors that are difficult, if not impossible to predict, even with formal and sophisticated risk assessment processes;
- That people at all levels in the organization create safety - or not - and that organizations must "internalize" their approach to safety risk management so that goals, policies, processes, practices, communications and culture are consistent and integrated (admittedly much easier said than done!);
- That so called "near-miss" incidents must be viewed as "free opportunities" for organizational learning; that people will only report if they feel "safe" to do so, knowing that their input will be treated seriously, and they are empowered to offer their own suggestions for solutions;
Slide 26 - Summary
- That accident investigations are at best "constructions" of what happened and why, often based on incomplete data. Investigators must focus on why it made sense for those involved to do what they did at the time; refrain from using judgmental language; and try to avoid the trap of hindsight bias, micro-matching or cherry picking data with a world they now know to be true;
- That true "accountability" is about more than retribution (i.e. find and retrain, discipline or fire the "bad apples"). It is forward-looking and requires organizations and professions to take full responsibility to fix identified problems.
I encourage you to read widely but critically, to discuss/debate these issues with your colleagues and other subject matter experts, and to continually challenge the assumptions people make and the language they use about what promotes effective safety management or leads to accidents.
Slide 27 - Canada Wordmark
Thank you for your attention and I look forward to your questions.
- 30 -
- Date modified: