Presentation to the Canadian Women in Aviation (CWIA) Conference
Kathy Fox, Member
Transportation Safety Board of Canada
Montreal, QC
June 17, 2011

Click here to see PowerPoint Presentation (PPT)  [3447 KB]

Check against delivery

Slide 1: Title Page

Bonjour! Good morning!

It's a pleasure to be here. I believe this is the 4th CWIA conference I have attended and it's always great to reconnect with old friends and make new ones. As well, this is a fabulous opportunity to listen to a variety of speakers who can contribute to expanding your horizons, so I hope you take full advantage of the sessions offered.

The theme of this conference is "Building Towards the Unknown" and that's certainly applicable when we talk about managing safety risks.

Slide 2: Outline

Today, I would like to share with you some of my experiences and the lessons I've learned about managing safety risks during the various phases of my career, and how my own thinking has evolved about what needs to be done to reduce, if not eliminate, safety risks.

During my talk, I will briefly review some of the schools of thought about accident causation, and more importantly, prevention, and how that led to the introduction of safety management systems in the Canadian aviation industry.  In particular, I will outline some of the lessons I learned about organizational drift into failure, employee adaptations, hazard identification, incident reporting and safety measurement.

Slide 3: Early Thoughts on Safety

Like many of you who fly or work in aviation, my whole career has been about practicing safety.  As an air traffic controller, I was responsible for ensuring that aircraft didn't run into each other or into the ground, and providing pilots information necessary for flight safety, such as weather or runway conditions.  As a commercial pilot, I was responsible for transporting people safely from Point A to Point B.  As a flight instructor, I teach people the basics of safe flying so they can earn their pilot's license.  As a Pilot Examiner, I assess pilots' skills to make sure they meet Transport Canada licensing standards and can fly safely.  And in my current role as a Member of the Transportation Safety Board of Canada, I am responsible with the other Board members for analyzing the safety deficiencies found in accident investigations and making recommendations to regulators, operators and manufacturers on what needs to be done to reduce risks to transportation safety.

It may seem odd, but early in my career, I didn't often think about what the word "safety" really meant.  I was taught and believed that as long as you followed standard operating procedures, paid attention to what you were doing, didn't make any stupid mistakes or break any rules, and as long as equipment didn't fail, things would be "safe". 

Accidents happened to those who didn't follow the necessary steps.  This belief persisted when I went on to be the manager responsible for investigating Air Traffic Control losses of separation and other incidents where safety may have been compromised.   We often attributed the causes to "not following procedures" or "loss of situational awareness" without really looking deeper into the "whys".

Slide 4: Safety ≠ Zero Risk

When I became Director of Safety at NAV CANADA, the then-recently privatized Air Navigation Service provider, I was responsible for implementing and managing a corporate safety oversight program.  This involved developing a number of policies, processes and practices for identifying operational safety hazards, analyzing risks and reporting them to decision-makers who could take action on them.

This role caused a major turning point in my thinking about safety.  For the first time, I started to explicitly think and speak of risk management versus safety. 

I came to realize that safety does not mean zero risk, and that organizations must balance competing priorities and manage multiple risks: safety, economic, regulatory, financial, environmental and technological, to name a few.

Slide 5: Balancing Competing Priorities

While many safety critical organizations state that "safety is our first priority", there is convincing evidence to suggest that "customer service" or "return on shareholder investment" are really their top priorities.  However, products and services must be "safe" if companies want to stay in business, avoid accidents and costly litigation and maintain customer confidence and a competitive advantage.

Production pressures are often in conflict with safety goals.  Case studies and accident investigations often reveal how organizational factors such as the emphasis on productivity and cost control can result in trade-offs which inadvertently contribute to the circumstances leading to accidents.

Slide 6: Reason's Model

Some of you are familiar with Reason's model, more commonly known as the "Swiss cheese" model of accident causation.

I had the opportunity to meet Dr. James Reason at a safety conference and briefly discussed risk management concepts with him.  I found his work very compelling in terms of showing how a chain of events, including organizational factors, can converge to create a window of opportunity for an accident to occur.  Using his accident causation model, it all seemed logical how and why accidents happened. 

Reason's model was also useful for explaining to colleagues working in non-operational departments, such as Finance and Human Resources, how their policies and practices could inadvertently create the conditions leading to an accident.  As such safety wasn't only the responsibility of Operations, Engineering or Maintenance departments. 

However, Reason's model may appear somewhat simplistic insofar as it may lead people to believe that accidents occur as a linear sequence of events. The world is far too random for that.

Slide 7: Sydney Dekker

Some of my greatest insights came from studying under Dr. Sidney Dekker, then Professor of Human Factors and Flight Safety and Director of Research at the School of Aviation at Lund University in Sweden.  Professor Dekker maintains that:

  • People do their best to reconcile different goals simultaneously: for example, service or efficiency versus safety;
  • A system isn't automatically safe: people actually have to create safety through practice at all levels of the organization;
  • Production pressures influence people's trade-offs: this makes what was previously thought of as irregular or unsafe, normal or acceptable.

Slide 8: Sydney Dekker (cont'd)

He goes on to say that "human error is not a cause of failure.  Human error is the effect, or symptom, of deeper trouble.  Human error is not random.  It is systematically connected to features of people's tools, tasks and operating environment."

Given the constant need to reconcile competing goals and the uncertainties involved in assessing safety risks, how can managers recognize if or when they are drifting outside the boundaries of safe operation while they focus on their other priorities?

This is important because it is the organization that creates the operating environment, provides the tools, training and resources required to produce goods and services.  Hence, in my view, organizational commitment to and investment in safety must permeate the day-to-day operations if it's to be truly meaningful and effective.

Slide 9: Why Focus on Management?

Slide 10: Safety Management System (SMS)

Safety Management Systems were designed to help build safety into everything your company does.

SMS is generally defined as a formalized framework for integrating safety into an organization's daily operations, including the necessary organizational structures, accountabilities, policies and procedures. The concept of Safety Management Systems (SMS) evolved from the contemporary principles of high reliability organizations, strong safety culture and organizational resilience.

It originated in the chemical industry in the 1980s and has since evolved and been progressively adopted in other safety critical industries around the world. In particular, the International Civil Aviation Organization recognized that traditional approaches to safety management based primarily on compliance with regulations, reactive responses following accidents and incidents and a ‘blame and punish' or ‘blame and retrain' philosophy was insufficient to reduce accident rates.

Slide 11: Safety Management System (SMS) (cont'd)

Since 2001, Transport Canada has been implementing requirements for SMS, in the railway and marine sectors. The commercial aviation sector has gradually been implementing SMS since 2005. The requirements are listed here.

Some people misconstrue SMS as a form of deregulation or industry self-regulation. However, just as organizations rely on internal financial and HR management systems to manage their financial assets and human resources, SMS is a framework designed to enable companies to better manage their safety risks. This does not preclude the need for effective regulatory oversight

Slide 12: Key Elements of SMS

The proactive identification of safety hazards is a key cornerstone of SMS.  It's really about using a structured process to think through what might go wrong – to find trouble before trouble finds you.

Identifying safety hazards is not without its difficulties. Before an accident, it can be quite challenging intellectually to try to identify all of the ways that things might go wrong.

Slide 13: SMS: Hazard Identification

Often, a company doesn't recognize changes in their operations or didn't consider the impacts of equipment design factors, operator training/experience/workload or local adaptations. In describing why deteriorations in safety defences leading up to two accidents had not been detected and repaired, Reason suggested "the people involved had forgotten to be afraid...If eternal vigilance is the price of liberty, then chronic unease is the price of safety."

Slide 14: SMS: Hazard Identification (cont'd)

In an operational context, changes in procedure might inadvertently compromise safety.  Sidney Dekker asked thoughtfully "why do safe systems fail?"  One factor he identified was the "drift into failure".   This is about the slow and incremental movement of systems operations towards the edge of their safety envelope.  Pressures of scarcity and competition typically fuel this drift.  Without knowledge of where the safety boundaries actually are, people don't see the drift and thus don't do anything to stop it.

Let's look at some practical examples of this.

Slide 15: Alaska Airlines Flight 261

In 2000, Air Alaska Flight 261 crashed off the coast of California with the loss of all souls on board, when the in-flight failure of a jackscrew in the MD-80 aircraft trim system resulted in a loss of airplane pitch control. The thread failure was caused by excessive wear resulting from Alaska Airlines' insufficient lubrication of the jackscrew assembly. The National Transportation Safety Board (NTSB) investigation revealed that over time, the lubrication schedule for the failed jackscrew increased from an interval of once every 300-350 flight hours to once every 8 months, or approximately 2550 hours.  The jackscrew recovered from the accident site revealed no evidence that there had been adequate lubrication at the previous interval, meaning it might have been more than 5000 hours since the assembly had last been greased.  The question then becomes: what if any risk assessment was done at the time to assess the impact of changes to lubrication intervals?

Slide 16: MK Airlines

Here's another example, this one investigated by us at the Transportation Safety Board (TSB): Aon 14 October 2004, a Boeing 747 on an international cargo flight crashed during take-off when the crew inadvertently used the aircraft weight from a previous leg to calculate take-off performance data. This resulted in incorrect V speeds and a thrust setting too low to enable the aircraft to take off safely given its actual weight. The TSB's investigation (A04H0004 ) showed that crew fatigue likely increased the probability of errors in calculating takeoff performance data and degraded the flight crew's ability to detect the errors, in combination with the dark take-off environment.

The company was experiencing significant growth and a shortage of flight crews. During the previous four years, the company had gradually increased the maximum allowable duty period from 20 hours (with a maximum of 16 flight hours) to 24 hours (with a maximum of 18 flight hours). Originally, the crew complement consisted of two captains, two co-pilots and two flight engineers but was then revised to include three pilots and two flight engineers.   At the time of the accident, the flight crew had been on duty for almost 19 hours and, due to earlier delays experienced, would likely have been on duty for approximately 30 hours at their final destination had the remaining flights continued uneventfully. The Crewing Department routinely scheduled flights in excess of the 24 hour limit. This routine non-adherence to the Operations Manual contributed to an environment where some employees and company management felt that it was acceptable to deviate from company policy and/or procedures when it was considered necessary to complete a flight or a series of flights.

Slide 17: Organizational Draft / Employee Adaptations

Organizational drift usually isn't visible from inside the organization because incremental changes are always occurring. It often becomes visible to outsiders only after an adverse outcome (such as an accident), and then often thanks primarily to the benefits of hindsight.

Drift can also occur at an operation's front lines. In the context of limited resources, time pressures and multiple goals, workers often create "locally efficient practices" to get the job done.  Accident investigation reports sometimes describe these as "violations" or "deviations from SOPs". But let's look at this in a different light. Dekker says: "Emphasis on local efficiency or cost-effectiveness pushes operational people to achieve or prioritize one goal or a limited set of goals... (that are) easily measurable … whereas it is much more difficult to measure how much is borrowed from safety. Past success is taken as a guarantee of future safety. Each operational success achieved at incremental distances from the formal, original rules can establish a new norm….Departures from the routine become routine…violations become compliant behaviour".

Understanding the context-specific reasons for the gap between written procedures and real practices will help organizations better understand this natural phenomenon and, allow for more effective interventions beyond simply telling the workers to "Follow the rules!" or to "Be more careful!"

Slide 18: Touchdown Short of Runway

A practical example of this was revealed by the TSB investigation into why a Global 5000 business jet touched down 7 feet short of the runway in Fox Harbour, Nova Scotia in 2007. During the investigation (A07A0134) the TSB learned that the operator endorsed a practice whereby flight crews would "duck" under visual glide slope indicator systems to land as close as possible to the beginning of the relatively short runway.

Slide 19: Aircraft Attitude at Threshold

The crew had previously flown in to this airport in a Challenger 604 and was still adjusting to this new larger aircraft. They weren't aware of the Global's eye-to-wheel height or the fact that the VGSI in use was not suitable for that type, and resulted in the aircraft not meeting the manufacturer's recommendation of crossing the threshold at 50' above.

Furthermore the company hadn't conducted a thorough risk assessment of the implications of operating a new and larger aircraft as was evidenced by the fact that they simply transferred most of their operating procedures from the CL604 over to the Global 5000.

Slide 20: SMS: Incident Reporting

Traditionally these have been defined as events resulting in adverse outcomes.  By defining incidents too narrowly, an organization risks losing information about events that could indicate potential system vulnerabilities, such as evidence of drift into failure or the normalization of deviance.

In the Canadian Air Navigation System, not only are actual losses of separation, those instances were minimum spacing between aircraft was not achieved reported, but also those where spacing was achieved but not assured.  This generates much richer data for analysis of system vulnerabilities.

Slide 21: Weak Signals

By their nature, ‘weak signals' may not be sufficient to attract the attention of busy managers, who often suffer from information overload while juggling many competing priorities under significant time pressures.  In several accidents, early warning signs of a hazardous situation were either not recognized or not effectively addressed.

In 2007 a medical evacuation flight crashed in Sandy Bay, Saskatchewan, killing the pilot in command. The subsequent TSB investigation (A07C0001) found that the crew of two pilots was unable to work effectively as a team to avoid, trap, or mitigate errors and safely manage the risks associated with the flight. As our lead investigator at the time put it, "This crew did not employ basic strategies that could have helped prevent the chain of events leading to this accident." This lack of coordination can be attributed in part to the fact that the crew had not received crew resource management (CRM) training.

Previously, there had been numerous "crew pairing issues" with respect to this particular crew. The company's management knew about this, although they were unaware of the extent to which these factors could impair effective crew coordination.

In fairness, it's hard to have as much context as you'd like, sometimes—and it's almost impossible to know the full extent and implication of any single event. For instance, are one or two reported problems just that—isolated conflicts or hazards—or are they evidence, warning signs of a dangerous trend?

Slide 22: SMS: Incident Reporting

However, no matter how much information you have, if data on "near misses" is not analyzed properly, organizations lose opportunities to learn how to prevent future incidents. Merely counting errors doesn't necessarily generate meaningful or relevant safety data, and measuring performance based solely on error trends can be misleading, as the absence of error and incidents does not imply an absence of risk. Moreover, many organizations simply have limited resources available to analyze reports, keep track of deficiencies, and identify patterns. Sometimes, much to our dismay, the one person who is best positioned to identify and act on a known hazard is stretched too thin, or focused on other priorities.

Slide 23: SMS: Organizational Culture

Slide 24: SMS: Accountability

Slide 25: Elements of a "Just Culture"

Slide 26: SMS: Benefits and Pitfalls

Slide 27: About the TSB

Slide 28: About the TSB (con'td)

Slide 29: Summary

Slide 30: Summary

Slide 31: References

Slide 32: END