Software Engineering Lessons From The Flight Deck

I’ve tech-led teams at Facebook, WhatsApp, and Twitter. I’m also a licensed private pilot and have found that there are many lessons from flying that can benefit software engineering teams. In this post, I’d like to share some of these lessons while also providing you an insight into aviation.

One of the important things I researched before starting flight training was— how dangerous is flying? An obvious yardstick to use for comparison was automobiles. It turns out that aviation overall is the safest mode of transport at about sixty-times fewer fatal accidents per billion miles compared to cars¹! General aviation (GA), the category that includes the small planes like the ones I would fly, is about nineteen-times more dangerous than driving cars, but also 25% safer than riding motorcycles².

GA being nineteen-times more dangerous compared to cars might seem like a lot. But flying is inherently a lot more complex undertaking than driving. While flying, you are soaring thousands of feet above the ground and you can’t just pull over in case there’s a problem; you have to keep on going. Yet, it is safer than motorcycles and only 19x worse than cars. I found this to be a remarkably good safety record. Over the course of my training, I’ve seen first-hand what makes aviation so safe. There are excellent lessons in safety, robustness, and risk-management that I’ve found translate very well to software engineering. I’d like to describe four of these lessons that will help you build better software without slowing you down.

1. Respect the craft

There are certain elements of piloting that are not as exciting as the rest. Before even sitting in the aircraft, pilots go through a thorough inspection of the exterior of the aircraft, called pre-flight. In my flying club, as part of the pre-flight, students pilots were required to clean the windshield. I would sometimes skimp on it when I felt it was clean enough. My instructor wouldn’t fail to remind me— a clean windshield will help us spot other airplanes more easily, and hence lead to a safer flight. One day he said in a very matter of fact tone, “I’ve found that students who do a better job in cleaning the windshield turn out to be better pilots”. At that moment I didn’t make much of it. But later that day, as I was replaying the flight lesson in my mind, it clicked— cleaning the windshield is a powerful illustration of the ethos of a modern-day pilot.

Piloting is no longer the swashbuckling undertaking it was during the early days of aviation. WWI took place only a decade after the Wright brothers completed the world’s first successful powered flight. At that time, pilots were seen as larger-than-life heroes who used their superior reflexes and spatial orientation skills to complete complex missions. Lot of that emphasis has shifted over time towards maintaining operational excellence, following standard procedures, and doing the small things right. Frank Borman, the commander of Apollo 8, helped spread this message with his clever quote.

“A superior pilot uses his superior judgement to avoid getting into situations which require the use of his superior skills.”

Cleaning the windshield metaphor applies aptly to software engineering. When one of my mentors at work was leaving the company, I asked him to share any parting wisdom on how to build the best software engineering team. He said “Everybody knows what the best practices in software engineering are. But nobody wants to spend the time to follow them. Follow the best engineering practices and you will become the best team”. In other words, similar to Aviation, it’s important to spend the time to work on some of the unappealing aspects of software engineering. This includes— making sure you’re cleaning up unused code; spending the time to write specs for features and keeping the specs up to date; ensuring that your build & deploy is a single step or ideally continuous; having good test coverage; making the internal monitoring & debugging tools to be as high-quality as the external code; and much more. For folks looking for actionable pointers, I highly recommend the book The Pragmatic Programmer and some of Joel’s writing.

2. Communicate clearly using shared vocabulary

Nothing makes you feel more like a pilot than talking to air traffic control on the radio. To a layperson, “aviation-speak” feels like a different language altogether, because of all the lingo and the abbreviated phrasing. For pilots, that’s probably part of the charm. However beneath the surface, the goal of radio communication rules is not to sound cool, but to communicate with brevity while being as accurate as possible. Brevity is important because all aircrafts in the same area are typically talking on the same frequency channel. Being long-winded takes away valuable on-air time from other pilots. Even more important than brevity, though, is being as accurate as possible, so as to avoid any miscommunication. Everything in radio communication revolves around these goals, for e.g:

  • The NATO phonetic alphabet (like Alpha, Bravo, Charlie) is used to distinguish phonetically similar sounding letters.

  • The numbers (like wind direction 320°) are always pronounced as individual digits to avoid miscommunication in case the audio cuts out at an inopportune time while saying, for example, “three hundred and twenty”.

  • The digit 9 is pronounced as niner instead of nine to differentiate with the word nein which means no in the German language.

These are interesting examples. But what I find particularly elegant and transferrable to software engineering is the use of specific phraseology in aviation to mean very specific things. When the controller says “hold short of {position}”, the instruction is very clear. There is no confusion whether you need to stop before the position, or at the position, or right after. “Climb and maintain {new altitude}” very tersely conveys that you need to climb and then stay at that altitude. It doesn’t leave room for interpretation about what to do once you have reached the new altitude. Similarly when the pilot says “Mayday, mayday, mayday”, the receiver instantly knows there’s an emergency without having to guess the severity of the situation.

I’ve found a need for shared vocabulary in software engineering teams as well. During team syncs, team members often ask questions like— “is item X complete?” or “how long will it take to do Y?”. This opens up room for miscommunication because of differing definitions of complete. One person’s definition of complete might be that the basic functionality is done. Whereas, another person’s definition might be that the product is thoroughly tested and all the edge cases (including error retries) are covered. You can imagine that these two people will have very different answers to the above questions. This can be solved with the help of a shared vocabulary. For instance, my teammate proposed a Levels of Polish framework, to objectively describe the different levels of complete.

  • Level 1 Functional: User can complete the basic task. For example— User can post a picture on the app.

  • Level 2 Polished: User finds the experience polished. For example— When the user posts a picture, they see a progress indicator. Once the post is successfully uploaded, they receive both visual and haptic feedback.

  • Level 3 Delightful: User’s experience is elevated to the next level. For example— The user finds the upload speed  very fast because the picture is “optimistically uploaded”. All corner cases are handled; network errors result in automatic retries; and if there’s no network, the post is saved as a draft, so the user can come back to it later.

It doesn’t matter what exact leveling system you use. But having a shared framework will help the team discuss what level of polish is required for individual features, and how much time it will take to get that done.

3. Foster blameless post-mortem culture

Every reported aviation incident, big or small, is thoroughly investigated by the national board of the country where the crash happened. I once attended a flight safety seminar, in which the speaker talked about the infamous Tenerife Airport Disaster of 1977, where two Boeing 747 passenger jets collided into each other on a foggy runway, resulting in 583 fatalities. In summary, the KLM jet began takeoff after mistaking air traffic control’s (ATC) routing instructions for takeoff clearance. The other jet, operated by Pan Am, was still on the same runway, just about to exit! The seminar speaker paused after explaining the story to ask the attendees “whose fault do you think it was?”. Was it the KLM captain’s fault for misinterpreting ATCs instructions? Was it the Pan Am airliner’s fault for missing their taxiway exit a few minutes earlier, which would have led them to not be on the runway? Was it the ATCs fault for having a strong accent when relaying instructions and also letting both the jets taxi on the same runway so close to each other in the first place? The fact is, there was a chain of events that led to the disaster, any one of which could have avoided the whole nightmare altogether.

The aviation industry follows a blameless culture where the focus of the investigation is on identifying all contributing factors that led to the incident and recommending changes to prevent entire classes of problems from happening again. For this reason, the author, Nassim Taleb mentions the aviation industry as an example of an anti-fragile system— a system that is not just resilient against shocks, but rather gets better with shocks.

This isn’t completely new to software engineering, but it’s worth reiterating. A good software engineering team must use each incident (bug, outage, and even near-misses) as a learning opportunity. There should be a detailed post-mortem of incidents with a focus on preventing the same types of incidents from happening again. A very useful framework in achieving this is called DERP (which stands for detection, escalation, remediation, and prevention).

  • Detection: How could the issue have been detected faster - alarms, dashboards, user reports?

  • Escalation: How could the right people have gotten involved quickly?

  • Remediation: What steps can be taken to fix the issue if it happens again? Can these steps be automated?

  • Prevention: What improvements could remove the risk of this type of failure happening again? How could you have failed gracefully?

4. Have clear owners

The Federal Aviation Regulations (FAR) are rules, prescribed by the FAA, that govern all aviation activity in the United States— from small gliders to jumbo jets. It’s a long and dry read like most legal documents go, however, there’s one regulation that I find particularly powerful.

14 CFR Part 91.3 - Responsibility and authority of the pilot in command

(a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft.

(b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency.

That’s a lot of power vested in the pilot-in-command! Not only does a pilot have full control of the aircraft, they also have advance permission to break any rule in the book, if required for an emergency. Why does the FAA do this? This regulation is based on the belief that nobody involved has a better assessment of an in-flight situation than the pilot themself. Having complete authority helps the pilot respond faster to emergencies. It also cuts down on ambiguity and deferred responsibility, which is a psychological phenomenon where people are less likely to take responsibility when other people are present.

Clear ownership is equally important in software engineering. On the surface, this piece of advice often comes off as trivial and already well understood. I also felt the same way until our team decided to follow a formal framework for quantifying and improving the software & operational health of the team. As part of this framework, we filled out a detailed discovery sheet, where we listed down each and every “asset” (a piece of functionality or code) the team owned; along with owners, test coverage, bugs list, alerts, and metrics for each “asset”. We found that most assets had clear owners as we had suspected. However, it was eye-opening to see that few assets that lay on the “boundaries” between our team and a partner team had unclear owners. Either our team felt that the other team was responsible for those assets, or vice versa. It was also interesting to find that the assets that had unclear owners also caused most of the bugs and inefficiencies. Things improved drastically when we worked with our partner teams to clarify ownership and SLAs.

I hope these lessons resonate with you and help you in building better software. In aviation, the consequences of making mistakes are very clear— people die. This skin-in-the-game has propelled aviation to lead the way in safety, robustness, and better engineering. In most software companies, the stakes don’t feel as high, and often there are good reasons to move fast and break things™. However, it’s important to realize that software is eating the world, and a lot of people are relying more and more on software to do critical things in their lives, especially post COVID. When your video chat software has a bug, that’s not just a mere annoyance anymore. I hope this empathy for your customer pushes you to take pride in the craft of software building and to strive for excellence.