Aug
12

Downtime Disasters. When Airlines Computer Systems Crash And Burn.

Downtime Disasters. When Airlines Computer Systems Crash And Burn.

Delta data center downtime caused chaos in US airports this week, with hundreds of flights canceled and others severely delayed. Delta systems were down roughly six hours due to a power outage at the company headquarters in Atlanta.While the Delta CEO has apologized, and the company has issued compensation for affected passengers, the incident shows how the backup and recovery process is vital to ensuring business continuity in any industry. For airlines, the complexities of the systems and sometimes the age of the technology they use can cause major issues.

airlines plane wing

Airlines Computer Outage Show The Importance of Business Continuity

Fields like transportation are particularly vulnerable, as the cost can be high for even one hour of downtime. The damages translate into lost revenue, fuming customers and damage to company brand and reputation. From a consumer perspective, we can only empathize and hope it doesn’t happen to you. The disruption ruins holiday plans, cancels family reunions, causes long lines in airports, and creates confusion and exhaustion for employees and travelers alike.

The power outage at Delta mission control in Atlanta happened around 2:40 a.m and impacted computer systems and flight operations worldwide. Huge lines formed at the check-in gates. Agents wrote out boarding passes with pen and paper. The company issued warnings of severe delays and flight cancellations.

Passengers soon found themselves stranded in airports, with screens incorrectly showing flights on time. What’s more, travelers were not able to avoid the chaos, as the flight status update system wasn’t working either. Passengers only learned about flight problems when they arrived at the airport, instead of being able to stay home and wait for a resolution.

It took 6 hours to get systems back up and running. On Monday, the company canceled 870 flights (over 12% of their daily traffic). 

Delta Disaster: When Backup Systems Fail To Take Over

Quoted in The Wall Street Journal, the responding electric company said the incident was triggered by failed switchgear – a device that routes and distributes power. Ideally, Delta’s backup systems should have taken over, but that didn’t happen. The quoted electric company said, “Following the power loss, some critical systems and network equipment didn’t switch over to Delta’s backup systems.” They added that recovery efforts continued well into the evening.

The following day brought even more misery to passengers, with a huge backlog of delays and cancellations. The ripple effects from the disruption will take a few days to calm. Travelers were urged to check their flight status on the company websites, and they were able to rebook their flight. By Wednesday morning (day 3), 150 flights were still cancelled and the company announced operations would only be back to normal in the late afternoon.

While Delta engineers are investigating the issue, it may take months before a cause is found. This speaks volumes to the importance of backup plans and the ability to recover quickly after a system wide outage.

airport signs US

Downtime = Damage In the $Millions For Business

Delta is not the first airline to be hit by a computer system disaster. In July this year, a lone router failed at Love Field data centers in Dallas. The hardware malfunction crippled hundreds of software applications belonging to Southwest Airlines, reported Dallas News.

CEO Gary Kelly described the incident as a “once-in-a-thousand-year-flood”. He added that a partial failure didn’t allow the system to trigger a backup recovery process. It sounds odd that recovery didn’t trigger unless the component gave way 100%, but it shows how an unusual failure like that can make it hard to avoid disaster.

Both United Airlines and American Airlines had computer problems in the summer of 2015. They fixed the problems within one day, but in their industry, even short outages cause mayhem and long lines in airports.

In October last year, Southwest kept its passengers waiting yet again on another computer systems fix. The glitch affected  the main website at Southwest.com, the mobile app, and all of its call centers.  Over 800 Southwest Airlines flights were cancelled, and again, gate agents rushed to check in passengers with pen and paper because the automated kiosks were down. And while Southwest Airlines managed to recover in about a day, the the downtime cost them between $5 million and $10 million.

That is one expensive glitch.

Can You Afford Downtime? Learn More About How You Can Calculate Downtime Losses

Unplanned Data Center Outages Cost a Pretty Penny

And the costs are not about to get any lower. Data from the Ponemon Institute quoted by Emerson Power Network shows that the average cost of a data center outage has steadily increased from a little over $690,000 in 2013 to over $740,000 in 2016 (a 7.4 percent increase).

cost data center outage

Source: Ponemon Institute Data for Unplanned Data Center Outages – 2016

The data shows exactly how costly disaster recovery can be. The following summary shows the cost breakdown for one unplanned outage incident for organizations that took part in the study:

Third parties $9,927
Equipment $9,478
Ex-post activities $8,428
Recovery $21,177
Detection $26,712
IT productivity $61,880
End-user productivity $138,193
Lost revenue $208,599
Business disruption $255,963

 

Data Center Downtime: Human Error And Hardware Failure Are Top Causes

Understanding the causes of unplanned outages is a key piece in the business continuity puzzle. Statistics show most of the time, hardware failure and human error pose a problem. The causes for outages in 2016 were as follows, according to the Ponemon Institute study:

  • UPS System Failure – 25 percent;
  • Cyber Attack (DDoS) – 22 percent;
  • Accidents/Human Error – 22 percent;
  • Water, heat or CRAC failure– 11 percent;
  • Weather related – 10 percent;
  • Generator failure – 6 percent;
  • IT Equipment failure – 4 percent;

Many incidents can be caused by weather-related issues or heat. This is why, when push comes to shove, it’s a good idea to keep backups in multiple locations and test them often. 

airplane taking off

What Can We Learn From The Airline Outages

Engineers can learn from these incidents by ensuring that backup and recovery plans are kept up to date and thoroughly tested. They can implement automated verification steps and have a fail-safe manual recovery options in place. Moreover, they can perform emergency drills to test the backup system’s ability to meet their recovery point and recovery time objectives. Businesses must ensure they have skilled IT resources ready to respond to a failure event. IT needs to restore quickly from a recent backup, to make sure sales and production don’t stop when systems are affected. 

Better safe then sorry, they say – so make sure your business continuity plan passes the disaster test with flying colors. Backup & disaster recovery software like the StorageCraft Recovery Solution can ensure data protection and peace of mind whether you are using Windows, Linux or hybrid IT environments in your organization.

So keep safe and always, always back up your data.