Customer Experience Lessons from the Akamai Outage

Leigh Finch

On Saturday, 17 June 2021, a small configuration error by a usually ‘invisible’ cloud service provider had a massive impact on some of the world’s leading businesses. The Reserve Bank of Australia plus three of the Big Four were severely affected, along with Australia Post and Virgin Australia. Online services halted, staff couldn’t access the internet, contact centers went down, planes couldn’t take off–evaporating end-user experience and damaging brand reputation with their customers.

What happened?

Big brands are constant targets for a range of ideological, political, commercial or sheer criminal reasons. They must remain proactive against persistent cyber threats, including Distributed Denial of Service (DDoS) attacks originating from anywhere in the world. DDoS scrubbing is a powerful form of defense, and Prolexic from US-based global content delivery network (CDN) Akamai is a leading choice.

Prolexic monitors traffic entering large networks—such as web queries or mobile apps—then establishes whether it is valid or malignant. If valid, traffic is forwarded to the network of the bank, airline or other business. If not considered valid, the traffic isn’t allowed in.

Unfortunately, an erroneous value in a routing table caused a failure in Prolexic which affected around 500 organizations globally. Some were automatically rerouted, while for others it was a manual operation.

All up, it took from around 30 to 120 minutes for services to be restored, causing widespread angst and frustration for the customers of affected brands. All-points apologies via social media were reputation damaging. “We’re aware some of you are experiencing difficulties accessing our services and we’re urgently investigating,” tweeted CBA. “We’ll be back soon… We are currently experiencing a system outage which is impacting our website and Guest Contact Centre,” said Virgin Australia. For some consumers, it might even have been the last straw, causing them to switch providers.

How would Unified NPM have helped?

Customers with Riverbed’s Unified Network Performance Monitoring (NPM) platform have the advantage of visibility in both directions: up and down. The cause of the fault would be quickly placed outside of the network as no traffic would have been detected in the GRE tunnel. In other words, “Everything’s fine, but there’s no load!” This would have sped remediation by simply turning off the Akamai DDoS scrubber or switching over to another one.

Unified NPM is able to protect customer experience by monitoring all key metrics—packets, flows and device data—all of the time. This gives you end-to-end visibility to:

  1. Understand what normal looks like. How much traffic should we be expecting? Where is that traffic coming from, or not coming from?
  2. Baseline the traffic leveraging passive (packets/flows) and active (synthetics).
  3. Alert on KPI deviations to help isolate the problem.
  4. Implement a mitigation or business continuity strategy.

This level of granularity delivers NetOps and SecOps teams with quantitative, empirical evidence of precisely where faults lie, so they can be remediated fast. If, as in the Prolexic case, the fault lies beyond the network, the indicated service provider can be alerted and have services diverted or switched off.

Unified NPM also provides valuable forensic information after an event. Once systems are up and running again, you have solid evidence to use in the development of mitigation tactics internally between teams and with your external service providers—with the objective of avoiding such outages in the future.

What have we learned?

The Akamai incident highlights the importance of redundancy for an essential service like DDoS scrubbing and a ready-to-go mitigation strategy. Once network and applications teams worked out that Akamai was the problem, they could have switched to an internal DDoS scrubber. In fact, many organizations principally use these less costly options and only switch to cloud providers like Akamai and Fastly when they are overwhelmed by a high level of incoming threats.

Network, application and security engineers could have been saved extended, high-intensity troubleshooting on a Saturday afternoon, if they had been able to pinpoint the fault in minutes rather than hours. Most importantly, faster recovery would have meant fewer consumers suffering a poor customer experience.

If you’d like to know more about Network Performance Monitoring, our recent webinar The Art of Troubleshooting is Back! is now available on-demand.