Tech Roundup: Are large-scale outages more common?

Gavin
Gavin
3 min read •

On October 20th, AWS went down and tanked about half the internet. This week, the same situation unfolded with another tech giant; Cloudflare. With most of the internet's traffic passing through a handful of providers, it feels like large scale outages are happening more and more. The internet was built on the principal of security through distribution, but it now seems like we are in a vulnerable position.

We are sorry for the impact to our customers and to the Internet in general.

What is Cloudflare?

Cloudflare provides a range of delivery and cybersecurity services to online platforms. It offers a great deal of reliability by routing traffic through a global network which provides clients with a security layer to block threats. It is essential for many platforms, however, as a traffic router it serves as a linchpin for most of the internet.

Why did Cloudflare go down?

Cloudflare have written a candid comprehensive account on why their systems went down on at 11:20 am UTC on November 18th. The reason has been identified as a change to configuration settings having an unexpected impact. As a result, the Cloudflare network fell down and millions upon millions of 500 errors (server failure) were sent to dependent systems worldwide and thousands of online systems were brought to a screaching halt.

500 errors logged by Cloudflare

(Source: Cloudflare)

The team initially had no idea what caused this outage, and at first thought it looked like a DDoS attack. DDoS stands for distributed denial-of-service and it is a common form of cyber attack. It is the simple and effective method of bringing a system offline by flooding it with more requests than it can handle. Once the team had ruled this out and managed to figure out the core issue they were able to bring systems back online around 14:30 UTC. "We are really sorry for the impact to our customers and to the Internet in general." - they wrote in the above mentioned post-mortem.

What was the impact?

X, Spotify and ChatGPT were among some of the high profile platforms affected by this outage, as they are one of the many online tools that rely on Cloudflare for security. According to Down Detector, a community-led outage tracker, more than 10,000 people reported service issues before, ironically, Down Detector itself fell silent.

Just like October 20th, where AWS was the centre of the outage (and affected by this week's outage), the error has had a wide and meaningful impact, and has disrupted the lives of many. Which begs the question - has the internet become less resilient?

The original distributed nature of the internet has evaporated and we are now on a far more centralised system, with most of the worlds traffic funnelling through a few key players. Not all of the internet is cloud - but for the majority of services that are cloud hosted, here is a distribution of the providers.

Source: Statista

The chart shows that over 60% of the internet's cloud infrastructure is provided by only 3 players. AWS (Amazon), Azure (Microsoft), and Google Cloud (Google) - and Cloudflare integrates with each of these.

AWS, the largest provider - was first on the scene in 2002. AWS services are delivered to customers via a network of AWS server farms throughout the world, which customers can easily tap into to host websites and online services. Gone were the days of owning your own physical server in order to run a web page. Here are just a few of the top value-propisitions provided by cloud.

  • Rapid time to market for digital services.

  • Reduce spend on physical infrastructure.

  • Costs scale with usage and can be optimised.

  • Out of the box managed services and tools, such as AI and data analysis.

  • Facilitation of collaboration and remote work.

With benefits like this, it is easy to tell why so many organisations have adopted a cloud approach to their digital development. Google joined the party in 2008, with Azure following in 2010, and over the next decade, cloud computing began to really take shape with "cloud based" becoming a term you heard more and more often.

There are drawbacks to these solutions of course - with security issues and hidden costs among them. The biggest issue, however (or at least the issue we are feeling right now) is the centralised dependability on these tools. When AWS went down on October 20th numerous functional tools went down with it - people could not hear the doorbell thanks to Amazon's Ring service going down. Precious year-long-snap-streaks were suddenly in jeopardy as Snap Inc was offline. Massive tools like Zoom were out of action - stalling Monday morning stand ups across the globe. In addition, Airtable itself went down.

The pressure on the cloud giants have increased. The number of customers have grown and the teams assembled to provide them have also grown. Moreover, while the internet has become centralised, the workforce has done the opposite and become more distributed, thanks to online collaborative tools enabled cloud technology itself. Is a distributed workforce as reliable as its previously centralised counterpart? We are living in a world where small mistakes become more impactful, and perhaps more common.

Be ready

If you turn on your desk lamp with Amazon's Alexa without access to a backup switch, there will be no way of seeing the light next outage. Its always important to have a backup available for next time an outage strikes.

(Thankfully Airtable and Notion users with CSV Getter enabled were able to turn to their Google Sheets backup 🎉)