Amazon Web Services Outage Explained: A Typo Took Down A Huge Part Of The Internet

Earlier this week, a four-hour outage that affected the cloud computing division of Amazon Web Services took down thousands of websites in the United States, showing how the cloud has become an integral part of the internet's infrastructure.

Dave Bartoletti, an analyst for Forrester, estimated that the outage affected as many as 100,000 websites, as AWS is currently the leading cloud provider. Amazon has since been able to fully recover from the outage, though the reason behind the incident was not yet revealed.

Amazon Explains AWS Outage

In a post uploaded on the Amazon Web Services page, Amazon both issued an apology and explained the reason for the outage.

Amazon claimed that while the Simple Storage Service, or S3, team was debugging an issue that caused the service's billing system to slow down, a team member executed a command that was intended to remove just a small number of servers from one of the subsystems of the S3 utilized for the billing process.

However, as Amazon explains, "one of the inputs to the command was entered incorrectly," causing a bigger group of servers than intended to be removed. This resulted in a domino effect, as two of the servers supported important systems in the East Coast region, and to get them back up, a full restart was required.

Fully restarting these servers is not as simple as restarting a laptop though, which is why the outage lasted for a few hours. Amazon had to carry out a series of safety checks to make sure that no stored files were corrupted in the process, and it took the company specifically four hours and 17 minutes to get its systems running once again.

While the outage was happening, the S3 customers of Amazon were not the only ones being affected, as other cloud customers were also experiencing the shutdown because they also use Amazon's S3.

What Is Amazon Doing To Prevent Future Similar Incidents?

It is mind-boggling to know that a simple typo could cause a big part of the internet to crash. While Amazon was able to rectify the problem, it is still a cause of concern.

To make sure that no similar incidents happen in the future, Amazon said that it has modified the tool that the team member used to remove servers, as previously, the tool allowed too much capacity to be taken down too quickly, with just a single incorrectly entered command. The tool will now be removing capacity more slowly, with additional safeguards in place to prevent capacity from being taken down when it will be bringing subsystems below their minimum required levels.

Other Amazon Web Services News

Amazon Web Services made headlines last month for a different reason than a typo that caused a major shutdown, as it rolled out Chime and was said to be working on a productivity suite.

Chime, which can be used in Windows, Mac, Android, and iOS devices, is Amazon Web Services' own video conferencing and communication tool. The productivity suite that Amazon Web Services is working on, meanwhile is said to be powerful enough to rival Microsoft Office 365 and the Google G Suite.