What caused the AWS outage, and what can we take away from it?
- Paige Haines

- Oct 27
- 2 min read
On the 20th of October, AWS experienced an outage stemming from its US-East-1 (N. Virginia) region, creating a global cascading effect on services using this data cluster. Users were left unable to access applications or online services that were utilising this cluster causing minor confusion to, in some cases, actual physical implications.
What was the issue?
This issue began when a DNS failure in Amazon's DynamoDB cloud database service that caused routing information to disappear. According to Amazon, their engineers "identified the trigger of the event as DNS resolution issues for the regional DynamoDB service endpoints." This means that while the servers were still alive and working, they were unreachable.
This failure rippled across the AWS ecosystem with EC2 virtual machines unable to launch, Lambda functions failing to execute, and Network Load Balancers (NLB's) dropping perfectly healthy connections. Amazon later confirmed a “latent race condition” in its DNS automation which caused an empty DNS record to overwrite the correct one until engineers intervened manually.
Why does this matter for Australia’s energy sector?
While this outage did not directly lead to power system outages, we can't help but wonder what the true implications of this type of outage. A particular outage reports of "smart beds" thinking for themselves, increasing their temperature, and moving into a sharp incline, with some beds even triggering wake-up alarms. Perhaps not the doomsday failure mode that might have contemplated, but nonetheless a hint of what may happen on a larger scale.
As most of us in the sector are acutely aware, the Australian electricity system is transitioning into rapid inter-connectedness, like Swiss cheese full of DER management platforms, back-to-base analytics systems, and remote monitoring. When hosted on a platform like AWS, a DNS outage could cause country-wide operational disruptions due to the loss of key data feeds, or command interfaces.
So, over the last two weeks we have made the effort to ask many of the utility operators and OEMs about the outage, and its impact on the Australian power systems. The answer? No-one is really quite sure...
Any real-time data aggregation hosted in the cloud becomes a weakness during these outages and even an hour of downtime or latency could inadvertently take down critical data or control systems, or kill our defensive capabilities.
Considerations moving forward
These events are an opportunity for learning and improving, and preparing for a much larger and more extreme even, such as what happened on the Iberian Peninsula earlier this year. The clearest learning that we can see is that our situational awareness to the signatures and impacts of these events is still very low, as is our posture to respond appropriately.
Food for thought before the next major Cloud infrastructure outage.



Comments