Navigate Select ESC Close

AWS Outage Takes Down Internet Blamed on DNS -- We Are a Fat Finger Away from Apocalypse

2025-10-20 Science & Technology
29.5k
1.4k
286
Eli the Computer Guy
Eli the Computer Guy
1.1m subscribers

AWS DNS Outage Analysis: Single Points of Failure in Cloud Infrastructure

Investigate the root cause of wide-scale cloud outages, focusing on the fragility of Domain Name System (DNS) and the operational risks created by overwhelming workloads.

Short Summary

  • The recent AWS outage traced back to a critical DNS failure that immediately crippled downstream services like Dynamo DB.
  • Companies often skip costly redundancy measures (like cross-region failover) to save money, accepting higher risk profiles.
  • Extreme work hours (100+ weeks) increase the likelihood of catastrophic human error, such as wiping the wrong production server.
  • This discussion urges policymakers and IT leaders to establish robust oversight for essential cloud infrastructure, mirroring old utility standards.

This episode analyzes the cascading effects of the recent Amazon Web Services outage, demonstrating how a single configuration mistake in Domain Name System (DNS) routing can halt global commerce. We explore the trade-offs companies make against redundancy and the human factor behind infrastructure mishaps.

Unlock all features

FREE: Get instant access to 10 AI summaries, chats, or transcripts per day.

Description

Support Content at - https://donorbox.org/etcg LinkedIn at - https://www.linkedin.com/in/eli-etherton-a15362211/

Top Comments (10)

@Plxlinixy 2025-10-20

Isn't it funny how after decades of networking and IT pioneers warned various companies that we need to make sure we have redundancies and not over centralized networks. WHAT'S THE FIRST THING THE CORPORATE LEADERS CHOOSE TO DO!?

199 11 replies
@aquatrax123 2025-10-21

I'm just a low-paid sys admin who had servers in AWS that went down. I took the extra time to configure AWS health checks and automatic failover. I have servers in AWS and hosted locally, and when AWS went down, AWS health checks automatically detected the failure and removed the failed IPs from DNS. With a 60-second TTL, our services were only down for a few minutes while the health checks detected the outage and changed DNS. It amazes me how many other admins don't take the time to configure redundancy.

43 1 replies
@alexkatsanos8475 2025-10-20

As soon as i heard about the Cloud outage i knew you were gonna give us your take pretty quickly.

44
@foley2k2 2025-10-20

Self hosting has its advantages. One day people will learn.

114 29 replies
@richsadowsky8580 2025-10-21

Back in the mid-80s I worked for a small LA startup that made medical billing software for CP/M systems using 8" floppy disks. One client in Honolulu kept saying our software stopped working after every update. They finally flew me out to fix it (not a bad gig!). Within 10 minutes I found the problem: they were pinning their 8" diskettes to a whiteboard—with magnets! Every “update” was getting wiped before it ever ran. I replaced the disks and spent the rest of the day getting a scenic tour of Oahu.

18 2 replies
@ajponte1 2025-10-21

I entered the IT world right as the cloud movement was beginning. Tools like Docker were there for us to create essentially cloud-agnostic architectures. It seemed that as soon as companies realized they could offload their infrastructure to a 3rd party, they gave up on investing in strong systems and infrastructure engineers.

9 1 replies
@rikachiu 2025-10-20

Told my staff there is nothing I can do. AWS caused a lot of egg on my face today.

34 2 replies
@steveoc64 2025-10-21

Senior engineers - the ones bought up on c64s, soldering irons and machine code, are getting towards retirement. The original guys and gals that built the operating systems and compilers and the internet protocols are also retiring. Junior engineers are not getting apprenticeships. The millennial 10x types are leaving corporate to build their own startups. All that’s left to run ops at these big critical parts of the chain are the career types who played politics and won. Give it 5 more years, and “the internet is down again” will be the new normal

69 7 replies
@f1aziz 2025-10-20

Reading the book on Network Programming in Java some 15-20 years ago as a junior SE, one of the recommendations that stuck with me was, write code/design systems with network failures in mind and have disaster recovery in place.

18 5 replies
@mantaramg60 2025-10-21

Its one thing if Salesforce goes down, its a whole different thing if life safety systems go down

8

Unlock the Data Inside
Turn Videos into Knowledge

  • Get FREE 10/day: transcripts, summaries, chats
  • Chat with videos, export text & PDF
  • $1 free API credit for RAG, chatbots & research

Free forever plan • All features unlocked

App screenshot