Aws Down

Discussion in 'Lounge' started by J0n3s, Oct 20, 2025 at 7:09 PM.

  1. Not sure if I'm feeling more 404 or 500 over this?

    (nerdy joke)
     
    • Funny Funny x 2
  2. Yeah... my brain had an internal error and besides it couldn't find the page you were referring to.... :)

    Quite why anybody thinks it's a good idea to configure a network that has a single point of failure. I notice AWS haven't detailed how the DNS server failures came about and I see a Wired article mentions "there is no indication that Monday's AWS outages were nefarious."

    Btw if anyone is interested see the list of HTTP error/status codes in Wikipedia: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
     
    #2 Andy Bee, Oct 21, 2025 at 12:28 PM
    Last edited: Oct 21, 2025 at 12:39 PM
    • Like Like x 1
  3. What's disappointing from the high street news coverage of the outage is complete lack of domain knowledge from reporters who convince themselves they are experts. And why oh why has the Graun reporter quoted the GMB union who has no domain knowledge (or members) in the AWS or tech teams and are conflating people who work in the warehouses to those working in the tech teams. https://www.theguardian.com/technol...exposed-uk-states-17bn-reliance-on-tech-giant

    Amazon treat and pay their AWS teams very very well, they are high valued and some of the most targeted and poached employees they have. Salaries >$250k are very common in those teams. In the UK the going rate for contract AWS experts is over £1k per day.

    The issue was related to DNS, probably one of the most complicated areas of IP technology. Experts in the area acknowledge this and have upmost respect for the people tacking the issue. DNS issues are generally very knarly, hard to pinpioint and harder to fix without bring everything down.

    We use a mixture of Google Cloud, Azure and AWS, the latter being by far the best and also cost effective. Those writing in the Graun and other rags proposing AWS is broken up and we should be using smaller companies fundamentally don't understand the technology and why AWS size is exactly the reason for their success. It's not a case of if things break, it's a case of when, and when happens you want "fail over" and lots of it, all over the world. We have tried the smaller company route, never again. Things always break, the smaller companies don't have the resources or culture to fix them and fast, you will go days and weeks waiting for updates and resolutions.

    Amazon were very open about the failure, it's cause, and it's resolution. It's completely different proposition to their warehouses and logistics operations.
     
    • Like Like x 1
  4. Interesting post although The Grauniad reporting comes as no surprise. Jeff Bezos is one of their despised billionaire tech oligarchs so any chance of giving him a kicking, regardless of the facts, is not to be missed. Which does sortov go against their much trumpeted "Comment is free but facts are sacred" dictum. It's also funny that many of their readers who hate Amazon for their retail operations are unaware that around 70% of Amazon's profits (and hence Bezos's wealth) comes from AWS. And further, they may not realise that, ironically, the online Grauniad itself utilises their service & hosting functions.

    Assuming the issue was due to incorrect DNS data configuration it does still surprise there aren't any rigid processes to follow that allow for a system fallback to previously working data. My own DNS knowledge is pretty thin & flaky but I understand data changes are tightly managed with only a few super users having edit permissions. Seemingly DNS servers can be configured in primary/secondary/standby/slave configurations for load sharing & fault tolerance. But all servers seem to be automatically updated when the master is updated.

    I'm an old school telecomms guy and whenever configuration data changes were made on a live system there was always an extended run time period to ensure it was error free. And if errors are detected there was a well defined & tested rollback path or switch over to the old working configuration.

    I latterly worked in network optimisation and most customers had revolting complex and awfully optimised networks because they were built organically & on the fly by just chucking some more routers & links at it as and when money or demand required. The fluid nature of IP technology & advancement helped to facilitate this. Until it reached a point where any changes or updates became fiendishly difficult to achieve. It wouldn't surprise if this is a contributing factor in this instance.
     
Do Not Sell My Personal Information