Jump to content

Inside Datadog’s $5M Outage

Rui Carlos

Recommended Posts

Um bom artigo sobre um incidente que levou a um downtime global na Datadog: Inside Datadog’s $5M Outage

Entre os pontos abordados no artigo, realçava dois:

  • A dependência circular entre software que gere a infraestrutura e a própria infraestrutura, cada vez mais comum.
  • O facto de usarem vários serviços cloud diferentes não ter evitado que todos falhassem em simultâneo.


So the control plane going offline was the real problem. Had the control plane been unaffected, the outage would have likely been brief. In that case, Datadog could have just re-added the vanished nodes to the routes using the control plane. But with the control planes also gone, the first order of business had to be to get this control plane back and figure out how it disappeared in the first place.

This circular dependency, where the infrastructure control plane depends on the infrastructure it manages, recalls what happened when the video game Roblox went down for three days straight in October 2022. Then, the dependency was that Nomad (orchestrating containers) and Vault (secrets management service) both used Consul (service discovery.) So when Consul became unhealthy, Vault went offline. But Vault was needed to operate Consul: a circular dependency.


It’s interesting to consider how much Datadog did to avoid a global outage: operating a multi-cloud, multi-region, multi-zone setup with separate infrastructure control planes per region. But despite these efforts, the unforeseen event of a parallel operating system update —and the impact of this update—brought it down. This is a reminder that prevention is just as important as mitigation.


O postmortem da Datadog está disponível aqui.

  • Thanks 1
Link to comment
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now
  • Create New...

Important Information

By using this site you accept our Terms of Use and Privacy Policy. We have placed cookies on your device to help make this website better. You can adjust your cookie settings, otherwise we'll assume you're okay to continue.