24
.
12
.
2023

Network fault on 11.12.2023

Last week on Monday, there was a network outage between 14:24 and 14:31. We look at the technical background of this fault, which represents the first and therefore only total failure of our network since the commissioning of our own AS (58212).

Tim Lauderbach

Classification

We made our first attempts to operate our own network in 2018, which was still significantly dependent on our former data center. These first steps were, of course, not yet comparable with our current — but very stable — network operation. We were able to learn a lot from our mistakes in the first few years and then put our own AS (Autonomous System) into operation at the beginning of 2020 — investing time and money once again to eradicate all the teething problems of our first setup. Of course: Carriers (of which we have several) can fail, Internet nodes such as DE-CIX (to which we have been directly connected for years) may be disrupted... but due to the design, such problems solve themselves within seconds or minutes and only ever affect individual paths. Until 11.12.2023, when the DataForest network was completely offline for just under eight minutes in the afternoon, we had not had a single total failure of the current network setup, which therefore lasted over three and a half years. It was almost four.

monday

Although a monitoring report concerning our network is usually associated with an increased level of alert for those on call, it is usually nothing wild. In this case, however, two minutes after the start of the fault, it was clear that we had a major problem, so that all available employees were notified in order to be able to handle the many calls on our hotline — which was also successful thanks to well-functioning escalation processes. Since we have many readers interested in technology here, we want to share the main results of our first and second audit with you transparently. For anyone who is not interested, there is a brief summary below.

If our monitoring reports “everything gone” and can then be confirmed, the first step is (metaphorically) to our management network, which we operate “out of band”, i.e. completely independently of our own network. This structure is extremely important so that you are never locked out yourself, whether due to a misconfiguration or a network failure. In case of absolute emergency, this also allows access to the serial interfaces of our routing and switching infrastructure.

Since the management network was accessible, we were able to rule out a power failure — which probably makes every administrator breathe a sigh of relief for now, despite in our case 100% SLA on electricity & air conditioning at the Maincubes site. A login on the active router then first showed that a failover had taken place from the primary routing engine to the standby routing engine — this can times It happens and has already happened in the past (both planned and unplanned), but as a result, nothing comes out “by design” — and never has been. Our network structure aims for almost 100% availability, which essentially means that there is no central component whose failure on the hardware or software side is likely to cause a serious malfunction. So what had happened?

acute analysis

All system-relevant processes were running, all line cards and power supplies were there and in operation without errors, no errors — which our monitoring also recorded but would currently have been delivered via SNMP and thus somewhat delayed — were apparent. Apart from the alarm about the fact that the standby routing engine was currently active — and even the primary routing engine, which was initially thought dead, was still running and had not undergone a “spontaneous reboot” or anything like that. A bad premonition was confirmed:

root@re1-mx480.mc-fra01.as58212.net > show bgp summary
Error: The Routing Subsystem Is Not Running

root@re1-mx480.mc-fra01.as58212.net > show route table inet.0
Error: The Routing Subsystem Is Not Running

It's really not a message you want to see. But it made sense in the context of the previous review of the processes, because we already knew that the “Routing Protocol Process” (“rpd” for short) had just restarted. A validating look at the logs confirmed that it had just crashed and was still recovering. We could see that everything would be online again right away, and that was the way it was. At the time this discovery began, approximately five minutes had elapsed for the initial analysis and two minutes later, the first recovery messages came from Opsgenie. In the end, we didn't have to intervene at all to actually correct the disturbance — not a nice incident, but self-healing has already brought back a bit of trust.

Cause (s) of failure

As described above, a failover to the standby routing engine had taken place, and it was reasonable to assume that this process was also the cause. It quickly became clear that this was not the case — in fact, the network continued running for almost exactly five minutes after the failover. Why the failover — and why the five minutes? These were the key questions that needed to be clarified.

Once is not

Our Junos configuration (Junos is the operating system of network equipment manufacturer Juniper Networks) provides that in the event of a crash of the “rpd”, which plays the main role in this incident, a switchover to the standby routing engine is carried out, even if the rest of the active routing engine is working properly. This is usually best practice (exceptions confirm the rule) and will remain so for us. The configuration ensures that everything continues without failure in the event of such a crash. It is true that the line cards need to forward data packets No “rpd” for now, but routing protocols such as BGP would lose their status and that in turn would lead to at least a partial failure within a very short period of time. The setting therefore makes sense and has also done exactly what it should here, namely changed the routing engine without failure in the event of a crash. Because the routing engines are constantly synchronizing, a change is possible at any time without failure (and we have already carried out several times without problems and failures by us for maintenance purposes). The first question was thus resolved.

But twice is twice too much

The failover as a result of the RPD crash took place at 2:21 p.m. At 2:23 PM, the process crashed on the standby routing engine as well. And that's when the disaster began: The “rpd” on the primary routing engine simply had not yet fully restarted, so the routing engine in Junos language was simply “not ready for switchover.” There was still 1-2 minutes left until all of our BGP upstreams had “forgotten” us and kicked us out of the routing table, because there was no longer a BGP running on our side. The fact that the line cards forwarded traffic “without direction” for a while didn't help us anymore, because our AS gradually disappeared from the Internet.

Key cause and solution

That the rpd Sometimes it crashes, as I said. It's very rare. The fact that he does this several times and subsequently smokes off the network does not actually happen and has never happened to us before. After intensive analysis of the core dumps and log files available to us, we were able to find a Junos bug that clearly corresponds to the behavior observed here, and was fixed with a more recent release, which we later installed.

Bad timing

On 03.12.2023, we had a six-minute disruption in a virtualization rack, which affected around 50% of our SolusVM-based vServer host systems — but fortunately no other products such as dedicated servers or colocation. This network issue was due to a misconfiguration. At around 3 o'clock on 10/12/2023, one of our carriers had a total failure, which affected 80% of our IPv4 networks for a good two minutes and repeated itself at around 11 o'clock on 20/12/2023. None of the incidents had anything to do with our network outage, which is what this article is about, but we understand that this accumulation has raised general questions about network stability in December among some customers and are working hard to be able to deliver a long period of stability again — at least in terms of what is within our control.

Conclusion and potential for improvement

That after almost four years, our setup is now times had a problem does not shake our basic trust in it, because hardware and software are not error-free, such bugs occur (rarely) and can usually be solved quickly. The Juniper MX series we have used has been regarded on the market as “rock solid” for operating high-availability network infrastructure for almost two decades. And so we are convinced that we have actually solved the problem with the updated update. However, we can learn a few lessons from this:

  1. Release of our status page: Admittedly not just since this disturbance was desired by you and us, it should now be reason enough to status.dataforest.net to finally create a reasonable solution for communicating faults and maintenance work and to retire the mix of social media and emails. From the beginning of January 2024, this will now be the central hub. Of course, you can also subscribe to the status page accordingly (for now via email, Slack, Atom/RSS).
  2. Sub-optimal emergency maintenance: We had announced packet losses in the range of seconds for 17.12.2023, 3:00 a.m. On the one hand, the announcement came at very short notice, which was due to the fact that the bug suddenly reappeared several times the night before after a week without any problems (although there was no failure in each case, but we were alarmed, of course). In future, we would announce a corresponding software update as soon as possible after the fault and then bring it forward again if necessary. On the other hand, there was an annoying downtime of a full five minutes, which was due to the fact that the router's line cards entirely restarted. That was simply a misinterpretation on our part, for which we can only apologize now. In the future, we will consider whether to migrate our customers and carriers to a replacement router for such an update to avoid real downtime — or simply calculate a larger downtime and not only explicitly announce it, but also place it at an even less critical time. However, if you want to almost completely avoid downtime, such a migration is by no means a matter of minutes, and therefore not always a realistic option.
  3. Better separation of edge and core routing: Regardless of the malfunction, this was also already a work in progress — in future, we will physically separate our edge routing, i.e. the termination of BGP sessions to our upstreams, from core routing, where the actual customer VLANs terminate. In the event of faults and such maintenance work, this will reduce the convergence time.

With this in mind: Thank you for reading and Merry Christmas!