Fast Fault Tolerance Architecture for Programmable Datacenter Networks

Internet-Draft	Fast Fault Tolerance Architecture	October 2024
Li, et al.	Expires 24 April 2025	[Page]

Abstract

This document introduces a fast rerouting architecture for enhancing network resilience through rapid failure detection and swift traffic rerouting within the programmable data plane, leveraging in-band network telemetry and source routing. Unlike traditional methods that rely on the control plane and face significant delays in traffic rerouting, the proposed architecture utilizes a white-box modeling of the data plane to distinguish and analyze packet losses accurately, enabling immediate identification for link failures (including black-hole and gray failures). By utilizing real-time telemetry and SR-based rerouting, the proposed solution significantly reduces rerouting times to a few milliseconds, offering a substantial improvement over existing practices and marking a pivotal advancement in fault tolerance of datacenter networks.¶

1. Introduction

In the rapidly evolving landscape of network technologies, ensuring the resilience and reliability of data transmission has become paramount. Traditional approaches to network failure detection and rerouting, heavily reliant on the control plane, often suffer from significant delays due to the inherent latency in failure notification, route learning, and route table updates. These delays can severely impact the performance of time-sensitive applications, making it crucial to explore more efficient methods for failure detection and traffic rerouting. Fast fault tolerance (FFT) architecture leverages the capabilities of the programmable data plane to significantly reduce the time required to detect link failures and reroute traffic, thereby enhancing the overall robustness of datacenter networks.¶

FFT architecture stands at the forefront of innovation by integrating in-band network telemetry (INT [RFC9232]) with source routing (SR [RFC8402]) to facilitate rapid path switching directly within the data plane. Unlike traditional schemes that treat the data plane as a "black box" and struggle to distinguish between different types of packet losses, our approach adopts a "white box" modeling of the data plane's packet processing logic. This allows for a precise analysis of packet loss types and the implementation of targeted statistical methods for failure detection. By deploying packet counters at both ends of a link and comparing them periodically, FFT can identify fault-induced packet losses with unprecedented speed and accuracy.¶

Furthermore, by pre-maintaining a path information table and utilizing SR (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]), FFT architecture enables the sender to quickly switch traffic to alternative paths without the need for control plane intervention. This not only circumvents the delays associated with traditional control plane rerouting but also overcomes the limitations of data plane rerouting schemes that cannot pre-prepare for all failure scenarios. The integration of INT allows for real-time failure notification, making it possible to control traffic recovery times within a few milliseconds, significantly faster than conventional methods. This document details the principles, architecture, and operational mechanisms of FFT, aiming to contribute to the development of more resilient and efficient datacenter networks.¶

1.1. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶

3. Architecture Overview

              4.SR-based Rerouting     Switch#3
                       +----------> +------------+
                       |     +------|            |---------+
    Endhost#1          |     |      +------------+         |
+---------------+      |     |                        Switch#4
|               |------+     |                      +-----------+
| +-----------+ |        Switch#1                   |+--------- |
| |  3. Path  | |      +----------+                 ||  Packet ||
| | Management| +------|          |                 || Counters||
| | Mechanism | |      +----------+                 ||in Inport||
| +-----------+ |<------+    |                      |+---------+|
|               |       |    |         Switch#2     +-----------+
+---------------+       |    |      +------------+         ^ |
                        |    |      |+----------+|         | |
                        |    |      ||   Packet ||         | |
                        |    +------||  Counters||---------+ |
                        +-----------||in Outport|| <---------+
 2.Failure Notification Mechanism   |+----------+|      1.FDM
                                    +------------+

Figure 1: Fast Fault Tolerance Architecture.

Traditional network failure detection methods generate probe packets through the control plane (such as BFD [RFC5880]), treating the network data plane as a "black box". If there is no response to a probe, it is assumed that a link failure has occurred, without the ability to distinguish between fault-induced packet loss and non-fault packet loss (such as congestion loss, policy loss, etc.). FFT models the packet processing logic in the data plane as a white box, analyzing all types of packet loss and designing corresponding statistical methods. As shown in Figure 1, FFT deploys packet counters at both ends of a link, which tally the total number of packets passing through as well as the number of non-fault packet losses, periodically comparing the two sets of counters to precisely measure fault-induced packet loss. This method operates entirely in the data plane, with probe packets directly generated by programmable network chips, thus allowing for a higher frequency of probes and the ability to detect link failures within a millisecond.¶

After detecting a link failure, FFT enables fast path switching for traffic in the data plane by combining INT with source routing. As shown in Figure 1, after a switch detects a link failure, it promptly notifies the sender of the failure information using INT technology; the sender then quickly switches the traffic to another available path using source routing, based on a path information table maintained in advance. All processes of this method are completed in the data plane, allowing traffic recovery time to be controlled within a few RTTs (on the order of milliseconds).¶

4. Fast Fault Tolerance Architecture

The fast fault tolerance architecture involves accurately detecting link failures within the network, distinguishing between packet losses caused by failures and normal packet losses, and then having switches convey failure information back to the end hosts via INT [RFC9232]. The end hosts, in turn, utilize SR (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]) to reroute traffic. Therefore, the fast fault tolerance architecture comprises three processes.¶

4.1. Failure Detection Mechanism

         Upstream Switch                   Downstream Switch
+--------------------------------+  +------------------------------+
|+--------------+  +------------+|  |+-----------------+ +--------+|
||         +---+|  |+--+        ||  ||        +--++---+| |        ||
|| Ingress |FDM||->||UM| Egress ||  || Ingress|DM||FDM|+>| Egress ||
||Pipeline | -U||  ||  |Pipeline||  ||Pipeline|  || -D|| |Pipeline||
||         +---+|  |+--+        ||  ||        +--++---+| |        ||
|+--------------+  +------------++->|+-----------------+ +--------+|
|          +---+    +---+--+     |  |  +---+--+--+                 |
|          |Req|->  |Req|UM|->   |  |  |Req|UM|DM|--->             |
|          +---+    +---+--+     |  |  +---+--+--+                 |
|                                |  |      +----+--+--+            |
|                                |  | <----|Resp|UM|DM|            |
|                                |  |      +----+--+--+            |
+--------------------------------+  +------------------------------+

Figure 2: Failure Detection Mechanism: counter deployment locations and request generation.

This document designs a failure detection mechanism (FDM) based on packet counters, leveraging the programmable data plane. As shown in Figure 2, this mechanism employs counters at both ends of a link to tally packet losses. So adjacent switches can collaborate to detect failures of any type (including gray failures), and the mechanism is capable of accurately distinguishing non-failure packet losses, thus avoiding false positive.¶

4.1.1. Counter Deployment

FDM places a pair of counter arrays on two directly connected programmable switches to achieve rapid and accurate failure detection. Figure 2 illustrates the deployment locations of these counters, which include two types of meter arrays: (1) the Upstream Meter (UM) is positioned at the beginning of the egress pipeline of the upstream switch; (2) the Downstream Meter (DM) is located at the end of the ingress pipeline of the downstream switch. Each meter records the number of packets passing through. With this arrangement, the difference between UM and DM represents the number of packets lost on the link. It is important to note that packets dropped due to congestion in the switch buffers are not counted, as the counters do not cover the buffer areas.¶

Furthermore, to exclude packet losses caused by non-failure reasons, each meter array includes some counters to tally the number of non-failure packet losses (e.g., TTL expiry). Therefore, FDM is capable of accurately measuring the total number of failure-induced packet losses occurring between UM and DM, including losses due to physical device failures (e.g., cable dust or link jitter) and control plane oscillations (e.g., route lookup misses).¶

                                               +----------+
                                               | switch#3 |
                                               +-----+    |
         +----------+    +---------------+  +->|DM#2 |    |
         |          |    |         +-----+  |  +-----+    |
         |    +-----+    +-----+   |UM#2 |--+  +----------+
         |    |UM#1 |--->|DM#1 |   +-----+
         |    +-----+    +-----+   +-----+
         |          |    |         |UM#3 |--+  +----------+
         | switch#1 |    |switch#2 +-----+  |  +-----+    |
         +----------+    +---------------+  +->|DM#3 |    |
                                               +-----+    |
                                               | switch#4 |
                                               +----------+

Figure 3: FDM (UM and DM) deployment on all network links.

Figure 3 illustrates the deployment method of FDM across the entire datacenter network. Similar to the BFD mechanism, FDM needs to cover every link in the network. Therefore, each link in the network requires the deployment of a pair of UM and DM. It is important to note that although only the unidirectional deployment from Switch#1 to Switch#2 is depicted in Figure 3, Switch#2 also sends traffic to Switch#1. To monitor the link from Switch#2 to Switch#1, FDM deploys a UM on the egress port of Switch#2 and a DM on the ingress port of Switch#1. Consequently, FDM utilizes two pairs of UM and DM to monitor a bidirectional link.¶

4.1.2. Counter Comparison

As shown in Figure 2, the FDM agent in the upstream switch (FDM-U) periodically sends request packets to the link's opposite end. These request packets record specific data of UM and DM along the path through the INT mechanism. Upon detecting the request packets, the FDM agent in the downstream switch (FDM-D) immediately modifies them as response packets and bounces them back, allowing the packets containing UM and DM data to return to the FDM-U. Subsequently, the FDM-U processes the response packets and calculates the packet loss rate of the link over the past period. If FDM-U continuously fails to receive a response packet, indicating that either the response or request packets are lost, then FDM-U considers the packet loss rate of that link to be 100%. This can be used to detect black-hole failure in the link. In other scenarios, if the packet loss rate exceeds a threshold (e.g., 5%) for an extended period, FDM-U will mark that outgoing link as failure.¶

             Upstream Switch           Downstream Switch
         +----------------------+    +--------------------+
         |    +---+      +---+  |    |    +---+    +---+  |
         | 000|Req|000000|Req|00+--->|0000|Req|0000|Req|0 |
         |    +---+      +---+  |    |    +---+    +---+  |
         +----------------------+    +--------------------+
         Req: INT request packet
         0: data packet

Figure 4: An example for illustrating the batch synchronization provided by request packets.

To ensure the correctness of packet loss rate statistics, FDM must ensure that the packets recorded by UM and DM belong to the same batch. Upon closer analysis, it's found that request packets provide native batch synchronization, and FDM only needs to reset the counters upon receiving a request packet and then start counting the new batch. Specifically, since packets between two directly connected ports do not get out of order, the sequence of packets passing through UM and DM is consistent. As shown in Figure 4, the request packets serve to isolate different intervals and record the number of packets in the right interval. When such a request packet reaches the downstream switch, the DM records the number of packets for the same interval. Thus, UM and DM count the same batch of packets. However, the loss of request packets would disrupt FDM's batch synchronization. To avoid this, FDM configures active queue management to prevent the dropping of request packets during buffer congestion. If a request packet is still lost, it must be due to a fault.¶

4.1.3. Failure Recovery Detection

To ensure stable network operation after failure recovery, FDM also periodically monitors the recovery status of links. This requires the FDM-U to send a batch of test packets, triggering UM and DM to count. Then, the FDM-U sends request packets to collect data from UM and DM. If the link's packet loss rate remains below the threshold for an extended period, FDM-U will mark the link as healthy. To reduce the bandwidth overhead of FDM, considering that the detection of failure recovery is not as urgent as failure detection, FDM can use a lower recovery detection frequency, such as once every second.¶

4.1.4. An Example

This section presents an example of how FDM calculates the packet loss rate of a link. Assume that 100 packets pass through the upstream switch UM, which records [100,0], with 0 representing no non-fault-related packet loss. Suppose 8 packets are dropped on the physical link and 2 packets are dropped at the ingress pipeline of the downstream switch due to ACL rules. Then, the DM records [90,2], where 90 represents the number of packets that passed through DM, and 2 represents the number of packets dropped due to non-fault reasons. Finally, by comparing the UM with DM, FDM calculates the packet loss rate of the link as 8% ((100-90-2)/100), rather than 10%.¶

4.2. Failure Notification Mechanism

Traditional control plane rerouting schemes require several steps after detecting a failure, including failure notification, route learning, and routing table updates, which can take several seconds to modify traffic paths. Data plane rerouting schemes, on the other hand, cannot prepare alternative routes for all possible failure scenarios in advance. To achieve fast rerouting in the data plane, FFT combines INT with source routing to quickly reroute traffic.¶

Assume that the sender periodically sends INT probe packets along the path of the traffic to collect fine-grained network information, such as port rates, queue lengths, etc.. After a switch detects a link failure, it promptly notifies the sender of the failure information within the INT probe. Specifically, when a probe emitted by an end host is about to be forwarded to an egress link that has failed, FFT will immediately bounce the probe back within the data plane and mark the failure status in the probe. Finally, the probe with the failure status will return to the sender.¶

4.3. Path Management Mechanism

To enable sender-driven fast rerouting, the sender needs to maintain a path information table in advance so that it can quickly switch to another available path upon detecting network failure. Specifically, within the transport layer protocol stack of the sender, this document designs a Path Management Mechanism (PMM), which periodically probes all available paths to other destinations. Of course, this information can also be obtained through other means, such as from an SDN controller. Then, for a new flow, the sender will randomly select an optimal available path from the path information table and use source routing (e.g., SRv6 [RFC8986] and SR-MPLS [RFC8660]) to control the path of this flow. Similarly, the sender also controls the path of the INT probes using source routing, allowing them to probe the path taken by the traffic flow. The fine-grained network information brought back by these probes can be used for congestion control, such as HPCC [hpcc].¶

When the above FFM mechanism is effective, and the INT information makes the sender aware of a failure on the path, the sender will immediately mark this path as faulty in the path information table and choose another available path, accordingly modifying the source routing headers of both the data packets and the INT probes. To promptly understand the availability of other paths, PMM will periodically probe other paths and update the path information table, including failure entering and recovering.¶

Fast Fault Tolerance Architecture for Programmable Datacenter Networks

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

1.1. Requirements Language

2. Terminology

3. Architecture Overview

4. Fast Fault Tolerance Architecture

4.1. Failure Detection Mechanism

4.1.1. Counter Deployment

4.1.2. Counter Comparison

4.1.3. Failure Recovery Detection

4.1.4. An Example

4.2. Failure Notification Mechanism

4.3. Path Management Mechanism

5. Security Considerations

6. IANA Considerations

Acknowledgements

References

Normative References

Informative References

Authors' Addresses