Coordinated Congestion Management

Internet-Draft	CCM	April 2024
Lyu, et al.	Expires 21 October 2024	[Page]

Abstract

AI fabric is sensitive to bandwidth. Congestion management, including congestion control and load balancing, is a main method to fully utilize network resource. However, current congestion management mechanisms are not coordinated, which lead to throughput decreasing. This document provides a scheme to coordinate different congestion management mechanisms. It describes the design principle, behaviors of network switches and hosts in the scheme, and gives an example to show end-to-end procedure.¶

1. Introduction

ML/AI has been progressing rapidly over the last decade. ChatGpt is a milestone of generative AI. It ignites industry's enthusiasm of AI large models. A single AI accelerator or a single server with multiple AI accelerator is not capable to train the large models, due to lack of memory and lack of compute power. So it is imperative to employ distributed system with parallel processing to train those models.¶

AI training is bandwidth sensitive. Taking data pralleslism and MOE which are commonly used prallel processing in AI training as example, the required bandwidth is GB level. That brings a big challenge to AI fabric. Increasing link speed is an important approach, from 400Gbps to 800Gbps, or even 1.6Tbps in future. What's more, how to effectively use the bandwidth also becomes a critical issue. It is expected to fully utilize the link bandwidth to achieve high throughput. Network congestion is a major problem which deteriorate the performance. Thus, congestion management is always applied in the network to alleviate congestion. Usually, congestion managment includes congestion control and load balancing. But today, congestion control and load balancing work independently, without any coordination.¶

This document discusses the uncoordinated mechanisms in current congestion management. That leads to throughput issues which are particularly harmful in AI fabric. A scheme for coordinating different congestion management mechanisms is proposed in this document, which can be effectively and widely deployed in AI fabric.¶

4. Existing congestion management

Congestion managment includes congestion control and load balancing. PFC like flow control is not discussed in this document. It is useful as the last gate to prevent packet loss. We do not count it as a part of congestion management.¶

There are many congestion control mechanisms, such as DCQCN [DCQCN], Timely [Timely]. Although they have differnt procedure, using different algorithms, the purpose is to control the sending rate at the source. Basically, congestion control identifies network congestion by network status, like queue length of switch port, end-to-end delay RTT, etc., then adjust the sending rate at the sender to alleviate congestion. How to quickly flatten down the rate curve to avoid packet loss and how to recover the rate for less throughput reduction are essential to congestion control mechanism.¶
From another aspect, load balancing alleviate congestion by adjusting forwarding paths for traffic. ECMP is one way of load balancing. It hashes each flow on a specific path by 5-tuple of the flow. This does not work well for AI workload. Because AI has a few number of flows, and most of the flows are with big size. ECMP cannot distribute the traffic evenly on the network. So adpative routing is perferred. Adpative routing indicates to changes the path for a single flow according to network status. For example, originally, flow 1 uses path 1 for forwarding. When network switch detects the path is becoming heavy-loaded, it selects another light-loaded path, path 2, for the following packets in the flow. The path status could be indicated by local link status, and/or downstream link status etc. And how to judge if the path is heavy-loaded, that could be implementation dependently. Adaptive routing can select path for each packet, thus using network resource in a most efficient way. But avoding uncessary path swithcing is critical, because each path switching may increase the systeme complexity, like re-ordering. Another load balancing mechanism is packet spray. Source host or network switch evenly distributes packets on each path. The distribution does not consider actual path status. Compared with adaptive routing, it is easier for implementation, but it is not the most optimized way. In this document, we focus on adaptive routing. And the scheme proposed is also applicable for packet spray.¶

Currently, congestion control and adaptive routing work independently, without coordination. That results in negative impact on system performance. For example, when congestion caused by imbalanced load on network occurs on a switch, both DCQCN and adaptive routing are activated. ECN in data packets is marked, causing the CNP to be sent back to sender. Thus, sender slows down the sending rate of the congested flow. Meanwhile, the switch changes the path for packets of the congested flow, traversing the new incoming packets to a light-loaded path. The result is that the congested flow is forwarded on the light-loaded path at a low rate. Then, DCQCN needs some time to recover the sending rate at the new path. It reduces effective bandwidth and seriously impact computation efficiency in AI training. Another example, if the congestion is caused by in-cast traffic, congestion control should be enough. Additional adaptive routing adjustments not only fail to mitigate congestion, but may also introduce more out-of-order packets.¶

The fact is that current congestion management does not distinguish the cause of congestion, but triggering the mechanmis when congestion is detected. That brings trouble. In principle, in-cast congestion cannot be migigated by load balancing, and reducing flow rate by congestion control for imbalanced congestion (in-network congestion) decreases network efficiency.¶

5. Design principle of coordinated congestion management

Coordinated congestion management is designed to coordinate congestion control and adaptive routing. Design principle is shown as below.¶

Avoid unnecessary sending rate reduction
AI fabric is bandwidth sensitive. High throughput is extremely important. Multipath is needed to make full use of network bandwidth. Slowing down the sending rate while there are still available paths for traffic will be a waste of network resource, thereby increasing communication time in AI cluster and reducing AI training performance.¶
Fully use multipath while reducing invalid path switching
While searching for light-loaded paths for load balancing, new paths should be located quickly and accurately. The new path should not be restricted to local paths but extends the search to available paths upstream. Invalid path switching should be avoided. Invalid path switching includes switching in-cast traffic as no matter how to switch the traffic path, it will final get congested on the last hop.¶
Reuse current CC algorithm and AR algorithm
There are already a variety of CC algorithm and AR algorithms. Those can still be used in the congestion management coordination scheme. The scheme enables CC and AR be triggered coordinately, adjusting sending rate or switching path depending on different reasons of congestion.¶
Applicable to various topologies
Most AI fabrics use CLOS or FATTREE topologies, but there are also new studies considering the use of direct topologies, such as torus, dragonfly, dragonfly+. Some of existing solutions for CC and AR coordination, e.g PLB [PLB], relies on ECMP which can only be used in topologies with equal cost paths like CLOS. For those topologies without equal cost paths, like dragonfly+, such solutions do not work. The coordination scheme should be applicable to different topologies.¶

6. Coordinated congestion management scheme

The key to the coordinated congestion management is to identify CC traffic and non-CC traffic, thereby they are treated differently in network when congestion occurs. CC traffic is those packets which cause in-cast congestion. Non-CC traffic is the rest packets in network.¶

CC traffic recognized by network is notified to the source host. The subsequent packets of the same flow are tagged by the source host. This indicates the network switch to perform CC mechanism on those packets instead of AR. For non-CC traffic, the network switch first performs AR. Only when AR mechansim cannot find light-loaded path for switching, the traffic turns to be CC traffic and CC will be run to alleviate congestion.¶

Coordinated congestion management requires interaction between network switches and source hosts. The following sections explain the detail of the scheme.¶

6.1. Coordination tag

Coordination tag is inserted into data packets by source host when it sends out the packets. The tag contains CC indicator and AR indicator.¶

CC indicator: indicates if the packet may cause in-cast congestion.¶
AR indicator: indicates the location of upstream AR point where adaptive routing can be performed. The AR point can be a network switch or a source host. AR indicator can be an ID, an IP address or other information which guides how to send a message to the AR point.¶

The tag can use in-band telemetry scheme to carry in data packet. A new method CSIG [I-D.draft-ravi-ippm-csig] may provide another possibility.¶

6.2. Notification message

There are 3 types of notification.¶

Type 1: congestion control required
Example: Type 1 message is sent from incast congetion switch to source host, notifying the source host to tag (set CC indicator) the packets belonging to the flow which causes in-cast congestion.¶
Type 2: congestion control released
Example: When incast congestion is eliminated, the switch sends type 2 message to corresponding hosts, notfifying the source hosts to untag CC indicator in the subsequent packets of the corresponding flow.¶
Type 3: upstream AR required
Example: If the switch determins to perform AR upstream, type 3 message is sent to the upstream AR point. The upstream AR point can be one-hop neighbour of the switch or a point multi-hop away.¶

The notification message includes source IP, destination IP, notification type and flow key. Source IP is the ip address of the switch which sends the notification. Destination IP is the ip address of the destination which will handle the notification message. Notification type is one of the above 3 types. Flow key is the information of the flow to be handled, such as 5-tuple information.¶

6.3. Behavior of network switches

6.3.1. Identify congestion type

When congestion is detected, network switch judge whether it is in-cast congestion.¶

If congestion occurs at the switch egress port, and the switch is the last-hop switch to destination host, it is determined that the congestion is incast congestion. The flows causing incast congestion are identified as incast flow.¶

There may have other methods to identify congestion type. This document does not make limitation on that.¶

6.3.2. Notify CC congestion

When in-cast congestion is determined by the network switch, it generates type 1 notification messages for each identified flow, and sends the notification messages to source hosts of the flows. When CC congestion is eliminated, the switch sends type 2 notification messages to the source hosts.¶

6.3.3. Notify upstream point to perform AR

When it is determined to perform AR, but network switch cannot do it locally and AR indicator in the data packet shows availability to do AR upstream, a type 3 notification message is sent to upstream point according to AR indicator.¶

6.3.4. Perform congestion control

Network switch performs congestion control in below cases.¶

It is identified as in-cast congestion.¶
It is not identified as in-cast congestion, but adaptive routing cannot be used because there is no available new path for traffic switching either locally or upstream.¶

This document does not limit which CC mechanism is performed.¶

6.3.5. Perform adaptive routing

Network switch performs adaptive routing in below cases.¶

The packet is not in-cast traffic. CC indicator in data packet is used to determine if it is in-cast traffic.¶
Type 3 notification message is received. According to flow information in the notification, new path is selected for the subsequent packets of the flow.¶

In order to enable upstream AR, it is required to update AR indicator in data packets hop by hop. When a data packet arrives at the network switches,¶

if there are several local light-loaded paths available for AR on the switch, the switch updates AR indicator in the data packet to itself, such as its own ID. Then the switch selects the appropriate local path to send the data packet. This document does not define algorithm of local path selection. It depends on routing strategy on the network switch.¶
If there is only one local light-loaded path available for AR, network switch can only select that path for traffic. AR indicator in the data packet will not be updated.¶
If there is no local light-loaded path, network switch gets upstream AR availability by reading AR indicator in the data packet. If AR indicator indicates upstream point can perform AR, network switch generates type 3 notification message and sends it directly to the corresponding upstream point. Otherwise, network switch triggers congestion control mechanism, such as set ECN in data packet.¶

6.4. Behavior of source hosts

When receiving type 1 notification message, source host sets CC indicator of the subsequent packets for the corresponding flow.¶

When receiving type 2 notificiation message, source host unset CC indicator of the subsequent packets for the corresponding flow.¶

When receiving type 3 notification message, source host performs AR on the subsequent packets for the corresponding flow.¶

When receiving congestion control signals and the CC indicator is set, source host performs CC on the flow.¶

7. An example of end-to-end procedure

Network topology is shown in Figure 1. This is a 4 layer fattree topology. There are n computing racks and m switching racks. Computing racks have source hosts, layer 1 switches and layer 2 switches. Swithcing racks contain layer 3 and layer 4 switches.¶

      Switching Rack 1    Switching Rack m
      +---------------+   +---------------+
      |L4-1-1...L4-1-e|   |L4-m-1...L4-m-e|
      |  | \    / |   |   |  | \    / |   |
      |  |  \  /  |   |   |  |  \  /  |   |
      |  |   \/   |   |   |  |   \/   |   |
      |  |   /\   |   |...|  |   /\   |   |
      |  |  /  \  |   |   |  |  /  \  |   |
      |  | /    \ |   |   |  | /    \ |   |
      |L3-1-1...L3-1-d|   |L3-m-1...L3-m-d|
      +--+-----------\    +-/----------+--+
         |            \    /           |
         |             \  /            |
         |  ......      \/     ......  |
         |              /\             |
         |             /  \            |
         |            /    \           |
      +--+-----------/      \----------+---+
      |L2-1-1...L1-1-c|    |L2-n-1...L2-n-c|
      |  | \    / |   |    |  | \    / |   |
      |  |  \  /  |   |    |  |  \  /  |   |
      |  |   \/   |   |    |  |   \/   |   |
      |  |   /\   |   |... |  |   /\   |   |
      |  |  /  \  |   |    |  |  /  \  |   |
      |  | /    \ |   |    |  | /    \ |   |
      |L1-1-1...L1-1-b|    |L1-n-1...L1-n-b|
      |  +        +   |    |  +        +   |
      | H-1-1... H-1-a|    | H-n-1... H-n-a|
      +---------------+    +---------------+
      Computing Rack 1     Computing Rack n

Figure 1: Network Topology

Host H-1-1 in computing rack 1sends out a data packet P1 belonging to flow F1 to H-n-1 in computing rack n. The value of CC indicator in the packet tag is not set indicating this packet is in a non-incast flow. The AR indicator in the packet tag does not point to any available AR point.¶
P1 arrives at switch L1-1-1 in computing rack 1. L1-1-1 has multiple light-loaded paths for AR. Path from L1-1-1 to L2-1-1 is selected for P1. AR indicator in P1 tag is updated to L1-1-1.¶
P1 arrives at switch L2-1-1. L2-1-1 also has multiple light-loaded paths for AR. Path from L2-1-1 to L3-1-1 is selected for P1. AR indicator in P1 tag is updated to L2-1-1.¶
P1 arrives at switch L3-1-1. L3-1-1 only has one light-loaded paths. The only path from L3-1-1 to L4-1-1 is selected for P1. AR indicator in P1 tag keeps to be L2-1-1.¶
P1 arrives at switch L4-1-1. L4-1-1 is congested and no local path available for performing AR. By reading AR indicator in P1, L4-1-1 sends an type 3 notification to L2-1.¶
After receiving AR notification, L2-1-1 switches path from L2-1-1->L3-1-1 to L2-1-1->L3-m-1 for the new incoming packets of flow F1.¶
After a while, L1-n-1 is congested due to incast. The flow F1 is identified as incast flow. L1-n-1 sends type 1 notification to H-1-1.¶
By receiving the type 1notification, H-1-1 sets CC indicator of the subsequent packets of F1 indicating the packets are in a incast flow. Thus those packets will not be performed AR. Sending rate of F1 will also be reduced according to congestion control algorithm.¶

Coordinated Congestion Management

Abstract

Status of This Memo

Copyright Notice

Table of Contents

1. Introduction

2. Terminology

3. Requirements Language

4. Existing congestion management

5. Design principle of coordinated congestion management

6. Coordinated congestion management scheme

6.1. Coordination tag

6.2. Notification message

6.3. Behavior of network switches

6.3.1. Identify congestion type

6.3.2. Notify CC congestion

6.3.3. Notify upstream point to perform AR

6.3.4. Perform congestion control

6.3.5. Perform adaptive routing

6.4. Behavior of source hosts

7. An example of end-to-end procedure

8. Security Considerations

9. IANA Considerations

10. References

10.1. Normative References

10.2. Informative References

Authors' Addresses