Network Working Group J. Zhao Internet-Draft CAICT Intended status: Informational Q. Xiong Expires: 24 April 2025 ZTE Corporation 21 October 2024 Scenarios and Deployment Considerations for High Performance Wide Area Network draft-zhao-hpwan-scenarios-deployment-00 Abstract This document describes the typical scenarios and deployment considerations for High Performance Wide Area Networks (HP-WANs). It also provides simulation results for data transmission in WANs and analyses the impacts on throughput.. Status of This Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at https://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on 24 April 2025. Copyright Notice Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/ license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License. Zhao & Xiong Expires 24 April 2025 [Page 1] Internet-Draft Scenarios and Deployment Considerations October 2024 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1. Requirements Language . . . . . . . . . . . . . . . . . . 3 2. Terminology . . . . . . . . . . . . . . . . . . . . . . . . . 3 3. Typical Scenarios for HP-WANs . . . . . . . . . . . . . . . . 3 3.1. Long-distance Data Transmission . . . . . . . . . . . . . 3 3.2. Collaborative and Interactive Data Transmission . . . . . 4 4. Deployment Considerations for HP-WANs . . . . . . . . . . . . 5 4.1. Host Optimization Deployment . . . . . . . . . . . . . . 5 4.2. WAN optimization Deployment . . . . . . . . . . . . . . . 6 4.3. Gateway Deployment . . . . . . . . . . . . . . . . . . . 6 5. Simulation Results . . . . . . . . . . . . . . . . . . . . . 7 5.1. The Impact of Long-distance Delay . . . . . . . . . . . . 7 5.2. The Impact of Packet Loss . . . . . . . . . . . . . . . . 8 6. Security Considerations . . . . . . . . . . . . . . . . . . . 9 7. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 9 8. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 9 9. References . . . . . . . . . . . . . . . . . . . . . . . . . 9 9.1. Normative References . . . . . . . . . . . . . . . . . . 9 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 10 1. Introduction As per [I-D.xiong-hpwan-uc-req-problem], High Performance Wide Area Network (HP-WAN) puts forward higher performance requirements for WANs. The high performance data transmission should provide the advantages of low latency, high throughput and low CPU utilization, which can significantly improve the performance and efficiency of the intra-DC and DC interconnection network. At present, the tests and deployments of long-distance, high-performance data transmission have been carried out among the operators WAN, cloud service providers DC interconnection network and research institutions private network. However, there are still challenges in providing high performance in long-distance and wide area networks deployment: * the high utilization and high throughput capabilities for long- distance links; * the efficient congestion control mechanisms to avoid packet loss; * fair sharing of link bandwidth resources among multiple concurrent applications; * the packet ACK delay increases exponentially with distance, which will be challenging for high-performance applications, especially distributed processing models. Zhao & Xiong Expires 24 April 2025 [Page 2] Internet-Draft Scenarios and Deployment Considerations October 2024 This document describes the typical scenarios and deployment considerations for High Performance Wide Area Networks (HP-WANs). It also provides simulation results for data transmission in WANs and analyses the impacts on throughput. 1.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2. Terminology The terminology is defined as [I-D.xiong-hpwan-uc-req-problem]. 3. Typical Scenarios for HP-WANs According to different transmission distances and deployment requirements, the high-throughput transmission includes two types of scenarios: high volume data transmission over thousands of kilometers in WANs and the collaborative data transmission over hundreds of kilometers in MANs. 3.1. Long-distance Data Transmission There are two types of scenarios: massive research data transmission between HPCs and data transmission of training samples between the DCs for AI. The long-distance data transmission scenario is shown in Figure 1, where the data flows are transmitted between two sites or DCs, with a location distance ranging from 100km to 1000km. +---100km~1000km---+ | | +--------+ | | +--------+ | Host A |------+ WAN +------| Host B | +--------+ | | +--------+ Site/DC | | Site/DC +------------------+ Figure 1: Long-distance Data Transmission over WANs Zhao & Xiong Expires 24 April 2025 [Page 3] Internet-Draft Scenarios and Deployment Considerations October 2024 Massive research data transmission between HPCs: The scenario of thousands of kilometers of big data migration mainly refers to the high-throughput transmission of massive data between scientific research institutions. At present, research institutions in some countries, such as the US ESnet6 and the EU EuroHPC program, are deploying wide area high-performance networks to support the construction and operation of high-performance computing and data interconnection infrastructure. In this scenario, data transmission is usually carried out regularly or in demand, with each transmission ranging from a few terabytes to several hundred terabytes, Data transmission costs and security is required to balance. Data transmission of training samples between the DCs for AI: The construction of the large-scale DC for AI is limited by energy and land resources. Allocating training tasks to data centers with lower computing power and electricity prices has become a cost-effective option. When the distance between data DCs over 1000km, a wide area high-performance network is required to transmit high-throughput training samples and corpus data. Usually, training large models in the billions or trillions tokens requires several hundred terabytes to over P of corpus data, with a large amount of data transmission per session, which places high demands on transmission throughput and stability. 3.2. Collaborative and Interactive Data Transmission There are two types of scenarios: data transmission between storage and computing separation data centers and high-throughput data transmission between DCs under distributed intelligent computing. The collaborative and interactive data transmission scenario is shown in Figure 2, where data flows are transmitted between two or more DCs, with a location distance ranging from 80km to 100km. +-------------80km~100km-----------------+ | | +----+----+ +----+----+ | Core DC | | Core DC | +----+----+ MAN +----+----+ | | | | +----+----+ +---------+ +----+----+ | Edge DC +----------+ Edge DC +---------+ Edge DC | +---------+ +---------+ +---------+ Figure 2: Collaborative and Interactive Data Transmission over MANs Zhao & Xiong Expires 24 April 2025 [Page 4] Internet-Draft Scenarios and Deployment Considerations October 2024 Storage and computing separation scenario: the cloud services providers deploy multiple data center with storage and intelligent computing devices deployed separately in MAN (under 100km). By extending the high-performance transmission technology used within the original DC to across data centers, the DC cluster with the separated storage and computing is constructed. In 2023, Amazon has implemented a Storage and computing separation data center for high- throughput data transmission on the MAN with a speed of 100Gbps and 100 kilometers. In addition, the training sample of customers in industries such as government and finance is "sensitive data", and the consequences of data leakage are very serious. The sample data needs to be storage in the customer's private DC and connected to the cloud service provider's DC for AI through a wide area high- performance network. Distributed coordination reasoning scenario: in order to improve the user experience of computing services, the architecture with centralized training and distributed reasoning is deployed. The training is carried out at core computing nodes that are far away from the user, the inference is respond to the user at distributed edge nodes with closer distance, shorter latency, and better experience. Local sample data needs to be transmitted back between the core and edge DCs through a high-performance MAN to fine tune and optimize the trained model. In addition, user inference requests and response data require low latency transmission. 4. Deployment Considerations for HP-WANs 4.1. Host Optimization Deployment The host optimization deployment mainly adopts the improved transport layer protocol on the NIC of host server to achieve long-distance and efficient transmission based on lossy networks. The optimization of the transport layer protocol may involve caching and resembling for out of order packages, packet loss tolerant and error correction mechanism based on lossy network, etc. The host optimization deployment is as Figure 3 shown. Zhao & Xiong Expires 24 April 2025 [Page 5] Internet-Draft Scenarios and Deployment Considerations October 2024 +--------------+ +---------------+ +--------------+ | | | | | | +----+----+ | | WAN | | +----+----+ | Host A | +------+ (lossy) +------+ | Host B | +----+----+ | | | | +----+----+ |DCN or | | | |DCN or | |dedicated line| | | |dedicated line| +--------------+ +---------------+ +--------------+ The NIC with transport The NIC with transport protocol optimization protocol optimization Figure 3: Host Optimization Deployment Consideration 4.2. WAN optimization Deployment The WAN optimize the performance of packet loss, bandwidth utilization, and latency to provide high-throughput data transmission between DCs. The optimization of wide area networks may involve path selection, congestion control and flow control etc. The deterministic forwarding may also reduce the packet loss ratio, latency, and jitter in wide area networks. The WAN optimization deployment is as Figure 4 shown. +--------------+ +------------------+ +--------------+ | | | | | | +----+----+ | | WAN | | +----+----+ | Host A | +------+(High performance)+------+ | Host B | +----+----+ | | | | +----+----+ |DCN or | | | |DCN or | |dedicated line| | | |dedicated line| +--------------+ +------------------+ +--------------+ The optimization of packet loss, bandwidth utilization, and latency in WAN Figure 4: Host Optimization Deployment Consideration 4.3. Gateway Deployment The solution requires the deployment of gateway devices at the DC edge to isolate or relay traffic within the data center and wide area network. The gateway devices should support high-performance services packet caching, buffering, and retransmission, and implement The collaboration and Interaction between gateway and WAN through running optimized high-performance transport layer protocols, Zhao & Xiong Expires 24 April 2025 [Page 6] Internet-Draft Scenarios and Deployment Considerations October 2024 including high-performance services intelligence sensitive, routing selection and congestion control. In addition, the gateway also needs to have mapping and conversion of different high-performance protocols running in the data center and WAN. The gateway deployment is as Figure 5 shown. +-------------+ +---------+ +---------+ | | +---------+ +---------+ | Host A +---+ Gateway +---+ WAN +---+ Gateway +---+ Host B | +---------+ +---------+ | (Lossy) | +---------+ +---------+ +-------------+ Figure 5: Gateway Deployment Consideration 5. Simulation Results 5.1. The Impact of Long-distance Delay Based on the current implementation over 100km, the selection of delay parameters in this experiment is mainly aimed at wide area scenarios of 100~2000 km, with round trip time (RTT) of 1-20ms. In terms of parameter selection, this experiment is based on the superposition verification from 100km (1ms delay) to 2000km (20ms delay). The impact of long-distance delay on throughput is shown as Figure 6. +-------------+--------------------+---------------+--------------------+ |RTT latency |message length(byte)| distance |Throughput(Gbps) | +-------------+--------------------+---------------+--------------------+ |less than 1ms|less than 1024 |less than 100km|more than90%@100Gbps| +-------------+--------------------+---------------+--------------------+ | 1ms | 256K | 100km |more than90%@100Gbps| +-------------+--------------------+---------------+--------------------+ | 2ms | 512K | 200km |more than90%@100Gbps| +-------------+--------------------+---------------+--------------------+ | 5ms | 1M | 500km |more than90%@100Gbps| +-------------+--------------------+---------------+--------------------+ | 10ms | 8M | 1000km |more than90%@100Gbps| +-------------+--------------------+---------------+--------------------+ Figure 6: The Impact of Long-distance Delay on Throughput Zhao & Xiong Expires 24 April 2025 [Page 7] Internet-Draft Scenarios and Deployment Considerations October 2024 The transmission performance of RDMA in different network environments is Verified. The impact of long distance and latency on throughput performance is shown in Table 1. As latency increases (1~20ms), the RDMA message size needs to be continuously increased to achieve high-performance transmission with 100% throughput. Due to the maximum message length of 2GB, a bandwidth of 100Gbit/s can be achieved without loss, satisfying the throughput theoretical calculation equation. Throughput = Window_Size/RTT (1) The overall analysis shows that by adjusting RDMA parameters (such as message length), high-performance transmission of 1000km (with over 90% throughput) can be achieved; The message length setting is actually related to the specific network application, device cache space, and cache threshold settings, and the increase of message length is unlimited. 5.2. The Impact of Packet Loss The traditional RDMA adopts the Go-Back-N retransmission mechanism, which retransmits all data packets after the dropped data packet N. Loss of packets can cause significant performance degradation in RDMA. However, TCP only needs to retransmit lost individual packets, and the latest RDMA network cards have started using selective repeat. Therefore, the calculation formulas for TCP packet loss rate (p), message size (MSS), latency (RTT) and bandwidth capacity (C) can be referred to: Throughput = Min{MSS/RTT*C*(1/p)} (2) The actual testing performance of RDMA differs from that of TCP, and the main impact of wide area networks is latency, with retransmission and congestion control algorithm models being similar. Therefore, the theoretical rate of RDMA is empirically judged by adjusting the value of parameter C in equation (2). (TCP empirical value C = 1.0) When both bigger delay and packet loss coexist and over 80% throughput of a 100G link, the packet loss rate in the data center must be less than 0.005%; In the scenario of wide area interconnection in DCs, due to the increase in retransmission cost and response time caused by propagation link delay, the packet loss threshold is more strict and harsh in the data center, requiring the network to achieve lossless as much as possible. In a wide area scenario, even with the optimization algorithm of selective retransmission, it is difficult to achieve a bandwidth utilization rate of over 70% when the packet loss rate is less than 0.001%. Zhao & Xiong Expires 24 April 2025 [Page 8] Internet-Draft Scenarios and Deployment Considerations October 2024 In general, the network performance indicators for RDMA over a wide area of 1000 kilometers are as follows: the throughput of RDMA over a wide area is directly proportional to the length of message size, and inversely proportional to the network packet loss rate and latency. To ensure 80% throughput of links over 100Gbps and 1000 kilometers, the message length needs to be greater than 512KB, resulting in extremely strict packet loss rate indicators due to increased latency. 6. Security Considerations TBA 7. IANA Considerations This document makes no requests for IANA action. 8. Acknowledgements TBA 9. References 9.1. Normative References [I-D.xiong-hpwan-uc-req-problem] Xiong, Q., Yao, K., Huang, C., Zhengxin, H., and J. Zhao, "Use Cases, Requirements and Problems for High Performance Wide Area Network", Work in Progress, Internet-Draft, draft-xiong-hpwan-uc-req-problem-00, 12 October 2024, . [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997, . [RFC3168] Ramakrishnan, K., Floyd, S., and D. Black, "The Addition of Explicit Congestion Notification (ECN) to IP", RFC 3168, DOI 10.17487/RFC3168, September 2001, . [RFC7424] Krishnan, R., Yong, L., Ghanwani, A., So, N., and B. Khasnabish, "Mechanisms for Optimizing Link Aggregation Group (LAG) and Equal-Cost Multipath (ECMP) Component Link Utilization in Networks", RFC 7424, DOI 10.17487/RFC7424, January 2015, . Zhao & Xiong Expires 24 April 2025 [Page 9] Internet-Draft Scenarios and Deployment Considerations October 2024 [RFC8174] Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, May 2017, . [RFC8664] Sivabalan, S., Filsfils, C., Tantsura, J., Henderickx, W., and J. Hardwick, "Path Computation Element Communication Protocol (PCEP) Extensions for Segment Routing", RFC 8664, DOI 10.17487/RFC8664, December 2019, . [RFC9232] Song, H., Qin, F., Martinez-Julia, P., Ciavaglia, L., and A. Wang, "Network Telemetry Framework", RFC 9232, DOI 10.17487/RFC9232, May 2022, . [RFC9438] Xu, L., Ha, S., Rhee, I., Goel, V., and L. Eggert, Ed., "CUBIC for Fast and Long-Distance Networks", RFC 9438, DOI 10.17487/RFC9438, August 2023, . Authors' Addresses Junfeng Zhao CAICT Beijing China Email: zhaojunfeng@caict.ac.cn Quan Xiong ZTE Corporation China Email: xiong.quan@zte.com.cn Zhao & Xiong Expires 24 April 2025 [Page 10]