Internet Engineering Task Force N. Kuhn Internet-Draft Thales Alenia Space Intended status: Standards Track E. Stephan Expires: 24 April 2025 Orange G. Fairhurst R. Secchi University of Aberdeen C. Huitema Private Octopus Inc. 21 October 2024 Convergence of Congestion Control from Retained State draft-ietf-tsvwg-careful-resume-11 Abstract This document specifies a cautious method for IETF transports that enables fast startup of congestion control for a wide range of connections. It reuses a set of computed congestion control parameters that are based on previously observed path characteristics between the same pair of transport endpoints. These parameters are saved, allowing them to be later used to modify the congestion control behavior of a subsequent connection. It describes assumptions and defines requirements for how a sender utilizes these parameters to provide opportunities for a connection to more rapidly get up to speed and rapidly utilize available capacity. Examples of Scenarios of Interest . . . . . . . . . . . . 5 1.5. Design Principles . . . . . . . . . . . . . . . . . . . . 6 2. Language, Notation and Terms . . . . . . . . . . . . . . . . 7 2.1. Requirements Language . . . . . . . . . . . . . . . . . . 8 2.2. The Remote Endpoint . . . . . . . . . . . . . . . . . . . 8 2.3. Notation and Terms . . . . . . . . . . . . . . . . . . . 8 3. The Phases of CC using Careful Resume . . . . . . . . . . . . 9 3.1. Observing . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2. Reconnaissance Phase . . . . . . . . . . . . . . . . . . 10 3.3. Unvalidated Phase . . . . . . . . . . . . . . . . . . . . 11 3.4. Validating Phase . . . . . . . . . . . . . . . . . . . . 13 3.5. Safe Retreat Phase . . . . . . . . . . . . . . . . . . . 14 3.5.1. Loss Recovery after entering Safe Retreat . . . . . . 15 3.6. RTO Expiry while using Careful Resume . . . . . . . . . . 15 3.7. Normal Phase . . . . . . . . . . . . . . . . . . . . . . 15 4. Implementation Notes and Guidelines . . . . . . . . . . . . . 15 4.1. Observing the Path Capacity . . . . . . . . . . . . . . . 16 4.2. Confirming the Path in the Reconnaissance Phase . . . . . 16 4.2.1. Confirming the Path . . . . . . . . . . . . . . . . . 17 4.3. Safety for the Unvalidated Phase . . . . . . . . . . . . 18 4.3.1. Lifetime of CC Parameters . . . . . . . . . . . . . . 18 4.3.2. Pacing in the Unvalidated Phase . . . . . . . . . . . 18 4.3.3. Exit from the Unvalidated Phase because of Variable Network Conditions . . . . . . . . . . . . . . . . . 19 4.4. The Validating Phase . . . . . . . . . . . . . . . . . . 19 4.5. Implementation Notes for using BBR . . . . . . . . . . . 20 4.6. Safety in the Safe Retreat Phase . . . . . . . . . . . . 21 4.7. Returning to Normal Congestion Control . . . . . . . . . 22 4.8. Limitations from Transport Protocols . . . . . . . . . . 22 QLOG support for QUIC . . . . . . . . . . . . . . . . . . . . 22 5.1. cr_phase Event . . . . . . . . . . . . . . . . . . . . . 22 Introduction All Internet transports are required to either use a Congestion Control (CC) algorithm, or to constrain their rate of transmission [RFC8085]. In 2010, a survey of alternative CC algorithms [RFC5783], noted that there are challenges when a CC algorithm operates across an Internet path with a high and/or varying Bandwidth-Delay Product (BDP). This mechanism targets a solution for these challenges. A CC algorithm typically takes time to ramp-up the sending rate, called the "Slow-Start phase", informally known as the time to "Get up to speed". This defines a time in which a sender intentionally uses less capacity than might be available, with the intention to avoid or limit overshoot of the available capacity for the path. This can increase queuing (latency or jitter) and/or congestion packet loss for the flow. Any overshoot can have a detrimental effect on other flows sharing a common bottleneck. A sender can use a method that observes the rate of acknowledged data, and seek to avoid an overshoot of the bottleneck capacity (e.g., Hystart++ [RFC9406]). In the extreme case, an overshoot can result in persistent congestion with unwanted starvation of other flows [RFC8867] (i.e., preventing other flows from successfully sharing the capacity at a common bottleneck). The present document specifies a CC mechanism, called Careful Resume, which is expected to reduce the time to complete a transfer when the transfer sends significantly more data than allowed by the Initial congestion Window (IW), and where the BDP of the path is also significantly more than the IW. It introduces an alternative mechanism to select initial CC parameters, that seek to more rapidly and safely grow the sending rate controlled by the congestion window (CWND). CC algorithms that are rate-based can make similar adjustments to their target sending rate. Careful Resume is based on temporal sharing (sometimes known as caching) of a saved set of CC parameters that relate to previous observations of the same path. The parameters include: the saved_cwnd for the path and the minimum Round Trip Time (RTT). These parameters are saved and used to modify the CC behavior of a subsequent connection between the same endpoints. Some congestion control algorithms may use other parameters. For example, implementations using BBR also retain the value of the bottleneck bandwidth required to reach the capacity available to the flow (BBR.max_bw, see [I-D.cardwell-iccrg-bbr-congestion-control]). When used with the QUIC transport, this provides transport services that resemble those that could be implemented in TCP, using methods such as TCP Control Block (TCB) [RFC9040] caching. 1.1. Use of saved CC parameters by a Sender CC parameters are used by Careful Resume for three functions: 1. Information to confirm whether a saved path corresponds to the current path. 2. Information about the utilised path capacity to set CC parameters. 3. Information to check the CC parameters are not too old. "Generally, implementations are advised to be cautious when using saved CC parameters on a new path", as stated in [RFC9000]. While this statement has been proposed in the context of QUIC standardization, this advice is appropriate for any IETF transport protocol. Care is therefore needed to assure safe use and to be robust to changes in traffic patterns, network routing, and link/node conditions. There are cases where using the saved parameters of a previous connection is not appropriate (see Section 3.2). 1.2. Receiver Preference Whilst a sender could take optimization decisions without considering the receiver's preference, there are cases where a receiver could have information that is not available at the sender, or might benefit from understanding that Careful Resume might be used. In these cases, a receiver could explicitly ask to enable or inhibit Careful Resume when an application initiates a new connection. Examples where a receiver might request to inhibit using Careful Resume include: 1. a receiver that can predict the pattern of traffic (e.g., insight into the volume of data to be sent, the expected length of a connection, or the requested maximum transfer rate); 2. a receiver with a local indication that a path/local interface has changed since the CC parameters were saved; 3. knowledge of the current hardware limitations at a receiver; 4. a receiver that can predict additional capacity will be needed for other concurrent or later flows (i.e., prefers to activate Careful Resume for a different connection). A related document proposes an extension for QUIC that allows sender- generated CC parameters to be stored at the receiver [I-D.kuhn-quic-bdpframe-extension]. This avoids the need for a sender to retain transport state for each receiver. It also allows the receiver to express a preference for whether a sender ought use Careful Resume. 1.3. Transport Protocol Interaction The CWND is one factor that limits the sending rate of a transport protocol. Other mechanisms also constrain the maxmimum sending rate. These include the sender pacing rate and the receiver-advertised window (or flow credit), see Section 4.8. 1.4. Examples of Scenarios of Interest This section provides a set of examples where Careful Resume is expected to improve performance. Either endpoint can assume the role of a sender or a receiver. Careful Resume also supports a bidirectional data transfer, where both endpoints simultaneously send data (e.g., remote execution of an application, or a bidirectional video conference call). Without a new method, each connection would need to individually discover appropriate CC parameters, whereas Careful Resume allows the flow to use a rate based on the previously observed CC parameters. In another example, an application connects after a disruption had temporarily reduced the path capacity. When the endpoint returns to use the path using Careful Resume, the sending rate can be based on the previously observed CC parameters. There is particular benefit for any path with an RTT that is much larger than typical Internet paths. In a specific example, an application connected via a satellite access network [IJSCN] could take 9 seconds to complete a 5.3 MB transfer using standard CC, whereas a sender using Careful Resume could reduce this transfer time to 4 seconds. The time to complete a 1 MB transfer could similarly be reduced by 62 % [MAPRG111]. This benefit is also expected for other sizes of transfer and for different path characteristics when a path has a large BDP. 1.5. Design Principles Resuming a connection with parameters that were observed during a previous connection is inherently a tradeoff between the potential performance gains for the new connection and the risks of degraded performance for other connections that share a common bottleneck. We describe a careful process that is designed to obtain good performance when resuming is appropriate, while seeking to minimise the impact on other connections when it is not appropriate. The following design principles seek to mitigate the risk that a sender adds excessive congestion to an already congested path: The first precaution is to recognize whether the conditions have changed so much that the saved values are no longer valid. We describe that as the "reconnaissance phase". During that phase, the sender will not send more data than allowed for any new connection, e.g., using the recommended maximum IW for the first RTT of transmitting data [RFC9000]. The sender will only proceed with the resume process if the reconnaissance succeeds. If it fails, for example if previous packets in a connection experience congestion or the RTT is significantly different, the sender will follow the standard process for new connections. This provides some protection against aggravating severe congestion and to establish the minimum RTT. The second precaution is to cautiously use the saved parameters when resuming, in what we will call the "unvalidated phase". For example, the jump in the size of cwnd/rate is restricted to a fraction (1/2) of the saved cwnd, to avoid starving other flows that may have started or increased their capacity after the last measurement. The same principle applies for algorithms that use different parameters to classic TCP congestion control: do not push more than a fraction of the remembered values. For example, a connection using BBR will set the pacing rate to half the remembered value of the "bottleneck bandwidth". The sender also needs to pace all unvalidated packets, to ensure the rate does not exceed the previously used rate. This is intended to avoid a sudden influx of a large number of packets that could result in building bottleneck queues and disrupt existing flows. Successful validation can further increase the cwnd, resulting in a cwnd after validating that the used rate did not result in congestion. The third precaution is to perform a "careful retreat" if the validation fails, for example if congestion is detected during validation. The risk here is that the trial use of the saved parameters could have disrupted existing connections. Suppose, for example a connection using the classic TCP congestion control. In "slow start" mode, a Reno congestion control would normally converge on a "slow start threshold" set to half the volume of data in flight, but during this validation the value is restored from the saved parameters. The resultant sending rate could be much larger than the value that would have been reached by a "standard" slow start process, and the overload of the path potentially causing significant congestion to other flows. Instead of continuing with that "too large" value, the retreat process resets the congestion window to a value no greater than what a standard process would have discovered. For other congestion control algorithms, such as Cubic [RFC9438] or BBR, the implementation details may differ, but the principle remains: trying and failing should not confer an undue advantage (e.g., starving) over existing connections that share a common bottleneck. 2. Language, Notation and Terms This subsection provides a brief summary of key terms and the requirements language. 2.1. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here. 2.2. The Remote Endpoint The Remote Endpoint is an implementation-dependent value that identifies the sender's view of the network path being used. This is used to match the current path with a set of CC parameters associated with a previously observed path. It includes: * an identifier representing the sending interface (e.g., a globally assigned address/prefix or other local identifier); * an identifier representing the destination (e.g., a name or IP address). The Remote Endpoint could include information such as the DSCP, the transport ports, a flow label, etc. This information needs to be set consistently for a resumed connection to the same endpoint. Although additional information could improve the path differentiation, it could reduce the re-usability of saved parameters. 2.3. Notation and Terms The document uses language drawn from a range of IETF RFCs. The following terms are defined: Beta: A scaling factor between 0.5 and 1, the default value is 0.5. Careful Resume (CR): The method specified in this document to select initial CC parameters and to more rapidly and safely increase the initial sending rate. CC parameters: A set of saved congestion control parameters from observing the capacity of an established connection (see Section 1.1). CWND: The congestion window, or equivalent CC variable limiting the maximum sending rate; current_remote_endpoint: The Remote Endpoint; current_rtt: A sample measurement of the current RTT; flight_size: The current volume of unacknowledged data; jump_cwnd: The resumed CWND, used in the Unvalidated Phase. LifeTime: The time for which the saved CC parameters can be safely re-used. max_jump: The configured maximum jump_cwnd; PipeSize: A measure of the validated available capacity based on the acknowledged data; Remote Endpoint: See Section 2.2; saved_cwnd: The preserved capacity derived from observation of a previous connection (see Section 4.1); saved_remote_endpoint: The Remote Endpoint associated with a set of CC parameters; saved_rtt: The preserved minimum RTT (see Section 4.1). Unvalidated Packet: A packet sent when the CWND has been increased beyond the size normally permitted by the congestion control algorithm; if such a packet is acknowledged, it contributes to the PipeSize, but if congestion is detected, it triggers entry to the Safe Retreat Phase. 3. The Phases of CC using Careful Resume This section defines a series of phases that the congestion controller moves through as a connection uses Careful Resume. Observing ...> Connect -> Reconnaissance --------------------> Normal (Normal) | ^ v | Unvalidated --------------------------+ | | | | +--> Validating --------------+ | | | | | | +---------------+--> Safe Retreat ---+ Figure 1: Key transitions between Phases in Careful Resume Examples of the transitions between phases are provided in Appendix A. 3.1. Observing An established connection in the Normal Phase, can save a set of CC parameters for the specific path to the current endpoint. Each set of CC parameters includes the saved_remote_endpoint and the LifeTime (e.g., as a timestamp after which the parameters must not be used). * Observing (saved_cwnd): The saved_cwnd is a measure of the currently utilised capacity for the connection, measured as the volume of bytes sent during an RTT. This could be computed by measuring the volume of data acknowledged in one RTT. If the measured CWND is less than four times the Initial Window (IW) a sender can choose to not save the CC parameters, because the additional actions associated with performing Careful Resume for a small CWND would not justify its use. * Observing (saved_rtt): The minimum RTT is saved as the saved_RTT. Implementation notes are provided in Section 4.1. 3.2. Reconnaissance Phase A sender enters the Reconnaissance Phase after connection setup. In this phase, the CWND is initialised to the IW, and the sender transmits initial data. The CWND MAY be increased using normal CC as each ACK confirms delivery of previously unacknowledged data (i.e., the CC is unchanged). The phase seeks to determine if the path is consistent with a previously observed path (saved as a set of CC parameters). The following conditions need to be confirmed before the sender enters the Reconnaissance Phase: * Reconnaissance Phase (Endpoint change): If the current_remote_endpoint is not the same as one of the saved_remote_endpoints, the sender MUST enter the Normal Phase. (A difference in the Remote Endpoint indicates a the network path was different to one that was observed.) * Reconnaissance Phase (Lifetime of saved CC parameters): The CC parameters are temporal. If the LifeTime of the observed CC parameters is exceeded, the CC parameters are not used and the sender enters the Normal Phase. The following actions are performed during the Reconnaissance Phase: * Reconnaissance Phase (Confirming the RTT): During this phase, a sender MUST record the minimum RTT for the current connection as the current_rtt. * Reconnaissance Phase (Detected congestion): If the sender detects congestion (e.g., packet loss or ECN-CE marking), the sender MUST enter the Normal Phase to respond to the detected congestion. * Reconnaissance Phase (Using saved_cwnd): Only one connection can use a specific set of saved CC parameters. If another connection has already started to use the saved_cwnd, the sender MUST enter the Normal Phase. * Reconnaissance Phase (Path confirmed): When a sender has confirmed the RTT and also has received an acknowledgement for the initial data without reported congestion, it MAY then enter the Unvalidated Phase. This transition occurs when more data is sent than normally permitted by the congestion control algorithm. If a sender is rate-limited [RFC7661], it might send insufficient data to be able to validate transmission at the higher rate. A sender is allowed to remain in the Reconnaissance Phase and to not transition to the Unvalidated Phase until there is more data in the transmission buffer than normally permitted by the congestion control algorithm. When a path is not confirmed, Careful Resume is not used and the sender enters the Normal Phase with a CWND that was not modified by Careful Resume. Implementation notes are provided in Section 4.2. 3.3. Unvalidated Phase The Unvalidated Phase is designed to enable the CWND to more rapidly get up to speed by using paced transmission of a tenatively increased CWND. The following conditions need to be confirmed before the sender enters the Unvalidated Phase: * Unvalidated Phase (Confirming the path on entry): If the current_rtt is greater than or equal to (saved_rtt / 2) or the current_rtt is less than or equal to (saved_rtt x 10) (see Section 4.2.1), the sender MUST enter the Normal Phase (see trigger rtt_not_validated in Section 5). On entry to the Unvalidated Phase, the sender: * Unvalidated Phase (Initialising PipeSize): The variable PipeSize is initialised to the flight_size on entry to the Unvalidated Phase. This records the window before a jump is applied. * Unvalidated Phase (Setting the jump_cwnd): To avoid starving other flows that could have either started or increased their use of capacity after the Observation Phase, the jump_cwnd MUST be no more than half of the saved_cwnd. Hence, jump_cwnd is less than or equal to Min(max_jump,(saved_cwnd/2)). CWND = jump_cwnd. The following actions are performed during the Unvalidated Phase: * Unvalidated Phase (Pacing transmission): All packets sent in the Unvalidated Phase MUST use pacing based on the current_rtt. * Unvalidated Phase (Confirming the path during transmission): If a sender determines that the previous CC parameters are not valid (due to a detected path change), the Safe Retreat Phase is entered. (In the Unvalidated Phase, insufficient time has passed for a sender to receive feedback validating the jump in CWND. Therefore, any detected congestion must have resulted from packets sent before the Unvalidated Phase.) * Unvalidated Phase (Completed sending all unvalidated packets): The sender enters the Validating Phase when the flight_size equals the CWND. * Unvalidated Phase (Tracking PipeSize): The variable PipeSize is increased by the volume of data acknowledged by each received ACK. (This indicates a previously unvalidated packet has been succesfully sent over the path.) * Unvalidated Phase (Receiving acknowledgement for an unvalidated packet): The sender enters the Validating Phase when an acknowledgement is received for the first packet number (or higher) that was sent in the Unvalidated Phase or greater than 1 RTT has passed in the Unvalidated Phase (see first_unvalidated_packet_acknowledged in Section 5). Implementation notes are provided in Section 4.3. 3.4. Validating Phase The Validating Phase checks whether all packets sent in the Unvalidated Phase were received without inducing congestion. The CWND remains unvalidated and the sender typically remains in this phase for one RTT. On entry to the Validating Phase, the sender: * Validating Phase (Check flight_size on entry): On entry to the Validating Phase, if the flight_size is less equal to the PipeSize, the Normal Phase is entered with the CWND reset to the PipeSize. (The unvalidated part of the jump_cwnd was not utilised.) * Validating Phase (Limiting CWND on entry): On entry to the Validating Phase (when flight_size is greater than the PipeSize), the CWND is set to the flight_size. During the Validating Phase, the sender performs the following actions: * Validating Phase (Tracking PipeSize): The PipeSize is increased by the volume of acknowledged data for each received ACK that indicates a packet was successfully sent over the path. * Validating Phase (Updating CWND): The CWND is updated using the normal rules for the current congestion controller, this typically allows CWND to be increased for each received acknowledgement that indicates a packet has been successfully sent across the path. * Validating Phase (Congestion indication): If a sender determines that congestion was experienced (e.g., packet loss or ECN-CE marking), Careful Resume enters the Safe Retreat Phase (see trigger packet_loss and ECN_CE in Section 5). * Validating Phase (Receiving acknowledgement of the unvalidated packets): The sender enters the Normal Phase when an acknowledgement is received for the last packet number (or higher) that was sent in the Unvalidated Phase (see last_unvalidated_packet_acknowledged in Section 5). This means that the packets sent in the Unvalidated Phase were acknowledged without congestion. When using BBR Section 4.5, validation is performed using the regular BBR rules for exiting Startup. The measured delivery rate will reflect the actual capacity of the network. If congestion is experienced while the "carefully-resuming" flag is True (e.g., packet losses were observed), BBR exits the Startup state and enter the Drain state. 3.5. Safe Retreat Phase This phase is entered when congestion is detected for an unvalidated packet. It drains the path of other unvalidated packets. (This trigger is the same as used by a On entry to the Safe Retreat Phase, the sender: * Safe Retreat Phase (Removing saved information): The set of saved CC parameters for the path are deleted, to prevent these from being used again by other flows. * Safe Retreat Phase (Re-initializing CWND): The CWND MUST be reduced to no more than (PipeSize/2). This avoids persistent starvation by allowing capacity for other flows to regain their share of the total capacity. The minimum CWND in QUIC is 2 packets (see: [RFC9002] section 4.8). * Safe Retreat Phase (QUIC recovery): When the CWND is reduced, a QUIC sender can immediately send a single packet prior to the reduction [RFC9002]. (This speeds up loss recovery if the data in the lost packet is retransmitted and is similar to TCP as described in Section 5 of [RFC6675].) In the Safe Retreat Phase, the sender performs the following actions: * Safe Retreat Phase (Tracking PipeSize): The sender continues to update the PipeSize after processing each acknowledgement. (This PipeSize is used to reset the ssthresh when leaving this phase, it does not modify CWND.) * Safe Retreat Phase (Maintaining CWND): The CWND MUST NOT be increased in the Safe Retreat Phase. * Safe Retreat Phase (Acknowledgement of unvalidated packets): The sender enters the Normal Phase when the last packet (or a later packet) sent during the Unvalidated Phase has been acknowledged. On leaving the Safe Retreat Phase, the ssthresh MUST be set to no larger than the most recently measured PipeSize x Beta, where Beta is a scaling factor between 0.5 and 1. The default value is 0.5, chosen to reduce the probability of inducing a second round of congestion. Cubic defines a Beta__cubic of 0.7 [RFC9438]. (The log is updated to exit_recovery, see Section 5.) When using BBR, the Safe Retreat Phase is entered if the Drain state is entered while the "carefully-resuming" flag is still True, i.e., if less than 2 full rounds have elapsed after the sender entered the Kuhn, et al. Expires 24 April 2025 [Page 14] Internet-Draft Careful Resume October 2024 Unvalidated Phase. The delivery rates measured in these conditions are tainted, because packets sent during the attempt are still queued at the bottleneck and may have "pushed out" competing traffic. The delivery rates measured in Drain state MUST be discarded if the "carefully-resuming" flag is set to True. This flag is cleared upon exiting the Drain state. Implementation notes are provided in Section 4.6. 3.5.1. Loss Recovery after entering Safe Retreat Unacknowledged packets that were sent in the Unvalidated Phase can be lost. Loss recovery commences using the reduced CWND that was set on entry to the Safe Retreat Phase and continues until acknowledgment of the last packet number (or a later packet) sent in the Unvalidated Phase. If the last unvalidated packet is not cumulatively acknowledged, then additional packets might need to be retransmitted. 3.6. RTO Expiry while using Careful Resume A sender that experiences a Retransmission Time Out (RTO) expiry ceases to use Careful Resume. The sender enters the Normal Phase. If using BBR, the normal processing of packet losses will cause it to enter the Drain state while the "carefully-resuming" flag is set to True, implementing the Safe Retreat Phase Section 4.5. As in loss recovery, data sent in the Unvalidated Phase could be later acknowledged after an RTO event (see Section 3.5.1). 3.7. Normal Phase In the Normal Phase, the sender transitions to using the normal CC algorithm (e.g., in congestion avoidance if CWND is more than ssthresh). (Note that when the sender did not use the entire jump_cwnd the CWND was reduced on entering the Validating Phase.) Implementation notes are provided in Section 4.7. 4. Implementation Notes and Guidelines This section provides guidance for implementation and use. Kuhn, et al. Expires 24 April 2025 [Page 15] Internet-Draft Careful Resume October 2024 4.1. Observing the Path Capacity There are various approaches to measuring the capacity used by a connection. Congestion controllers, such as CUBIC or Reno, can estimate the capacity by utilizing the CWND or flight_size. A different approach could estimate the same parameters for a rate- based congestion controller, such as BBR [I-D.cardwell-iccrg-bbr-congestion-control], or by observing the rate at which data is acknowledged by the remote endpoint. Implementations are required to calculate a saved_rtt, measuring the minimum RTT while observing the capacity. For example, this could be the minimum of a set RTT of measurements measured over the previous 5 minutes. Implementations are expected to include a LifeTime parameter in the CC parameters that can be used to remove old CC parameters when no longer needed, or the CC parameters are out of date. * There are cases where the current CWND does not reflect the path capacity. At the end of slow start, the CWND can be significantly larger than needed to fully utilize the path (i.e., a CWND overshoot). It is inappropriate to use an overshoot in the CWND as a basis for estimating the capacity. In most cases, the CWND will converge to a stable value after several more RTTs. One mitigation could be to set the saved_cwnd based on the flight_size, or an averaged CWND. * When a sender is rate-limited, or in the RTT following a burst of transmission, a sender typically transmits less data than allowed by the CWND. Such observations could be discounted when estimating the saved_cwnd (e.g., when a previous observation recorded a higher value.) 4.2. Confirming the Path in the Reconnaissance Phase In the Reconnaissance Phase, a sender initiates a connection and starts sending initial data, while measuring the current_rtt. The CC is not modified. A sender therefore needs to limit the initial data, sent in the first RTT of transmitted data, to not more than the IW [RFC9000]. This transmission using the IW is assumed to be a safe starting point for any path to avoid adding excessive load to a potentially congested path. Careful Resume does not permit multiple concurrent reuse of the saved CC parameters. When multiple new concurrent connections are made to a server, each can have a valid saved_remote_endpoint, but the saved_cwnd can once (i.e., if two connections start simultaneously Kuhn, et al. Expires 24 April 2025 [Page 16] Internet-Draft Careful Resume October 2024 they cannot both use the saved_cwnd to perform a jump). This is to prevent a sender from performing multiple jumps in the CWND, each individually based on the same saved_cwnd, and hence creating an excessive aggregate load at the bottleneck. The method that is used to prevent re-use of the saved CC parameters will depend upon the design of the server (e.g., if all connections from a given client IP arrive at the same server process, then the server process could use a hash table, whereas when using some types of load balancing, a distributed system might be needed to ensure this invariant when the load balancing hashes connections by 4-tuple and hence multiple connections from the same client device are served by different server processes. 4.2.1. Confirming the Path Path characteristics can change over time for many reasons. This can result in the previously observed CC parameters becoming irrelevant. To help confirm the path, the sender compares the saved_RTT with each of a series of current_rtt samples. If the current_rtt sample is less than a half of the saved_RTT, this is regarded as too small, and is an indicator of a path change. (This factor of two arises, because the rate should not exceed the observed rate when the saved_cwnd was measured, because the jump_cwnd is calculated as half the measured saved_cwnd.) If the current RTT is larger than saved_rtt (when the saved_cwnd was measured), this results in a proportionally lower resumed rate, because the transmission using Careful Resume is paced based on the current_rtt (i.e., a larger RTT sample in the Unvalidated Phase would reduce the paced sending rate ,and hence is still safe). If the current_rtt is incorrectly measured as larger than the actual path RTT, the sender will receive an ACK for an unvalidated packet before it would have completed the Unvalidated Phase, Careful Resume uses this ACK to reset the CWND to reflect the flight_size, and the sender then enters the Validating Phase. A current_rtt more than ten times the saved_RTT is indicative of a path change. (The value of ten was chosen to accommodate both increases in latency from buffering on a path, and any variation between RTT samples). A sender also verifies that the initial data was acknowledged. (i.e., both could otherwise could be indicative of persistent congestion). A sender in Reconnaissance Phase reverts to the Normal Phase if congestion is detected. Some transport protocols implement CC mechanisms that infer potential congestion from an increase in the Kuhn, et al. Expires 24 April 2025 [Page 17] Internet-Draft Careful Resume October 2024 current_rtt. In the Reconnaissance Phase, this indication can occur earlier than congestion that is reported by loss or by ECN marking. Designs need to consider if such an indication is a suitable trigger to revert to the Normal Phase. 4.3. Safety for the Unvalidated Phase This section considers the safety for using saved CC parameters to tentatively update the CWND. This is designed to mitigate the risk of adding excessive congestion to an already congested path. A connection must not directly use the previously saved_cwnd to directly initialize a new flow causing it to resume sending at the same rate. The jump_cwnd must therefore be no more than half the previously saved_cwnd. 4.3.1. Lifetime of CC Parameters The long-term use of the previously observed parameters is not appropriate, a lifetime therefore needs to be specified during which the saved CC parameters can be safely re-used. [RFC9040] provides guidance on the implementation of TCP Control Block Interdependence, but does not specify how long a saved parameter can safely be reused. [RFC7661] specifies a method for managing an unvalidated CWND. This states: "After a fixed period of time (the non-validated period (NVP)), the sender adjusts the cwnd (Section 4.4.3). The NVP SHOULD NOT exceed five minutes." Section 5 of [RFC7661] discusses the rationale for choosing that period. However, RFC 7661 targets rate- limited connections using normal CC. Careful Resume includes additional mechanisms to avoid and mitigate the effects of overshoot, and therefore this can be used to justify a longer lifetime of the saved_cwnd using Careful Resume. 4.3.2. Pacing in the Unvalidated Phase A QUIC sender must avoid sending a burst of packets greater than IW as a result of a step-increase in the CWND. This is consistent with [RFC8085], [RFC9000]. Pacing packets as a function of the current_rtt, rather than the saved_RTT provides an additional safety during the Unvalidated Phase, because it avoids a smaller saved_RTT inflating the sending rate. Pacing also places a limitation on the minimum acceptable current_RTT to avoid sending at a rate higher than was previously observed. Kuhn, et al. Expires 24 April 2025 [Page 18] Internet-Draft Careful Resume October 2024 The following example provides a relevant pacing rhythm using the RTT and the saved_cwnd. The Inter-packet Transmission Time (ITT) is determined by using the current Maximum Message Size (MMS), the saved_cwnd and the current_RTT. A safety margin can be configured to avoid sending more than a maximum (max_jump): jump_cwnd = Min(max_jump,saved_cwnd/2) ITT = (current_RTT x MMS)/jump_cwnd This follows the idea presented in [RFC4782], [I-D.irtf-iccrg-sallantin-initial-spreading] and [CONEXT15]. Other sender mitigations have also been suggested to avoid line-rate bursts (e.g., []). 4.3.3. Exit from the Unvalidated Phase because of Variable Network Conditions * Careful Resume has been designed to be robust to changes in network conditions due to variations in the forwarding path, such as reconfiguration of equipment, or changes in the link conditions. This is mitigated by path confirmation. * Careful Resume has been designed to be robust to changes in network traffic, including the arrival of new flows that compete for capacity at a shared bottleneck. This is mitigated by jumping to no more than a half of the saved_cwnd and by using pacing. * Careful Resume has been designed to avoid unduly suppressing flows that used the capacity since the available capacity was measured. This is further mitigated by bounding the duration of the Unvalidated Phase (and the following Validating Phase), and the conservative design of the Safe Retreat Phase. 4.4. The Validating Phase The purpose of the Validating Phase is to trigger an entry to the Safe Retreat Phase if the capacity is not validated. When a sender completes the Unvalidated Phase, either by sending a jump_cwnd of data or after one RTT or an acknowledgment for an unvalidated packet, it ceases to use the unvalidated CWND. If the flight_size was less than or equal to the PipeSize, the sender resets the CWND to the PipeSize, and enters the Normal Phase. Kuhn, et al. Expires 24 April 2025 [Page 19] Internet-Draft Careful Resume October 2024 Otherwise, if the CWND is larger than the flight_size, the CWND is reset to the flight_size. The sender then awaits reception of ACKs to validate the use of this capacity. New packets are sent when previously sent data is newly acknowledged. The CWND is increased during the Validating Phase, based on received ACKs. This allows new data to be sent, but this does not have any final impact on the CWND if congestion is subsequently detected. 4.5. Implementation Notes for using BBR When the flow is controlled using BBR, Careful Resume is implemented by setting the pacing rate from the saved congestion control parameters, with the following precautions: * The flag "carefully-resuming" is added to the BBR state to indicate that the sender is allowed to send unvalidated packets. This is initialized to "False" when the BBR flow starts; * Prequisites for using Careful Resume are described in Section 3.2; * Careful Resume is allowed to transmit unvalidated packets only when the BBR flow is in the Startup state; * The probing rate is configured to 1/2 of the bottleneck bandwidth, as derived from the CWND calculation specified in the saved congestion control parameters according to the requirements in Section 3.3; * The sender starts the Unvalidated Phase at the beginning of a BBR round, and sets the "carefully-resuming" flags to "True"; * When the "carefully-resuming" flag is set, the BBR congestion controller sets the BBR pacing rate to the larger of the nominal pacing rate ( multiplied bytes BBRStartupPacingGain) or the calculated probing rate. Then, CWND is set to the larger of and the probing rate, multiplied by BBR.rtt_min times BBRStartupCwndGain; * The "carefully-resuming" flag is reset to False two rounds after it is set, i.e., after all the packets sent in the first round of "carefully resuming" have been received and acknowledged by the peer. At that stage (after the capacity has been validated), the measured delivery rate is expected to reflect the probing rate. * If congestion is experienced while the "carefully-resuming" flag is True, BBR exits the Startup state and enters the Drain state (implementing the Safe Retreat Phase). Kuhn, et al. Expires 24 April 2025 [Page 20] Internet-Draft Careful Resume October 2024 4.6. Safety in the Safe Retreat Phase This section considers the safety after congestion has been detected for unvalidated packets. The Safe Retreat Phase sets a safe CWND value to drain any unvalidated packets from the path after a packet loss has been detected or ACKs that indicate sent packets were ECN CE-marked. The CC parameters that were used are invalid, and are removed. The Safe Retreat reaction differs from a traditional reaction to detected congestion, because a jump_cwnd can result in a significantly higher rate than would be allowed by Slow-Start. This jump could aggressively feed a congested bottleneck, resulting in overshoot where a disproportionate number of packets from existing flows are displaced from the buffer at the congested bottleneck. For this reason, a sender in the Safe Retreat Phase needs to react to detected congestion by reducing CWND significantly below the saved_cwnd. During loss recovery, a receiver can cumulatively acknowledge data that was previously sent in the Unvalidated Phase in addition to acknowledging successful retransmission of data. [RFC3465] describes how to appropriately account for such ACKs. ACKS received for unvalidated packets are tracked to measure the maximum available capacity, called the PipeSize (The first unvalidated packet can be determined by recording the sequence number of the first packet sent in the Unvalidated Phase.) This calculated PipeSize is later used to reset the ssthresh. However, note that this is not a safe measure of the currently available share of the capacity whenever there was also a significant overshoot at the bottleneck, and must not be used to reinitialise the CWND. The Proportional Rate Reduction (PRR) [RFC6937] assumes that it is safe to reduce the rate gradually when in congestion avoidance. PRR is therefore not appropriate when there might be significant overshoot in the use of the capacity, which can be the case when the Safe Retreat Phase is entered. The recovery from loss depends on the design of a transport protocol. A TCP or SCTP sender is required to retransmit all lost data [RFC5681]. For QUIC and DCCP, the need for loss recovery depends on the sender policy for retransmission. On entry to the Safe Retreat Phase, the CWND can be significantly reduced. When there was multiple loss, a sender recovering all lost data could then take multiple RTTs to complete. Kuhn, et al. Expires 24 April 2025 [Page 21] Internet-Draft Careful Resume October 2024 4.7. Returning to Normal Congestion Control After using Careful Resume, the CC controller returns to the Normal Phase. The implementation details for different transports depend on the design of the transport. In the Normal Phase, a sender is permitted to start Observing the capacity of the path. 4.8. Limitations from Transport Protocols The CWND is one factor that limits the sending rate of a sender. Other mechanisms can also constrain the maximum sending rate of a transport protocol. A transport protocol might need to update these mechanisms to fully utilise the CWND made available by Careful Resume: A TCP sender is limited by the receiver window (rwnd). Unless configured at a receiver, the rwnd constrains the rate of increase for a connection and reduces the benefit of Careful Resume. QUIC includes flow control mechanisms and mechanisms to prevent amplification attacks. In particular, a QUIC receiver might need to issue proactive MAX_DATA frames to increase the flow control limits of a connection that is started when using Careful Resume to gain the expected benefit. 5. QLOG support for QUIC This section provides definitions that enable a Careful Resume implementation to generate qlog events when using QUIC. It introduces an event to report the current phase of a sender, and an associated description. The event and data structure definitions in this section are expressed in the Concise Data Definition Language (CDDL) [RFC8610] and its extensions described in [I-D.ietf-quic-qlog-quic-events]. The current convention is to use long names for variables. For example, "CWND" is expanded as "congestion_window" and "saved_cwnd" is expanded as "saved_congestion_window". 5.1. cr_phase Event Importance: Extra When the CC algorithm changes the Careful Resume Phase described in Section 3 of this specification. Definition: Kuhn, et al. Expires 24 April 2025 [Page 22] Internet-Draft Careful Resume October 2024 RecoveryCarefulResumePhaseUpdated = { ? old_phase: CarefulResumePhase, new_phase: CarefulResumePhase, state_data: CarefulResumeStateParameters, ? restored_data: CarefulResumeRestoredParameters, ? trigger: ; for the Unvalidated phase, when no unvalidated packets "congestion_window_limited" / ; for the Validating phase "first_unvalidated_packet_acknowledged" / ; for the Normal phase ; and no remaining unvalidated packets to be acknowledged "last_unvalidated_packet_acknowledged" / ; for the Normal phase, when CR not allowed "rtt_not_validated" / ; for the Normal phase, ; when sending fewer unvalidated packets than CWND permits "rate_limited" / ; for the Safe Retreat phase, when loss detected "packet_loss" / ; for the Safe Retreat phase, ; when ECN congestion experienced reported "ECN_CE" / ; for the Normal phase 1 RTT after a congestion event "exit_recovery" } CarefulResumePhase = "reconnaissance" / "unvalidated" / "validating" / "normal" / "safe_retreat" CarefulResumeStateParameters = { pipesize: uint, first_unvalidated_packet: uint, last_unvalidated_packet: uint, ? congestion_window: uint, ? ssthresh: uint } CarefulResumeRestoredParameters = { saved_congestion_window: uint, saved_rtt: float32 } Figure 2 Kuhn, et al. Expires 24 April 2025 [Page 27] Internet-Draft Careful Resume October 2024 +------+---------+---------+------------+-----------+------------+ |Phase |Normal |Recon. |Unvalidated |Validating |Safe Retreat| +------+---------+---------+------------+-----------+------------+ | |Observing|Confirm |Send faster |Validate |Drain path; | | |CC params|path |using saved |new CWND; |Update PS | | | | |_cwnd |Update PS | | +------+---------+---------+------------+-----------+------------+ |On | - |CWND=IW |PS=FS |If (FS>PS) |CWND=(PS/2) | |entry:| | |jump_cwnd |{CWND=FS} | | | | | |=saved_cwnd |else | | | | | |/2; |{CWND=PS; | | | | | |CWND |enter | | | | |=jump_cwnd |Normal} | | +------+---------+---------+------------+-----------+------------+ |CWND: |When in |CWND |CWND is not |CWND can |CWND is not | | |observe, |increases|increased |increase |increased | | |measure |using SS | |using | | | |saved | | |normal CC | | | |_cwnd | | | | | +------+---------+---------+------------+-----------+------------+ |PS: | - | - | PS+=ACked | +------+---------+---------+------------+-----------+------------+ |RTT: |Measure |Measure | - | - | - | | |saved_rtt|current | | | | | | |_rtt | | | | +------+---------+---------+------------+-----------+------------+ |If |Normal |Normal | Enter | - | |loss |CC |CC; | Safe | | |or | |CR is not| Retreat | | |ECNCE:| |allowed | | | +------+---------+---------+------------+-----------+------------+ |Next |Observing|If ( |If (FS=CWND |If (ACK |If (ACK | |Phase:|(as |FS=CWND, |or >1 RTT |>= last |>= last | | |needed) |Lifetime,|has passed |unvalidated|unvalidated | | | |and RTT |or ACK |packet), |packet), | | | |confirmed|>= first |enter |ssthresh = | | | |), enter |unvalidated |Normal |PS x Beta; | | | |Unvalidat|packet), | |and enter | | | |ing else |enter | |Normal | | | |enter |Validating | | | | | |Normal | | | | +------+---------+---------+------------+-----------+------------+ Figure 3: Illustration of the operation of Careful Resume Kuhn, et al. Expires 24 April 2025 [Page 28] Internet-Draft Careful Resume October 2024 The following abbreviations are used SS = Slow-Start FS = flight_size; PS = PipeSize; ACK = highest acknowledged packet. The PipeSize tracks the validated part of the cwnd. It is set to the CWND on entry to the Unvalidated Phase and is updated as each additional packet is acknowledged. The default value of Beta is 0.5. Note: For an implementation that keeps track of transmitted data in terms of packets: In the Unvalidated Phase, the first unvalidated packet corresponds to the highest sent packet recorded on entry to this phase. In the Validating Phase and Safe Retreat Phase, this corresponds to the last unvalidated packet. It is also the highest sent packet number recorded on entry to this phase. The remaining subsections provide informative examples of use. Note: Although the QLOG variables are expressed in bytes, to simplify the description, these examples are described in term of packet numbers. A.1. Example with No Loss In the first example of using Careful Resume, the sender starts by sending IW packets, assumed to be 10 packets, in the Reconnaissance Phase, and then continues in a subsequent RTT to send more packets until the sender becomes CWND-limited (i.e., flight_size = CWND). The sender in the Reconaissance Phase then confirms the RTT and other conditions for using Careful Resume. In this example, this is confirmed when the sender has 29 packets in flight. The sender then enters the Unvalidated Phase. (This path confirmation could have happened earlier if data had been available to send.) The sender initialises the PipeSize to the flight_size (in this case, 29 packets) and then sets the CWND to 150 packets (based upon half of the previously observed saved_cwnd of 300 packets). The sender now sends 121 unvalidated packets (the unused portion of the current CWND). Each time a packet is sent, the sender checks whether 1 RTT has passed since entering the Unvalidated Phase (otherwise, the Validating Phase is entered). This check triggers only for cases where the sender is rate-limited, see the following example. The PipeSize increases after each ACK is received. When the first unvalidated packet is acknowledged (packet number 30) the sender enters the Validating Phase. (This transition would also occur if the flight_size increased to equal CWND.) During this Kuhn, et al. Expires 24 April 2025 [Page 29] Internet-Draft Careful Resume October 2024 phase, the CWND can be increased for each ACK that acknowledges an unvalidated packet, because this indicates that the packet was indeed validated. When an ACK is received for the last packet that was sent in the Unvalidated Phase, the sender completes using Careful Resume. It then enters the Normal Phase. If CWND is less than ssthresh, a Reno or Cubic sender in the Normal Phase is permitted to use Slow-Start to grow the CWND towards the ssthresh, and will then enter congestion avoidance. A.2. Example with No Loss, Rate-Limited A rate-limited sender will not fully utilize the available CWND when using Careful Resume, and CWND is therefore reset on entry to the Validating Phase, as described below. The sender starts by sending IW packets (10) in the Reconnaissance Phase. It commences as described in the first example, transitioning to the Unvalidated Phase. The CWND is set to 150 packets, and the PipeSize to the flight_size (i.e., 29 packets). The sender then becomes rate-limited because it only sends 50 unvalidated packets. After about one RTT (detected by using local timestamps or by receiving an ACK for the first unvalidated packet), the sender will still not have fully used the CWND. It then enters the Validating Phase and resets the CWND to the current flight_size, (i.e., 50 packets). During this phase, the CWND can be increased for each received ACK that validates reception of an unvalidated packet. The PipeSize also increases with each ACK received, to reflect the discovered capacity. When an ACK is received for the last packet sent in the Unvalidated Phase, the sender has completed using Careful Resume. It then enters the Normal Phase, as in the example with no loss. A.3. Example with Loss detected in the Reconnaissance Phase When a packet is lost in the Reconnaissance Phase, the sender will enter the Normal Phase and recovers this using the normal method. (This sender has discovered a potential capacity limit and is not allowed to continue to use Careful Resume, therefore there is no change to the CC method and the CWND is the same as if Careful Resume had not been attempted.) Kuhn, et al. Expires 24 April 2025 [Page 30] Internet-Draft Careful Resume October 2024 A.4. Example with Loss detected in the Validating Phase As in the first example, the sender enters the Unvalidated Phase with a CWND of 150 packets and with the PipeSize initialized to the flight_size (i.e., 29 packets). The sender now sends 121 unvalidated packets (the remaining unused CWND). This example considers the case when one of the unvalidated packet is lost, which we choose to be packet 64 (the 35th packet sent in the Unvalidated Phase). ACKs confirm the first 34 unvalidated packets are received without loss. The PipeSize at this point is equal to 63 (29 + 34) packets. The loss is then detected (by a timer or by receiving three ACKs that do not cover packet number 35), the sender then enters the Safe Retreat Phase because the window was not validated. The PipeSize at this point is equal to 66 (29 + 34) packets. Assuming that the IW was 10 packets, the CWND is reset to Max(10,PS/2) = Max(10,66/2) = 33 packets. This CWND is used during the Safe Retreat Phase, because congestion was detected and the sender still does not yet know if the remaining unvalidated packets will be successfully acknowledged. A conservative CWND calculation ensures the sender drains the path after this potentially severe congestion event. There is no further increase in CWND in this phase. The sender continues to receive ACKs for the remaining 86 (121-35) unvalidated packets. Recall that the 35th unvalidated packet was lost and had packet number 64 (29+35). The PipeSize tracks the capacity discovered by acknowledgments for the unvalidated packets and continues to be further increased for each received ACK acknowledges new data. Although the PipeSize cannot be used to safety initialise the CWND (because it was measured when the sender had aggressively created overload), the estimated PipeSize (which, in this case, is 121-1 = 120 packets) can be used to set the ssthresh on exit from Safe Retreat, since it does indicate an upper limit to the current capacity. At the point where all packets sent in the Unvalidated Phase have been either acknowledged or have been declared lost, the sender updates ssthresh and enters the Normal Phase. Because CWND will now now be less than ssthresh, a sender in the Normal Phase is permitted to use Slow-Start to grow the CWND towards the ssthresh, after which it will enter congestion avoidance. Kuhn, et al. Expires 24 April 2025 [Page 31] Internet-Draft Careful Resume October 2024 Appendix B. Internet Draft Revision details Previous individual submissions were discussed in TSVWG and QUIC. WG -00 included clarifications and restructuring to form the 1st WG draft. WG -01 included review comments and suggestions from John Border, and follows the setting of the TSVWG milestone with an intended status of "Proposed Standard". WG -02 includes steps to complete the spec. In particular, consideration of rate-limited senders; selection of reasoned parameters; specification of the Safe Retreat Phase; and improvements to the consistency throughout. Added the Validating Phase. WG -03, explain entry to Validating Phase, editorial tidy. WG -04, update based on review comments from Kazuho Oku. WG-05, update based on review comments from Neal Cardwell. WG feedback from IETF-118. Reviewed the requirements v. guidelines; clarified that CC is not changed in recon., but the recon. info is used to steer the next phase; clarified saved_cwnd can be computed from ACK rate; use jump once; that real server platforms are complex. Clarified lifetime for saved CC params. Incorporates comments from Tong. WG-06, SR updated following Hackathon comments from Kazuho Oku, and rework of use of PipeSize. Added an informative summary of actions, on suggestion by Tong. Added examples based on text by Ana Custura. WG-07, Use "rate-limited" uniformly instead of application and data limited. Updated to exit early when the unvalidated CWND not utilised, detected in tests by Q Misell. Change pipe_size to be PipeSize. WG-08, Updated CDDL, and made constraints to Observing into guidance, they say what makes sense - but do not need to be followed for conformance. Updated table in the appendix to align with text. WG-09, Cleaning text to separate guidelines and specification and adjust wording to improve clarity based on questions received during implementation. Kuhn, et al. Expires 24 April 2025 [Page 32] Internet-Draft Careful Resume October 2024 WG-10, CH developed text to explain expected operation with BBR. This also fixed some typos introduced in previous edits. Fix XML and fix CDDL bugs for submission. Changed the ssthresh value used after an exit of Safe Retreat to be (PipeSize/2). WG-11, JD fixed mistakes. GF clarified text. RS added that after SR, ssthresh ought to match the behaviour of Cubic/Reno. updated the text to be allow for an implementation to update CWND ahead of entering the Unvalidated Phase, and to clarify that the Unvalidated Phase starts when the first unvalidated packet is actually sent. Authors' Addresses Nicolas Kuhn Thales Alenia Space Email: Emile Stephan Orange Email: Godred Fairhurst University of Aberdeen Department of Engineering Fraser Noble Building Aberdeen AB24 3UE United Kingdom Email: Raffaello Secchi University of Aberdeen Department of Engineering Fraser Noble Building Aberdeen AB24 3UE United Kingdom Email: Christian Huitema Private Octopus Inc. Kuhn, et al. Expires 24 April 2025 [Page 33] Internet-Draft Careful Resume October 2024 Email: Kuhn, et al. Expires 24 April 2025 [Page 34]