Working Group name                                       A. Fressancourt
Internet-Draft                                                L. Iannone
Intended status: Standards Track                                  D. Lou
Expires: 10 April 2025                                        D. Trossen
                                                                  Huawei
                                                          7 October 2024


  Handling inter-DC/Edge AI-related network traffic: Problem statement
                        draft-aft-ai-traffic-00

Abstract

   The growth in terms of number of parameters of LLM models as well as
   the need to use or train those models with private or protected data
   will require service providers operating LLM-based services to
   cooperate to train, specialize or serve LLM-based services accross
   datacenters.  Given their structure, the number of parameters they
   incorporate and the collective communication librairies they are
   built with, LLM training and inference (or serving) network traffic
   has specific characteristics.

   In that regard, understanding the specificities of AI-related
   workloads is critical to determine how to operate AI-based services
   in a federated setting across datacenters.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at https://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on 10 April 2025.

Copyright Notice

   Copyright (c) 2024 IETF Trust and the persons identified as the
   document authors.  All rights reserved.


Fressancourt, et al.      Expires 10 April 2025                 [Page 1]

Internet-Draft                 AI traffic                   October 2024


   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents (https://trustee.ietf.org/
   license-info) in effect on the date of publication of this document.
   Please review these documents carefully, as they describe your rights
   and restrictions with respect to this document.  Code Components
   extracted from this document must include Revised BSD License text as
   described in Section 4.e of the Trust Legal Provisions and are
   provided without warranty as described in the Revised BSD License.

Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   3
   2.  Applicability of AI . . . . . . . . . . . . . . . . . . . . .   4
     2.1.  xGPT-like Services: an example centralized training use
           case  . . . . . . . . . . . . . . . . . . . . . . . . . .   4
     2.2.  Decentralized training with centralized inference use
           cases . . . . . . . . . . . . . . . . . . . . . . . . . .   5
     2.3.  Federated building management: a decentralized inference
           use case  . . . . . . . . . . . . . . . . . . . . . . . .   6
     2.4.  Key Takeaways . . . . . . . . . . . . . . . . . . . . . .   6
   3.  How ML systems are built  . . . . . . . . . . . . . . . . . .   7
     3.1.  Lifecycle of a typical large language model . . . . . . .   7
     3.2.  The distributed nature of Machine Learning systems  . . .   8
     3.3.  The topology of distributed machine learning systems  . .  10
     3.4.  Deployment considerations for AI systems  . . . . . . . .  12
   4.  Challenges (in Networking for AI) . . . . . . . . . . . . . .  14
     4.1.  Network resource management at regional scale . . . . . .  14
     4.2.  Latency sensitivity of LLM training and inference . . . .  15
     4.3.  Aligning communications patterns to the Internet's
           constraints . . . . . . . . . . . . . . . . . . . . . . .  16
     4.4.  Managing incast traffic related to AI inference . . . . .  17
     4.5.  Securing and attesting AI-related traffic . . . . . . . .  18
   5.  Problem statement . . . . . . . . . . . . . . . . . . . . . .  19
   6.  Next Steps  . . . . . . . . . . . . . . . . . . . . . . . . .  19
   7.  Security Considerations . . . . . . . . . . . . . . . . . . .  20
   8.  IANA Considerations . . . . . . . . . . . . . . . . . . . . .  20
   9.  Informative References  . . . . . . . . . . . . . . . . . . .  20
   Appendix A.  A primer on Machine learning (extended version of
           Section 3)  . . . . . . . . . . . . . . . . . . . . . . .  24
     A.1.  Machine learning model lifecycle  . . . . . . . . . . . .  25
     A.2.  System model  . . . . . . . . . . . . . . . . . . . . . .  26
     A.3.  Parallelization modes . . . . . . . . . . . . . . . . . .  28
     A.4.  Collective communication methods  . . . . . . . . . . . .  32
   Authors' Addresses  . . . . . . . . . . . . . . . . . . . . . . .  37


Fressancourt, et al.      Expires 10 April 2025                 [Page 2]

Internet-Draft                 AI traffic                   October 2024


1.  Introduction

   AI has attracted a far bit of attention in the networking community,
   including the IETF, in regards to not just applications but also
   needed protocols and technologies for the realization of those
   applications.

   While AI is a large area, this document focuses on the method of
   (training and inferencing over) Large Language Models (LLM).  The
   starting point being the distributed nature of implementing LLMs,
   both for training and inferencing.  For this, a network of so-called
   'workers' is being realized, that over time will train a model, over
   which in turn inferencing can be performed.  This distributed nature
   involves a number of communication patterns to exchange suitable
   information, those patterns needing realization in suitable
   protocols.  Differences in those protocols emerge from the deployment
   choices for AI platforms and the challenges that arise from those
   deployments.

   The training of LLMs in ever growing large-scale data centres (DCs)
   is a current method to push the boundaries of key performance
   indicators, most notably the number of parameters included in the
   training model [LLMSize].  Observations in recent works [AIBackbone],
   however, point to the fact that distribution across more than one DC
   will quickly be necessary to continue the growth of LLM training.

   LLMs may also start inherently distributed in deployment itself, not
   because of their size, but because of their use case.  An example
   here is content moderation in fediverse environments [FEDI], where
   decentralized data governance and computation drives the equally
   decentralized realization of the workers developing a shared model
   over time.  Other examples of such decentralized use case stems from
   health (e.g., where independent health trust may train a shared model
   over strictly locally governed patient data), IoT (where AI-derived
   features may be derived from localized sensor data), or also network
   management (where operator data for training may not be possible or
   legally permitted to be disclosed to a centralized training entity).
   Realizations of platforms for those use cases often refer to
   'federated learning' [FLOWER] to capture the equal basis of
   participation, both at the data and processing level, of the
   participating entities of the platform.

   The intention of this document is to highlight the possible
   challenges in the spectrum of realizations for the distributed LLM
   training and inferencing.  For this, we provide more details and
   examples for use cases in Section 2, followed by a primer of key AI/
   LLM techniques in Section 3.  Then, a number of challenges to, e.g.,
   resource management, latency, and security, are identified in


Fressancourt, et al.      Expires 10 April 2025                 [Page 3]

Internet-Draft                 AI traffic                   October 2024


   Section 4 as a starting point for a focussed discussion in the IETF
   community on challenges AI may pose for networks, networks protocols
   and technologies.

   As the spectrum of realization ranges from centralized intra-DC to
   highly distributed, federated AI platforms, there is a strong
   relevance of solving the identified challenges within the IETF,
   leading to the formulation of a problem statement in Section 5 along
   those lines and proposing next steps in Section 6 for the IETF
   community to pursue.

2.  Applicability of AI

   This section introduces the applicability of AI/LLMs across a
   spectrum of decentralization for realising machine learning training
   and inferencing tasks.  These two tasks are the most intensive in AI
   workflows, and they are inherently distributed compute problems given
   the amount of memory and processing power they require Section 3 will
   introduce the prominent techniques relevant to implementing those
   tasks in deployed systems.

   It can be observed upfront that the realization of AI/LLM use cases
   is driven by two main aspects, namely the decentralization of data
   and that of compute; during the training phase, the inference phase
   or both.  The proposed examples are introduced with regards to those
   two aspects, for reasons that are detailed in the text.

2.1.  xGPT-like Services: an example centralized training use case

   Various GPT-like services have gained much attention recently,
   advocating the training of ultra-large language models, providing
   chat-like inferencing services to users for, _e.g._, text generation,
   generative AI-based image creation, and many others.

   The key success factor of those services is the ingestion of vast
   amounts of data into a centralized training entity.  This data may
   come from public sources, search input, through license arrangements
   with, _e.g._, video providers like YouTube, and many others.

   In this use case, centralization pertains to the ownership of the
   training resources under a single entity.  Until now, the prevalent
   deployment of such centralized training is within a single, large-
   scale DC with a growing number of GPUs to execute the necessary model
   training operations over a sustained period of time.  However, the
   growing need for more compute, together with physical and energy
   limitations in existing DCs, make a growing scale-out of those
   services beyond the reach of a single DC a likely needed path in the
   future.


Fressancourt, et al.      Expires 10 April 2025                 [Page 4]

Internet-Draft                 AI traffic                   October 2024


2.2.  Decentralized training with centralized inference use cases

   In some cases, it is impossible to gather the amount of data
   necessary to properly train a machine learning model, either for
   security, privacy or regulatory reasons.  The necessity to tackle
   those cases triggered the development of Federated learning
   [FederatedLearning], in which several entities can collaborate in a
   decentralized way to the training of a single large model.

   Health is a traditional area for reasoning-based systems.  Its
   richness in data, both historical as well as current (_e.g._, from
   patients), lends itself for training LLMs over which inferencing can
   take place for, _e.g._, diagnosis, drug development, developing
   treatment plans and many others.

   Individual health organizations are often well equipped in terms of
   compute resources, albeit not at the scale that suffices to perform
   centralized LLM training on their own.  Thus, federating compute
   resources is useful for scale and incentivized through the sharing of
   the resulting LLM in the clinical pathway.

   Furthermore, data is also localized, since patients sign up to local
   health organizations, practises, or health trusts, which in turn
   manage their data.  In many countries, data sharing is performed,
   also for transferability (_e.g._, when patients change location and
   thus local health contacts) but also treatment across health
   organization.  Overall, data governance needs to be strictly adhered
   to for any application of the data, with a possible development of a
   LLM being just one such application.

   Network management is another use case for federated learning, where
   the federation is driven by the common goal to develop an
   increasingly refined model for, _e.g._, intrusion detection, network
   failure detection, and others, but where suitable training data is
   only shared in a limited manner or not at all for confidentiality
   reasons.

   In those use cases, the decentralization of the training stems from
   the constraints limiting the exchaneg of data between entities for
   technical feasibility, privacy, confidentiality or regulatory
   reasons.  Once trained, the models including the common knowledge
   gathered in the training phase can be used by each single entity
   without necessarily requiring collaboration during the inference
   phase.


Fressancourt, et al.      Expires 10 April 2025                 [Page 5]

Internet-Draft                 AI traffic                   October 2024


2.3.  Federated building management: a decentralized inference use case

   Building management, where smart buildings are often equipped with
   micro-DC capabilities, can be federated to improve energy management
   in a larger geographic area, utilizing local yet private data to
   train a suitable (possibly locally limited) LLM.  Works like
   [AIConst] expand this use case into other areas like construction
   management, architectural design, material design, and others.

   Similar to the previous use case, the deployment across shared
   infrastructures (as well as the sharing of compute resources itself)
   is a key aspect, utilizing Internet technologies for realizing the
   necessary exchange within the training and inferencing scenario at
   hand.  Here, the decentralization of the inference task is a
   necessity given that the goal is to reach a global optimum (_e.g._,
   energy savings accross an aea or region) by clustering the
   capabilities of buildings while keeping the data used for training
   and inferencing local for security and privacy reasons.

2.4.  Key Takeaways

   The following key takeaways can be derived from the proposed
   applicability examples:

   *  _LLM AI is inherently distributed_: This is due to its scale in
      required training nodes and model size.  Although the distribution
      may be handled entirely within a central, single data centre, the
      execution of tasks across, often many thousands, of workers still
      remains a key aspect in LLM AI.

   *  _Centralized LLM AI training does not need to touch the Internet_:
      This is because it may be performed entirely within the limits of
      a single DC, or at least among a few DCs interconnected by private
      lines under the control of a single entity.

   *  _Centralization of compute implies centralization of data_: This
      is a consequence of the model creation, based on the data, which
      is centralized and thus requires suitable data transfer towards
      the compute nodes.

   *  _Federation allows for decentralization of data_: This is possible
      with worker nodes that may individually train on data, while
      contributing to the global model.  However, centralization of the
      federation again leads to the observation in the second item,
      _i.e._, data is centralized, too.


Fressancourt, et al.      Expires 10 April 2025                 [Page 6]

Internet-Draft                 AI traffic                   October 2024


   *  _Inferencing inherently touches the Internet_: This is especially
      true in any applicability scenario involving end users residing on
      the Internet.  The impact may the creation of very large amounts
      of traffic to (_e.g._, centralized) nodes that hold the suitable
      model information for performing the desired inferencing.

   The next section outlines the system aspects of LLM AI, in light of
   the above takeaways, _e.g._, on centralization, while Section 4 more
   directly takes these takeaways into account when formulating key
   challenges for networking, stemming from the insights in this
   section.

3.  How ML systems are built

3.1.  Lifecycle of a typical large language model

   In the last few years, the impressive AI boom has been associated
   with the development of Large Language Models (LLMs).  LLMs are
   models, or representations of systems that are observed or measured
   with data.  Models have an underlying structure, consisting in a
   parametrized graph linking together small operators performing
   operations on *tensors*, or small arrays of finite dimensions.
   Models are built, or *trained*, using a reference data set (or
   training set), consisting in data labelled with its underlying
   structure or meaning.

   Before its use for training, the data is collected and pre-processed
   to structure it and chunk it into *tokens*, the basic data units used
   as input and outputs of models.  After the training data set has been
   prepared, the model is trained.  During this training phase, the
   model is parametrized, i.e., the parameters ruling the strength of
   the relationships between the tensor operators constituting the model
   are modified so the model is able to abstract the behaviour of the
   system from the dataset representing it.

   After the training phase, the model can be used during an *inference*
   phase to derive information or make predictions from new data
   presented in the form of a sequence of tokens.  Inference operations
   are not necessarily done by the same nodes that have trained the
   model.  Models can be transferred to other worker nodes to perfom
   inference tasks, sometimes placed close to the users making requests.
   Besides, those transferred models can be re-trained or fine-tuned in
   order to better serve requests in the context they are done.  This
   can be done for instance with private or locally relevant data.


Fressancourt, et al.      Expires 10 April 2025                 [Page 7]

Internet-Draft                 AI traffic                   October 2024


3.2.  The distributed nature of Machine Learning systems

   To improve the accuracy of LLMs, those models incorporate an
   increasing amount of parameters, and are trained using ever larger
   datasets.  Modern LLMs use 125 million up to 405 billion parameters,
   where each parameters can be encoded with 4 bits to 4 bytes depending
   on the required precision.  This increase in models' sizes has
   important consequences on the power consumption and on the memory
   footprint of the systems used to train those models.

   From a power consumption perspective, the increase in computing power
   needed to accomplish the training tasks of large models requires the
   deployment of more powerful Tensor Processing Units (TPUs) of
   Graphics Processing Units (GPUs) with higher power and cooling
   demands.  It is becoming physically unsustainable to bring the power
   required to datacenters hosting those machine learning worker nodes.
   A possible way to address this challenge is to distribute machine
   learning workloads beyond the realm of a single datacenter, possibly
   between two or a few interconnected datacenters separated by long to
   mid range private interconnections.

   From a memory perspective, this means that storing a model and its
   parameters will cost from 62,5 MB (for a model using 125 million
   parameters with 4-bits parameters) to 1620 GB (for a model using 405
   billion parameters with 4 bytes parameters).  The memory footprint of
   modern large language models makes those models difficult to
   manipulate on a single node.  Besides, as mentionned in
   [SchedulingSharding], the pre-training of Llama3 with 405 billion
   parameters required from 8192 to 16384 GPUs, which were obviously
   hosted on multiple, connected nodes.

   To cope with the memory and computing needs of workloads associated
   with LLM operations, the data preparation, training and inference
   tasks mentionned in Section 3.1 can be distributed among nodes with
   specific functionnal roles (task execution, result aggregation,
   orchestration...).  The detailed description of those functional
   roles is given in Appendix A.2.

   From a networking perspective, some elements are important to
   mention:

   *  Some roles (aggregator or orchestrator) tend to concentrate
      network traffic, thus creating some issues in properly managing
      the incast traffic.


Fressancourt, et al.      Expires 10 April 2025                 [Page 8]

Internet-Draft                 AI traffic                   October 2024


   *  The parallel execution of training and inference tasks by worker
      nodes is following specific parallelism modes presented in
      Figure 1 and detailed in Appendix A.3.  Those modes have various
      requirements in terms of volume of data exchanged and latency (see
      [SchedulingSharding] for instance).

      a. *Data parallelism*, in which data is split into chunks to be
      used by different worker nodes to train the model.  This
      parallelism mode can cope with rather large latencies (several
      100s of ms) but requires exchanging a large volume of data.

      b. *Pipeline model-parallelism*, in which a model is separated
      into stages consisting in a few consecutive layers of the entire
      model.  Each stage is allocated to a separate worker node, and
      intermediate states are exchanged between successive worker nodes.
      This parallelism mode can cope with large latencies (10s of ms)
      and requires exchanging less data than data parallelism.

      c. *Tensor model-parallelism*, in which model layers are split
      into chunks that can be operated by different nodes.  This
      parallelism mode needs to operate in low latency networks (10s of
      us) and requires exchanging a lot of data.

      d. *Mixture of Expert parallelism*, in which nodes are holding
      smaller but specialized models trained over a smaller amount of
      data and holding fewer parameters.  This parallelism mode can cope
      with latencies in the ms range.

   *  Machine learning applications rely most of the time on collective
      communication libraries [xCCL] that use patterns presented in
      details in Appendix A.4 such as All-to-All or All-reduce.
      [I-D.yao-tsvwg-cco-problem-statement-and-usecases] has already
      introduced some networking challenges associated with the use of
      collective communication in machine learning systems.  From a
      networking perspective, those communication patterns translate in
      "on-off" communication patterns, with a few large flows starting
      simultaneously to allow nodes involved in a collective to exchange
      data and parameters.  The volume, synchronism and imbalance of the
      traffic generated by those communication pattern is a burden to
      the distribution of machine learning workloads, and a challenge to
      be addressed by the networking community.


Fressancourt, et al.      Expires 10 April 2025                 [Page 9]

Internet-Draft                 AI traffic                   October 2024


   +----+     +-+ +-+
   |    |     |P| |P|     +-+
   |    |     |.| |.|     |O|
   |Data| ==> |1|X|2| ==> |u|
   | 1. |     |-| |-|     |t|           +----+
   |    |     | | | |     +-+         +-+--+ |     +-+     +-+
   |    |     +-+ +-+                 |    | |     |P| +-+ |P|     +-+
   +----+                             |    | |     |.| |I| |.|     |O|
                                      |Data| | ==> |1|X|n|X|2| ==> |u|
   +----+     +-+ +-+                 |    | |     |-| |t| |-|     |t|
   |    |     |P| |P|     +-+         |    | |     | | +-+ | |     +-+
   |    |     |.| |.|     |O|         |    +-+     +-+     +-+
   |Data| ==> |1|X|2| ==> |u|         +----+
   | 2. |     |-| |-|     |t|
   |    |     | | | |     +-+
   |    |     +-+ +-+
   +----+
   (a) Data parallelism               (b) Pipeline model-parallelism

                +-+ +-+                                +-+ +-+
                |P| |P|     +-+                        |P| |P|
                |.| |.|     |O|                        |.| |.|
     +----+     |1|X|2| ==> |-|         +----+       ^ |1|X|2|\
   +-+--+ |   ^ |-| |-|     |1|       +-+--+ |      /  |-| |-| \
   |    | |  /  |1| |1|     +-+ +-+   |    | |     /   |1| |1|  v  +-+
   |    | | /   +-+ +-+       \ |O|   |    | |   +-+   +-+ +-+     |O|
   |Data| |      |   |          |u|   |Data| |==>|R|  ----------   |u|
   |    | | \   +-+ +-+       / |t|   |    | |   +-+   +-+ +-+     |t|
   |    | |  \  |P| |P|     +-+ +-+   |    | |         |P| |P|     +-+
   |    +-+   v |.| |.|     |O|       |    +-+         |.| |.|
   +----+       |1|X|2| ==> |-|       +----+           |1|X|2|
                |-| |-|     |2|                        |-| |-|
                |2| |2|     +-+                        |2| |2|
                +-+ +-+                                +-+ +-+
   (c) Tensor model-parallelism       (d) Mixture of Expert parallelism

       Figure 1: Parallelism models used in machine learning systems

3.3.  The topology of distributed machine learning systems

   To boost the workload distribution efficiency, nodes participating in
   the common execution of machine learning workloads are interconnected
   following specific topologies.  Depending on the number of nodes
   involved, the nodes' capacities to connect directly with other nodes
   and the control of the machine learning application administrator
   over the physical connectivity substrate, difefrent topologies can be
   used.  Those topologies can inherit from the work done either in the
   high performance computing community if the nodes are homogeneous,


Fressancourt, et al.      Expires 10 April 2025                [Page 10]

Internet-Draft                 AI traffic                   October 2024


   high capacity computers under the control of a single administrator
   or in the peer-to-peer or swarm computing community if the nodes are
   more diverse and numerous.

   Topologies inspired from the HPC networking community are typically
   found in distributed systems in which all the nodes are located in
   the same datacenter.  In specific cases, those topologies can be
   extended beyond a single datacenter.  Then, the distribution of the
   workloads is done so as to avoid frequent and latency-bound
   communications over those links, and congestion control mechanisms
   are tuned to avoid deadlocks and bloats in network equipments manging
   the inter-datacenter link.  Among those HPC-inspired topologies, we
   can find the n-dimension torus, the Dragonfly topology,the fat tree
   topology and the hypercube topology.

   On the other end of the spectrum, topologies inspired from swarm
   computing and peer-to-peer networks have emerged as solutions
   connecting heteorgeneous nodes involved in distributed machine
   learning tasks outside datacenters, at the edge of the networks or
   among clusters of nodes hosted on public cloud platforms.  Those
   topologies can be built either from dynamic wireless connections,
   which can be rearranged easily, or as overlay links on top of an
   immutable physical connectivity substrate.  Among those swarm-
   inspired topologies presented in Figure 2, we can find the full mesh,
   the star topology, the Peer-to-Peer topology which presents a limited
   diameter without too many connections, the random topology and the
   clustered topology.  Note that in the case of the clustered topology,
   nodes can be gathered based on their proximity to a proxy or an
   aggregator, or based on their interest or capabilities.  Besides, the
   clustered topology can be structured according to a hierarchy.

      o-----o     o     o     o-----o        o     o      o o o   o o o
     / \   / \     \   /     / ____/ \       |    /        o o   / o o
    /   \ /   \     \ /     / /       \      o---o          o \ o
   o-----+-----o  o--o--o  o-+       +-o      \   \   o        o o
    \   / \   /     / \     \   ____/ /        o---o--+       / o \ o
     \ /   \ /     /   \     \ /     /    o   /          o o o     o o
      o-----o     o     o     o-----o     +--o            o o       o

   Full mesh      Star     Peer-to-Peer   Random         Clustered
   topology       topology topology       topology       topology

                     Figure 2: Decentralized topologies

   The topology connecting the nodes together is not anecdotical.  If it
   is under the control of the machine learning system administrator,
   the construction of a proper topology able to efficiently support the
   communication patterns generated by the parallelization model and the


Fressancourt, et al.      Expires 10 April 2025                [Page 11]

Internet-Draft                 AI traffic                   October 2024


   collective communication method used is a key performance factor.  If
   the topology is inherited from the environment, machine learning
   system administrators will need to adapt the communication patterns
   they use to the characteirstics of the topology.

3.4.  Deployment considerations for AI systems

   Given the computing and memory requirements of machine learning
   workloads, modern machine learning systems are fundamentally
   distributed.  By combining functionnal roles together, distributing
   them using a combination of parallelization modes, and communicating
   following patterns, a variety of systems with different shades of
   decentralization can be built.  Indeed, distributing workloads across
   several workers does not necessarily means that the control and
   orchestration of those workloads is distributed.  In case a parameter
   server is used, the orchestrator role is played by a single node.

   Yet, following the Federated learning [FederatedLearning] approach,
   machine learning systems can be decentralized as shown in Figure 3.


Fressancourt, et al.      Expires 10 April 2025                [Page 12]

Internet-Draft                 AI traffic                   October 2024


            +----+   +----+                     +----+   +----+
            | D  |   |  D |                     | DM |   | DM |
            +--+-+   +-+--+                     +--+-+   +-+--+
                \     /                             \     /
               +-+---+-+                           +-+---+-+
               |Central|                           |Central|
               | Inst. |                           | serv. |
               | DM (Σ)|                           |  M (Σ)|
               +-+---+-+                           +-+---+-+
                /     \                             /     \
            +--+-+   +-+--+                     +--+-+   +-+--+
            | D  |   |  D |                     | DM |   | DM |
            +----+   +----+                     +----+   +----+

          Centralized learning          Centralized federated learning

             +-----------+                       +-----------+
             |   DM (Σ)  |                       |   DM (Σ)  |
             +-+---+---+-+                       +-+---+---+-+
              /    |    \                         /    |    \
   +---------+-+   |   +-+---------+   +---------+-+   |   +-+---------+
   |     DM    +---*---+     DM    |   |   DM (Σ)  +---*---+   DM (Σ)  |
   +--------+--+  / \  +--+--------+   +--------+--+  / \  +--+--------+
             \   /   \   /                       \   /   \   /
      +-------+-+-+ +-+-+-------+         +-------+-+-+ +-+-+-------+
      |     DM    +-+     DM    |         |   DM (Σ)  +-+   DM (Σ)  |
      +-----------+ +-----------+         +-----------+ +-----------+

           Semi-decentralized                Fully decentralized
           federated learning                federated learning

      Figure 3: Different centralization models in federated learning

   In federated learning, the coordination role of the orchestrator or
   the aggregation of parameters done by the aggregator can be done by a
   subset or all the worker nodes deployed in the system, using a
   distributed protocol to exchange information, synchronize parameters
   and agree on a reference.  Of course, this comes at a communication
   cost, but decentralization also has benefits.


Fressancourt, et al.      Expires 10 April 2025                [Page 13]

Internet-Draft                 AI traffic                   October 2024


   One of the first reason for decentralizing machine learning systems
   is to allow users to retain ownership of their data, and to control
   how their data is transferred or used.  In a centralized setting,
   users need to trust the central entity managing the system for
   respecting the data management agreed upon with the user.  In a
   decentralized setting, users can either infer from a copy of a
   generic model sent by a central entity (centralized federated
   learning), fine-tune the model and infer from it (semi-decentralized
   setting) or even cooperate with others to train together a model and
   use it afterwards (in a fully decentralized setting).

   Besides, in a decentralized machine learning system, users and worker
   nodes can be co-located, or workers can be placed close to the user
   in order to reduce the overall communication latency.  In such a
   decentralization scheme, if it is considered that edge nodes can
   benefit from less memory or computing power, a balance has to be
   found between the time spent in communicating between the user and
   worker nodes and the processing time to reach an optimal response
   time for user requests.  Worker nodes located at the edge of the
   network may also collaborate with other, more capable nodes located
   in other locations.  The tail latency of the flows associated with
   those tasks should be bounded to avoid degrading the user's
   experience.

4.  Challenges (in Networking for AI)

4.1.  Network resource management at regional scale

   In (large) model-based machine learning, training, fine-tuning and
   infering from the model are workloads that involve the transfer of a
   very large volume of data.  In [AITrainingTraffic], the authors
   estimate that the first iteration for training a GPT-3B model among
   32 GPU nodes generate roughly 85 GB of tensor-parallel data traffic,
   1 GB of pipeline-parallel data traffic and 741 MB of data-parallel
   data traffic.  The tensor-parallel and pipeline-parallel data traffic
   consist in 125 MB messages that are periodically exchanged during
   communication phases, while very little traffic is exchanged during
   computing phases.

   This traffic pattern is a good example of the characteristics of
   network traffic flows associated with machine learning workloads.
   Indeed, the network traffic generated by distributed machine learning
   systems consists in a relatively little number of large (elephant)
   flows, starting and stopping roughly simultaneously.  Such
   synchronous traffic patterns are coming from the use of collective
   communication libraries such as NCCL and associated collective
   communication primitive such as All-Reduce or All-Gather, as
   mentionned in [Burstiness].  Dealing with this synchronized traffic


Fressancourt, et al.      Expires 10 April 2025                [Page 14]

Internet-Draft                 AI traffic                   October 2024


   in a stochastic network is challenging, because it generates periodic
   traffic bursts that bloat network equipment buffers.  Besides, the
   network's capacity needs to meet peak requirements at the cost of an
   unefficient average utilization of the installed capacity.  Even if
   the network's capacity is sufficient in theory to accomodate the peak
   requirements of machine learning systems, the transport mechanisms
   used on the links between the nodes make it difficult to immediately
   use the full deployed capacity, resulting in inefficiencies at the
   network level [NetworkBottleneck] At last, in such a synchronous
   communication pattern, the failure of a link, delaying data
   transmission between two nodes in the system might delay the whole
   system after a few iterations.

   Mitigating the resource management challenges raised by machine
   learning traffic patterns is currently an open research area.  At the
   application level, machine learning system designers work to develop
   hybrid parallellization schemes combining the patterns presented in
   Section 3.2 (detailed in Appendix A.3) and orchestration methods
   aiming at better utilizing the deployed network capacity and at
   avoiding "on-off" behaviors.

   At the network level, the main challenge that the research community
   is trying to address is the proper balancing of the flows generated
   by machine learning workloads in the network.  Indeed, to address the
   need for bandwidth and avoid bottlenecks, machine learning system
   designers are often deploying their system on top of networks whose
   topology presents a large number of equal cost links between two
   nodes.  Yet, as mentionned earlier, machine learning traffic is
   constituted with a rather small number of large flows that have a
   very small entropy.  This makes applying classic load balancing
   techniques challenging.  To address this load balancing issue, most
   collective communication libraries use packet spraying strategies,
   which require specific tuning due to mentioned lack of entropy.  Yet,
   some researchers are questionning the relevance of this approach
   [ChallengingSpraying].

4.2.  Latency sensitivity of LLM training and inference

   Training a (large) model in a distributed setting does not
   necessarily require data transfer to be operated at a controlled low
   latency, but large tail latencies related to packet losses or head of
   line blocking can delay the training of models considerably.  Indeed,
   problems arising from packet losses, link failures or excessive
   buffering can have cascading consequences since in most
   parallellization methods, data and model parameters are exchanged
   following synchronized patterns (see Section 4.1).


Fressancourt, et al.      Expires 10 April 2025                [Page 15]

Internet-Draft                 AI traffic                   October 2024


   The extend of the amplification of the latency effects of a soft
   failure on a connection between two nodes depends on the topology of
   the network connecting the nodes.  Besides, routing inefficiencies or
   failures to properly balance the load on some heavily-used links can
   also generate additional latencies.  Thus, the topology of the
   network supporting machine learning workloads needs specific care.

   In a large scale and decentralized AI context, heterogeneous links
   can be used between nodes participating in a decentralized machine
   learning system.  The specificities of the links need to be taken
   into account when orchestrating or distributing AI-related tasks.  As
   latency is affected by congestion, addressing the mismatch between
   links' bandwidth-delay products for efficient congestion management
   is an open challenge.  In the research community, some projects have
   proposed to introduce proxies to investigate new control loops taking
   into account link segments characteristics [SiteToSite].  In the
   IETF, CSIG draft (now expired) presented a model to expose a rich
   congestion signal [I-D.ravi-ippm-csig].

4.3.  Aligning communications patterns to the Internet's constraints

   In the development and deployment of distributed machine learning
   systems within the real of an administrator's responsibility, it is
   feasible, and advised, to design the system to adapt the network's
   topology to the paralellization method used and to the collective
   communication patterns required to perform the training or inference
   task efficiently.  As we have seen in Section 3, several network
   topologies can be adopted, depending on the model's architecture and
   on the design choice made while designing the training system.

   In a decentralized or federated setting, such a co-design offers
   lmess freedom, as some adverserial choices can be made by people
   willing to cooperate on specific machine learning tasks.  Besides,
   when the nodes involved are deployed at the edge of the network, in a
   majority of the cases, the topology is following the access network's
   specificities rather than adapting to the requirements of machine
   learning tasks.

   As machine learning-related traffic is growing on the Internet, some
   challenges associated with the delivery of network flows related to
   the realization of decentralized training or inference tasks are
   appearing: How to inform a machine learning application about the
   topology of the network interconnecting involved nodes?  How to adapt
   the parallelization model or collective communication pattern to
   maximize efficiency?  In the IETF, the ALTO working group [ALTO] has
   investigated similar challenges related to the operation of peer-to-
   peer traffic.  As machine learning workloads have a different set of
   requiremenst, it may be time to revisit this work.


Fressancourt, et al.      Expires 10 April 2025                [Page 16]

Internet-Draft                 AI traffic                   October 2024


4.4.  Managing incast traffic related to AI inference

   In a machine learning system, some specific nodes, such as the
   orchestrator of the aggregator, play a central role, and thus
   concentrate communications.  This concentration might be reinforced
   by the use of specific collective communication primitives, and by
   the synchronicity of traffic patterns.  The network traffic
   management challenges we previously mentionned (Packet losses,
   delays, congestion) can be amplified by this concentration on a few
   hotspots in the network.

   Decentralized systems tend to mitigate this functionnal centrality by
   distributing the responsibility for fullfilling those roles accross
   several nodes.  Yet, even in decentralized systems, it is difficult
   to completely avoid incast problems given the specificities of
   machine learning workloads.  For instance, due to the characteristics
   of model inference, first prompt requests addressed to a given node
   are bound to generate a large incast traffic related to the
   personalization and fine tuning of the model instance run by the
   node, which can represent a large amount of data.

   The research and operational communities dealing with scaling out
   machine learning systems are working on some solutions to address
   incast traffic management issues.  For instance, in the same way as
   media codecs structured as layers of increasing resolution, models
   used in training and inference can be layered, with coarse grained
   models using less parameters than finer grained ones.  Coarse grain
   models can be distributed more quickly, drafting a first answer to a
   request while finer grain models are retrieved and then used to build
   a more precise answer to a request.  Besides, as inference requests
   towards large models are often done in the form of a converstaion or
   contextualized exchange between a user and the node running inference
   tasks addressing its requests, the use of service routing to
   consistently route requests to the same instance can be used.
   In that extend, the work done in the CATS working group in the IETF
   is of particular relevance ([CATSWG]).  Specific work on the proper
   metrics to be applied to CATS in the context of large scale
   decentralized AI will need to be done, taking into consideration the
   large body of work on load balancing for machien learning workloads
   done in the research community.  In particular, one problem of
   importance in the selection of the instance serving a specific
   inference request is the trade-off to find between the need to serve
   the requests with a bounded latency, requiring to use a node that is
   located as close to the requesting user as possible, the necessity to
   bootstrap this node with the user's contextual model parameters and
   the uncertainty about the length of the session between the user and
   the instance, which puts a stress on the memory needed to keep the
   session's context at hand.


Fressancourt, et al.      Expires 10 April 2025                [Page 17]

Internet-Draft                 AI traffic                   October 2024


4.5.  Securing and attesting AI-related traffic

   The distribution of machine learning workloads at a regional scale
   beyond the limits of a single datacenter and the decentralization of
   the control and management of those operations raises important
   security and privacy challenges.  Indeed, when those workloads are
   operated in a single tenant datacenter, data can be secured by means
   of perimeter security measures, or adopting encryption at least for
   data stored in the datacenter's realm.  In the same way, it is
   acceptable to exchange data between nodes in an unencrypted manner
   provided the network on which data is exchanged is isolated from the
   outside environment.

   When data, models and their associated parameters are exchanged on
   the Internet, or on public connections, data managers need to make
   sure that the data is exchanged securely, and following policies
   complying both with users preferences and local regulations.  As
   previously mentionned, data exchanges done during model training
   phases or directly before performing an inference task need to be
   done as quickly as possible, avoiding tail latencies as much as
   possible.  Even if encryption of large data flow has improved
   considerably in the last years, it is adding a latency overhead,
   considering that the computational overhead related to cryptographic
   operations is handled beyond the chip (often a GPU or a tensor
   processor) performing the machine learning tasks.  It would be
   interesting and benefitial to contribute to the efforts done in the
   IRTF and IETF to develop low latency, lightweight encryption schemes
   in order to reduce this overhead as much as possible.

   Furthermore, as private data is involved in data exchanges related to
   training and inference, specific care must be taken to respect
   regulations steering the way those data can be processed and
   transfered.  In particular, some data need to be processed in the
   same region or country they have been generated in.  To make sure
   those regulations are respected, attestation mechanisms need to be
   put in place.  By using those attestation mechanisms, data owners or
   managers can prove to local authorities that the data they are
   managing is being exchanged according to the policy they specified.
   In the IETF, the RATS working group [RATSWG], and the NASR initiative
   [NASR] are developping attestation mechanisms that could be adapted
   to address machine learning requirements, even if their work is still
   at the beginning.


Fressancourt, et al.      Expires 10 April 2025                [Page 18]

Internet-Draft                 AI traffic                   October 2024


5.  Problem statement

   In today's LLM-based machine learning landscape, we observe a strong
   concentration of training abilities in the hands of a few
   hyperscalers, while several companies tend to propose innovative
   approaches to improve inference by making it faster, and done by
   machines located closer to the users.  Yet, there is a need to
   distribute both the training and inference workloads of AI.

   In the same way as the Internet, there is, and will be several
   incentives to decentralize some or all aspects of AI's lifecycle to
   address scalability, resilience, personalization or privacy concerns.
   This decentralization trend is exemplified by Federated Learning
   [FederatedLearning], with different decentralization models, as
   presented in Section 3.4.

   Given the challenges highlighted in Section 4, and the fact that
   multiple stakeholders are involved in properly adressing those
   challenges, AI-related network traffic will no longer be operated
   only on private infrastructure, but also on an *_open interconnected
   network_*.  Thus, it is desirable that the IETF community discuss
   networking challenges related to large scale decentralized AI to
   avoid the deployment of proprietary solutions, or of solutions
   putting the stability of the Internet at risk due to unfair resource
   mangement or competition between AI-related network traffic and other
   traffic in the Internet.

6.  Next Steps

   While some work addressing the challenges highligted in this document
   is done in working groups related to congestion management,
   deterministic networking or service routing, there might be interest
   in the IETF community at large to aggregate a group of contributors
   interested in elaborating specific requirements and corresponding
   solutions to the challenges listed in Section 4.  In particular, the
   goal of such an initiative would be to:

   *  Formalize AI-related requirements for service routing: Indeed, it
      is needed to define the metrics to take into account for load
      balancing AI-related workloads, given the need for instance
      stickiness or the importance of latency aspects while addressing
      the specificities of AI-related traffic, consisting in elephant
      flows with little entropy.


Fressancourt, et al.      Expires 10 April 2025                [Page 19]

Internet-Draft                 AI traffic                   October 2024


   *  Formalize low latency / limited latency requirements associated
      with AI network traffic: As presented in this document, during the
      different phases of the AI lifecycle, it will be needed to enforce
      a different set of requirements in terms of latency, tolerance to
      jitter or packet loss for sustaining network traffic.

   *  Formalize congestion control aspects related to the operation of
      AI traffic at regional scale: The machine learning community is
      already engaged in improvements of the congestion aspects of AI
      workloads by working on application-level solutions taking for
      granted the underlying behavior of the network.  It would be
      interesting to determine what are the possible improvements that
      the network could benefit from in order to better solve the
      congestion and incast management issues related to the way AI
      network traffic is managed.

   *  Formalize coordination aspects of AI distributed systems: This
      topic is very important to the realization of decentralized, large
      scale AI systems, and to the emergence of an inter-AI network of
      entities collaborating in the execution of end to end AI workloads
      for end users.

7.  Security Considerations

   Section 4.5 highlights privacy related challenges that AI operations
   at regional scale will have to address.  Beyond this section, no
   additional security concern is raised by the elements presented in
   the document.

8.  IANA Considerations

   This document has no IANA actions.

9.  Informative References

   [AIBackbone]
              Sundaresan, J., Gopalan, A., and Meta, "AI impact on
              backbone", <https://atscaleconference.com/videos/ai-
              impact-on-backbone/>.

   [AIConst]  Baduge, S., Thilakarathna, S., Perera, J., Arashpour, M.,
              Sharafi, P., Teodosio, B., Shringi, A., and P. Mendis,
              "Artificial intelligence and smart vision for building and
              construction 4.0: Machine and deep learning methods and
              applications", Elsevier BV, Automation in
              Construction vol. 141, pp. 104440,
              DOI 10.1016/j.autcon.2022.104440, September 2022,
              <https://doi.org/10.1016/j.autcon.2022.104440>.


Fressancourt, et al.      Expires 10 April 2025                [Page 20]

Internet-Draft                 AI traffic                   October 2024


   [AITrainingTraffic]
              Li, W., Liu, X., Li, Y., Jin, Y., Tian, H., Zhong, Z.,
              Liu, G., Zhang, Y., and K. Chen, "Understanding
              Communication Characteristics of Distributed Training",
              ACM, Proceedings of the 8th Asia-Pacific Workshop on
              Networking vol. 23, pp. 1-8, DOI 10.1145/3663408.3663409,
              August 2024, <https://doi.org/10.1145/3663408.3663409>.

   [ALTO]     "Application-Layer Traffic Optimization (alto)", n.d.,
              <https://datatracker.ietf.org/wg/alto/about/>.

   [Burstiness]
              Luangsomboon, N., Fazel, F., Liebeherr, J., Sobhani, A.,
              Guan, S., and X. Chu, "On the Burstiness of Distributed
              Machine Learning Traffic", arXiv,
              DOI 10.48550/ARXIV.2401.00329, 2024,
              <https://doi.org/10.48550/ARXIV.2401.00329>.

   [CATSWG]   "Computing-Aware Traffic Steering (cats) Working Group",
              n.d., <https://datatracker.ietf.org/group/cats/about/>.

   [ChallengingSpraying]
              Addanki, V., Goyal, P., and I. Marinos, "Challenging the
              Need for Packet Spraying in Large-Scale Distributed
              Training", arXiv, DOI 10.48550/ARXIV.2407.00550, 2024,
              <https://doi.org/10.48550/ARXIV.2407.00550>.

   [DataParallelism]
              Valiant, L., "A bridging model for parallel computation",
              Association for Computing Machinery (ACM), Communications
              of the ACM vol. 33, no. 8, pp. 103-111,
              DOI 10.1145/79173.79181, August 1990,
              <https://doi.org/10.1145/79173.79181>.

   [FederatedLearning]
              McMahan, H., Moore, E., Ramage, D., Hampson, S., and B.
              Arcas, "Communication-Efficient Learning of Deep Networks
              from Decentralized Data", arXiv,  arXiv,
              DOI 10.48550/ARXIV.1602.05629, 2016,
              <https://doi.org/10.48550/ARXIV.1602.05629>.

   [FEDI]     Anaobi, I., Raman, A., Castro, I., Zia, H., Ibosiola, D.,
              and G. Tyson, "Will Admins Cope? Decentralized Moderation
              in the Fediverse", ACM, Proceedings of the ACM Web
              Conference 2023, DOI 10.1145/3543507.3583487, April 2023,
              <https://doi.org/10.1145/3543507.3583487>.


Fressancourt, et al.      Expires 10 April 2025                [Page 21]

Internet-Draft                 AI traffic                   October 2024


   [FLOWER]   Flower Labs GmbH, "Flower - A Friendly Federated Learning
              Framework", n.d., <https://flower.ai/>.

   [I-D.ravi-ippm-csig]
              Ravi, A., Dukkipati, N., Mehta, N., and J. Kumar,
              "Congestion Signaling (CSIG)", Work in Progress, Internet-
              Draft, draft-ravi-ippm-csig-01, 2 February 2024,
              <https://datatracker.ietf.org/doc/html/draft-ravi-ippm-
              csig-01>.

   [I-D.yao-tsvwg-cco-problem-statement-and-usecases]
              Yao, K., Shiping, X., Li, Y., Huang, H., and D. KUTSCHER,
              "Collective Communication Optimization: Problem Statement
              and Use cases", Work in Progress, Internet-Draft, draft-
              yao-tsvwg-cco-problem-statement-and-usecases-00, 23
              October 2023, <https://datatracker.ietf.org/doc/html/
              draft-yao-tsvwg-cco-problem-statement-and-usecases-00>.

   [LLMSize]  Zhang, B., Liu, Z., Cherry, C., and O. Firat, "When
              Scaling Meets LLM Finetuning: The Effect of Data, Model
              and Finetuning Method", arXiv,
              DOI 10.48550/ARXIV.2402.17193, 2024,
              <https://doi.org/10.48550/ARXIV.2402.17193>.

   [MoEParallelism1]
              Jacobs, R., Jordan, M., Nowlan, S., and G. Hinton,
              "Adaptive Mixtures of Local Experts", MIT Press -
              Journals, Neural Computation vol. 3, no. 1, pp. 79-87,
              DOI 10.1162/neco.1991.3.1.79, February 1991,
              <https://doi.org/10.1162/neco.1991.3.1.79>.

   [MoEParallelism2]
              Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le,
              Q., Hinton, G., and J. Dean, "Outrageously Large Neural
              Networks: The Sparsely-Gated Mixture-of-Experts Layer",
              arXiv, DOI 10.48550/ARXIV.1701.06538, 2017,
              <https://doi.org/10.48550/ARXIV.1701.06538>.

   [NASR]     "Network Attestation for Secure Routing (NASR)", n.d.,
              <https://datatracker.ietf.org/doc/bofreq-liu-nasr/>.

   [NCCL]     Nvidia, "NVIDIA Collective Communications Library (NCCL)",
              n.d., <https://developer.nvidia.com/nccl>.


Fressancourt, et al.      Expires 10 April 2025                [Page 22]

Internet-Draft                 AI traffic                   October 2024


   [NetworkBottleneck]
              Zhang, Z., Chang, C., Lin, H., Wang, Y., Arora, R., and X.
              Jin, "Is Network the Bottleneck of Distributed Training?",
              ACM, Proceedings of the Workshop on Network Meets AI & ML,
              DOI 10.1145/3405671.3405810, August 2020,
              <https://doi.org/10.1145/3405671.3405810>.

   [PipelineParallelism]
              Narayanan, D., Harlap, A., Phanishayee, A., Seshadri, V.,
              Devanur, N., Ganger, G., Gibbons, P., and M. Zaharia,
              "PipeDream: generalized pipeline parallelism for DNN
              training", ACM, Proceedings of the 27th ACM Symposium on
              Operating Systems Principles, DOI 10.1145/3341301.3359646,
              October 2019, <https://doi.org/10.1145/3341301.3359646>.

   [RATSWG]   "Remote ATtestation ProcedureS (rats) Working Group",
              n.d., <https://datatracker.ietf.org/wg/rats/about/>.

   [RCCL]     AMD, "AMD ROCm Software", n.d.,
              <https://github.com/ROCm/ROCm>.

   [SchedulingSharding]
              Chu, W., Choudhury, A., and Meta, "Scheduler and Sharding
              Considerations for Network Efficiency",
              <https://atscaleconference.com/videos/scheduler-and-
              sharding-considerations-for-network-efficiency/>.

   [SiteToSite]
              Cangialosi, F., Narayan, A., Goyal, P., Mittal, R.,
              Alizadeh, M., and H. Balakrishnan, "Site-to-site internet
              traffic control", ACM, Proceedings of the Sixteenth
              European Conference on Computer Systems,
              DOI 10.1145/3447786.3456260, April 2021,
              <https://doi.org/10.1145/3447786.3456260>.


Fressancourt, et al.      Expires 10 April 2025                [Page 23]

Internet-Draft                 AI traffic                   October 2024


   [TensorParallelism]
              Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,
              Citro, C., Corrado, G., Davis, A., Dean, J., Devin, M.,
              Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard,
              M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M.,
              Levenberg, J., Mane, D., Monga, R., Moore, S., Murray, D.,
              Olah, C., Schuster, M., Shlens, J., Steiner, B.,
              Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V.,
              Vasudevan, V., Viegas, F., Vinyals, O., Warden, P.,
              Wattenberg, M., Wicke, M., Yu, Y., and X. Zheng,
              "TensorFlow: Large-Scale Machine Learning on Heterogeneous
              Distributed Systems", arXiv,
              DOI 10.48550/ARXIV.1603.04467, 2016,
              <https://doi.org/10.48550/ARXIV.1603.04467>.

   [xCCL]     Weingram, A., Li, Y., Qi, H., Ng, D., Dai, L., and X. Lu,
              "xCCL: A Survey of Industry-Led Collective Communication
              Libraries for Deep Learning", Springer Science and
              Business Media LLC, Journal of Computer Science and
              Technology vol. 38, no. 1, pp. 166-195,
              DOI 10.1007/s11390-023-2894-6, February 2023,
              <https://doi.org/10.1007/s11390-023-2894-6>.

Appendix A.  A primer on Machine learning (extended version of
             Section 3)

   Along its development, Machine Learning (ML) involves the use of an
   increasing amount of computing and storage resources in the
   operations of algorithms taking decisions or deriving insights from
   data.  In recent generative AI algorithms, or in large language
   models, the amount of data used to train models of increasing size in
   terms of parameters has grown exponentially.  Besides, the size of
   large models translates in an increasing memory and computing
   footprint.  Given this evolution, ML algorithms can not be executed
   or trained on a single machine, and thus ML systems are "distributed"
   by nature, regardless of the architecture they adopt.

   This appendix section introduces the lifecycle of ML systems
   (Appendix A.1), then explains how ML systems can be split between
   entities fulfilling different roles (Appendix A.2) and introduces the
   methods designed to parallelize ML jobs (Appendix A.3) and the
   communication methods used between instances to communicate data and
   parameters (Appendix A.4).  It is an extended version of Section 3
   which focuses on the consequences on network traffic patterns and
   challenges of the elements presented here.


Fressancourt, et al.      Expires 10 April 2025                [Page 24]

Internet-Draft                 AI traffic                   October 2024


A.1.  Machine learning model lifecycle

   In machine learning, two approaches can be adopted to algorithmically
   derive insights from a dataset: supervised learning and unsupervised
   learning.  In unsupervised learning, algorithms are developped and
   used to find patterns, clusters or relationships among data without
   previous knowledge or reference.  In supervised learning, algorithms
   use a reference data set (or training set), consisting in data
   labelled with answers or hints to the solution of the question being
   asked, to build (or train) a model to which data is compared later
   during the inference (or serving) phase.

   The model is the cornerstone of supervised machine learning.  It is a
   representation of a system that is observed and measured with data.
   When the model is trained, the model is parametrized or modified so
   it is able to model the behaviour of the system from the dataset used
   in the training phase.  Then, during the inference phase, data is
   presented to the trained model in order to derive information or make
   predictions about the system.

   +----------+    +----------+    +--------+    +------+    +---------+
   |   Data   |    |   Data   |    |  Model |    | Model|    |Inference|
   |          | => |   Pre-   | => |        | => | Fine-| => |    /    |
   |Collection|    |Processing|    |Training|    |Tuning|    | Serving |
   +----------+    +----------+    +--------+    +------+    +---------+

              Figure 4: Supervised machine learning lifecycle

   In the renewed interest in deep neural networks, and the strong
   development of generative AI, a supervised learning approach has been
   favoured, using reference models.  Supervised machine learning
   follows a cycle presented in Figure 4.

   To obtain a reference model, first, data is collected from a variety
   of sources.  This data is then gathered and pre-processed, in order
   to clean it (align formatting, erase obvious measurement errors,
   _etc._) and potentially to label it.  In particular, large data sets
   are divided into chunks, or tokens, which are the basic data units
   used as input or outputs of models.  For instance, if the data set
   consists in a corpus of texts, tokens are words, subwords or short
   sequences of words.


Fressancourt, et al.      Expires 10 April 2025                [Page 25]

Internet-Draft                 AI traffic                   October 2024


   The pre-processed data is used to train the reference model, _i.e._
   convert the knowledge to extract from the pre-processed data into a
   set of parameters including weights ruling how the model will answer
   future requests.  Parameters are variables adjusting the structure
   underlying the model that are set during the training phase.  Weights
   are a subste of the parameters, and determine how strong the ties are
   between some other parameters in the model.

   Once the model has been trained, it can be used to infer from a
   request, or, in other words, to serve an insight from a request's
   data.  Inference operations are not necessarily done by the same
   nodes that have trained the model.  Models can be transferred to
   other worker nodes to perfom inference tasks, sometimes placed close
   to the users making requests.  Besides, those transferred models can
   be re-trained or fine-tuned in order to better serve requests in the
   context they are done.  This can be done for instance with private or
   locally relevant data.

A.2.  System model

   The machine learning systems built to meet the requirements of use
   cases presented in Section 2 are fundamentally distributed in order
   to meet scaling and time to answer requirements.  Indeed, if we
   consider modern lareg language models, they use from 125 million to
   175 billion parameters, where each parameters can be encoded with 4
   bytes to 4 bits depending on the precision required in the answer
   provided by the model.  From a memory perspective, this means that
   storing a model and its parameters will cost from 62,5 MB (for a
   model using 125 million parameters with 4-bits parameters) to 700 GB
   (for a model using 175 billion parameters with 4 bytes parameters).
   The memory footprint of modern large language models makes those
   models both difficult to manipulate on single nodes and to exchange
   data between nodes participating in a distributed task.

   In machine learning systems' distribution, nodes can play specific
   roles, depending on their (hardware) capabilities or their location
   in the network.

   Those roles, presented in Figure 5 are:

   *  *Data producer / Owner:*

      This entity is producing data from which users are willing to
      retrieve insights.  It can be a sensor in a network, a server
      producing logs and interacting with a set of users, etc.

   *  *Data manager:*


Fressancourt, et al.      Expires 10 April 2025                [Page 26]

Internet-Draft                 AI traffic                   October 2024


      This entity is in charge of managing the data produced by the data
      producer.  In that extend, it can pre-process, filter, add context
      (or metadata), store or transfer data to other entities in the
      system.  In its data management operations, the data manager has
      to take specific care of security and privacy aspects of data
      management.  In particular, it must make sure that data is stored
      securely, transferred to authorized entities and that user's
      consent and privacy preferences are respected.  To do so, it may
      use specialized encryption methods and protocols, meeting
      challenges presented in Section 4.5.

   *  *Worker:*

      This entity is in charge of processing data retrieved from the
      data manager in order to gain insights from it.  To do so, it can
      training a (potentially large) model during the model training
      phase; and personalize this model to better serve specific needs
      or requests from users during the inference phase.  In some
      specific cases, for instance, if the model from which the
      inference is done is small, the worker might be alone, but given
      the operational requirements of new approaches in machine
      learning, workers need to collaborate with other workers to
      complete a task together.

   *  *(Model) Aggregator:*

      In case the training, personalization or fine tuning of a model is
      done by several collaborating entities, the aggregator is
      responsible for gathering the results of the individual tasks
      performed by the workers and synthetize them in a general model.
      This model can be sent back to the workers for subsequent
      processing.

   *  *Inference engine / Server:*

      This entity is in charge of producing inferences, _i.e._ deriving
      insights from a request done against a trained model.  Those
      requests might be contextualized from a conversation, _i.e._ a
      sequence of requests and their answers.

   *  *Coordinator / Orchestrator:*

      This entity is in charge of orchestrating the execution of
      distributed tasks among the workers in charge.  This coordination
      role might be done by means of a centralized node (_e.g._ a
      parameter server), a federation of locally responsible nodes or in
      a coordinated fashion using a consensus protocol.


Fressancourt, et al.      Expires 10 April 2025                [Page 27]

Internet-Draft                 AI traffic                   October 2024


                  +----------------------------------+
                  |           Orchestrator           |
                  |                     +------------+  +------+   User
                  |                     | Aggregator |  |Server| <----
    +--------+    +---------------------+------------+  +------+   Req.
    |  Data  |        |         |||     ^  ^  ^  \  \     |||
    |Producer|        |         v||    /  /  /    \  \    ||v
    +--------+ \      v          v|   /  /  /   \  \  \   |v
      ...       v +-------+       v  /  /  /     \  \  v  v
    +--------+    |       |    +----+-+/  /       \  v+-+-------+
    |  Data  |    | Data  |--> |Worker+-+/         v+-+  Worker |
    |Producer|--> |Manager|--> +-+----+ +-+       +-+ +-------+-+
    +--------+    |       |-->   +-+----+ |       | +-------+-+
       ...      ^ +-------+        +------+       +---------+
    +--------+ /            :                :
    | Data   |              :                :
    |Producer|              : <--Training--> : <---  Inference --->
    +--------+              :                :  (with fine tuning)

                   Figure 5: Machine Learning syste model

   Those functional roles can be arranged, managed and executed in many
   different way in a ML system.  This arrangement determines how
   centralized and / or distributed a ML system is.

A.3.  Parallelization modes

   In model-based machine learning, training and inference phases follow
   a pipeline similar to the pipeline presented in Figure 6.  As
   mentionned in previous sections, given the size of currently used
   models and the amount of data required to either train or infer from
   a model, machine learning workloads need to be distributed.  This
   distribution can follow different patterns (or a mix of those
   patterns): data parallelism, tensor parallelism, pipeline parallelism
   or mixture-of-expert parallelism.  Those patterns are presented in
   the following subsections.

         +------+
       +-+----+ |        +-+   +-+       +---+
     +-+----+ | |        |p|   |p|       | o |
   +-+----+ | | |       ^|a+   +a+       | u |
   |      | | | |      / |r|\ /|r|\      | t |
   |      | | | | ==> *  |a| * |a| * ==> | p |
   | Data | | | |      \ |m|/ \|m|/      | u |
   |      | | +-+       v|s+   +s+       | t |
   |      | +-+          |.|   |.|       |   |
   |      +-+            +-+   +-+       +---+
   +------+


Fressancourt, et al.      Expires 10 April 2025                [Page 28]

Internet-Draft                 AI traffic                   October 2024


                     Figure 6: Model-based AI pipeline

A.3.1.  Data parallelism

   Data parallelism is the oldest among the parallelization patterns we
   highlighted.  It has been introduced in [DataParallelism].  As
   presented in Figure 7, data parallelism consists in the partitioning
   of the data used in a given machine learning task in a set of batches
   that are used to train or tune the model.  In data parallelism, a
   node manipulates a complete model and change its parameters using the
   data partition it has been allocated.  The result of the tasks
   performed in parallel by the workers are aggregated by an aggregator
   node at the end of the pipeline.  The model parameters resulting from
   this aggregation are transmitted back to the workers for future use
   of the model in order that each worker benefits from the work done by
   others in parallel.

                                     +-+   +-+       +---+
                     +-+----+        |p|   |p|       |i o|
                     |      |       ^|a+   +a+       |n u|
                     |      |      / |r|\ /|r|\      |t t|
         +------+    | Data | ==> *  |a| * |a| * ==> |e p|
       +-+----+ |    |  01  |      \ |m|/ \|m|/      |r u|    +---+
     +-+----+ | |  ^ |      |       v|s+   +s+       |i t| \  | o |
   +-+----+ | | | /  |      +        |.|   |.|       |m  |  v | u |
   |      | | | |    +------+        +-+   +-+       +---+    | t |
   |      | | | |                                             | p |
   | Data | | | |                    +-+   +-+       +---+    | u |
   |      | | +-+ \  +-+----+        |p|   |p|       |i o|  ^ | t |
   |      | +-+    v |      |       ^|a+   +a+       |n u| /  |   |
   |      +-+        |      |      / |r|\ /|r|\      |t t|    +---+
   +------+          | Data | ==> *  |a| * |a| * ==> |e p|
                     |  02  |      \ |m|/ \|m|/      |r u|
                     |      |       v|s+   +s+       |i t|
                     |      +        |.|   |.|       |m  |
                     +------+        +-+   +-+       +---+

                    Figure 7: Data-parallel AI pipeline

   In this pattern, the workers are only weakly synchronized, once the
   model has been aggregated and its parameters sent back.  In that
   extend, data parallelism can sustain up to seconds latency in the
   synchronization traffic.  Yet, as parameters are sent back by the
   aggregator to all the ndes for the entire model, the volume of data
   exchanged between training iterations of inference tasks can be quite
   large.  Besides, the aggregator is a focal point in the traffic
   between nodes, which can raise traffic management challenges.


Fressancourt, et al.      Expires 10 April 2025                [Page 29]

Internet-Draft                 AI traffic                   October 2024


A.3.2.  Model parallelism

   For the last few years, models involved in machine learning tasks
   have increased in size, making their use by single nodes complex.  To
   allow the training and inference of larger models, some
   parallelization patterns have been designed to spilt models among
   several worker nodes: *pipeline parallelism* and *tensor
   parallelism*.

A.3.2.1.  Pipeline parallelism

   Pipeline parallelism, described in [PipelineParallelism], takes
   advantage of the fact that models used in deep neural networks are
   structured in layers.  In pipeline parallelism, as shown in Figure 8,
   a model is separated into stages consisting in a few consecutive
   layers of the entire model.  Each stage is allocated to a separate
   worker node.  Each stage is executed using the intermediate results
   from the previous stage, and after each iteration, the parameters
   from adjacent stages are used to refine the model's stage held by the
   worker node.

         +------+
       +-+----+ |        +-+       +---+       +-+       +---+
     +-+----+ | |        |p|       |i o|       |p|       | o |
   +-+----+ | | |       ^|a+       |n u|       +a+       | u |
   |      | | | |      / |r|\      |t t|      /|r|\      | t |
   |      | | | | ==> *  |a| * ==> |e p| ==> * |a| * ==> | p |
   | Data | | | |      \ |m|/      |r u|      \|m|/      | u |
   |      | | +-+       v|s+       |i t|       +s+       | t |
   |      | +-+     :    |.|  :    |m  |    :  |.|    :  |   |
   |      +-+       :    +-+  :    +---+    :  +-+    :  +---+
   +------+         :         :             :         :
                    : Stage 1 :             : Stage 2 :

               Figure 8: Pipeline model-parallel AI pipeline

   In this pattern, the communication volume between node can be less
   important than in data parallellism, as each node only communicates
   with nodes holding adjacent layers.  Yet, as stage iteration can be
   faster to execute, communications can be more frequent.


Fressancourt, et al.      Expires 10 April 2025                [Page 30]

Internet-Draft                 AI traffic                   October 2024


A.3.2.2.  Tensor parallelism

   Tensor parallelism, stemming from the work presented in
   [TensorParallelism], is taking advantage of the fact that training
   and inference operations in model-based machine learning are using
   operations on matrixes that can be split according to several
   dimensions.  In comparison with pipeline parallellism, tensor
   parallelism is a pattern in which layers are split into chunks that
   can be operated by different nodes, as shown on Figure 9.  Tensor
   paralellism can be naively presented as a split in the model
   parameter space rather than along the layer / stage dimension.

                     +---+   +---+       +---+
                     |p p+   +p p+       |i o|
                     |a a|\ /|a a|\      |n u|
         +------+    |r r| * |r r| * ==> |t t|
       +-+----+ |    |a t|/ \|a t|/      |e p|    +---+
     +-+----+ | |  ^ |m 1+   +m 1+       |r .| \  | o |
   +-+----+ | | | /  +---+   +---+       +---+  v | u |
   |      | | | |      ^       ^                  | t |
   |      | | | |      |       |                  | p |
   | Data | | | |      v       v                  | u |
   |      | | +-+ \  +---+   +---+       +---+  ^ | t |
   |      | +-+    v |p p+   +p p+       |i o| /  |   |
   |      +-+        |a a|\ /|a a|\      |n u|    +---+
   +------+          |r r| * |r r| * ==> |t t|
                     |a t|/ \|a t|/      |e p|
                     |m 2+   +m 2+       |r .|
                     +---+   +---+       +---+

                Figure 9: Tensor model-parallel AI pipeline

   In tensor parallellism, nodes holding chunks of the same model (or of
   the same stage) need to communicate during the execution of a model
   (or stage) iteration.  This involves even tighter latency
   requirements compared with pipeline parallelism for inter-node
   communications.

A.3.3.  Mixture of Expert parallelism

   Mixture of experts (MoE) parallelism stems from an idea introduced in
   [MoEParallelism1] and adapted to deep neural networks in
   [MoEParallelism2].  In MoE parallellism, nodes are holding smaller
   but specialized models trained over a smaller amount of data and
   holding fewer parameters.  When a request or specific token is
   submitted to a MoE model, a gateway node, the router, takes a
   decision about which model instance to use against the specific token
   or request that has been submitted.  After the selected worker


Fressancourt, et al.      Expires 10 April 2025                [Page 31]

Internet-Draft                 AI traffic                   October 2024


   executed the task against the token or request, the result is given,
   and considered as the result given by the whole model.

                                +---+   +---+
                                |p s+   +p s+
         +------+               |a e|\ /|a e|\
       +-+----+ |               |r t| * |r t| *   +---+
     +-+----+ | |             ^ |a .|/ \|a .|/ \  | o |
   +-+----+ | | |            /  |m 1+   +m 1+   v | u |
   |      | | | |     +------+  +---+   +---+     | t |
   |      | | | | ==> |Router| ---------------    | p |
   | Data | | | |     +------+  +---+   +---+     | u |
   |      | | +-+               |p s+   +p s|     | t |
   |      | +-+                 |a e|\ /|a e|     |   |
   |      +-+                   |r t| * |r t|     +---+
   +------+                     |a .|/ \|a .|
                                |m 2+   +m 2|
                                +---+   +---+

             Figure 10: Mixture of Expert parallel AI pipeline

   In MoE parallellism, the router plays a major role, and is a
   communication focal point.

A.4.  Collective communication methods

   In their development, distributed machine learning systems benefit
   from collective communication libraries such as NCCL ([NCCL]) or RCCL
   ([RCCL]) to abstract the setup and management of the connections used
   to exchange data between worker nodes involved in distributed
   training or inference tasks.  As presented in [xCCL] or in
   [I-D.yao-tsvwg-cco-problem-statement-and-usecases], those libraries
   introduce collective communication methodss, used in accordance with
   parallelization modes presented in Appendix A.3.

   To better explain what are the main collective communication methods,
   we consider a collective consisting in four nodes.  Those nodes
   exchange pieces of data in the frame of the execution of a
   distributed machine learning task.  The topology underlying the
   connections between those nodes is not detailed at this stage.  The
   major collective communication methods are:


Fressancourt, et al.      Expires 10 April 2025                [Page 32]

Internet-Draft                 AI traffic                   October 2024


A.4.1.  Broadcast

   This collective communication method is very intuitive to understand
   for networking people, as its behaviour is the same as when a host
   broadcasts a packet in an IP-based network.  In the broadcast
   collective communication, method, a node (N.1) is sending the same
   information to all the nodes participating in the collective.  No
   data transformation is performed during this operation.

                                  +---+
                                  +N.1+
                                 /+---+\
                                v   |   v
                            +---+   v   +---+
                            |N.2| +---+ |N.3|
                            +---+ |N.4| +---+
                                  +---+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | D.1 | | --- | | --- | | --- | ==> | D.1 | | D.1 | | D.1 | | D.1 |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                        Figure 11: Broadcast method

A.4.2.  Reduce

   In this collective communication method, the data exchange is
   combined with an operation on the received data to produce an output.
   During a Reduce operation, the nodes involved in the collective send
   their data to an aggregator processing the received inputs _I.1_ ...
   _I.4_ using an operator _f_ to obtain _Out_.  _Out_ is then provided
   to one of the nodes in the collective (N.1).  Note that most of the
   time, the aggregation is done by one node in the collective, carrying
   the aggregator function.


Fressancourt, et al.      Expires 10 April 2025                [Page 33]

Internet-Draft                 AI traffic                   October 2024


                         +---+ +---+ +---+ +---+
                         |N.1| |N.2| |N.3| |N.4|
                         +---+ +---+ +---+ +---+
                          | ^   |     |     |
                          | |   |     |     |
                          v |   v     v     v
                       +-------------------------+
                       |        Aggregator       |
                       | f(I.1,I.2,I.4,I.4)=Out  |
                       +-------------------------+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | I.1 | | I.2 | | I.3 | | I.4 | ==> | Out | | --- | | --- | | --- |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                          Figure 12: Reduce method

A.4.3.  Scatter

   In this collective communication method, a node in the collective
   splits the data it has into equal size chunks, and distribute a
   different chunk to every single other node in the collective, in
   order to distribute the data evenly.

                                  +---+
                                  +N.1+
                                 /+---+\
                                v   |   v
                            +---+   v   +---+
                            |N.2| +---+ |N.3|
                            +---+ |N.4| +---+
                                  +---+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | D.1 | | --- | | --- | | --- |     | D.1 | | --- | | --- | | --- |
   | D.2 | | --- | | --- | | --- | ==> | --- | | D.2 | | --- | | --- |
   | D.3 | | --- | | --- | | --- |     | --- | | --- | | D.3 | | --- |
   | D.4 | | --- | | --- | | --- |     | --- | | --- | | --- | | D.4 |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                         Figure 13: Scatter method


Fressancourt, et al.      Expires 10 April 2025                [Page 34]

Internet-Draft                 AI traffic                   October 2024


A.4.4.  All-Gather

   This collective communication method can be seen as a simultaneous
   broadcast done by every node in the collective.  Indeed, in an All-
   gather operation, every node sends the data it has to every other
   nodes in the collective, so that in the end, every node has a copy of
   every piece of data held by the nodes in the collective before the
   operation.

                                  +---+
                                ^ |N.1| ^
                               /  +---+  \
                              v           v
                           +---+    ^    +---+
                           |N.2| <--|--> |N.3|
                           +---+    v    +---+
                              ^           ^
                               \  +---+  /
                                v |N.4| v
                                  +---+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | D.1 | | --- | | --- | | --- |     | D.1 | | D.1 | | D.1 | | D.1 |
   | --- | | D.2 | | --- | | --- | ==> | D.2 | | D.2 | | D.2 | | D.2 |
   | --- | | --- | | D.3 | | --- |     | D.3 | | D.3 | | D.3 | | D.3 |
   | --- | | --- | | --- | | D.4 |     | D.4 | | D.4 | | D.4 | | D.4 |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                        Figure 14: All-gather method

A.4.5.  All-to-All

   This collective communication method can be seen as a simultaneous
   scatter operation done by every node in the collective.  Indeed, in
   an All-to-All operation, every node splits the data it holds in equal
   size chunks, and sends one chunk to every other node in the
   collective.  As a result, the data held by each node after the All-
   to-All operation is different, but every node holds roughly the same
   amount of data, even if the data volume was not balanced before the
   operation.


Fressancourt, et al.      Expires 10 April 2025                [Page 35]

Internet-Draft                 AI traffic                   October 2024


                                  +---+
                                ^ |N.1| ^
                               /  +---+  \
                              v           v
                           +---+    ^    +---+
                           |N.2| <--|--> |N.3|
                           +---+    v    +---+
                              ^           ^
                               \  +---+  /
                                v |N.4| v
                                  +---+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | A.1 | | B.1 | | C.1 | | D.1 |     | A.1 | | A.2 | | A.3 | | A.4 |
   | A.2 | | B.2 | | C.2 | | D.2 | ==> | B.1 | | B.2 | | B.3 | | B.4 |
   | A.3 | | B.3 | | C.3 | | D.3 |     | C.1 | | C.2 | | C.3 | | C.4 |
   | A.4 | | B.4 | | C.4 | | D.4 |     | D.1 | | D.2 | | D.3 | | D.4 |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                        Figure 15: All-to-All method

A.4.6.  All-Reduce

   This collective communication method is very similar to the Reduce
   method.  During an All-Reduce operation, the nodes involved in the
   collective send their data to an aggregator processing the received
   inputs _I.1_ ... _I.4_ using an operator _f_ to obtain _Out_.  To the
   difference of the Reduce operation, _Out_ is sent back to every node
   in the collective.  In the implementation of this method, the
   aggregator function can be done by one node in the collective, or
   every single node applies the operator _f_ on the inputs received
   after an All-to-All operation.

                         +---+ +---+ +---+ +---+
                         |N.1| |N.2| |N.3| |N.4|
                         +---+ +---+ +---+ +---+
                          | ^   | ^   | ^   | ^
                          | |   | |   | |   | |
                          v |   v |   v |   v |
                       +-------------------------+
                       |        Aggregator       |
                       | f(I.1,I.2,I.3,I.4)=Out  |
                       +-------------------------+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | I.1 | | I.2 | | I.3 | | I.4 | ==> | Out | | Out | | Out | | Out |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+


Fressancourt, et al.      Expires 10 April 2025                [Page 36]

Internet-Draft                 AI traffic                   October 2024


                        Figure 16: All-Reduce method

A.4.7.  Reduce-Scatter

   This collective communication method is a combination of the All-
   Reduce method with the Scatter method.  During a Reduce-Scatter
   operation, the nodes involved in the collective send their data to an
   aggregator processing the received inputs _I.1_ ... _I.4_ using an
   operator _f_ to obtain _Out_.  To the difference of the All-Reduce
   operation, _Out_ is split into equal size chunks _O.1_ ... _O.4_, and
   each chunk is sent to a different node in the collective.

                         +---+ +---+ +---+ +---+
                         |N.1| |N.2| |N.3| |N.4|
                         +---+ +---+ +---+ +---+
                          | ^   | ^   | ^   | ^
                          | |   | |   | |   | |
                          v |   v |   v |   v |
                 +---------------------------------------+
                 |              Aggregator               |
                 | f(I.1,I.2,I.3,I.4)=(O.1,O.2,O.3,O.4)  |
                 +---------------------------------------+
     N.1     N.2     N.3     N.4         N.1     N.2     N.3     N.4
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+
   | I.1 | | I.2 | | I.3 | | I.4 | ==> | O.1 | | O.2 | | O.3 | | O.4 |
   +-----+ +-----+ +-----+ +-----+     +-----+ +-----+ +-----+ +-----+

                      Figure 17: Reduce-Scatter method

Authors' Addresses

   Antoine Fressancourt
   Huawei Technologies France S.A.S.U.
   18, Quai du Point du Jour
   92100 Boulogne-Billancourt
   France
   Email: antoine.fressancourt@huawei.com


   Luigi Iannone
   Huawei Technologies France S.A.S.U.
   18, Quai du Point du Jour
   92100 Boulogne-Billancourt
   France
   Email: luigi.iannone@huawei.com


Fressancourt, et al.      Expires 10 April 2025                [Page 37]

Internet-Draft                 AI traffic                   October 2024


   David Lou
   Huawei Technologies Duesseldorf GmbH
   Riesstrasse 25
   80992 Munich
   Germany
   Email: zhe.lou@huawei.com


   Dirk Trossen
   Huawei Technologies Duesseldorf GmbH
   Riesstrasse 25
   80992 Munich
   Germany
   Email: dirk.trossen@huawei.com


Fressancourt, et al.      Expires 10 April 2025                [Page 38]