Internet-Draft Cloud Resource Abstraction October 2024
Dunbar, et al. Expires 17 April 2025 [Page]
Workgroup:
NeoTec
Internet-Draft:
draft-dxs-neotec-crossdomain-net-mgnt-dm-00
Updates:
8342 (if approved)
Published:
Intended Status:
Standards Track
Expires:
Authors:
L. Dunbar, Ed.
Futurewei
C. Xie
China Telecom
Q. Sun
China Telecom

Cross-Domain Cloud and Network Resource Management Data Model

Abstract

This document proposes extensions to existing YANG models, as well as new YANG models, to enable the management of cross-domain cloud and network resources. The intent is to provide dynamic resource allocation mechanisms that allow services to scale efficiently across multiple cloud environments and edge computing platforms. By defining unified YANG models for both network and cloud domains, this draft addresses challenges in orchestrating and managing resources in a hybrid environment while maintaining interoperability and dynamic scaling.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 17 April 2025.

Table of Contents

1. Introduction

Cloud and edge computing environments are increasingly interconnected with network infrastructure, and modern services require dynamic, cross-domain orchestration to scale efficiently. Services placed in Cloud Data Centers (DC) are changing dynamically, often undergoing high-frequency modifications based on evolving service requirements. As a result, the network connecting these services must dynamically adapt and reconfigure itself in real-time to accommodate the services changes.

A set of network-related problems that enterprises face when interconnecting their branch offices with dynamic workloads in third-party data centers (Cloud DCs) is described in [Net2Cloud], which outlines various issues, including the challenges of ensuring reliable, scalable, and efficient network connectivity between enterprise sites and cloud-hosted services. While mitigation practices have been referenced by [Net2Cloud], they fall short of addressing the dynamic and rapidly changing nature of services placed in Cloud DC. More advanced solutions are needed to make the network serve these dynamic services effectively, ensuring that the network can adjust in real-time to the changes in service workloads, resource allocations, bandwidth requirements, and latency constraints driven by cloud-hosted services.

This draft extends existing YANG models or introduces new ones to enable the management of both cloud and network resources in a unified, cross-domain manner. The goal is to optimize dynamic resource allocation, allowing services to scale seamlessly across public clouds, private clouds, and edge computing nodes while ensuring consistency, interoperability, and real time adaptability of the network to the dynamically changing services placed in Cloud DC.

2. Requirements Language

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

3. Problem Statement

Current management models face several limitations:

- Siloed Resource Management: Most current models treat network and cloud resources as separate entities, making cross-domain management inefficient.

- Lack of Dynamic Scaling Support: Many models lack the mechanisms needed to dynamically allocate and reallocate resources across domains based on real-time service demands.

- Inconsistent Interfaces and Data Models: Inconsistent data models across cloud and network platforms hinder seamless integration.

Limited Support for Edge Environments: Traditional models focus on cloud and core network infrastructure, often overlooking edge computing platforms where latency-sensitive workloads run.

This draft proposes a solution by extending YANG models to facilitate cross-domain resource management and efficient scaling.

4. Data Models Overview

Several existing IETF YANG models, such as ietf-routing-mgnt [RFC8349], ietf-network-instance [RFC8529], and ietf-l3vpn-svc [RFC8299], offer foundational models for network resource management. However, these models need to be extended to include cloud-specific attributes and edge-related extensions.

The primary design objectives for the extended or new YANG models include:

- Cross-Domain Resource Orchestrator: Provides the high level orchestration and policies for managing resources across domains, invoking network reconfiguration actions as needed.

- Dynamic Resource Allocation: Handles the overall allocation of resources (compute, storage, network). For 5G network and beyond, Dynamic Resource Allocation can be used to allocating network resources based on the needs of federated learning process.

- Dynamic Network Reconfiguration: To reflect the network dynamic adaptation to cloud services, focusing on real-time network reconfiguration based on cloud workload needs. Extend support for multi-cloud VPNs, multi-segment SD-WAN [MULTI-SEG-SDWAN], and service overlays.

- Edge Node Resource: edge nodes refer to computing resources placed at the edge of the network, closer to the end-user or data source, to reduce latency and improve performance for time-sensitive or high-bandwidth applications. Edge nodes can be located Telcom Provider's Edge Data Centers, such as Edge DCs for 5G or Regional Micro DC. Extend models to manage compute and storage resources on edge platforms.

How they work together:

- High Level Orchestration (Cross-Domain Resource Orchestrator): The orchestrator manages the overall allocation of cloud and network resources based on policies and telemetry.

- Resource Requests (Resource Allocation): When the orchestrator detects a need for resource changes (e.g., increased compute or bandwidth), it triggers resource requests. Network resource allocation will adapt based on these requests.

- Real Time Adjustments (Dynamic Network Reconfiguration): As resource demands change (due to dynamic cloud services), the network reconfigures in real time. This includes adjusting bandwidth, latency, or other parameters to ensure that the network supports the new service requirements effectively.

- Edge Node Integration (Edge Node Resource): The network reconfiguration model can dynamically adjust the network to ensure optimal connectivity between edge nodes and cloud services, allowing latency-sensitive or bandwidth-intensive applications to operate efficiently.

Together, these models provide a comprehensive framework for orchestrating, allocating, and dynamically adjusting network and compute resources across cloud, edge, and network domains. The Dynamic Network Reconfiguration model enhances this by ensuring that the network component reacts in real-time to the dynamic nature of cloud services.

5. Cross-Domain Resource Orchestrator

Here is an examplary strcture of YANG model for a Cross-Domain Resource Orchestrator. This model enables the orchestration of cloud and network resources, allowing efficient dynamic resource allocation and scaling across multiple cloud and network domains.

module: cross-domain-orchestrator
   +--rw orchestrator
      +--rw policies
      |  +--rw policy* [policy-id]
      |     +--rw policy-id               string
      |     +--rw policy-name             string
      |     +--rw policy-type             enumeration
      |     +--rw status                  enumeration
      |     +--rw conditions
      |        +--rw cpu-utilization-threshold    uint8
      |        +--rw memory-utilization-threshold uint8
      |        +--rw latency-threshold            uint32
      |        +--rw bandwidth-threshold          uint32
      +--rw telemetry
      |  +--rw domain* [domain-id]
      |     +--rw domain-id              string
      |     +--rw domain-type            enumeration
      |     +--rw resources
      |     |  +--rw cpu                decimal64
      |     |  +--rw memory             uint64
      |     |  +--rw storage            uint64
      |     |  +--rw bandwidth          uint64
      |     +--rw utilization
      |        +--rw cpu-utilization    decimal64
      |        +--rw memory-utilization decimal64
      |        +--rw storage-utilization decimal64
      |        +--rw bandwidth-utilization decimal64
      |        +--rw latency            uint32
      +--rpc allocate-resources
         +--input
         |  +--rw service-id           string
         |  +--rw resource-type        enumeration
         |  +--rw amount               decimal64
         |  +--rw domain-id            string
         +--output
            +--ro allocation-status    enumeration
            +--ro allocated-amount     decimal64

Explanation of the structure

- orchestrator:

The top-level container for managing the orchestration of resources across cloud and network domains.

- policies:

Defines the set of policies that govern resource allocation..

Each policy has: policy-id, policy-name, polity-type (e.g., the purpose of the policy), conditions (e.g., the thresholds (CPU, memory, latency, etc.) that trigger the policy).

- telemetry

Collects real-time telemetry data from different domains (e.g., cloud, edge, network).

Each domain contains information about the resources (CPU, memory, storage, bandwidth) and their utilization metrics (percentage of usage, current latency)

- Action: allocate-resources (as an RPC or YANG action):

This defines the action that a service or orchestrator can call to request dynamic allocation of resources in real-time.

Example for using the Action:

- A cloud-hosted service detects a spike in user traffic and requests an additional 50 Mbps of network bandwidth. The service submits an allocate-resources request

- The orchestration system processes the request based on the current telemetry data (bandwidth utilization, network latency) and any active policies (scaling, SLA compliance, etc.). It checks if the additional bandwidth is available in the requested domain.

- If the resources are available, the system returns success. If not, it returns failure.

- If successful, it shows how much bandwidth (e.g., 50 Mbps) was allocated to the service.

6. Dynamic Resource Allocation for Federated Learning

The resource needs for federated learning fluctuate depending on the phase of the training process, model complexity, and number of devices involved. Dynamic Resource Allocation for Federated Learning is a specific type or use case of a Cross-Domain Orchestrator.

module: dynamic-resource-allocation-federated-learning
   +--rw dynamic-allocation
      +--rw federated-learning
      |  +--rw training-job* [job-id]
      |     +--rw job-id                        string
      |     +--rw model-type                    string
      |     +--rw device-type                   enumeration
      |     +--rw required-cpu                  decimal64
      |     +--rw required-memory               uint64
      |     +--rw required-storage              uint64
      |     +--rw required-bandwidth            uint64
      |     +--rw latency-tolerance             uint32
      +--rw policies
      |  +--rw policy* [policy-id]
      |     +--rw policy-id                     string
      |     +--rw policy-name                   string
      |     +--rw policy-type                   enumeration
      |     +--rw conditions
      |        +--rw cpu-utilization-threshold   uint8
      |        +--rw memory-utilization-threshold uint8
      |        +--rw bandwidth-utilization-threshold uint8
      |        +--rw latency-threshold           uint32
      +--rw telemetry
      |  +--rw domain* [domain-id]
      |     +--rw domain-id                     string
      |     +--rw domain-type                   enumeration
      |     +--rw resources
      |     |  +--rw cpu                        decimal64
      |     |  +--rw memory                     uint64
      |     |  +--rw storage                    uint64
      |     |  +--rw bandwidth                  uint64
      |     +--rw utilization
      |        +--rw cpu-utilization            decimal64
      |        +--rw memory-utilization         decimal64
      |        +--rw storage-utilization        decimal64
      |        +--rw bandwidth-utilization      decimal64
      |        +--rw latency                    uint32
      +--rpc allocate-resources
         +--input
         |  +--rw job-id                        string
         |  +--rw resource-type                 enumeration
         |  +--rw amount                        decimal64
         |  +--rw domain-id                     string
         +--output
            +--ro allocation-status             enumeration
            +--ro allocated-amount              decimal64




7. Dynamic Network Reconfiguration

This section describe a YANG structure for Dynamic Network Reconfiguration, which supports the scenario where services placed in Cloud Data Centers (DCs) undergo frequent changes, requiring the network to dynamically adapt and reconfigure itself in real time. This structure enables the dynamic adjustment of network parameters (such as bandwidth, latency, QoS, and paths) based on evolving service requirements.

module: dynamic-network-reconfiguration
   +--rw network-reconfiguration
      +--rw telemetry
      |  +--rw bandwidth-utilization         decimal64
      |  +--rw latency                       uint32
      |  +--rw packet-loss-rate              decimal64
      |  +--rw jitter                        decimal64
      |  +--rw qos-level                     string
      +--rw policies
      |  +--rw policy* [policy-id]
      |     +--rw policy-id                  string
      |     +--rw policy-name                string
      |     +--rw policy-type                enumeration
      |     +--rw conditions
      |        +--rw bandwidth-utilization-threshold    uint8
      |        +--rw latency-threshold                  uint32
      |        +--rw packet-loss-threshold              decimal64
      |        +--rw qos-threshold                      string
      +--rpc reconfigure-network
         +--input
         |  +--rw service-id                 string
         |  +--rw target-latency             uint32
         |  +--rw target-bandwidth           uint64
         |  +--rw target-qos                 string
         |  +--rw target-packet-loss         decimal64
         |  +--rw target-jitter              decimal64
         +--output
            +--ro reconfiguration-status     enumeration
            +--ro achieved-latency           uint32
            +--ro achieved-bandwidth         uint64
            +--ro achieved-qos               string
            +--ro achieved-packet-loss       decimal64
            +--ro achieved-jitter            decimal64


Explanation of the structure:

The telemetry container collects real-time data about the current state of the network, which is used to determine whether network reconfiguration is needed to accommodate changes in cloud services.

Policies govern how and when the network should be dynamically reconfigured. Each policy has specific conditions that, when met, trigger network reconfiguration.

This action (or RPC) is the primary mechanism for dynamically reconfiguring the network in real-time. When triggered, it adjusts the network settings to meet the new requirements of services running in cloud data centers.

How it works together:

The system continuously monitors network conditions (bandwidth usage, latency, packet loss, jitter) using telemetry data. As services in cloud data centers evolve, this data helps determine whether the network is performing within acceptable limits.

When telemetry data indicates that certain thresholds are being breached (e.g., high latency or packet loss), policies are triggered. For example, if bandwidth usage exceeds 80%, the system may allocate more bandwidth to ensure the services continue to operate smoothly.

The reconfigure-network action is called in real-time to adjust the network parameters, including bandwidth, latency, packet loss, and QoS, to accommodate changes in cloud services. This action ensures the network can keep up with the frequent modifications to services hosted in the cloud.

8. Edge Computing Node

Below is the YANG tree structure designed to enable resource allocation close to the end-user or device, specifically optimized for latency-sensitive workloads. It includes support for Mobile Edge Computing (MEC) and integration with 5G edge computing. The structure allows for dynamic allocation of compute, storage, and network resources, with real-time adjustments based on the needs of low-latency applications like IoT, AR/VR, and real-time analytics.

module: mec-5g-resource-allocation
   +--rw edge-resource-allocation
      +--rw telemetry
      |  +--rw latency                       uint32
      |  +--rw bandwidth-utilization         decimal64
      |  +--rw edge-cpu-utilization          decimal64
      |  +--rw edge-memory-utilization       decimal64
      |  +--rw edge-storage-utilization      decimal64
      +--rw policies
      |  +--rw policy* [policy-id]
      |     +--rw policy-id                  string
      |     +--rw policy-name                string
      |     +--rw policy-type                enumeration
      |     +--rw conditions
      |        +--rw latency-threshold             uint32
      |        +--rw bandwidth-utilization-threshold uint8
      |        +--rw edge-cpu-utilization-threshold  uint8
      |        +--rw edge-memory-utilization-threshold uint8
      +--rw resource-allocation
      |  +--rw workload* [workload-id]
      |     +--rw workload-id                string
      |     +--rw workload-type              enumeration
      |     +--rw required-latency           uint32
      |     +--rw required-bandwidth         uint64
      |     +--rw required-edge-cpu          decimal64
      |     +--rw required-edge-memory       uint64
      |     +--rw required-edge-storage      uint64
      +--rpc allocate-edge-resources
         +--input
         |  +--rw workload-id               string
         |  +--rw target-latency            uint32
         |  +--rw target-bandwidth          uint64
         |  +--rw target-edge-cpu           decimal64
         |  +--rw target-edge-memory        uint64
         |  +--rw target-edge-storage       uint64
         +--output
            +--ro allocation-status         enumeration
            +--ro achieved-latency          uint32
            +--ro achieved-bandwidth        uint64
            +--ro allocated-edge-cpu        decimal64
            +--ro allocated-edge-memory     uint64
            +--ro allocated-edge-storage    uint64




9. Security Considerations

Authentication and Authorization: The orchestrator must authenticate requests using secure credentials (e.g., OAuth tokens, X.509 certificates).

Data Encryption: All data exchanged between domains, especially telemetry and resource allocation requests, must be encrypted using protocols like TLS.

Access Control: Role-Based Access Control (RBAC) must be implemented to ensure that only authorized users can request or allocate resources.

10. IANA Considerations

TBD

11. References

11.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8299]
Wu, Q., Ed., Litkowski, S., Tomotaki, L., and K. Ogaki, "YANG Data Model for L3VPN Service Delivery", RFC 8299, DOI 10.17487/RFC8299, , <https://www.rfc-editor.org/info/rfc8299>.
[RFC8349]
Lhotka, L., Lindem, A., and Y. Qu, "A YANG Data Model for Routing Management (NMDA Version)", RFC 8349, DOI 10.17487/RFC8349, , <https://www.rfc-editor.org/info/rfc8349>.
[RFC8529]
Berger, L., Hopps, C., Lindem, A., Bogdanovic, D., and X. Liu, "YANG Data Model for Network Instances", RFC 8529, DOI 10.17487/RFC8529, , <https://www.rfc-editor.org/info/rfc8529>.

11.2. Informative References

[Net2Cloud]
L. Dunbar, et al, "Net2Cloud", Net2Cloud https://datatracker.ietf.org/doc/draft-ietf-rtgwg-net2cloud-problem-statement/.

Acknowledgements

The authors would like to thank for following for discussions and providing input to this document: xxx.

Contributors

Authors' Addresses

Linda Dunbar (editor)
Futurewei
United States of America
ChongFeng Xie
China Telecom
China
Qiang Sun
China Telecom
China