Improving Datacenter Performance with Network Offloading

Improving Datacenter Performance with Network Offloading
Author: Yanfang Le
Publisher:
Total Pages: 0
Release: 2020
Genre:
ISBN:

Download Improving Datacenter Performance with Network Offloading Book in PDF, Epub and Kindle

There has been a recent emergence of distributed systems in datacenters, such as MapReduce and Spark for data analytics and TensorFlow and PyTorch for machine learning. These frameworks are not only computation and memory intensive, they also place high demands on the network for distributing data. The fast-growing Ethernet speed mitigates the high demand a bit. However, as Ethernet speed outgrows the CPU processing power, it not only requires us to rethink the existing algorithms for different network layers, but also provides opportunities to innovate with new application designs, such as datacenter resource disaggregation [3] and in-network computation applications [4, 5, 6]. The fast network devices come with a programmability feature, which enables offloading computation tasks from CPU to NICs or switches. Network offloading to programmable hardware is a promising approach to help relieve processing pressure on the CPU for computation-intensive applications, e.g., Spark, or reduce the network traffic for network-intensive applications, e.g., TensorFlow. However, leveraging programmable hardware effectively is challenging due to the limited memory capacity and restricted programming model. In order to understand how to leverage the advantage of network offloading in developing new network stacks, network protocols, and applications, the following question needs to be answered: how to do judicious division between the programmable hardware and software for network offload given limited resources and restricted programming models? Driven by the real application demand while exploring the answer to this question, we first propose RoGUE, a new congestion control and recovery mechanism for RDMA over Converged Ethernet that does not rely on PFC while preserving the benefits of running RDMA, i.e., low CPU and low latency. To preserve the low CPU benefit, RoGUE offloads packet pacing to the NIC. Though RoGUE achieves better performance in extensive testbed evaluations, the architecture for optimal congestion control should be a centralized packet scheduler [7], which has global visibility into packet reservation requests from all the servers. Given all the hosts are connected through switches and the emerging programmable switch hardware can have stateful objects, we designed a centralized packet scheduler at the switch, called PL2, to provide stable and near-zero-queuing in the network by proactively reserving switch buffers for packet bursts in the appropriate time-slots. Congestion control is an essential component in the networking stack because application demand for the network is higher than link speed. To eliminate the net- work congestion control, the fundamental solution is reducing the network traffic such that the application demand for the network is no more than link speed. We observed that we are able to reduce the network traffic for distributed training sys- tems by offloading a critical function, gradients aggregation, to the programmable switch. Each worker in the distributed training system sends gradients over the network to special components, parameter servers, to do aggregation, which is a simple add operator. Thus, we propose ATP, a network service for in-network aggregation aimed at modern multi-rack, multi-job DT settings. ATP performs decentralized, dynamic, best-effort aggregation, enables efficient and equitable sharing of limited switch resources across simultaneously running DT jobs, and gracefully accommodates heavy contention for switch resources.