M.Sc Thesis

M.Sc StudentBen-Arye Sariel
SubjectCEETCP: Congestion Control for Data-center Lossless
DepartmentDepartment of Electrical and Computer Engineering
Supervisors PROF. Isaac Keslassy
DR. Alexander Shpiner
Full Thesis textFull thesis text - English Version


Today, many Data Center Networks (DCNs) rely on lossless Ethernet to better support storage area networks and loss-intolerant applications. However, traditional congestion control mechanisms that were designed for the wide Internet do not fit such lossless DCNs, since they rely on losses as congestion indications. Likewise, alternative approaches, such as Quantized Congestion Notification (QCN) and Explicit Congestion Notification (ECN), not only require middleware modifications, but also perform poorly in large-scale lossless DCNs.

In this research, we argue that in lossless networks, high Round Trip Times (RTTs) constitute a better congestion control indication. We support our claim using a novel analytical model of additive-increase-multiplicative-decrease congestion control mechanisms for lossless environments. This model illustrates why commonly-used TCP versions may suffer from a significant latency increase in lossless DCNs. In addition we observe other issues in lossless networks such as retransmission timers, duplicate ACKs and congestion spreading.

We introduce CEETCP, a novel RTT-based congestion control mechanism for Converged Enhanced Ethernet DCNs. CEETCP combines the losslessness property, the known structured network topology, and key assumptions that are innovative to TCP, to formulate a novel L4 lossless congestion control mechanism. CEETCP is an easy-to-deploy full-path solution for lossless DCNs. We compare CEETCP in various simulations against other congestion control mechanisms, which are commonly used in datacenters today. We show how CEETCP avoids the congestion spreading phenomenon even in extreme scenarios, whereas other congestion control

mechanisms fail to prevent it. When testing for convergence time of the various algorithms we show that CEETCP achieves the fastest convergence period to a steady state. We simulate a full datacenter scenario using a commercial workload, in which we show that CEETCP provides significantly improved performance and avoids the shortcomings found in other congestion control algorithms.