D2.3 RoCE and IPoverBXI Evaluation Report

Edited by Nikolaos D. Kallimanis (FORTH), Gregoire Pichon (Atos)


Giorgos Saloustros (FORTH), Nikolaos D. Kallimanis (FORTH), Nikolaos Chrysos (FORTH), Jonathan Espié Caullet (Atos), Sylvain Goudeau (Atos), Grégoire Pichon (Atos)

Executive summary

In the RED-SEA project, we study new interconnect technologies built around BXI. Designing new solutions for RDMA interconnects, it is important to know the performance levels of existing solutions. Beside raw performance numbers, it is important to know how existing interconnects behave under real applications.

In this deliverable, we report the outputs of Tasks 2.1 and 2.2. In Task T2.1, we examine the performance of RoCE and Infiniband interconnects under relevant workloads. In particular, we evaluate how they perform under a distributed persistent key-value store framework named Tebis. In addition, we evaluate their performance under GSAS environment. In Task T2.2, we evaluate the performance of IPoverBXI. Supporting IP inside the network allows important applications and frameworks to run unmodified, which is a key offering of HPC interconnects.
In this deliverable we also report enhancements in IPoverBXI that are implemented in Ptlnet code and which have brought significant performance improvements for unmodified socket-based applications that run on top of BXI.

Our key findings are: 1) with regards GSAS applications, the bandwidth of modern interconnects, i.e. Infiniband, significantly suffers in case that the traffic consists of a high number of small-size packets (i.e., a few hundreds of bytes at most); and 2) with regards Tebis and key-value stores, frameworks cannot easily saturate the available network throughput with small (64 B – 512 B) message sizes due to saturation of the packet rate of the network card, although higher-level bottlenecks do exist in today’s systems for such small packets. Regarding IPoverBXI, our software enhancements have increased the TCP/UDP bandwidth from 10-40 Gbits/s to 70-80 Gbits/s almost independent from the MTU.