D4.7 Optimized MPI and compute in network implementations

Edited by Simon Pickartz (ParTec)


Xu Huang (ParTec), Simon Pickartz (ParTec), Carsten Clauss (ParTec), Gilles Moreau (CEA), Hugo Taboada (CEA), Marc Pérache (CEA), Timo Schneider (ETHZ)

Executive summary

This deliverable presents improvements of the support for the BullSequana eXascale Interconnect (BXI) by the two Message-Passing Interface (MPI) implementations ParaStation MPI and Multi-Processor Computing (MPC). The multi-rail support in MPC was further improved especially with respect to its support for the rendezvous protocol. This
defers the transmission of MPI payload until the target application buffer is known to the communication layer to avoid intermediate copies of large memory regions. The pscom4portals plugin of the pscom library enabling BXI support in ParaStation MPI has been adapted to the new Remote Memory Access (RMA) interface. This interface has been developed within the DEEP-SEA project and provides upper software layers with a more direct access to the hardware’s RMA capabilities. With these adaptations,
applications benefit from an improved performance of MPI one-sided communication on top of BXI.

Finally, this deliverable presents streaming Processing in Network (sPIN), a microarchitecture for network accelerators, as well as FPsPINi, constituting its first fullsystem prototype implementation in hardware. These works demonstrate how package processing tasks, that are commonly performed by the Central Processing Unit (CPU), can be offloaded to a smart Network Interface Card (NIC) to enable a better overlap of communication and computation in parallel applications.