Categories
Publications

Canary: Congestion-aware in-network allreduce using dynamic trees

This paper prepared by ETH Zürich was accepted in journal Future Generation of Computer Systems. Highlights Abstract The allreduce operation is an essential building block for many distributed applications, ranging from the training of deep learning models to scientific computing. In an allreduce operation, data from multiple hosts is aggregated together and then broadcasted to each host participating in the operation. Allreduce […]

Categories
Publications

Proyecto RED-SEA: Resultados Intermedios

This paper prepared by Universitat Politècnica de València and Universidad de Castilla-La Mancha was accepted and presented at conference XXXIII Jornadas de Paralelismo held from 20 to 22 September 2023 in Ciudad Real (Spain). Note: the proceedings of this conference can only be accessed by conference attendees, but we obtained permission to publish the RED-SEA paper […]

Categories
Publications

FMI: Fast and Cheap Message Passing for Serverless Functions

This paper was accepted at ICS ’23, the 37th International Conference on Supercomputing that was held from 21 to 23 June in Orlando, Florida, USA. This paper was prepared by ETH Zürich in collaboration with OpenCore. Abstract Serverless functions provide elastic scaling and a fine-grained billing model, making Function-as-a-Service (FaaS) an attractive programming model. However, for distributed […]

Categories
Publications

rFaaS: Enabling High Performance Serverless with RDMA and Leases

This paper was accepted at IPDPS 2023, the 37th IEEE International Parallel and Distributed Processing Symposium that was held from 15 to 19 May in Saint Petersburg, Florida, USA. This paper was prepared by ETH Zürich in collaboration with Microsoft. Abstract High performance is needed in many computing systems, from batch-managed supercomputers to general-purpose cloud platforms. However, […]

Categories
Publications

RED-SEA: Network Solution for Exascale Architectures

This paper was accepted for the special session: “European Projects in Digital Systems Design (EPDSD) at the 25th Euromicro Conference on Digital System Design (DSD) held 31/08-02/09/2022 in Gran Canaria, Spain. This paper is an overall presentation of the RED-SEA project’s goals and approach. It was prepared by INFN in collaboration with all project partners. Abstract In […]

Categories
Publications

Influence of Network Performance Variability on Application Scalability

This research paper prepared by ETH Zürich was published in the journal “Proceedings of the ACM on Measurement and Analysis of Computing Systems“. Abstract Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and […]

Categories
Publications

Building blocks for network-accelerated distributed file systems

Research paper by our partner ETHZ, accepted at the SC22 Conference  that took place online from 14 to 18 November 2022 in Dallas, TX, USA. This paper was produced in collaboration with DEEP-SEA and was a BEST PAPER FINALIST! Abstract High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, […]

Categories
Publications

NeVerMore: Exploiting RDMA Mistakes in NVMe-oF Storage Applications

Research paper by our partner ETHZ, accepted at the ACM Conference on Computer and Communications Security (CCS)  that took place from 7 to 11 November 2022 in Los Angeles, CA, USA. Abstract This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage […]

Categories
Publications

Lifting C semantics for dataflow optimization

Research paper by our partner ETHZ, accepted at the ICS ’22 International Conference on Supercomputing that took place online from 28 to 30 June 2022. Abstract C is the lingua franca of programming and almost any device can be programmed using C. However, programming modern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly […]

Categories
Publications

KafkaDirect: Zero-copy Data Access for Apache Kafka over RDMA Networks

Research paper presented at the ACM SIGMOD/PODS Conference that took place in Philadelphia from 12 to 17 June 2022. Abstract Apache Kafka is an open-source distributed publish-subscribe system, which is widely used in data centers for messaging between applications, log aggregation, and stream processing. The existing Kafka implementation uses TCP/IP for communication, which has various […]