Collaborations between the IO-SEA and RED-SEA projects

Metrology and monitoring, optimization of data movements, cartography and in-depth knowledge of the network topology, interoperability, those are the fields identified by the IO-SEA and RED-SEA teams as having common points of interest for the two projects.

Introduction

IO-SEA (955811) and RED-SEA (955776) are two EuroHPC funded projects, belonging to the call EuroHPC-2019-1 topic b.

IO-SEA aims to build an IO software stack for the Exascale, capable to cope with the new challenges that will appear in the IO and storage domains as exaflopic capable supercomputer will replace older petaflopic ready systems.

RED-SEA builds upon the European interconnect BXI (BullSequana eXascale Interconnect), upon standard and mature technology and upon previous EU-funded initiatives to provide a competitive and efficient network solution for the exascale era and beyond.

Both projects have different ambitions and goals, but they share common aspects and common roots. Both projects are concrete results from the Mousquetaire Initiative, involving France and Germany, and are parts of the “SEA legacy” that gathers DEEP-SEA, IO-SEA and RED-SEA.

Since their proposal, the projects within the SEA legacy referenced each other in their proposals. The projects lasted three years, with regular and scheduled meetings between the projects. As a results, some convergences appeared between IO-SEA and RED-SEA, with effects to the design of the solutions proposed by the projects.

This document gathers those areas of convergence and explains how these convergences could be used to optimize the IO-SEA and Red-SEA solutions, making their integration in the machine room more efficient.

Areas of convergences

Several fields have been identified where IO-SEA and RED-SEA have common points of interest. They are to be investigated as both solutions will be integrated in a compute center environment.

The different domains are

metrology and monitoring
optimization of data movements
cartography and in-depth knowledge of the network topology
interoperability

Each of the following sections will provide details on each details, as well as tracks for future developments using the outcomes from IO-SEA and RED-SEA projects.

Metrology and monitoring

The pursuit of measurable performances when moving data is an important concern in both projects.

The IO-SEA project embeds probes at different levels, and uses already existing probes. For example, many pieces of information are extracted from the /sys and /dev pseudo-filesystems of the Linux operating system. Other probes exist, by design, in several of the products used in IO-SEA, for example the Phobos object store and the nfs-ganesha server (which has even a D-BUS interface to expose the data). For application IO-SEA makes instrumentation of a dedicated non-intrusive, called io library, that makes it possible to capture elements of the IO behavior of every program. This information is then stored in a dedicated database, acting as a “metrology repository”. This database is investigated by AI base tools that helps in identifying trends in the collected data. This “recommendation system” makes it possible to generate automated optimization for the involved component of the storage system.

As part of the BXI project, switches have a reach set of counters which are usable to monitors various resources usages, amount of data, packets that went through each port. BXI being par nature architecture to handle concurrent use, those counters are applicable to each VC (QoS) and BXI also propose a specific packet tacking allowing traffic to be colored and handle specifically by dedicated counters. Thanks to the fabric management software, those counters can be efficiently recovered and used as input to almost any analysis tool or database. Amount of traffic, either in terms of bandwidth or packet rate, as well as congestion can be easily monitored to determine hot point per traffic class in the fabric.

IO-SEA and RED-SEA share common pieces of interest. From a very high-level point of view, both projects push bytes of data over the network, and they try to do it the more efficient way. From that perspectives, IO-SEA and RED-SEA are closely related, as IO-SEA could naturally make use of the harware and software produced by RED-SEA.

If you imagine such a system, with IO-SEA software using RED-SEA hardware and software, we could imagine a deeper correlation between the both. A very first approach share the metrology data and the results from every probes in every locations at the same place. From the IO-SEA point of view, it will be possible to feed the recommendation database with more data. This would help in diagnosing “toxic behaviors” requiring further investigations and fixes.

This approach opens a door to a world where the IO and the network are much mess agnostic from each other. In particular, the network and the RED-SEA try to avoid network congestion, for they quickly will waste resources. This logic could be formalized by introducing “trigger warnings”. When the network detects a congestion, it send an explicit message to the storage system. This message can be as simple as a king of timestamped “tag” in the storage system log. The storage system would use that information to reduce the reasons behind the congestion. This information is quite important, for it helps in detecting the storage system users’s misbehaviors.

Optimisation of data movements

The previous topic is closely related to data movements optimizations.

IO-SEA has a explicit feature for optimizations the data placement. Such a placement, if done correctly, will naturally reduce the data movement. The IO-SEA “hints” are metadata information attached to managed objects and telling about the use case of this object. The hints may be set by the owner of the data himself, or it may be automatically set by the recommendation framework.

On the network handside, the QoS (Quality of Service) is a very important feature. If the application could tell if one movement of data is more important than others, the network may prioritize the related networks frames differently. For example, “premium transfers” may go on very fast routes when asynchronous and less important transfers may use second class routes with lesser performances.

IO and compute traffic are very different and may perturbate each other when mixed in a single QoS. While the bandwidth is a finite quantity, congestion isn’t, and its effect can affect the entire network quality. Segregating both family in different QoS help to limit unexpected congestion effect across different families. At the same time, implementation of QoS imply to differentiate traffic by tagging each elementary packets in some way, giving an opportunity to metrology to differentiate flow and account them differently and independently. Providing an API to simplify QoS assignation would help to get benefice of those mechanisms, segregation and accounting.

Making the network and the storage less “blind” to each other is potentially a very virtuous path. It should be interesting to see how further projects, based on the outcomes of the SEA projects could set up “bridges” between the concept of network QoS and hints-based data placement.

Cartography and in-depth knowledge of the network topology

It’s important or IO-SEA to know more about the network topology. IO-SEA introduces the concept of “ephemeral services”. Ephemeral servers are IO servers dedicated to some compute jobs. Ephemeral servers are scheduled with compute jobs, via IO scheduling methods, managing data nodes, aka part of the compute machine dedicated to host IO services.

IO Servers are spawned on data nodes, close to the compute nodes, which strongly reduces the data movement and the use of the network resources.

A natural and logical approach is to make the chose data nodes and the chosen compute nodes as close to each other as possible. If this objective is achieved, the overall system will get high performances.

The IO-Scheduling algorithm would benefit from an in-depth knowledge of the cartography of the network. This cartography, associated with precise measurements of every component’s performances will help in building the most effective data node allocation.

As data path on the network may evolve in a dynamic way, thanks to recent introduction of packet spreading and adaptive routing introducing even non minimal length path to increase the bandwidth, update to the traffic distribution can be propagated from the network to the storage, making it more aware of the “data paths quality” it relies on.

Interoperability

The final area of convergences proposes closer interoperability between the different components. This topic covers in some ways elements that were depicted in the previous paragraphs.

Metrology, cartography, and the movements of data are elements of interoperability but the idea could be pushed a little further. In particular, the whole software sources of every piece of software involved in each component should be audited in order to identify locations where optimizations would be done. This will help in setting up empty functions, which a potential callback routines logic (with related events and triggers). Those empty functions, or “placeholders” are like a skeleton for more in-depth, but intrusive, features with better performances and smarter use of the resources.

Such placeholders do not exist for the moment, but it could be quite interesting to investigate that way in order to open the way to new features.