General

Contribute to other work packages

The control programs will be inserted into the ADMIRE components and applications at control points. Control points are places in the system that implement reconfiguration or scheduling actions [15]. The control points can be used for implementing application-local or system-wide policies.

WP3

3.1 Base mechanisms for malleability

JGU will design malleability interfaces for the ad hoc storage system in three directions:

  1. The ad hoc file system can be expanded or shrunk while it is running. Users can still access the ad hoc file system during this process, albeit with some performance penalties due to the relocation process. The entire ad hoc file system can also be re-located to an entirely different set of nodes and node numbers. This is beneficial when another job's application must access intermediate data but cannot be run on the same set of nodes (due to batch scheduling challenges). Further, this saves intermediate results being stored to the back-end parallel file system and then being reread in another job, causing unnecessary network traffic.
  2. The ad hoc file system offers an API to throttle file system operations, that is, metadata operations and read/write operations. This can impact the overall CPU utilization of the file system and can be beneficial with regards to application performance when the application and the ad hoc storage system are co-located on the same nodes. Moreover, throttling I/O operations can reduce network interference of other applications.
  3. The ad hoc file system offers a rich configuration interface to enable or disable certain file system features or file system protocols. This can include disabling certain metadata fields, e.g., permission bits, or relaxing file system protocols, e.g., the file creation process, with the goal to increase performance when a feature is not required by the application.

3.2 Scheduling algorithms and policies

JGU will contribute their knowledge concerning Quality of Service (QoS) to the decisions of the scheduling algorithms and policies. As the I/O bandwidth to the global parallel file system is limited, and a single application can severely degrade the overall system performance, it is necessary to fairly distribute the bandwidth on the users while taking priorities into account (some users have a higher priority than others and are privileged to more bandwidth). Taking the bandwidth information into account, e.g., via the QoS Lustre extensions developed by JGU and DDN, the scheduler can make informed scheduling decisions based on the current bandwidth usage of the system. Other Slurm plugins, defined and implemented in WP 4, allow users to ask for the required bandwidth when allocating a batch job, which the I/O scheduler can also use for their scheduling policies.

3.3

JGU will implement the designed interfaces into the ad hoc storage system and provide a corresponding API as a control point, allowing the malleability features outlined in T3.1

3.4

JGU will design a new Slurm plugin that extends today's Slurm API so that malleability features in T3.1 can be controlled through Slurm directly. For instance, the Slurm plugin could allow resizing the ad hoc file system or re-locate it to an entirely new set of compute nodes without further user actions. Moreover, JGU will consult with UC3M and TUDA with the goal to provide an API that can integrate the malleability protocols/policies with the malleable runtime and the scheduling policies into the Slurm plugin. With the help of PSNC system and application-centric metrics are evaluated in a real-world environment.

WP4

4.1 definitions of APIs, QoS metrics

JGU will lead this task due to their knowledge of the I/O requirements on their ad hoc file systems.

JGU will define the required APIs for the batch scheduler with the project partners so that users can convey their I/O requirements to the I/O scheduler. Example interfaces in the context of ad hoc file systems are

  1. paths to the input data and paths where the output data should be placed within the PFS;
  2. how long the data should be available on the ad hoc file system, that is, should the ad hoc file system be scheduled within or outside the boundaries of the batch job;
  3. if other jobs need access to the data within the ad hoc file system. In this case, the ad hoc file system can run on a dedicated set of nodes, or the following jobs are scheduled to the same nodes of the node-local ad hoc file system. Nevertheless, I/O requirements can also include information about the data placement and distribution beneficial to the users' application.

Further, we define QoS metrics which allow insights on the used bandwidth of a user application. For instance, this can be based on the token-based system of Lustre's QoS extension that JGU and DDN developed in the past in which a user is allowed x amount of RPCs per second with each RPC being worth 1 MiB, regardless of wheather a user's I/O request uses the full megabyte of each RPC. This allows us to achieve insights into an application's behavior and allows users to ask for x amount of required bandwidth as a newly added the batch scheduler API. Based on the current usage and priorities (some users have a higher priority than others), the resources of the back-end storage system are fairly distributed. Such QoS metrics or, in general, the workload utilization should expand to the ad hoc file system so that users can make informed decisions on the ad hoc file system efficiency when running their applications.

Lastly, an API is defined for the batch scheduler, allowing users to ask for stage-in/out processes between storage tiers, e.g., PFS and ad hoc file system. In addition, the API should include ways to include custom intermediate code while moving the data between tiers. For instance, the job's output data could be processed and compressed before storing it on the PFS.

4.2 Scheduling algorithms and policies

JGU will offer methods to reduce congestion by coordinating ad-hoc file systems and the storage back-end in three ways:

  1. We'll define and implement optimized data movement strategies to minimize reading and writing data from and to the PFS, e.g., when staging-in/out data.
  2. We enforce the QoS requirements in the batch scheduler defined in 4.1 by leveraging on the Lustre QoS extensions.
  3. We implement the interfaces defined in 4.1 so that data can stay within the realms of the ad hoc file system across multiple jobs if they operate on the same input data or rely on the intermediate results of the previous job. In cases the same amount of nodes cannot be used, the data should be transferred between the compute nodes instead of storing them on the PFS.

4.3 On site, in transfer data transformations

JGU will implement interfaces and tools allowing users to execute their custom code for data processing at the compute node (in-situ). These users can then execute user-defined scripts or significantly extend certain file system I/O operations. For example, a modular file system interface could allow custom code to be executed before the ad hoc file system writes the back-end data to disk. One possible use case is the on-the-fly encryption of sensitive data so that raw data is never stored on node-local storage devices, which other users have access to in later scheduled batch jobs.

Workpackages overview

Ad hoc storage system

Ad-hoc storage system. The continuously growing size of HPC systems increases the probability of congestion on the back-end file systems. Ad-hoc storage systems dynamically virtualise on-node storage into a fast storage volume that allows congestion on the back-end storage systems to be reduced and data locality to be improved [9]. ADMIRE will develop two active ad-hoc storage systems with QoS support, addressing I/O performance and scalabil- ity. (Objectives O1, O2, O3, and O6; KPIs 1-6). ADMIRE will provide a high-performance ad-hoc file system that is based on the prototype ad-hoc file system GekkoFS developed by JGU and BSC. GekkoFS can be used by applications as a burst-buffer to alleviate I/O peaks and checkpointing pressure and already provides scalability to more than 500 nodes and delivers more than 40,000,000 file creates per second. It ranked number 4 in the overall 10-node challenge of Nov. 2019 IO500, while using a much smaller storage backend than competing file systems. It also even ranked number 2 concerning metadata performance in the same challenge. GekkoFS will be extended in ADMIRE to support malleability, allowing the dynamic resizing of resources in coordination with the malleability management module, and by integrating reliability mechanisms to enable long-term usage. Exposing control points will allow balancing the computation and I/O performance by coordinating the ad-hoc file system with the job and I/O scheduler. Since not all applications in the HPC ecosystem rely on traditional POSIX I/O interfaces, ADMIRE will also apply these ideas to BSC’s object store dataClay [10], thus providing more generality to the project’s infrastructure. Sections 1.3.2.5 and 1.4.1 provide more details about the ad-hoc storage system, and the advances over the state-of-the-art.

List of work packages

Deliverables:

WP1: Project management (UC3M)

WP2: Ad hoc storage systems (JGU)

WP3: Malleability management (TUDA)

WP4: I/O scheduler (BSC)

WP5: Sensing and profiling (DDN)

WP6: Intelligent controller (INRIA)

WP7: Application co-design (FZJ)

WP8: Dissemination and exploitation (PARATOOLS SAS)

Description of work (overview)

our envolvements

WP1: Project management (UC3M)

WP2: Ad hoc storage systems (JGU)

WP3: Malleability management (TUDA)

WP4: I/O scheduler (BSC)

WP5: Sensing and profiling (DDN)

WP6: Intelligent controller (INRIA)

WP7: Application co-design (FZJ)

JGU not part of

WP8: Dissemination and exploitation (PARATOOLS SAS)

Description of work (detailed)

our envolvements

WP1: Project management (UC3M)

This work package includes the effective management of the project as described in Section 3.2, including monitoring of progress towards milestones and deliverables, evaluation of research results, and the proper dissemination of those results as described in Section 2.2.1. It also provides the overall co-ordination of activities, both financial and technical. Thus, this WP will ensure resource sharing and usage as well as overall smooth execution of the project activities and the organisation project meetings. The project coordinator will prepare reviewing meetings, ensure the flow of information to the partner teams, signal any delay in providing the requested contributions, and identify deviations from the workplans.

WP2: Ad hoc storage systems (JGU)

This WP develops ad-hoc storage systems to efficiently use node-internal NVMe and persistent memory technologies to reduce the pressure on back-end storage systems. It simplifies ad-hoc storage development as much as possible by defining minimal storage semantics (Task 2.1) and also provides storage systems without resilience (Task 2.2). Nevertheless, also longer-running applications and workflows will be supported by including additional error correcting codes (Task 2.3). The storage systems developed in this work package will leverage GekkoFS and the dataClay object storage.

WP3: Malleability management (TUDA)

We will provide base mechanisms for the combined malleability of compute and I/O resources. Those mechanisms will be guided by new scheduling algorithms and policies, integrated into Slurm via a plugin, that are able to maximise throughput of the system by balancing computation and I/O. Moreover, we will add malleability to the ad-hoc storage systems developed in WP 2.

WP4: I/O scheduler (BSC)

Due to the continuous data traffic of ad-hoc storage systems and the complexity of the HPC storage hierarchy, I/O operations involving the shared back-end file system need to be coordinated to limit congestion while minimising batch job waiting times. This WP will develop an I/O Scheduler with control point support that coordinates inputs from the intelligent controller and the resource and malleability managers to provide QoS-aware data scheduling. Functionalities to support in-situ/in-transit data transformations will be provided, and using low-power processors for such tasks will be researched.

WP5: Sensing and profiling (DDN)

This WP will investigate and develop scalable monitoring (T5.2) and profiling tools (T5.3), including low-level instrumentation, data collection, mining and data-centric online performance analysis, able to scale to the exascale level. The WP will put emphasis not only on monitoring performance metrics, but also on modelling applications I/O profiles that enable to predict the scaling behaviour of applications (T5.3). Starting from historic I/O profiles and user hints, ADMIRE will generate dynamically feedback for the controller and help it to understand the interplay between applications necessary for online optimisation.

WP6: Intelligent controller (INRIA)

This work package integrates and analyses cross-layer system data to dynamically and intelligently steer the system components. It will optimise at system-scale the data management and the I/O accesses of the running applications based on the input provided by the ecosystem (WP5) and will enforce policies (e.g., I/O scheduling (WP4)) through malleability (WP3) and I/O management (WP2). In order to take decision it will rely on machine learning techniques to predict resource usage and application behaviour.

WP7: Application co-design (FZJ)

JGU not part of

WP8: Dissemination and exploitation (PARATOOLS SAS)

This workpackage is responsible for the dissemination of research results, the creation of an exploitation plan, promoting open source releases of ADMIRE framework and improving public awareness of the project, as described in Section 2.2. We will set up a project website to include all public documents and publications related to the project. Technical workshops and showcases will be held every year, probably co-located with major conferences, to contact with research groups, industrial researchers, and associations (as HIPEAC).

Technical objectives