Constantly growing numbers on the Top500 list of the World’s largest supercomputers can look appealing to computational scientists and engineers. Multi Peta- or even Exa-scale resources that are already available for users is something that 15-years ago was only a dream. Thus many application developers and users who were blocked by the computer power, now may relatively easily get access to sufficient resources, like LUMI (~0.5 Exa-flops) or LEONARDO (~0.3 Exa-flops). This is however only a part of the journey, as with the growing power of HPC power new problems arise, namely: the effective usage of this power and making this power easy to access
How to support execution of challenging scenarios on high-end computing resources is one of the key objectives of HiDALGO2 project. With a set of carefully selected software components, our aim is to ensure both sufficient performance of parallel computations and offer good flexibility and User eXperience. Today we would like to highlight QCG-Portal and QCG-PilotJob that are being integrated into the Hidalgo2 ecosystem in order to comprehensively support management and efficient execution of complex computing scenarios on large-scale resources.
The QCG family of tools and services is a proven solution developed by the Poznan Supercomputing and Networking Center for more than 10 years. During this time, it has been applied to a number of projects where it has successfully supported various scientific and engineering use cases.
QCG-Portal is a modern web solution aimed to simplify definitions and execution of computing scenarios on remote computing resources. With an intuitive and highly adjustable interface, integrated data management and compliance with modern authentication and authorisation protocols, it can efficiently support diverse applications and respond to needs of a variety of target communities accessing multiple resources. In addition, as the system architecture boasts a high level of flexibility, especially on a level of accessing resources, QCG is capable of interacting with a majority of current supercomputers.
Figure 1. High-level architecture of QCG-Portal. There are two types of QCG agents responsible for interacting with the queuing system available: one that is deployed on the computing resource and interacts with the queuing system directly, and another that is deployed externally and communicates with the queuing system via SSH
QCG-Portal broadly implements the concept of templates that define how the interface for submission and results presentation for a certain application should look like, as well as how the application should be executed on target resources. Consequently a new application can be attached to the portal fairly easily just by implementing a new template. For more advanced use-cases, the portal provides support for embedding custom application interfaces written in java script. If even this is not enough, the portal provides an API with all basic functionality, like authorisation, tasks management or data management, which can be employed in a most flexible manner for customized solutions whenever computing power is requested.
Figure 2. With Custom Application Views, the interface of QCG-Portal can be easily made intuitive and adjusted to specific needs of computational scenario
The efficient processing of a large number of tasks on HPC resources can be a challenge. The natural constraints of schedulers and policies typically applied in data centers, promote rare submission of large jobs. This is typically not a problem for static, monolithic computing jobs, but in case of scenarios that have a greater granularity or dynamicity, like ensembles or uncertainty quantification procedures, relaying on the scheduler is no longer an option. The problem grows in case of dynamic and heterogeneous scenarios, where the number of tasks and their requirements are hard to know in advance. It is no longer easy to maintain such scenarios using hand-written bash scripts, and obviously more sophisticated tools are needed. One of such tools is QCG-PilotJob. It implements the so-called two-level scheduling paradigm, where a pilot job service, started inside a regular queuing system’s allocation, plays the role of secondary-level scheduler. QCG-PilotJob, like a regular scheduler, can be used to submit and manage tasks, but since it runs exclusively in a user-space, it overcomes typical limitations of global scheduling. QCG-PilotJob provides multiple useful features, like python API, support for DAG workflows, support for parallel tasks based on MPI and OpenMP programming interfaces as well as resume mechanism. QCG-PilotJob’s particular strength is its lightweight and self-contained nature: the tool can be installed by a regular user in its home directory and run on regular-user’s rights. This allows it to be deployed practically anywhere, without having to bother resource administrators and requesting special permissions e.g. to open network ports.
Plans for HiDALGO2
By combining the practical requirements of the HiDALGO2 community with the fresh and constantly emerging ideas, we foreseen several directions in the near development of QCG solutions, namely:
- Common queue service for QCG-PilotJob
Within HiDALGO2 we plan to finalize the implementation of the so-called common queue service for QCG-PilotJob, which will allow us to combine in a single pilot job resources from many allocations. This will bring the flexibility of QCG-PilotJob to the next level and allow it to support more sophisticated scenarios.
- Close integration of QCG-Portal and QCG-PilotJob
With this approach, we want to develop new workflow mechanisms, namely in-allocation workflows and hybrid workflows, both based on QCG-PilotJob capabilities, that will be focused on intensive and low-latency processing inside already created allocations.
- Integration with third-party data management systems.
Currently QCG uses the IBIS platform as the primary data storage. Our goal is to integrate with other data management systems, like CKAN for example, to support a wider spectrum of application scenarios.
- Support for SSH credentials managements
Our goal is to enhance capabilities of SSH-based QCG-Agent with production-ready solutions for establishing secure connections to the cluster. In this context, we expect support of SSH certificates to be one of the most prospective options we plan to add to QCG stack.
- Advanced monitoring of applications
Long running computational scenarios often require the possibility to track their progress. In order to streamline this process, we are developing a dedicated monitoring service for QCG. With a high level of flexibility and configurability it will be easily adaptable to the specific needs of various applications.