The HiDALGO2 Centre of Excellence is pleased to announce a new publication by its researchers, focusing on improving the efficiency of machine learning (ML) workloads in High-Performance Computing (HPC) environments.
The paper, titled “Profiling and Optimization of Multicard GPU Machine Learning Jobs,” appears in Concurrency and Computation: Practice and Experience (Wiley), July 2025
Key Insights from the Study
The research provides a comprehensive analysis of model optimization techniques for multi-GPU workloads, addressing the growing demand for scalable ML training on modern HPC systems.
Among its key findings:
Parallelization strategies for image recognition were tested on various hardware and software configurations, including distributed data parallelism and distributed hardware processing.
Simple yet impactful optimizations—like switching PyTorch DataLoader tensor layouts from NCHW to NHWC and enabling pin_memory—led to notable performance gains.
Large Language Model (LLM) tuning was evaluated using multiple optimization techniques:
- LoRA enables faster tuning with lower VRAM usage than DPO,
- QLoRA provides memory-efficient adaptation,
- QAT is the most resource-intensive and slowest approach.
A significant portion of LLM tuning time is caused by kernel initialization and thread synchronization, especially when memory operations are not dominant.
This study aligns with HiDALGO2’s mission to advance HPC and AI for environmental and societal applications, improving the efficiency of large-scale simulations and machine learning tasks on pre-exascale infrastructure.
Citation:
Lawenda, M., Khloponin, K., Samborski, K., & Szustak, Ł. (2025). Profiling and Optimization of Multicard GPU Machine Learning Jobs. Concurrency and Computation: Practice and Experience, 37(18–20), e70196. https://doi.org/10.1002/cpe.70196