Large-scale parallelization of molecular dynamics simulations is facing challenges which seriously affect the simulation efficiency,among which the load imbalance problem is the most critical.In this paper,we propose,a new molecular dynamics static load balancing method(MDSLB).By analyzing the characteristics of the short-range force of molecular dynamics programs running in parallel,we divide the short-range force into three kinds of force models,and then package the computations of each force model into many tiny computational units called"cell loads",which provide the basic data structures for our load balancing method.In MDSLB,the spatial region is separated into sub-regions called"local domains",and the cell loads of each local domain are allocated to every processor in turn.Compared with the dynamic load balancing method,MDSLB can guarantee load balance by executing the algorithm only once at program startup without migrating the loads dynamically.We implement MDSLB in OpenFOAM software and test it on TianHe-1A supercomputer with 16 to 512 processors.Experimental results show that MDSLB can save 34%–64%time for the load imbalanced cases.
随着超级计算机规模迅速增大,可靠性成为制约系统可用性的主要问题。现有容错机制,包括检查点技术和进程冗余等,不能有效解决该问题。为此,提出一种基于进程复制和预取的高性能计算容错框架—FTRP(fault tolerance framework using process replication and prefetching),该框架兼具主动和被动容错机制的优点,引入创新的开销模型和主动容错机制,能够有效改善应用运行效率。提出"工作最多"(work-most,WM)的创新开销模型,基于故障预测结果和应用状态,从容错机制集中在线自适应给出运行容错决策。与程序运行过程中的局部性相似,我们第一次观察到超级计算机故障局部性现象。基于故障局部性,提出一种新的进程复制和进程预取相结合的容错机制,无论故障能否被预测到,都能够有效避免故障引起的损失。通过基于实际故障路径和普通故障预测准确率的模拟实验,并采用FTRP容错框架的应用,可以获得比现有容错机制10%的改进,且在P级甚至更大规模系统上有效。
The mismatch between compute performance and I/O performance has long been a stumbling block as supercomputers evolve from petaflops to exaflops. Currently, many parallel applications are I/O intensive,and their overall running times are typically limited by I/O performance. To quantify the I/O performance bottleneck and highlight the significance of achieving scalable performance in peta/exascale supercomputing, in this paper, we introduce for the first time a formal definition of the ‘storage wall' from the perspective of parallel application scalability. We quantify the effects of the storage bottleneck by providing a storage-bounded speedup,defining the storage wall quantitatively, presenting existence theorems for the storage wall, and classifying the system architectures depending on I/O performance variation. We analyze and extrapolate the existence of the storage wall by experiments on Tianhe-1A and case studies on Jaguar. These results provide insights on how to alleviate the storage wall bottleneck in system design and achieve hardware/software optimizations in peta/exascale supercomputing.
Wei HUGuang-ming LIUQiong LIYan-huang JIANGGui-lin CAI