Simulating CFD applications with parallel computers has been widely acknowledged by academia and industry. However, the ever-deteriorating reliability problem of parallel computers has seriously constrained the further development of such method. When applying traditional fault-tolerance methods to CFD applications, we face the contradiction between usability and efficiency: on the one hand, for easy adoption, system-level fault-tolerance methods introduce huge overhead, which is unacceptable in large-scale parallel CFD applications; on the other hand, for reducing fault-tolerance overhead, application-level fault-tolerance methods raise higher demands for programmers, and those demands are beyond the capacity of the programmers in the CFD application fields. In order to solve this contradiction, this project for the first time proposes the idea of embedding fault-tolerance methods into the CFD parallel application development framework to develop high-efficient fault-tolerance methods. Our method takes advantage of the highly abstract organization of the framework, and enables CFD application researchers to configure various fault-tolerance methods in a manner similar to the natural language. Meanwhile, our method utilizes the program information provided by the framework to guide the design of high-efficient fault-tolerance methods. We will carry out the researches in the organization, mechanism, and optimization techniques of fault-tolerance methods oriented to CFD parallel application framework, and finally design and implement a practically usable CFD parallel application development framework with embedded fault-tolerance functions, and so as to solve or alleviate the reliability problem in the simulation of CFD parallel applications.
利用并行计算机对CFD应用进行模拟,已经得到学术界和工业界的广泛认可。然而,并行计算系统日益严重的可靠性问题,却严重制约了CFD方法的进一步发展。传统容错方法在应用于CFD应用时,存在易用性与效率之间的矛盾:一方面,为了便于使用,系统级容错方法引入大量容错开销,这是大规模并行CFD应用不可接受的;另一方面,为了降低容错开销,应用级容错对程序员提出更高要求,CFD应用领域的程序员难以胜任。本课题首次提出将容错方法嵌入面向CFD并行应用开发框架以设计高效容错方法的思想。借助框架高度抽象的组织结构,让CFD应用研发人员以类自然语言的方式配置各种容错方法;同时,利用框架提供的程序信息,指导高效容错方法的设计。我们将对面向CFD并行应用开发框架容错的组织结构、机制方法以及优化技术展开研究,最终设计实现一个切实可用的嵌入容错功能的CFD并行应用开发框架,从而解决或缓解CFD并行应用模拟的可靠性问题。
针对计算流体力学(Computational Fluid Dynamics,CFD)并行应用在并行计算机上运行时的可靠性问题,研究基于CFD并行应用开发框架的高效容错方法。结合CFD并行应用的特点,设计了面向CFD并行应用开发框架的容错组织结构,将协同式检查点、非协同式检查点、错误检测、容错基础等模块嵌入原有CFD并行应用开发框架中;基于CFD模拟过程中周期性快照的特性,设计实现了面向CFD并行应用开发框架的协同式检查点技术;设计实现了面向CFD并行应用开发框架的用户级消息日志协议;面向并行计算领域越发严重的软错误干扰,结合诸如LBM的模板类CFD应用计算特征,设计实现了基于网格采样的双模冗余软错误检测方法。相关成果在Parallel Computing、IPDPS、ICA3PP、RSC Advances等国内外期刊和会议上发表学术论文35篇,其中SCI论文16篇;相关容错组织结构设计和容错功能方法均在国际开源CFD软件OpenFOAM中进行了设计与实现;培养博士研究生3名,硕士研究生5名。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于ESO的DGVSCMG双框架伺服系统不匹配 扰动抑制
惯性约束聚变内爆中基于多块结构网格的高效辐射扩散并行算法
瞬态波位移场计算方法在相控阵声场模拟中的实验验证
计及焊层疲劳影响的风电变流器IGBT 模块热分析及改进热网络模型
金属锆织构的标准极图计算及分析
高阶并行可扩展CFD应用开发框架研究
面向空间平台并行线性数字信号处理的高效容错方法研究
异构众核平台CFD高效预条件JFNK并行求解算法及应用
层次式面向对象并行应用框架技术的研究