More and more computing nodes are integrated into high performance computers to improve their performance, resulting in the problem that faults increase exponentially with the scale of nodes. In such a situation, fault-tolerance is necessary for system dependability. Unfortunately, fault tolerance often aggravates system in complexity by node redundancy, provoking more faults. Rollback recovery is a trustworthy and popular approach to fault tolerance in high performance computing, as it doesn't need node redundancy by employing time redundancy strategy. However, existing rollback recovery schemes show that their time overheads increase sharply with the scale of nodes, as they save process state at a checkpoint in a sole manner of memory mapping, and replay the logged messages in sequential pattern during the fault recovery. This project exploits the non-equivalency between process checkpoint and process renaissance in terms of times,and then proposes the technology of process checkpoint based on state distinctions. This technology will identify object components in a process by semantics modeling of program and data, and distinguish them into environment state and application state, then resolve the eigenvalue of environment state to displace it. The technology should decrease the size of checkpoint, leading to a reduction in checkpointing overhead. This project also exploits the differences in normal execution and rolling-forward of a process,and then proposes the technology of fast process rolling-forward based on concurrent message replaying. This technology will identify the independency of messages by evaluating their effect scopes, and break off their dependency by logging the outcome of a message, targeting at the enhancement of concurrency of message replaying for lowering the overhead of fault recovery. With the above-mentioned key technologies, the support library for fault tolerance will be developed to relieve significantly the problem that the overhead of rollback recovery grows sharply with the scale of nodes, and then to provide the support in fault tolerance at low overhead for large-scale systems.
高性能计算系统通过扩大计算结点规模来提升性能,带来了故障随结点规模呈指数增长的可靠性问题,要求有与之相应的容错支持。回卷恢复容错技术基于时间冗余来容错,无须结点冗余,适应了高性能计算的需求。但现有方法在设置进程检查点时单一地采取映像方式保存状态数据,故障恢复时以串行方式重演日志消息,其开销随系统规模增大而剧增。本项目研究进程检查点和进程重生的非对等特征,提出基于状态区分的进程检查点技术, 通过程序语义建模来解析进程状态的构成,采用对象特征值来置换其内存映像,以此减少检查点数据量,降低检查点开销;研究进程前滚和进程正常执行的非等同特征,提出基于并发重演的进程快速前滚技术,通过消息作用域估算来判定消息间的独立性,采用结果日志来解除消息间的依赖关系,以此提升消息重演的并发性,降低故障恢复开销。实现基于以上技术的容错支持库,解决开销随系统规模增大而剧增问题,为大规模高性能计算提供低开销的容错支持。
高性能计算系统通过扩大计算结点规模来提升性能,带来了故障随结点规模呈指数增长的可靠性问题,要求有与之相应的容错支持。回卷恢复容错技术基于时间冗余来容错,无须结点冗余,适应了高性能计算的需求。但现有方法在设置进程检查点时单一地采取映像方式保存状态数据,故障恢复时以串行方式重演日志消息,其开销随系统规模增大而剧增。本项目研究进程检查点和进程重生的非对等特征,提出了基于状态区分的进程检查点技术, 通过程序语义建模来解析进程状态的构成,采用特征值提取来置换进程环境的内存映像,以此减少检查点数据量,降低检查点开销;基于进程前滚和进程正常执行的非等同特征,提出了基于并发重演的进程快速前滚技术,通过消息作用域估算来判定消息间的独立性,采用结果日志来解除消息间的依赖关系,以此提升消息重演的并发性,降低故障恢复开销。以上技术对缓解容错开销随系统规模增大而剧增问题具有理论参考价值和实用价值,有助于降低大规模高性能计算容错开销。在理论研究的基础上,对事物处理类应用,OpenMP程序,以及频繁项集挖掘算法,线性方程组迭代求解等工程算法,分别探究了其数据演变特性,将其应用到回卷恢复容错开销的优化中,得出了这类应用的容错开销优化实现框架。研究成果已经应用于湖南移动计费系统的异常发现与定位,取得了很好的应用效果。
{{i.achievement_title}}
数据更新时间:2023-05-31
宁南山区植被恢复模式对土壤主要酶活性、微生物多样性及土壤养分的影响
基于ESO的DGVSCMG双框架伺服系统不匹配 扰动抑制
瞬态波位移场计算方法在相控阵声场模拟中的实验验证
时间序列分析与机器学习方法在预测肺结核发病趋势中的应用
计及焊层疲劳影响的风电变流器IGBT 模块热分析及改进热网络模型
面向百万万亿次高性能计算系统的容错计算模型研究
面向云计算的拜占庭故障诊断与容错关键技术研究
面向多层卷对卷印刷电子设备的精密对准技术研究
面向高性能计算应用的软件定义网络技术研究