面向大规模高性能计算的低开销回卷恢复容错技术

基本信息

批准号：61272401

项目类别：面上项目

资助金额：78.00

负责人：杨金民

学科分类：

依托单位：湖南大学

批准年份：2012

结题年份：2016

起止时间：2013-01-01 - 2016-12-31

项目状态：已结题

项目参与者：白树仁,彭黎,荣辉桂,池鹏,周军海,李元旗,袁功彪,卢俊

关键词：

回卷恢复状态区分高性能计算时间开销并发重演

结项摘要

More and more computing nodes are integrated into high performance computers to improve their performance, resulting in the problem that faults increase exponentially with the scale of nodes. In such a situation, fault-tolerance is necessary for system dependability. Unfortunately, fault tolerance often aggravates system in complexity by node redundancy, provoking more faults. Rollback recovery is a trustworthy and popular approach to fault tolerance in high performance computing, as it doesn't need node redundancy by employing time redundancy strategy. However, existing rollback recovery schemes show that their time overheads increase sharply with the scale of nodes, as they save process state at a checkpoint in a sole manner of memory mapping, and replay the logged messages in sequential pattern during the fault recovery. This project exploits the non-equivalency between process checkpoint and process renaissance in terms of times，and then proposes the technology of process checkpoint based on state distinctions. This technology will identify object components in a process by semantics modeling of program and data, and distinguish them into environment state and application state, then resolve the eigenvalue of environment state to displace it. The technology should decrease the size of checkpoint, leading to a reduction in checkpointing overhead. This project also exploits the differences in normal execution and rolling-forward of a process,and then proposes the technology of fast process rolling-forward based on concurrent message replaying. This technology will identify the independency of messages by evaluating their effect scopes, and break off their dependency by logging the outcome of a message, targeting at the enhancement of concurrency of message replaying for lowering the overhead of fault recovery. With the above-mentioned key technologies, the support library for fault tolerance will be developed to relieve significantly the problem that the overhead of rollback recovery grows sharply with the scale of nodes, and then to provide the support in fault tolerance at low overhead for large-scale systems.

高性能计算系统通过扩大计算结点规模来提升性能，带来了故障随结点规模呈指数增长的可靠性问题，要求有与之相应的容错支持。回卷恢复容错技术基于时间冗余来容错，无须结点冗余，适应了高性能计算的需求。但现有方法在设置进程检查点时单一地采取映像方式保存状态数据，故障恢复时以串行方式重演日志消息，其开销随系统规模增大而剧增。本项目研究进程检查点和进程重生的非对等特征，提出基于状态区分的进程检查点技术, 通过程序语义建模来解析进程状态的构成，采用对象特征值来置换其内存映像，以此减少检查点数据量，降低检查点开销；研究进程前滚和进程正常执行的非等同特征，提出基于并发重演的进程快速前滚技术,通过消息作用域估算来判定消息间的独立性，采用结果日志来解除消息间的依赖关系，以此提升消息重演的并发性，降低故障恢复开销。实现基于以上技术的容错支持库，解决开销随系统规模增大而剧增问题，为大规模高性能计算提供低开销的容错支持。

项目摘要

高性能计算系统通过扩大计算结点规模来提升性能，带来了故障随结点规模呈指数增长的可靠性问题，要求有与之相应的容错支持。回卷恢复容错技术基于时间冗余来容错，无须结点冗余，适应了高性能计算的需求。但现有方法在设置进程检查点时单一地采取映像方式保存状态数据，故障恢复时以串行方式重演日志消息，其开销随系统规模增大而剧增。本项目研究进程检查点和进程重生的非对等特征，提出了基于状态区分的进程检查点技术, 通过程序语义建模来解析进程状态的构成，采用特征值提取来置换进程环境的内存映像，以此减少检查点数据量，降低检查点开销；基于进程前滚和进程正常执行的非等同特征，提出了基于并发重演的进程快速前滚技术,通过消息作用域估算来判定消息间的独立性，采用结果日志来解除消息间的依赖关系，以此提升消息重演的并发性，降低故障恢复开销。以上技术对缓解容错开销随系统规模增大而剧增问题具有理论参考价值和实用价值，有助于降低大规模高性能计算容错开销。在理论研究的基础上，对事物处理类应用，OpenMP程序，以及频繁项集挖掘算法，线性方程组迭代求解等工程算法，分别探究了其数据演变特性，将其应用到回卷恢复容错开销的优化中，得出了这类应用的容错开销优化实现框架。研究成果已经应用于湖南移动计费系统的异常发现与定位，取得了很好的应用效果。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：10.13199/j.cnki.cst.2020.07.010

发表时间：2020

DOI：

发表时间：2023

DOI：10.14006/j.jzjgxb.2018.0676

发表时间：2021

DOI：

发表时间：2016

DOI：10.7502/j.issn.1674-3962.201906027

发表时间：2019

杨金民的其他基金

批准号：11675242

批准年份：2016

资助金额：68.00

项目类别：面上项目

批准号：11275245

批准年份：2012

资助金额：80.00

项目类别：面上项目

批准号：10475107

批准年份：2004

资助金额：22.00

项目类别：面上项目

相似国自然基金

面向百万万亿次高性能计算系统的容错计算模型研究

批准号：61272142

批准年份：2012

负责人：卢凯

学科分类：F0202

资助金额：72.00

项目类别：面上项目

面向云计算的拜占庭故障诊断与容错关键技术研究

批准号：61173017

批准年份：2011

负责人：杨震

学科分类：F0201

资助金额：55.00

项目类别：面上项目

面向多层卷对卷印刷电子设备的精密对准技术研究

批准号：51475017

批准年份：2014

负责人：陈伟海

学科分类：E0501

资助金额：84.00

项目类别：面上项目

面向高性能计算应用的软件定义网络技术研究

批准号：61402444

批准年份：2014

负责人：李强

学科分类：F0204

资助金额：26.00

项目类别：青年科学基金项目

面向大规模高性能计算的低开销回卷恢复容错技术

{{i.achievement_title}}

暂无此项成果

其他相关文献

智能煤矿建设路线与工程实践

新产品脱销等待时间对顾客抱怨行为的影响:基于有调节的双中介模型

带球冠形脱空缺陷的钢管混凝土构件拉弯试验和承载力计算方法研究

基于自组织小波小脑模型关节控制器的不确定非线性系统鲁棒自适应终端滑模控制

耐磨钢铁材料中强化相设计与性质计算研究进展

杨金民的其他基金

超出标准模型的新物理唯象研究

黑格斯和新物理的唯象研究

超对称唯象研究

相似国自然基金