Iterative data processing is widely employed in data analysis such as graph processing and machine learning. Generally, large-scale data analysis adopts distributed processing systems to perform iterative computations with a long execution time. Over execution periods, it is common that some of the nodes in a distributed system may fail. Hence, the fault-tolerant mechanism is critical in the system design. This project aims to study efficient fault-tolerance for distributed iterative data processing. It includes: 1) online evaluation for the cost of non-stationary iterative processing; 2) modeling the problem of dynamically setting checkpoint interval as well as its solution; 3) adaptive hybrid failure recovery strategy as well as its availability. In this proposal, we first evaluate the cost of iterative processing according to the properties of non-stationary algorithms and datasets as well as the incremental computations. Then, to model the problem of dynamically setting checkpoint interval, we introduce system failure rate with the object to minimize the expectation of the total execution cost, analyze the computational complexity and design an approximate algorithm accordingly. Finally, we explore Markov chain to illustrate the procedure of optimistic recovery without any checkpoint, so as to propose a hybrid recovery strategy and prove its availability by the convex optimization. In addition, we implement a prototype to demonstrate the strategy of dynamically setting checkpoint interval and hybrid recovery. This project keeps steps with the requirements on fault-tolerance for the high availability and reliability of big data analysis, which has broad application prospects if we can achieve the expected results.
迭代数据处理广泛存在于图处理、机器学习等数据分析场景,大规模数据分析通常使用分布式处理系统进行长时间迭代计算。计算过程中部分节点故障是分布式系统中的普遍现象,因此容错机制在系统设计中至关重要。本项目拟研究面向分布式迭代数据处理的高效容错机制,研究内容包括:1.非稳定迭代处理代价的在线评估;2.动态设置检查点间隔问题建模及其求解;3.自适应混合故障恢复策略及其可用性。项目研究方案包括:拟结合非稳定迭代算法和数据集特征以及增量计算特性来评估迭代处理的代价;拟引入系统故障率并以最小化总执行代价的期望为目标建模动态设置检查点间隔问题,分析计算复杂度并设计近似算法;拟利用马尔可夫链描述无检查点的乐观恢复过程,从而构建混合恢复策略并根据凸优化论证该策略的可用性;拟实现原型系统演示动态设置检查点间隔和混合恢复策略。本项目满足高可用、高可靠大数据分析对容错功能的需求,如能取得预期成果,具有广泛的应用前景。
迭代数据处理广泛存在于图处理、机器学习等数据分析场景,大规模数据分析通常使用分布式处理系统进行长时间迭代计算。计算过程中部分节点故障是分布式系统中的普遍现象,因此容错机制在系统设计中至关重要。本项目研究面向分布式迭代数据处理的高效容错机制,内容包括:1.非稳定迭代处理代价的在线评估;2.动态设置检查点间隔问题建模及其求解;3.自适应混合故障恢复策略及其可用性。基于以上三方面的研究成果,本项目设计实现了一个原型系统。在本项目资助下,发表论文15篇,含SIGMOD、VLDB论文4篇,申请专利2项,获得软件著作权3项,培养研究生5名。
{{i.achievement_title}}
数据更新时间:2023-05-31
圆柏大痣小蜂雌成虫触角、下颚须及产卵器感器超微结构观察
瞬态波位移场计算方法在相控阵声场模拟中的实验验证
计及焊层疲劳影响的风电变流器IGBT 模块热分析及改进热网络模型
金属锆织构的标准极图计算及分析
~(142~146,148,150)Nd光核反应理论计算
面向复杂生物数据处理的高效计算方法
数据处理中的迭代方法研究
面向CFD并行应用开发框架的高效容错方法研究
面向复杂超图数据分析的分布式迭代处理技术研究