With the development of large-scale distributed system, especially with the rise of cloud computing, failures appear more frequently. Effort spent on failure diagnosis has also been increased since both the types and the root causes of failures become more diverse and complex. This proposal presents a study on automatic service failure diagnosis in large-scale distributed systems. A service failure refers to the type of failures which makes the systems perform poorly or run far slower than expection. There are two main research topics in the proposal: (1) automtic failure model extraction based on adaptive tracing, and (2) automatic fault localization based on derivation and verification. The goal of our first research topic is to improve both the accuracy and the scalability of failure detection while keeping the cost of tracing low. To achieve this goal, we plan to present a study on adaptive end-to-end tracing which adjusts both the sampling rates and the granularities of online tracing. In this study, failure models will first be extracted and refined on the basis of the tracing results,and then be used to guide the tracing strategies in turn. The goal of our second research topic is to improve both the accuracy and the efficiency of fault localization (i.e. locating the root causes of service failures). Our research plan of this topic is to design a new representation for failure-related behaviors, along with an algorithm which quantizes a failure contribution rate for each of the behaviors. After that, we will present a study on automatic fault localization by performing following steps iteratively: deriving root cause candidates based on the failure contribution rates and verifying the candidates with an effective divide-and-conquer strategy. The goal of this proposal is to provide new approaches and key techniques for service failure diagnosis in large-scale distributed systems after they have been deployed. These approaches and techniques will help improve both the efficiency and the effectiveness of failure diagnosis and, consequently, raise the QoS of large-scale distributed systems.
随着大规模分布式系统的发展,尤其是云计算的兴起,失效的表现形式、原因以及传播形式均呈现出新的特点,进一步加重了识别失效和定位失效原因的负担。 本申请针对大规模分布式系统中涉及服务质量下降的失效,研究自动诊断方法。研究内容包括:(1)提出采样率、跟踪粒度均可双向调节的自适应跟踪策略,并基于该策略研究失效模式的自动提取与持续精化技术,支持对服务失效的自动识别;对该技术的研究以控制跟踪开销、提高失效识别精度和方法的可伸缩性为目标;(2)研究失效原因的自动定位技术:首先,研究失效相关的因素以及量化评估各因素对失效贡献的模型;然后,根据对失效贡献率的计算结果,研究基于推导、分治验证交替迭代的失效原因自动定位方法;对该技术的研究以自动且准确地定位失效原因为目标。上述研究将为大规模分布式系统部署后服务失效的诊断提供方法与关键技术,及时、准确地识别服务失效的表现及失效原因,提高系统的可靠性与服务质量。
大规模分布式系统的发展进一步加重了识别失效和定位失效原因的负担。本项目针对大规模分布式系统中涉及服务质量下降的失效,以对分布式系统运行日志的查询与分析为切入点,以精确的动静态程序分析为辅助,研究失效识别和对失效原因的自动定位。项目研究工作在日志分析技术、程序分析技术以及缺陷定位技术这三个方面展开,分别提出了若干关键技术和方法,取得了以下5方面研究成果:(1)针对失效原因推导与验证技术在可扩展性方面的不足,改进基于最小调试边界的缺陷定位技术,并提出一种稀疏的符号化搜索算法;(2)针对在日志分析以及对分析结果持续精化的过程中存在的冗余计算问题,提出一种以语义规则为指导的周期性查询增量优化技术;(3)针对路径敏感分析技术可扩展性不足的问题,提出了一种场景敏感、目标制导的高效分析方法,缩小失效原因的备选范围;(4)利用日志分析所涉及的查询之间存在的相似性和依赖性,提出一种基于查询间流分析的查询计划批量优化技术;并提出一种查询计划转化的优化技术;(5)提出一种基于关节点的两层图划分优化方法;并利用图的关节点特性,提出一种冗余计算消除的介度中心算法,提高识别网络中重要节点的效率。在项目的资助下,共发表论文12篇,其中包括领域国际著名会议CC’17、PPOPP’16、ICS’15等,国际知名期刊TPDS、TC、TSE等,国内核心期刊计算机学报、软件学报等。申请专利4项,并搭建了辅助诊断失效的原型。通过本项目的研究,部分解决了失效识别和失效原因定位的代价和自动化问题。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于分形L系统的水稻根系建模方法研究
拥堵路网交通流均衡分配模型
自然灾难地居民风险知觉与旅游支持度的关系研究——以汶川大地震重灾区北川和都江堰为例
卫生系统韧性研究概况及其展望
端壁抽吸控制下攻角对压气机叶栅叶尖 泄漏流动的影响
针对多线程程序失效的用户级半自动诊断方法研究
大规模模糊系统的自动生成方法研究
面向大规模服务系统的在线服务质量预测方法研究
大规模计算平台的失效分析方法研究