The “big data” era brings a variety of new challenges. Recently, it is increasingly common for a lot of big data analysis applications to have stringent real-time response requirement, meaning that whenever new data arrives the updates on the results of analytics, statistics and calculations must be made within a given period of time. Thus, the distributed stream processing architecture, owing to its high flexibility and scalability, is becoming more and more attractive to both developers and users of those real-time big data analysis applications. Because key properties of the input data, including its volume, arrival rates, value distribution, etc., can fluctuate in an unpredictable manner, resources must be dynamically provisioned and scheduled accordingly to ensure real-time response. It is essential to properly design a practical scheduler which is able to adjust resources dynamically according to the current workload, in order to avoid wasting resources, or failing in delivering correct results on time. Unfortunately however, due to its high complexity and difficulty, it still remains an open problem...In this project, we propose new and innovative designs on schedulers for two different types of critical resources, respectively. For scheduling computation resources among multiple co-existing stream processing applications, we propose to formulate it as a convex optimization problem based on the performance analytical models which are established by applying the Generalized Open Queueing Network theory. For network bandwidth resources, we propose to apply the new tool, Software Defined Network (SDN), which is capable of providing end-to-end connection based rate control and effective bandwidth allocation, in order to minimize the network transmission delay. Finally, we evaluate the performance of our proposed resource schedulers through both simulated and real experiments after implementing them into several practical open-sourced distributed stream processing systems such as Apache Storm, Twitter Heron, Apache Flink and Yahoo! S4.
大数据时代的到来带来了各种新的挑战,而越来越多的大数据处理应用对数据分析、统计和计算结果的产生提出了实时性的需求。因此,具有良好的可扩展性和灵活性的分布式流处理系统架构,正在不断获得更多实时大数据应用开发者的青睐。对分布式流处理系统而言,资源的动态调度问题是影响工作性能的关键因素,由于其难度大、复杂度高,至今未能有效解决。本研究项目针对两种重要资源的调度问题分别提出了新的解决思路和方法。针对计算资源,通过建立性能分析模型并结合凸优化理论的思路,考虑允许有多个应用同时部署的情况下,设计出能使计算资源利用率最大化的动态调度算法。针对网络资源,将软件定义网络(SDN)这一新的技术和工具运用在以“端到端连接”为单位的网络带宽资源调度问题上,从而使由网络传输造成的数据处理延时降到最低。
随着越来越多的移动互联网+5G+AI+IoT等应用场景的出现,对于实时复杂海量多模态流数据分析和处理的需求也与日俱增。这对构建一套资源可灵活扩展和调整的弹性分布式流数据处理系统设计提出了更高的要求和挑战。在分布式流处理计算领域,如何有效实现实时弹性资源调度一直是比较困难且缺乏有效技术解决方案。因此,本项目的研究内容正是针对这个前沿且艰难的问题开展研究工作。本项目执行期间,取得了如下代表性成果:(1)针对单个实时分布式流处理应用的计算资源,建立一个通用、准确的性能分析模型,并设计一套能使资源利用率最大化的可实际部署的动态调度算法;(2)运算单元的并行计算任务之间由于“键-值”派分不平稳造成的计算瓶颈问题,设计了一种实时负载“再平衡”的轻量级调度优化算法;(3)更一般化的,在任意一个流数据处理系统中,设计一套能在计算资源、计算结果产生的处理延时、以及计算结果的精确度,三者之间进行权衡和折中的资源调度方案;(4)将分布式流处理系统资源调度方法拓展到异构计算资源,即同时包含CPU、GPU和FPGA等计算资源;(5)针对网络带宽资源,我们设计并提出了一种跨层(网络传输层到流处理应用层)优化的框架,并引入软件定义网络SDN技术来进行优化算法的实现和部署。在本项目支持下,一共发表3篇SCI国际期刊论文,均为CCF A类期刊;6篇EI国际会议论文,其中CCF A类会议2篇、B类会议2篇、C类会议1篇。这些研究成果在分布式流处理系统的弹性资源调度方向属于国际领先,获得了同行的认可,发表论文距今已积累超过50次引用。
{{i.achievement_title}}
数据更新时间:2023-05-31
跨社交网络用户对齐技术综述
黄河流域水资源利用时空演变特征及驱动要素
针灸治疗胃食管反流病的研究进展
端壁抽吸控制下攻角对压气机叶栅叶尖 泄漏流动的影响
面向云工作流安全的任务调度方法
分布式流处理系统实时容错关键技术研究
云计算多工作流调度的动态分布式粒子群优化方法研究
分布式实时处理理论
面向多处理器平台的实时系统资源预留与管理方法研究