Recently, massive graph-structured data with billions of nodes has been adopted in several applications, such as knowledge graphs and social networks, etc. Pattern matching query is one of the most widely used approach to retrieve these graphs. Given a pattern graph Q and a data graph G, pattern matching query is to find a subgraph of G satisfying the pattern constrains of Q. ..However, existing pattern matching techniques have at least four technical challenges to effectively and efficiently retrieve the data graphs. Firstly, these queries require that all nodes of Q must match to nodes of G. But, such complete matching is often unachievable due to the great data incompleteness in G. Secondly, most existing works study pattarn matching query with a certain pattern constrain, which can only support a certain application scenario. They cannot efficiently support the applications with diverse scenarios. Thirdly, there are often many queries submitted to be processed concurrently in real applications, and the queries usually have common expressions. But, most existing works just focus on processing a single query. It is under-optimized for the real applications. Fourthly, most existing distributed pattern matching methods are based on the traditional distributed frameworks, such as MapReduce and Vertex-oriented framework. But, these frameworks do not suit the characteristics of pattern matching computation and hence the methods cannot scale to the ever-growing massive graphs. ..To address these challenges, this project studies the optimal subpattern matching query with multi-query optimization using Spark. The novelties of this research are as follows. Firstly, the optimal subpattern matching query allows incomplete matching and therefore is of higher effectiveness on the data graphs having great data incompleteness. Secondly, this project integrates the processing of multiple pattern constrains. Therefore, our technique can efficiently support the applications with diverse scenarios. Thirdly, this project studies multi-query optimization techniques, which can significantly improve the performance of the query processing in real applications. Fourthly, this project adopts the recently proposed and well-accepted Spark framework. It is more suitable to the characteristics of pattern matching computation and can achieve higher scalability. ..The research results of this project can provide an effective and efficient way for retrieving real massive data graphs, and have great theoretical and practical significance. The results can also guide the practices and applications of pattern matching techniques to massive graph retrieval and mining.
近年来,包含海量节点的大图数据得到了广泛的应用,如知识图谱和社交网络等。模式匹配查询是这些大图数据上应用非常广泛的查询方式。但现有的模式匹配查询方法在查询的查全率、多种模式约束的整合、多查询优化和分布式求解方法的可扩展性方面还存在一些不足。因此本项目提出基于新一代分布式计算框架Spark且支持多种模式约束和多查询优化的最优子模式匹配查询方法。该方法能在保证查准率的同时提高查全率;支持多种模式约束,有较广的应用范围;利用多个查询之间的共性进行优化,在多查询并发时整体的查询处理效率较高;基于更适合模式匹配计算的Spark框架,其分布式求解方法有较高的可扩展性。本项目不仅能为当前的大图数据应用提供一种有效且快速的查询方法,而且能进一步丰富大图数据模式匹配的理论,具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。
本项目研究一种具有较高查全率同时保证查准率、支持多种模式约束、支持多查询优化、基于新一代分布式计算框架Spark具有较高可扩展性的最优子模式匹配查询方法。主要研究内容包括:1)研究单个最优子模式匹配查询的处理方法。定义最优子模式匹配查询,分析其计算复杂性,并研究它的近似算法。该算法不仅要支持多种模式约束,还要有较好的近似度和较低阶的时间复杂度。研究单个最优子模式匹配查询最优执行计划的搜索,包括搜索算法和选择度估计方法。2)研究多个最优子模式匹配查询的优化方法。根据最优子模式匹配查询的性质,研究公共子查询的定义及其提取算法。研究一个成本模型来衡量一个整体执行计划的优劣。基于该成本模型,研究最优整体执行计划的搜索算法。研究将多个查询进行分组并对各组独立地进行多查询优化,以进一步提高多查询优化的效果。3)研究它们在Spark框架下的分布式算法。对前述的最优子模式匹配查询算法及其多查询优化,研究如何使用Spark所支持的操作来设计分布式算法。研究利用Spark的DAG任务调度机制和内存持久化机制来最小化Spark算法的数据混洗。研究将多个分布式计算任务进行合并,以进一步提高计算效率。在本项目的资助下,研究团队取得了11项研究成果。在国内外权威期刊和国际知名学术会议共发表学术论文8篇,其中在SCI 检索的期刊中发表论文3 篇,包括2篇论文发表在国际权威学术期刊《IEEE Transactions on Knowledge and Data Engineering》和《ACM Transactions on Intelligent Systems and Technology》,2篇论文发表在国际权威学术会议ICDE 2016和ICDE 2018中;授权1项美国发明专利、1项中国发明专利、1项软件著作权。培养硕士生2人。各项指标均完成了预期目标。本项目的研究可以为当前的大图数据应用提供一种具有较高查全率和更广的应用范围、支持多查询优化且可扩展性更高的查询方法,能进一步丰富和完善大图数据的模式匹配和分布式计算的基本理论,具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
基于多模态信息特征融合的犯罪预测算法研究
基于ESO的DGVSCMG双框架伺服系统不匹配 扰动抑制
惯性约束聚变内爆中基于多块结构网格的高效辐射扩散并行算法
基于协同表示的图嵌入鉴别分析在人脸识别中的应用
基于图的个人数据空间模型与查询方法研究
基于图数据库理论的海量RDF数据存储和查询方法研究
基于符号决策图的图数据表示和匹配研究
动态异质大图匹配模型及算法研究