基于Spark的大图数据最优子模式匹配查询方法研究

基本信息

批准号：61502258

项目类别：青年科学基金项目

资助金额：20.00

负责人：彭云

学科分类：

依托单位：齐鲁工业大学

批准年份：2015

结题年份：2018

起止时间：2016-01-01 - 2018-12-31

项目状态：已结题

项目参与者：蔡冠球,赵华伟,张路,葛程程,杨涛

关键词：

Spark框架子模式匹配查询多查询优化大图数据

结项摘要

Recently, massive graph-structured data with billions of nodes has been adopted in several applications, such as knowledge graphs and social networks, etc. Pattern matching query is one of the most widely used approach to retrieve these graphs. Given a pattern graph Q and a data graph G, pattern matching query is to find a subgraph of G satisfying the pattern constrains of Q. ..However, existing pattern matching techniques have at least four technical challenges to effectively and efficiently retrieve the data graphs. Firstly, these queries require that all nodes of Q must match to nodes of G. But, such complete matching is often unachievable due to the great data incompleteness in G. Secondly, most existing works study pattarn matching query with a certain pattern constrain, which can only support a certain application scenario. They cannot efficiently support the applications with diverse scenarios. Thirdly, there are often many queries submitted to be processed concurrently in real applications, and the queries usually have common expressions. But, most existing works just focus on processing a single query. It is under-optimized for the real applications. Fourthly, most existing distributed pattern matching methods are based on the traditional distributed frameworks, such as MapReduce and Vertex-oriented framework. But, these frameworks do not suit the characteristics of pattern matching computation and hence the methods cannot scale to the ever-growing massive graphs. ..To address these challenges, this project studies the optimal subpattern matching query with multi-query optimization using Spark. The novelties of this research are as follows. Firstly, the optimal subpattern matching query allows incomplete matching and therefore is of higher effectiveness on the data graphs having great data incompleteness. Secondly, this project integrates the processing of multiple pattern constrains. Therefore, our technique can efficiently support the applications with diverse scenarios. Thirdly, this project studies multi-query optimization techniques, which can significantly improve the performance of the query processing in real applications. Fourthly, this project adopts the recently proposed and well-accepted Spark framework. It is more suitable to the characteristics of pattern matching computation and can achieve higher scalability. ..The research results of this project can provide an effective and efficient way for retrieving real massive data graphs, and have great theoretical and practical significance. The results can also guide the practices and applications of pattern matching techniques to massive graph retrieval and mining.

近年来，包含海量节点的大图数据得到了广泛的应用，如知识图谱和社交网络等。模式匹配查询是这些大图数据上应用非常广泛的查询方式。但现有的模式匹配查询方法在查询的查全率、多种模式约束的整合、多查询优化和分布式求解方法的可扩展性方面还存在一些不足。因此本项目提出基于新一代分布式计算框架Spark且支持多种模式约束和多查询优化的最优子模式匹配查询方法。该方法能在保证查准率的同时提高查全率；支持多种模式约束，有较广的应用范围；利用多个查询之间的共性进行优化，在多查询并发时整体的查询处理效率较高；基于更适合模式匹配计算的Spark框架，其分布式求解方法有较高的可扩展性。本项目不仅能为当前的大图数据应用提供一种有效且快速的查询方法，而且能进一步丰富大图数据模式匹配的理论，具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。

项目摘要

本项目研究一种具有较高查全率同时保证查准率、支持多种模式约束、支持多查询优化、基于新一代分布式计算框架Spark具有较高可扩展性的最优子模式匹配查询方法。主要研究内容包括：1）研究单个最优子模式匹配查询的处理方法。定义最优子模式匹配查询，分析其计算复杂性，并研究它的近似算法。该算法不仅要支持多种模式约束，还要有较好的近似度和较低阶的时间复杂度。研究单个最优子模式匹配查询最优执行计划的搜索，包括搜索算法和选择度估计方法。2）研究多个最优子模式匹配查询的优化方法。根据最优子模式匹配查询的性质，研究公共子查询的定义及其提取算法。研究一个成本模型来衡量一个整体执行计划的优劣。基于该成本模型，研究最优整体执行计划的搜索算法。研究将多个查询进行分组并对各组独立地进行多查询优化，以进一步提高多查询优化的效果。3）研究它们在Spark框架下的分布式算法。对前述的最优子模式匹配查询算法及其多查询优化，研究如何使用Spark所支持的操作来设计分布式算法。研究利用Spark的DAG任务调度机制和内存持久化机制来最小化Spark算法的数据混洗。研究将多个分布式计算任务进行合并，以进一步提高计算效率。在本项目的资助下，研究团队取得了11项研究成果。在国内外权威期刊和国际知名学术会议共发表学术论文8篇，其中在SCI 检索的期刊中发表论文3 篇，包括2篇论文发表在国际权威学术期刊《IEEE Transactions on Knowledge and Data Engineering》和《ACM Transactions on Intelligent Systems and Technology》，2篇论文发表在国际权威学术会议ICDE 2016和ICDE 2018中；授权1项美国发明专利、1项中国发明专利、1项软件著作权。培养硕士生2人。各项指标均完成了预期目标。本项目的研究可以为当前的大图数据应用提供一种具有较高查全率和更广的应用范围、支持多查询优化且可扩展性更高的查询方法，能进一步丰富和完善大图数据的模式匹配和分布式计算的基本理论，具有重要的理论和现实意义。本项目的研究成果还能为大图数据检索、挖掘等相关领域的实践与应用提供理论指导和方法借鉴。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：

发表时间：2021

DOI：10.1051/jnwpu/20213920292

发表时间：2021

DOI：

发表时间：2020

DOI：10.3788/CJL201946.0801003

发表时间：2019

DOI：10.19328/j.cnki.2096-8655.2022.02.002

发表时间：2022

彭云的其他基金

批准号：61662032

批准年份：2016

资助金额：36.00

项目类别：地区科学基金项目

批准号：68880507

批准年份：1988

资助金额：3.00

项目类别：专项基金项目

批准号：51709037

批准年份：2017

资助金额：25.00

项目类别：青年科学基金项目

相似国自然基金

基于图的个人数据空间模型与查询方法研究

批准号：61170027

批准年份：2011

负责人：李玉坤

学科分类：F0202

资助金额：55.00

项目类别：面上项目

基于图数据库理论的海量RDF数据存储和查询方法研究

批准号：61003009

批准年份：2010

负责人：邹磊

学科分类：F0202

资助金额：19.00

项目类别：青年科学基金项目

基于符号决策图的图数据表示和匹配研究

批准号：61572146

批准年份：2015

负责人：古天龙

学科分类：F0202

资助金额：67.00

项目类别：面上项目

动态异质大图匹配模型及算法研究

批准号：61502349

批准年份：2015

负责人：祝园园

学科分类：F0202

资助金额：22.00

项目类别：青年科学基金项目

基于Spark的大图数据最优子模式匹配查询方法研究

{{i.achievement_title}}

暂无此项成果

其他相关文献

基于铁路客流分配的旅客列车开行方案调整方法

一种基于多层设计空间缩减策略的近似高维优化方法

基于多色集合理论的医院异常工作流处理建模

基于腔内级联变频的0.63μm波段多波长激光器

"多对多"模式下GEO卫星在轨加注任务规划

彭云的其他基金

语义约束主题模型的细粒度商品特征和情感词提取研究

神经网络与符号系统相结合的一般性推理模型

面向海洋能综合利用的港口多能互补与生产调度协同优化

相似国自然基金