Topic analysis technology, such as topic modeling and topic labeling, is widely used in natural language processing, information retrieval and information extraction. Most of the existing topic modeling methods are based on the three-layer hierarchical Bayesian structure of "document-topic-word", which lacks the description of semantic factors such as concept. Moreover, there exists high complexity in the current algorithm for topic labeling, which makes it difficult for these algorithms to adapt to large-scale data. Therefore, this task intends to explore the novel theory and methods of topic analysis for large-scale data in effectiveness and efficiency. Firstly, considering the semantic factor concept, we propose a novel theory for topic model based on the four-layer hierarchical structure assumption of "document-topic-concept-word". Then, based on the novel theory, we study the novel unsupervised and supervised topic models to evaluate the performance of the novel theory. To improve the efficiency of topic modeling, we further propose the corresponding online learning algorithm based on the particle filter algorithm. Again, based on the novel theoretical framework for topic labeling proposed by the applicant on COLING2016 and AAAI2017, we intends to study novel algorithm to solve the problem of high complexity for topic labeling algorithm in the dimension of data independence and data dependence. Through the above work, we intend to form a comprehensive systematic framework for topic analysis, effectively improving the effectiveness and efficiency of topic analysis for large-scale data.
以主题建模和主题解释为一体两面的主题分析技术,已经被广泛应用于自然语言处理、信息检索和信息抽取等领域。现有主题建模方法大都基于"文档-主题-词"三层贝叶斯假设,该假设缺乏对“概念”等语义因素的刻画;而目前主题解释算法复杂度过高,难以适应大数据场景。为此,本课题拟从效果和效率两个角度深入探索高效主题分析的理论和方法。首先,考虑主题与概念的关系,探索基于"文档-主题-概念-词"四层贝叶斯的主题建模新假设;其次,分别探索新的无监督和有监督主题模型,以验证所提假设的有效性,并基于粒子滤波理论和减小梯度方差思想提出相应的在线学习算法解决主题建模效率问题;再次,基于申请人在COLING2016和AAAI2017上提出的主题解释新框架,在数据独立和数据依赖两个维度上分别探索新方法,解决主题解释算法复杂度过高的问题。通过上述工作,形成体系化的高效主题分析框架,有效提升大数据主题分析的准确性和效率。
以主题建模和主题解释为一体两面的主题分析技术,已经被广泛应用于自然语言处理、信息检索和信息抽取等领域。现有主题建模方法大都基于"文档-主题-词"三层贝叶斯假设,该假设缺乏对“概念”等语义因素的刻画;而目前主题解释算法复杂度过高,难以适应大数据场景。为此,本课题从效果和效率两个角度深入探索高效主题分析的理论和方法。首先,考虑主题与概念的关系,探索基于"文档-主题-概念-词"四层贝叶斯的主题建模新假设和最新的预训练语言模型中加入概念元素;其次,分别探索新的无监督和有监督主题模型,以验证所提假设的有效性,并基于粒子滤波理论和减小梯度方差思想提出相应的在线学习算法解决主题建模效率问题;再次,基于负责人提出的主题解释新框架,在数据独立和数据依赖两个维度上分别探索新方法,解决主题解释算法复杂度过高的问题。通过上述工作,形成体系化的高效主题分析框架,有效提升大数据主题分析的准确性和效率。课题组按原定计划顺利开展了研究,并取得预期的成果,已超额完成了预期的考核指标。 迄今为止,本项目在国际期刊和会议发表学术论文33篇,其中SCI收录7篇,EI收录24篇,其中包括中国计算机协会推荐A类期刊会议论文10篇、B类期刊会议论文2篇,C类期刊会议5篇、SCI 2区期刊4篇,国际会议学术报告24人次;申请国家发明专利4项。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
基于分形L系统的水稻根系建模方法研究
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
基于SSVEP 直接脑控机器人方向和速度研究
基于主题形式概念分析的文本处理关键技术研究
基于短语信息和领域概念的主题标引关键技术研究
知识获取及主题建模关键技术研究
基于增量学习的主题爬虫关键技术研究