Aiming at the dynamic characteristic and the quick response challenges of traditional topic analysis technology, designing effective online topic analysis frameworks which can capture the inherent characteristics of text stream, has become a promising research direction. The existing methods, however, suffer from the following three limitations. (1) cannot capture the inherent law of text stream perfectly; (2) learning algorithms of models need to improve efficiency; (3) time-consuming problem in topic labeling; In this project, we are going to investigate and improve the topic analysis technology for text stream in accuracy and efficiency. For the first limitation, the hierarchical Dirichlet stochastic processes and the Brownian motion will be employed to capture the change of topic number, the topic evolution and the dynamic change of vocabulary, and then a generative model which combines these dynamic characteristics and the basic component of topic modeling will be proposed; for the second limitation, a novel online learning algorithm will be proposed to improve efficiency by the variance reduction methods of gradient descent direction; and for the third limitation, through translating the topic labeling problem into K-nearest neighbor search problem in space, and using the hash similarity principle to improve the distance calculation efficiency between two probability distributions, the high algorithm complexity of topic labeling can be solved. Finally, all the above three components are integrated as a unified online topic analysis tool, and will effectively enhance the accuracy and efficiency of online topic analysis for text stream.
针对传统主题分析技术在本文流的动态性描述和快速处理上面临的挑战,设计有效适应文本流内在特性的快速主题分析方法已成为主题建模领域研究的热点。而现有方法存在刻画文本流动态规律片面、学习算法效率亟待提高、主题解释算法复杂度过高等问题。为此,本课题拟从准确性和效率两个角度深入研究并改进动态文本流的在线主题分析方法。首先,通过层次狄利克雷随机过程和布朗运动等数学模型刻画文本流的主题个数变化、主题演化和词汇变化等动态特性,并通过生成模型方式将这些动态特性与主题模型基本组件进行有机结合,达到准确地刻画了文本流内在规律的目的;其次,通过设计减小梯度下降方向方差的方法提升主题模型在线学习算法效率;最后,通过将主题解释问题转化为概率分布空间中K最近邻查找问题,以准确而高效地解决主题解释算法复杂度过高的问题。通过上述工作,将有效地提升文本流主题分析的准确性和效率。
本课题的主要目标是针对传统主题分析技术在本文流的动态性描述和快速处理上面临的挑战,设计有效适应文本流内在特性的快速主题分析方法。具体地,从准确性和效率两个角度深入研究并改进动态文本流的在线主题分析方法。首先,通过数学模型刻画文本流的主题个数变化、主题演化和词汇变化等动态特性,并通过生成模型方式将这些动态特性与主题模型基本组件进行有机结合,达到准确地刻画了文本流内在规律的目的;其次,通过设计减小梯度下降方向方差的方法提升主题模型在线学习算法效率;最后,通过将主题解释问题转化为概率分布空间中K最近邻查找问题,以准确而高效地解决主题解释算法复杂度过高的问题。通过上述工作,课题组按原定计划顺利开展了研究,并取得预期的成果,已超额完成了预期的考核指标。 迄今为止,本项目在国际期刊和会议发表学术论文17篇,SCI收录5篇,EI收录12篇,其中包括国计算机协会推荐A类期刊会议4篇、B类期刊会议2篇,C类期刊会议5篇、国内核心期刊2篇,国际会议学术报告15人次;申请国家发明专利2项。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于分形L系统的水稻根系建模方法研究
涡度相关技术及其在陆地生态系统通量研究中的应用
内点最大化与冗余点控制的小型无人机遥感图像配准
针灸治疗胃食管反流病的研究进展
端壁抽吸控制下攻角对压气机叶栅叶尖 泄漏流动的影响
海量数据流实时分发技术研究
面向短文本的主题建模研究
社交文本流中的实时事件监测和摘要
海量移动对象轨迹数据流实时分析算法研究