After decades of development, natural language processing has been seen her effort on characters, words, phrases and sentences. Now she begins her trek on discourse. In the age of statistical natural language processing and corpus linguistics, researchers set about interpreting the structure of discourse by labeling the relations between discourse units with the help of machine learning methods. Of course, this could not happen without the publication of a large scale corpus-the Penn Discourse TreeBank (PDTB). However, the performance is not as welcome as the task itself. The discourse related issues are so intricate that the precision of implicit relation analysis which is a core problem of discourse parsing is less than 50 percent until now. This is why we say that "discourse interpreting is in its infancy". This project aims to make efforts on this problem. For Chinese, it is less lucky than English. There is even no annotated discourse corpus in satisfied size, which makes relevant research difficult even impossible. One of the reasons is that the annotation of discourse resource is complicated and time-consuming. In this project, we bend ourselves to explore the short cut for corpus building-annotating the Chinese discourse corpus by language projection. No matter regarding the corpus we promise to construct as a seed or considering the method we propose as a framework for discourse annotating, we believe that the fruits of this project will smooth the way for large scale discourse corpus (not just for Chinese) construction and hence abroad study in this area.
自然语言处理经历了几十年的发展,分析的对象从字、词、短语到句子,自然而且必然地进入了篇章这一层面。在统计自然语言处理思想和语料库语言学盛行的今天,随着宾州篇章树库的发布,学者们开始尝试借助各种机器学习方法,通过对篇章关系的标注来解释篇章结构,引发了篇章结构分析的热潮。但是,由于篇章问题的复杂性,篇章关系分析的核心部分- - 隐式关系的判别,其准确率没有超过50%。这也是篇章分析处于起步阶段的最好证明。本项目首先将矛头指向这一难题。汉语方面,目前最大的问题是没有大规模的篇章语料库, 严重制约了汉语篇章的研究和应用。而篇章语料库的标注又无疑是一项难度大、费时费力的工程。在本项目中,我们希望借助汉英双语平行树库这一资源,通过对英语端的篇章分析,来得到汉语的篇章关系标记。无论将获得的汉语篇章语料作为种子语料,还是视其为一种篇章标注的框架,都将是未来构建大规模汉语(甚至其它语言)篇章语料的便捷途径。
篇章上下文信息的利用是自然语言理解的瓶颈之一。本课题研究PDTB模式下的篇章结构分析方法,搭建了英语和汉语端到端(end-to-end)篇章结构分析平台;针对关系论元边界识别和隐式关系判别两个难点问题,提出了基于语义依存的一体化分析方法,将论元边界识别和关系判别、显式关系和隐式关系判别统一到了一个分析模型下,以此为基础,设计并实现了新的英语、汉语篇章结构分析框架;提出了基于word embedding的篇章关系分析方法,提升了英语隐式关系分析性能。针对汉语篇章结构标注数据匮乏问题,基于英汉双语语料,提出了基于投射的篇章语料库构建方法,并基于此标注了320篇汉语篇章语料。课题还开展了篇章上下文信息在机器翻译中的应用研究,最长名词短语识别方法研究和汉语拼写错误检查与修正研究,在国内国际评测中获得了理想的成绩。课题执行期间,共发表和录用学术论文11篇,申请和授权发明专利各1项。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
基于SSVEP 直接脑控机器人方向和速度研究
小跨高比钢板- 混凝土组合连梁抗剪承载力计算方法研究
基于跨语言主题向量的双语篇章可比度量化研究
面向篇章信息性的汉语篇章结构多层次联合分析研究
篇章级中文语义分析理论与方法
基于广义话题的汉语篇章结构研究