基于单语语料的无监督统计机器翻译模型研究

基本信息

批准号：61303181

项目类别：青年科学基金项目

资助金额：23.00

负责人：张家俊

学科分类：

依托单位：中国科学院自动化研究所

批准年份：2013

结题年份：2016

起止时间：2014-01-01 - 2016-12-31

项目状态：已结题

项目参与者：翟飞飞,涂眉,李小青,杨海彤,黄国平,向露

关键词：

单语语料基于短语的统计机器翻译无监督学习

结项摘要

At present, almost all of the statistical machine translation models are trained based on bilingual corpus. Given enough bitext for a domain, the existing statistical machine translation models can achieve relatively satisfactory translation results. However, the parallel corpus is very difficult to collect, and thus the quality of statistical machine translaiton dramatically decreases when facing a language pair or domain without any bilingual resources. In contrast, each domain of most languages has large-scale monolingual corpus in the web and the monolingual data is easy to obtain. Therefore, this project aims at taking full advantage of the large-scale monolingual data in the web, and propose a phrase-based statistical translation method using only the monolingual corpus. After obtaining the monolingual data in the same domain for source and target language, this project focuses mainly on utilizing only monlingual corpus to study an unsupervised method for constructing a probabilistic bilingual lexicon, a method for learning phrase translation rules and a probability estimation method for translation model and reordering model. Through designing a novel construction process for translation model, this project tries to break through the bottleneck that statistical machine translation must depend on the bilingual data, and makes the statistical machine translation get more extensive development.

目前，几乎所有的统计机器翻译模型都建立在双语平行语料上。给定某一领域足够的双语平行语料，现有的统计机器翻译模型能够获得较为满意的翻译结果。然而，由于现实中双语平行语料很难收集，当面对一个缺乏双语平行语料的语言对或领域时，统计机器翻译质量就会急剧下降。相反地，绝大多数语言的各领域单语语料大量存在于网络之中，且易于获取。因此，本项目旨在充分利用网络中的大规模单语语料，研究并构造面向单语语料的基于短语的统计机器翻译模型。在自动获取源语言和目标语言同一领域的大规模单语语料后，本项目着重研究基于单语语料的概率化双语词典的无监督构建方法、双语短语翻译规则的学习方法以及翻译模型与调序模型的概率估计方法。本项目通过创造性地重新设计翻译模型的构造过程，力图突破双语平行语料对统计机器翻译的限制，使统计翻译得到更加广泛深远的发展。

项目摘要

数据驱动机器翻译模型强烈依赖于双语平行语料的规模与质量，如何突破双语平行语料对机器翻译的限制是本项目的目标。针对该目标，项目组提出并研究了融合双语词典和大规模单语数据的机器翻译方法，源语言和目标语言的短语和句子的语义表示和相似度度量方法，以及融合源语言大规模单语语料的机器翻译模型，并且搭建了实用的多语言机器翻译平台，开放了一套机器翻译源代码。在学术研究方面，发表高水平学术论文9篇，获得国家发明专利2项，培养了4名博士和2名硕士。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：

发表时间：

DOI：10.1360/SSM-2020-0035

发表时间：2020

DOI：10.3799/dqkx.2019.110

发表时间：2019

DOI：

发表时间：2017

DOI：DOI: 10.11902/1005.4537.2013.169

发表时间：2014

张家俊的其他基金

批准号：61673380

批准年份：2016

资助金额：62.00

项目类别：面上项目

相似国自然基金

引入功能语篇分析的汉英语篇统计机器翻译方法研究

批准号：61573294

批准年份：2015

负责人：陈毅东

学科分类：F0606

资助金额：66.00

项目类别：面上项目

基于主题模型的枢轴语言统计机器翻译研究

批准号：61303082

批准年份：2013

负责人：苏劲松

学科分类：F0211

资助金额：27.00

项目类别：青年科学基金项目

基于语段处理的网上英汉机器翻译系统

批准号：60083006

批准年份：2000

负责人：姚天顺

学科分类：F0211

资助金额：13.00

项目类别：专项基金项目

基于汉英双向树串模型的统计机器翻译研究

批准号：60872118

批准年份：2008

负责人：孙广范

学科分类：F0113

资助金额：29.00

项目类别：面上项目

基于单语语料的无监督统计机器翻译模型研究

{{i.achievement_title}}

暂无此项成果

其他相关文献

基于LS-SVM香梨可溶性糖的近红外光谱快速检测

现代优化理论与应用

岩石/结构面劣化导致巴东组软硬互层岩体强度劣化的作用机制

基于小波高阶统计量的数字图像来源取证方法

Fe-Si合金在600℃不同气氛中的腐蚀

张家俊的其他基金

基于弱监督的神经网络翻译模型研究

相似国自然基金