Chinese and Portuguese are the official languages of Macao S.A.R., which are also the main languages bridging the connection between China and Portuguese-speaking countries. That shows a great significance to the research of Chinese-Portuguese machine translation (MT). In comparison with Chinese-English MT, the research of Chinese-Portuguese MT is very challenging in different aspects. First, Chinese and Portuguese are low-resourced languages from the MT point of view, in particular, it lacks a large parallel corpus. Secondly, different from isolating languages like Chinese, Portuguese is morphologically rich language. Such imbalanced phenomenon between Chinese and Portuguese results in degrading the translation quality due to lexical sparsity. Thirdly, the current Chinese-Portuguese MT is not domain-adaptable that led to poor quality when translating documents out of the training domain. To address these issues, this research project aims to utilize large monolingual (comparable) data and graph propagation techniques to improve the Chinese-Portuguese MT. The main content of this research includes: 1) Research on the construction methodology of Chinese and Portuguese parallel corpus based on graph label propagation model and big data technologies; 2) Investigation of Portuguese morphological rules (or features) induction model based on graph model that is constructed over large monolingual data, and research on morphologically rich MT model; 3) Study of unsupervised features extraction model that derives domain specific features from large parallel corpus based on similarity graph, and the proposal of cross-domain MT model; and 4) Based on the research results, implement a cross-domain Chinese-Portuguese MT platform, and release the constructed resources and platform to the academic research community. We believe the research achievements are of great significance to advance the Chinese-Portuguese MT technologies, and are of great value to the development of MT systems for language pairs with similar situations.
作为澳门官方语言,汉语和葡语是联系中国与葡语国家之间重要的纽带,研究汉葡机器翻译有着重要的科学和社会意义。与汉英相比,汉葡机器翻译研究面临几个主要的问题:1)汉葡非常缺乏语言资源,尤其是大规模平行语料;2)汉语和葡语存在严重的词形态不对称现象,葡语具有复杂的形态变化,导致数据稀疏,影响翻译效果;3)领域自适应能力差,跨领域的翻译质量显著下降。为此,本项目拟围绕这些问题展开创新性研究:1)研究基于图模型和大数据领域特征的汉葡双语语料资源构建方法;2)研究基于图模型和大数据的葡语形态信息的学习算法,和面向复杂形态的翻译模型;3)研究基于图模型和大规模语料的领域信息学习方法,和面向跨领域的机器翻译模型;4)基于上述研究成果,构建一个跨领域汉葡机器翻译平台,面向学术界共享资源和成果。本项目的展开将为汉葡机器翻译研究作出重要贡献,并为其他类似语言对机器翻译研究提供参考,具有重要的科学意义和应用价值。
本项目研究汉葡机器翻译,针对开发汉葡机器翻译系统面临训练数据匮乏以及汉葡语言之间词法不对称、语法结构迥异所带来的数据稀疏问题与挑战,尤其是在跨领域的翻译。本项目分别从三个方面展开研究:1)从可比单语语料中自动学习和获取汉葡平行数据;2)针对葡语复杂词法形态、汉葡语法结构迥异,构建基于词法特征和句法结构的翻译模型;3)从模型架构和算法角度出发,改进跨领域机器翻译的翻译效率;并在以上基础,构建了一个跨领域汉葡翻译平台。本项目的研究为机器翻译提供了理论和技术基础,研究成果在国际顶级的学术期刊和学术会议发表了相关的论文,为未来基于低资源的机器翻译提供参考和借鉴。同时,所构建的汉葡平行语料资源,以及开发的汉葡机器翻译平台向学术界、企业和大众公开免费测试。
{{i.achievement_title}}
数据更新时间:2023-05-31
黄河流域水资源利用时空演变特征及驱动要素
低轨卫星通信信道分配策略
青藏高原狮泉河-拉果错-永珠-嘉黎蛇绿混杂岩带时空结构与构造演化
F_q上一类周期为2p~2的四元广义分圆序列的线性复杂度
基于协同表示的图嵌入鉴别分析在人脸识别中的应用
面向可比语料的汉越神经机器翻译方法研究
面向跨领域异构数据的患者相似性学习方法及应用
面向多层次篇章语义的机器翻译理论、方法与实现
面向机器翻译的文本领域识别