Model comparison is one of the most important problems in statistical machine learning. Previous studies have revealed that the statistical significance tests, which are constructed on m repetitions of two-fold cross-validation, outperform others in model comparison problem. However, on a text data set, the variance of estimator of evaluation metric of model performance in mx2 cross-validation is easily affected by random partitions. Concretely, random data partitioning used in mx2 cross-validation may cause a large divergence of distributions of training set and validation set, and the divergence results in a large variance in the mx2 cross-validated estimator of evaluation metric and further leads to a unreliable conclusion of model comparison. Therefore, this proposal aims to construct a novel mx2 cross-validation, which is called as “regularized mx2 cross-validation”, by restricting the divergence of distributions of training set and validation set. For this purpose, we firstly introduce several measurements with regard to the divergence, and analyze the effect of these measurements on the variance of the metric estimator and the significant tests. Then, we define certain constraints (or regularized conditions) of the measurements to reduce the variance. Furthermore, we integrate the reasonable regularization conditions into an optimization model for deriving an efficient construction algorithm of the partitions of regularized mx2 cross-validation. Finally, we validate the effectiveness of regularized mx2 cross-validation in the problem of model comparison on text data sets for several popular natural language processing tasks.
模型比较是统计机器学习中的核心问题之一。研究表明,基于多次重复(m次)的2折交叉验证(简称mx2 CV)构造的统计显著性检验是一种优良的模型比较方法。但在文本数据上,模型性能指标估计(简称指标估计)对随机切分比较敏感。mx2 CV方法对文本数据随机切分时,容易切出差异较大的训练集和验证集,使得模型性能指标估计变得很不稳定(方差大),导致模型比较的结论不可靠。针对文本数据,本项目研究mx2 CV中约束训练集和验证集分布差异的切分优化方法,称为正则化mx2 CV。首先引入训练集与验证集的分布差异的若干度量,并分析这些度量对指标估计的方差和模型比较的显著性检验的影响。然后,基于这些度量,构建正则化条件。以最小化指标估计的方差为目标,研究构建满足正则化条件的mx2 CV的数据切分的优化算法。最后,以自然语言处理中若干任务为例,验证正则化mx2交叉验证方法在文本数据集的模型比较问题中的有效性。
传统交叉验证方法在对文本数据随机切分时,容易切出分布差异较大的训练集和验证集,使得模型性能指标估计变得很不稳定,导致模型比较结论不可靠。本项目以文本数据为研究对象,首先证明了精度、准确率、召回率和F1值等性能指标的正则化m×2交叉验证估计的概率分布,并基于贝叶斯检验、序贯检验和McNemar检验,设计了3种新颖有效的模型比较方法。然后,对于文本数据组块任务,本项目提出了10种候选的分布差异度量函数,并基于信噪比的选取准则,筛选出3种有效度量以构造正则化条件来约束训练集和验证集的分布差异。进而,本项目给出了融合分布正则化的正则化m×2交叉验证的构造算法。最后,本项目针对模型比较任务,引入了优良交叉验证方法的选取准则,并阐明了交叉验证的优良性与性能指标估计信噪比间的关系。本项目在大量的模拟数据和真实数据上,对比了多种基于交叉验证的模型比较方法,证实了正则化m×2交叉验证在模型比较任务中可有效地减少假阳性结果,产生更为可靠的模型比较结论。基于上述研究成果,发表SCI期刊论文3篇、中文核心期刊论文9篇,培养研究生3名。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
粗颗粒土的静止土压力系数非线性分析与计算方法
中国参与全球价值链的环境效应分析
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
面向云工作流安全的任务调度方法
面向高维数据的稀疏正则化方法及应用
面向非结构化文本的领域知识获取方法的研究
面向高维大数据的正则化统计方法的相关研究
面向生物大数据分析的正则化方法及应用研究