缺失数据分析中若干重要问题研究

基本信息

批准号：11871460

项目类别：面上项目

资助金额：55.00

负责人：王启华

学科分类：

依托单位：中国科学院数学与系统科学研究院

批准年份：2018

结题年份：2022

起止时间：2019-01-01 - 2022-12-31

项目状态：已结题

项目参与者：丁晓波,邓涧秋,盛赢,孙逸帆,张敬,黄云翔,王若宇

关键词：

不完全数据不可忽略缺失

结项摘要

In this project, we intend to investigate some significant and challenge problems with data missing. In the presence of missing covariables, we first extract model information and then incorporate the information by developing a pseudo semi-empirical likelihood to make improved inference. We develop model selection method based on the conditional mean score estimator of Kullback-Leiblert (KL) distance with covariables missing at random. For calculating the conditional mean score, one need assume a parametric model for the conditional distribution of the missing covariables given the response variable and the observable covariables. In practice, however, the conditional parametric model is usually specified errorly and hence leads to the bias for the conditional mean score estimator of the KL distance. Hence, we develop bias-corrected method to reduce the bias of the mean score estimator of the KL distance to zero in probability such that the .model selection method based on the bias-corrected distance is of selection consistency. With non-ignorable missing response, we intend to develop dimension reduction method, and prove the estimators of the central subspace are root n consistent and their structure dimension estimators are consistent. With class label missing at random,we introduce a reproducing kernel Hilbert space (RKHS) and estimate the selection probability function directly by minimizing the expected squared error. And then we extend the method to estimate the conditional probability function of the class label given covariables with the inverse probability weighted approach. We construct the conditional probability based classifier and investigate some asymptotic properties. We intend to prove that the proposed method attains asymptotically the Bayes misclassication error rate under some reasonable conditions and the rates of convergence are also obtained.

本项目研究数据缺失下几个重要并具有挑战性的问题。在协变量缺失时，通过挖掘包含在协变量中模型信息，并发展拟半经验似然利用这一信息改进推断；在协变量缺失时，以Kullback-LeiblerL距离均值得分估计作为模型选择的距离准则，然而在计算均值得分时，由于给定响应变量及观察协变量下缺失协变量条件分布通常假设错误，从而导致均值得分距离估计发生偏差，于是发展一种纠正距离准则得分估计方法进行模型选择，使得在纠正距离准则下模型选择仍具有选择相合性；在响应变量不可忽略缺失下，发展降维技术，证明所获得的降维中心子空间有根号n相合性且其维数估计是相合的；在类标记缺失时，我们在再生核空间使用惩罚积分均方误差方法获得选择概率函数的估计，并将这一方法通过逆概率加权推广到给定协变量下类标记的条件概率的估计，从而得到基于该条件概率函数估计的分类器，并证明所提方法的条件错误分类率渐近到Bayes条件错误分类率。

项目摘要

缺失数据在现实中普遍发生，比如民意测验，邮寄问卷调查，市场调研，经济金融研究，医药研究及其它一些科学实验中就普遍存在缺失数据问题。本项目研究数据缺失下几个重要并具有挑战性的问题。在协变量缺失时，以Kullback-LeiblerL距离均值得分估计作为模型选择的距离准则，然而在计算均值得分时，由于给定响应变量及观察协变量下缺失协变量条件分布通常假设错误，从而导致均值得分距离估计发生偏差，于是发展一种纠正距离准则得分估计方法进行模型选择，使得在纠正距离准则下模型选择仍具有选择相合性；在超高维数据分析中，在不可忽略缺失机制下，发展了不依赖模型的变量筛选方法，通过借用缺失示性的信息，使得任何全数据下的变量筛选方法均可应用于不可忽略响应变量缺失时的变量筛选，并保持全数据下确定筛选性质；在响应变量不可忽略缺失下，发展降维技术，证明所获得的降维中心子空间有根号n相合性且其维数估计是相合的；在类标记缺失时，我们在再生核空间使用惩罚积分均方误差方法获得选择概率函数的估计，并将这一方法通过逆概率加权推广到给定协变量下类标记的条件概率的估计，从而得到基于该条件概率函数估计的分类器，并证明所提方法的条件错误分类率渐近到Bayes条件错误分类率。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：

发表时间：2017

DOI：10.12202/j.0476-0301.2020285

发表时间：2021

DOI：10.19818/j.cnki.1671-1637.2021.05.022

发表时间：2021

DOI：10.3969/j.issn.1000-1441.2020.05.004

发表时间：2020

DOI：

发表时间：2016

王启华的其他基金

批准号：11171331

批准年份：2011

资助金额：40.00

项目类别：面上项目

批准号：11331011

批准年份：2013

资助金额：240.00

项目类别：重点项目

批准号：10241001

批准年份：2002

资助金额：4.00

项目类别：专项基金项目

批准号：10671198

批准年份：2006

资助金额：21.00

项目类别：面上项目

相似国自然基金

机器学习中的若干重要问题研究

批准号：60635030

批准年份：2006

负责人：周志华

学科分类：F0305

资助金额：190.00

项目类别：重点项目

流密码中若干重要问题的研究

批准号：60473028

批准年份：2004

负责人：肖国镇

学科分类：F0206

资助金额：25.00

项目类别：面上项目

正规族及其应用中若干重要问题研究

批准号：10771076

批准年份：2007

负责人：方明亮

学科分类：A0201

资助金额：25.00

项目类别：面上项目

不可忽略缺失数据的若干理论研究及其应用

批准号：11871287

批准年份：2018

负责人：王磊

学科分类：A0402

资助金额：52.00

项目类别：面上项目

缺失数据分析中若干重要问题研究

{{i.achievement_title}}

暂无此项成果

其他相关文献

论大数据环境对情报学发展的影响

资源型地区产业结构调整对水资源利用效率影响的实证分析—来自中国10个资源型省份的经验证据

多源数据驱动CNN-GRU模型的公交客流量分类预测

混采地震数据高效高精度分离处理方法研究进展

国际比较视野下我国开放政府数据的现状、问题与对策

王启华的其他基金

数据缺失时高维数据降维分析的方法、理论与应用

生物医学数据统计分析的方法、理论与应用

核实数据帮助下测量误差回归模型的校准分析

协变量缺失时生存数据回归分析的方法、理论与应用

相似国自然基金