In this project, we intend to investigate some significant and challenge problems with data missing. In the presence of missing covariables, we first extract model information and then incorporate the information by developing a pseudo semi-empirical likelihood to make improved inference. We develop model selection method based on the conditional mean score estimator of Kullback-Leiblert (KL) distance with covariables missing at random. For calculating the conditional mean score, one need assume a parametric model for the conditional distribution of the missing covariables given the response variable and the observable covariables. In practice, however, the conditional parametric model is usually specified errorly and hence leads to the bias for the conditional mean score estimator of the KL distance. Hence, we develop bias-corrected method to reduce the bias of the mean score estimator of the KL distance to zero in probability such that the .model selection method based on the bias-corrected distance is of selection consistency. With non-ignorable missing response, we intend to develop dimension reduction method, and prove the estimators of the central subspace are root n consistent and their structure dimension estimators are consistent. With class label missing at random,we introduce a reproducing kernel Hilbert space (RKHS) and estimate the selection probability function directly by minimizing the expected squared error. And then we extend the method to estimate the conditional probability function of the class label given covariables with the inverse probability weighted approach. We construct the conditional probability based classifier and investigate some asymptotic properties. We intend to prove that the proposed method attains asymptotically the Bayes misclassication error rate under some reasonable conditions and the rates of convergence are also obtained.
本项目研究数据缺失下几个重要并具有挑战性的问题。 在协变量缺失时,通过挖掘包含在协变量中模型信息,并发展拟半经验似然利用这一信息改进推断;在协变量缺失时,以Kullback-LeiblerL距离均值得分估计作为模型选择的距离准则,然而在计算均值得分时,由于给定响应变量及观察协变量下缺失协变量条件分布通常假设错误,从而导致均值得分距离估计发生偏差,于是发展一种纠正距离准则得分估计方法进行模型选择,使得在纠正距离准则下模型选择仍具有选择相合性;在响应变量不可忽略缺失下,发展降维技术,证明所获得的降维中心子空间有根号n相合性且其维数估计是相合的;在类标记缺失时,我们在再生核空间使用惩罚积分均方误差方法获得选择概率函数的估计,并将这一方法通过逆概率加权推广到给定协变量下类标记的条件概率的估计,从而得到基于该条件概率函数估计的分类器,并证明所提方法的条件错误分类率渐近到Bayes条件错误分类率。
缺失数据在现实中普遍发生,比如民意测验,邮寄问卷调查,市场调研,经济金融研究,医药研究及其它一些科学实验中就普遍存在缺失数据问题。 本项目研究数据缺失下几个重要并具有挑战性的问题。在协变量缺失时,以Kullback-LeiblerL距离均值得分估计作为模型选择的距离准则,然而在计算均值得分时,由于给定响应变量及观察协变量下缺失协变量条件分布通常假设错误,从而导致均值得分距离估计发生偏差,于是发展一种纠正距离准则得分估计方法进行模型选择,使得在纠正距离准则下模型选择仍具有选择相合性;在超高维数据分析中,在不可忽略缺失机制下,发展了不依赖模型的变量筛选方法,通过借用缺失示性的信息,使得任何全数据下的变量筛选方法均可应用于不可忽略响应变量缺失时的变量筛选,并保持全数据下确定筛选性质;在响应变量不可忽略缺失下,发展降维技术,证明所获得的降维中心子空间有根号n相合性且其维数估计是相合的;在类标记缺失时,我们在再生核空间使用惩罚积分均方误差方法获得选择概率函数的估计,并将这一方法通过逆概率加权推广到给定协变量下类标记的条件概率的估计,从而得到基于该条件概率函数估计的分类器,并证明所提方法的条件错误分类率渐近到Bayes条件错误分类率。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
资源型地区产业结构调整对水资源利用效率影响的实证分析—来自中国10个资源型省份的经验证据
多源数据驱动CNN-GRU模型的公交客流量分类预测
混采地震数据高效高精度分离处理方法研究进展
国际比较视野下我国开放政府数据的现状、问题与对策
机器学习中的若干重要问题研究
流密码中若干重要问题的研究
正规族及其应用中若干重要问题研究
不可忽略缺失数据的若干理论研究及其应用