Recently, with the development of digital technology, especially the ubiquitous information-sensing mobile devices, datasets grow in size and “massive datasets” become increasingly prevalent. Due to the need of large memory and computational complexity, methods (such as Lasso or SCAD) for high-dimensional datasets may be infeasible on such massive datasets. In this project, we focus on studying variable selection procedures and post-selection inference algorithms for massive datasets. First, based on streamwise regression, we will propose algorithms to test if a candidate variable could significantly improve the predictive performance of the current model. The algorithm could consider the correlation between the candidate variable and current selected variables without adding too much computational burden. Second, we will consider the effect of the variable selection procedure to the parameter estimation and predictive performance of the final model. Then, we will propose the post-selection inference for the variable selection method, so that the inference method could offer more reliable results by considering the randomness of the variable selection procedure. We will study the proposed methods systemically via theoretical investigation and numerical experiments. The methods proposed in this project will be scalable, easily accomplished on parallel and distributed computing platforms and extremely suitable to solve practical problems with the massive datasets .
近些年,随着数字技术的进步,特别是各种信息感应移动设备的广泛应用,数据规模快速的增大,“海量数据”正在变得越来越普遍。针对高维数据的变量选择算法(例如:Lasso, SCAD)分析海量数据时通常需要占用大量内存等计算机资源,且计算复杂度高。本项目将重点研究海量数据下回归模型的变量选择及变量选择之后的统计推断问题。首先,我们拟基于流式算法,提出快速检验候选自变量是否显著提高模型预测能力的算法,该方法可以考虑候选自变量与已选中自变量之间的相关关系,但是又不显著增加计算量。其次,我们将研究变量选择过程对模型参数估计及预测稳定性的影响,并提出变量选择之后的统计推断方法,该算法可以考虑到变量选择过程的随机性,从而使得统计推断的结果更加可靠。我们将在数值计算与理论研究两方面对上述算法进行系统的研究。本项目拟提出的算法具有良好的可扩展性,且容易在现代分布式设备中实现,可以有效解决海量数据的统计分析难题。
本项目围绕变量选择算法,变量选择的稳定性,以及模型的参数估计、统计推断等方面展开研究,历时三年,基本完成了预期的研究目标。本项目研究了变量选择的稳定性,提出评价稳定性的指标。该指标可用于指导变量选择算法的选择。本项目还针对海量数据的特征,提出了基于流式回归的变量选择算法,数值实验证明该方法具有计算复杂度低、速度快、变量选择精度高的优点。对于一些非线性模型,本项目研究了测量误差或者异方差存在时,模型的参数估计、统计推断或模型选择等问题,并研究了相关方法的渐进理论和数值性质。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
监管的非对称性、盈余管理模式选择与证监会执法效率?
粗颗粒土的静止土压力系数非线性分析与计算方法
中国参与全球价值链的环境效应分析
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
高维数据下多因变量回归模型的统计推断
复杂数据下众数回归模型的变量选择及统计诊断研究
纵向数据线性混合效应模型的统计推断及其变量选择
协变量驱动的随机系数自回归模型的统计推断