Given the recent rapid increase in the availability of extremely large datasets, storage, access, and analysis of such data sets becomes critical. ..Since data sets are often too large to load into the memory of a single machine, let alone conducting statistical analysis for the whole data sets at once, divide and conquer methodology has received significant attention. Conceptually, this simply involves distributing the entire data to multiple machines, carrying out standard statistical model fitting at each local machine separately to obtain multiple estimates of the same quantities/parameters of interest, and finally pooling the estimates into a single estimate on a central machine by a simple averaging step. ..For many models, the simple divide and conquer method described above can be theoretically shown to achieve the same estimation performance as when the entire data set is analyzed by a single machine, which is called the oracle property of the divide and conquer method. However, for high-dimensional models where the number of parameters to estimate could exceed the number of observations, the case is more complicated. In particular, the naïve averaging fails due to the propagation of bias attributed to the penalty used to make high-dimensional estimation feasible. Thus debiasing is critical before aggregation. In this proposal, we plan to study divide and conquer method for several high-dimensional statistical models, including partially linear models, quantile regression models, and support vector classification. The purpose of this study is to propose debiasing method in these penalized models and establish rigorously the optimal convergence rate or even, in some cases, the asymptotic distribution of the aggregated estimates. Once achieved, it will deepen our understanding of the divide and conquer strategy and significantly expand its applicability.
由于大型数据集往往因为数据量太大而无法加载到单个机器的内存中,分而治之的方法近年来已经受到了广泛的关注。也就是说,将不同的数据子集分配到多台机器,分别在每台机器上进行统计模型拟合,最后将多个估计值汇集到一个中央机器进行平均。 ..对于许多模型,理论上可以用以上所述的简单的分而治之的方法来实现,以达到与用单机分析整个数据集相同的估计性能,这就是所谓的分而治之法的oracle性质。然而,在要估计的参数个数可能超过观测数量的高维模型中,情况更为复杂。特别是在用惩罚函数进行变量选择时产生了估计偏差,在汇总之前,纠偏是至关重要的。在这个项目中,我们计划研究几种高维统计模型的分治法,包括部分线性模型,分位数回归模型和支持向量机分类器。本研究的目的是在这些使用LASSO惩罚的模型中提出纠偏的方法,严格地建立最优收敛速度,并通过数值模拟研究其有限样本性质。
在本项目中,我们对部分线性模型、分位数回归模型、非参数模型等几种复杂模型的分治策略的统计特性及相关方法进行了研究。我们建立的统计理论阐明了这些流行方法在大数据分析中的一些重要理论。特别是,我们展示了在合理的数学假设下不同模型中各种估计量的(通常是最优的)收敛速度。我们的研究结果已经发表在一些顶级国际期刊上。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于分形L系统的水稻根系建模方法研究
监管的非对称性、盈余管理模式选择与证监会执法效率?
粗颗粒土的静止土压力系数非线性分析与计算方法
主控因素对异型头弹丸半侵彻金属靶深度的影响特性研究
基于LASSO-SVMR模型城市生活需水量的预测
高维数据的几何结构分析
复杂高维数据流的实时监控策略研究
数据缺失时高维数据降维分析的方法、理论与应用
高维数据的函数型数据(functional data)分析方法