In biomedical research, effective variable selection methods were often used to discover key information in data. Research shows that, in classical models of variable selection, the optimal variable subset selection methods applied information criteria to include or exclude variables using specific threshold values and the selection results are to some degree subjective. The variable selection results are vulnerable to the impacts of stochastic errors. The LASSO (Least Absolute Shrinkage and Selection Operator) model, a representative type of coefficient shrinkage and variable selection model, tends to over-selecting variables and still has limits. This project aims to build two improved models of variable selection, the Bootstrap ranking LASSO and Two-stage hybrid LASSO, using traditional LASSO model and decrease the false positive rates in the process of filtering variables, improving the whole ability of variable screening. By Monte Carlo statistic simulation and empirical analysis, we will systematically compare these two proposed models with the existing variable selection methods. In addition, to improve prediction accuracy and stability of traditional LASSO regression model, we seek to combine the methods of ensemble prediction and multi-index optimization evaluation to construct a novel ensemble LASSO regression model. Finally, the proposed methods will be applied to dengue monitoring data analysis in Guangdong, identify factors related to dengue epidemics, and establish an accurate predictive model of dengue. The empirical analysis results will help to optimize the model.
生物医学研究经常借助有效的变量选择方法发掘数据中关键信息。研究表明,在经典变量选择统计模型中,基于信息准则的最优变量子集选择法筛选变量的阈值标准具有主观性,变量选择结果易受随机偏差的影响,而以LASSO为代表的模型系数收缩和估计法存在过度选择的不足。本研究拟基于传统LASSO变量选择方法构建两种改进模型Bootstrap ranking LASSO和Two-stage hybrid LASSO,降低变量筛选的假阳性率,克服其对变量过度选择的缺点,并通过蒙特卡洛统计模拟和实证分析对改进的模型与现有方法进行系统地比较和评估。另外,针对LASSO方法构建的回归模型的预测不稳定性,本课题拟运用模型集成方法和多指标优化评估策略建立一种集成的LASSO回归模型,增强模型预测准确性和稳定性。最后,将所建立的方法应用于广东省登革热疫情影响因素的识别和预测模型的构建,以实证分析结果修正模型。
传统LASSO惩罚回归模型在变量选择和模型预测准确性及稳健性等存在不足。本研究旨在采用集成统计建模等策略改进传统方法,构建登革热预测模型,使之适合分析具有明显差异的年际流行规模尺度及零膨胀特征的登革热监测数据。本研究构建了采样排序LASSO变量选择方法,降低LASSO方法筛选变量时假阳性率扩大的问题。针对传统LASSO回归模型的预测性能,本研究全面地比较和评估LASSO回归模型与多种机器学习算法(支持向量回归模型、广义相加模型、增强型回归树模型等)的预测性能,并确定了采用集成统计建模的改进策略。在此基础上,本研究采用模型集成化方法和多指标优化评估策略,对传统LASSO回归模型进行了改进,构建了集成惩罚回归算法(ensemble penalized regression algorithm, EPRA)。该框架整合了传统的惩罚回归算法(LASSO, Ridge, Elastic Net, SCAD和MCP)各自的优点。本研究分析结果确定了EPRA框架对登革热预测具有较高的准确性和鲁棒性。为了进一步融合气象、蚊媒、社会经济等因素提升模型对登革热流行和传播预测的准确性,本课题进一步将集成统计建模策略与传染病动力学模型加以整合,构建卡尔曼滤波与动力学模型的集成预测框架。研究结果表明,所开发的方法能够有准确地预测登革热流行季度中病例峰值及出现病例峰值的时间,验证了该方法的有效性。本研究所提出的方法学有助于建立准确的登革热流行和传播动态预测模型,将有助于加强登革热监测、推动防控方案实施,最终有助于更好地预防和控制登革热的流行和暴发。
{{i.achievement_title}}
数据更新时间:2023-05-31
涡度相关技术及其在陆地生态系统通量研究中的应用
监管的非对称性、盈余管理模式选择与证监会执法效率?
基于LASSO-SVMR模型城市生活需水量的预测
基于多模态信息特征融合的犯罪预测算法研究
多源数据驱动CNN-GRU模型的公交客流量分类预测
缺失数据下半参数回归模型的稳健估计及变量选择方法研究
稳健变量选择与高维数据分析
时间序列模型中稳健且有效估计及稳健变量选择问题的研究
高维纵向数据的若干稳健变量选择方法研究