In the era of big data, scientists are able to and tend to collect as many variables as possible. These variables are usually aggregated from multiple sources with potentially different data generating schemes. This increases the possibility selection bias and measurement errors. And it results in the correlation between variables and error terms, which is called endogeneity in economics. Most of high dimensional feature selection approaches are no longer valid because of the existence of endogeneity. This project will investigate variable selection approaches under the endogenous linear interactive model with high-dimensional data. The interactive model includes common main effect features as well as interactive effect features which cannot be ignored in practical problems. Due to the existence of endogeneity, the introduction of instrument variables becomes the first important step. And our feature selection procedures must be proposed in two stages. Firstly, our project will provide a new two stage penalized likelihood variable selection procedure under a linear model with main effects only. Secondly, this procedure will be extended to interactive models by treating main effects and interactive effects differently. Finally, a real data under the biomedical sciences will be provided and investigated.
大数据时代,科学家研究时常倾向于收集尽可能多变量,且这些高维变量通常是多个数据来源聚合。因而易增加测量错误及偏差等,导致变量和误差相关,经济学上称之为内生性。当内生性存在时,极大部分现有高维变量选择方法不再有效。本项目旨在内生性前提下,探讨高维线性交互回归模型的特征选择方法。交互模型既包含常见主效应变量,也包括实际问题中不可忽视的交互效应变量。深层次揭示内生性的影响,并引入适当的工具变量,是本项目的首要步骤。由于工具变量引入,变量选择方法不可避免地分为两阶段。本项目首先拟在只含主效应的回归模型下,提出一种新的二阶段惩罚似然变量选择法。其次针对主效应和交互效应变量作用及数量等巨大差异,拟将这两类变量作适当分类处理,从而将上述方法推广到高维交互模型。最后拟对推广二阶段惩罚似然变量选择法在生物医学领域的实际运用展开研究。
高维变量选择指从变量数远远大于样本容量的特征空间中,选取重要变量,剔除冗余变量,其为大数据时代信息提取的一种有效方式。与传统数据分析相比,高维特征选择不仅计算负担重,且易导致噪声积累,虚假相关及内生性。多种经典惩罚似然法在变量选择时考虑了前两者,却未考虑内生性。本项目展示了经典惩罚似然法在内生性存在时的不一致性;为了消除内生性影响,引入了工具变量,在解释变量的估计和估计值代入原模型进行特征选择这两阶段依据侧重点的不同选取不同惩罚函数,而提出了一个新的二阶段惩罚似然法TSPL并证明了其一致性;于之相关联的得到多素变量线性方程解数渐近公式;本项目亦研究癌症影响基因的筛选并于广义线性模型下以轮廓边缘得分函数为鉴定标准构造序贯特征选择算法SRA并进行理论证明与生物标志物筛选;研究了双向交互情形下不同种类效应特征的处理方式,采取主效应变量、交互效应变量先各选一个再抉择方式处理,并证明了对应模型选择准则渐近一致性。
{{i.achievement_title}}
数据更新时间:2023-05-31
多能耦合三相不平衡主动配电网与输电网交互随机模糊潮流方法
一种基于多层设计空间缩减策略的近似高维优化方法
基于LS-SVM香梨可溶性糖的近红外光谱快速检测
二维FM系统的同时故障检测与控制
扶贫资源输入对贫困地区分配公平的影响
高维生存数据下交互模型的变量选择方法
高维协变量下部分线性风险回归模型的变量选择
基于概率生成模型的高维数据变量选择
缺失响应数据下高维稀疏分位数回归模型的变量选择