Regularization, as a technique to solve an ill-posed problem or to prevent overfitting of parameter estimation due to the large number of features, has been widely applied in the fields of high dimensional big data analysis. There are still some challenge problems in its application: first, the difficultly obtained feature structure has impact on the process of constructing regularization penalty function, it therefore obtains unstable parameter estimation and feature selection; second, it increases computing time of parameter estimation for the data with a large number of features; third, the dynamic and change features in high dimensional data leads to unstable results of online learning algorithm. In this project, we mainly study how to solve these problems: ① studies on how to identify feature structure from high dimensional data based on clustering algorithms and graphical models, and then construct the regularization penalty function which can improve the stability of parameter estimation and feature selection. ② studies on how to categorize features into different classes based feature structures, and then divide features into different classes so that its corresponding unknown parameters can be estimated in distributed computing environment with less computing time. ③ studies on how to identify feature structure dynamically, and determine when update the online learning algorithm according to the changes of feature structure so as to improve the stability of the algorithm. Research results will enrich the analysis of large data, can be applied to analyze the big data from health care, e-commerce, internet finance and other applications, and make big data to big knowledge and big science.
正则化作为一种解决特征维度较多导致参数估计不稳定问题的技术,已被广泛用于高维大数据研究中。在其应用过程中还存在:①难以获取的特征结构会影响正则化约束函数的构造,导致模型参数估计和特征选择的不稳定;②大量特征会增加估计算法计算时间;③动态变化的特征会导致在线学习算法的稳定性降低等问题。针对这些问题,本课题拟:①研究基于聚类和图模型等的特征结构识别方法,识别高维大数据的特征结构并用于构造正则化约束函数,提高统计模型参数估计和特征选择稳定性;②研究带特征结构的正则化统计模型的分布式参数估计算法,提高参数估计效率,降低参数估计计算时间;③研究特征结构差异性识别方法,根据特征结构的变化情况判断正则化在线学习算法的更新时机,提高算法的稳定性。研究成果将丰富大数据的分析方法,能应用于分析医疗健康、电子商务、互联网金融等大数据,实现大数据到大知识、大科学的转变。
本课题旨在解决高维大数据分析过程中特征数量较多带来的统计问题,主要是借助特征之间的结构关系来改善统计模型估计和计算方法性能。基于此,课题组从解决实际分析问题出发,开展相关研究。首先,研究了处理多源异构大数据的分析技术,提出了融合多基因网络的深度图卷积神经网络,该模型具有较好的性能且能够识别表征疾病的重要基因。其次,提出了基于正则化模型的风险投资领袖识别框架,该框架能够借助联合投资网络提高风险投资领袖识别的准确度。再次,从管理学视角论证风险投资领袖识别结果与其绩效的内在联系,进而证实算法的实际应用价值。与此同时,提出了融合联合投资网络的图神经网络,提升识别风险投资领袖的准确度。最后,课题组还围绕着特征结构识别及给定特征结构的学习模型展开研究,并在未来逐步完善
{{i.achievement_title}}
数据更新时间:2023-05-31
演化经济地理学视角下的产业结构演替与分叉研究评述
监管的非对称性、盈余管理模式选择与证监会执法效率?
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
面向云工作流安全的任务调度方法
居住环境多维剥夺的地理识别及类型划分——以郑州主城区为例
面向高维数据的稀疏正则化方法及应用
信息论学习中的正则化及相关高维数据分析方法的数学理论
高维数据的稳健统计分析及相关问题
面向文本数据模型比较的正则化交叉验证方法