With the rapid development of information technology, the amount of data generated by all walks of life is increasing day by day. Meanwhile, it also presents many new characteristics, such as high dimension, complex structure, large sample size but few labeled sample. The aim of this project is to study the semi-supervised manifold learning algorithm and its mathematical theory for learning high dimensional data with a large sample size but less labeled sample. We make full use of the ideas and methods derived from statistical learning theory, robust statistics and approximation theory to study this project. Firstly, we study the consistency and convergence rate of distributed semi-supervised manifold least squares regression algorithm, in which distributed learning can effectively reduce computation cost; Secondly, we investigate various statistical robustness of semi-supervised manifold algorithm when data is perturbed. Robustness is another important nature of machine learning algorithms and is defined to describe whether the performance of the algorithm is stable when the data is disturbed. In particular, we will study the approximation properties and robustness of deep learning algorithm associated with hierarchical Gaussian kernels, and provide some theoretical basis for deep learning; Finally, we will consider a family of large-margin classifiers, namely large-margin unified machines (LUMs), which is specially designed to deal with high dimensional classification data. The mathematical analysis will be conducted in the framework of statistical learning theory. The research of this project can not only provide theoretical guidance for practitioners of data science, but also improve the mathematical theory foundation of machine learning algorithms. Furthermore, it will shed light on new theoretical problems in mathematics and design of new algorithms.
随着信息技术的飞速发展,各行各业产生的数据量与日俱增的同时也呈现出了许多新特点,比如维数高、结构复杂、样本量巨大但标签样本少等。本项目旨在充分运用源自于统计学习理论、稳健统计、逼近论等学科的思想与方法,研究用于分析高维且标签样本少的数据的半监督流形学习算法及其数学理论。首先研究分布式半监督流形最小二乘回归算法的相容性和收敛速率;其次研究当数据有扰动时半监督流形算法的各种统计稳健性,特别地,我们将研究基于hierarchical高斯核的深度学习算法的逼近性质及其稳健性,为深度学习提供一些理论基础;最后研究基于Large-Margin Unified Machine的一类适用于高维分类数据的分类算法,在统计学习理论的框架下建立该类算法的数学理论。本项目的研究不仅可为数据科学从业者提供理论指导,还可完善机器学习算法的数学理论基础,并为新问题的提出和新算法的设计提供线索。
本项目在大数据背景下,受高维小样本数据、未标注数据及海量数据分析应用驱动,用逼近论、高等概率论、统计学、傅里叶分析、希尔伯特空间理论等学科的思想与方法对数据分析中的分布式多惩罚正则化成对学习算法及半监督多惩罚成对学习算法、函数型线性回归问题、处理高维小样本数据的大间隔分类学习算法进行数学理论分析。首次采用双样本核和算子理论研究了分布式多惩罚正则化成对学习算法,得到了算法在mini-max意义下的最优学习率。同时研究了半监督分布式多惩罚正则化成对学习算法,证明了未标记数据可以提高分布式算法的性能。首次提出了基于梯度迭代、基于Huber损失、基于分布式学习的函数型线性回归算法,并建立了各类算法的误差理论。首次在统计学习理论的框架下系统地研究了处理高维小样本数据的大间隔分类学习算法的收敛理论,揭示了大间隔损失函数与0-1损失函数之间的定量关系,并分别给出了算法在独立同分布和不独立且不同分布两种不同采样过程下的快速学习率。本项目所得研究结果不仅进一步完善了上述学习算法的数学理论基础,而且可以帮助应用领域的人理解学习算法的深层原理,同时也为算法在各领域的应用提供了可靠的数学分析和解释,并为新算法的设计提供了线索。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
监管的非对称性、盈余管理模式选择与证监会执法效率?
主控因素对异型头弹丸半侵彻金属靶深度的影响特性研究
宁南山区植被恢复模式对土壤主要酶活性、微生物多样性及土壤养分的影响
内点最大化与冗余点控制的小型无人机遥感图像配准
基于稀疏表示和流形理论的半监督分类研究
基于半监督流形学习的非线性故障诊断方法研究
半监督流形学习方法及其在环境感知中的应用
半监督排序学习理论与算法研究