On the basis of collecting known agricultural high throughout data and analysis for the high throughout sequencing data and microarray data, the proposed project aims to develop novel feature selection algorithms incorporating Gene Ontology (GO), Pathway and Quantitative Trait Locus (QTL) under the guidance of known mechanism information. .As the main challenge of system biology is data incorporating, and the small samples and high features problem of high throughout data make classical technologies inefficiency, even the extremely small samples of agricultural high throughout data make it difficult in studying this feature selection problem. Merging biological information of different stages and aspects of biological process, we propose a novel ensemble algorithm with multiple stages and multiple Agents in incorporating multiple algorithms, and then use appropriate scoring methods to choose the consensus results of these problems. The proposed project will first present a novel algorithm in the framework of Support Vector Machine - Recursive Feature Elimination (SVM-RFE) under the framework of the proposed ensemble algorithm framework. This algorithm not only considers the kernel width from training results by using SVM, but also intends to add weights on specific features when they are in the same pathway. These algorithms are designed for the specific character of small number of samples in agricultural high throughout data, and could select genes, RNA and proteins list or set highly related with specific agricultural traits from the mass storage and heterogeneity data set. .The proposed project studies the mixed data modeling methods incorporating feature selection and network information, and then constructs the complex molecular network on agricultural genes and agricultural traits. The traditional QTL has abundance information in genetics and traits, which could help to validate the results of modeling. To evaluate the influence of the abundance of QTL, the project uses Hypergeometric distribution to evaluate the QTL information in the whole genome. Then the project will introduce and model the known QTL data to assess and validate the molecular network construction results. This project will propose and develop a suite of pipeline and platform of agricultural high throughout expression data. The researchers could do mechanism analysis on specific agricultural trait and construct related molecular network, and then make assessment on mechanism network construction results from multiple data sources. With the combination and analysis of mechanism information and high throughout data, the proposed project could reveal some mechanism of molecular genetics and metabolic mechanism related with specific agricultural traits, and could provide examples for molecular breeding design and complex genome with complex traits in molecular genetics improvements.
本项目在搜集已有农作物高通量表达谱数据的基础上,通过对高通量测序数据和微阵列数据的分析,针对农作物高通量数据小样本的特点,开发融合GO, Pathway和QTL等多种信息的特征选择算法。通过使用提出的多阶段多Agent的集成学习方法融合各生物过程和阶段的信息,从海量高维异种数据中选取与特定农艺性状最相关的基因、RNA和蛋白质集合。进一步研究融合特征选择与网络信息的混合数据耦合建模方法,构建农作物基因和农艺性状相关的复杂分子网络,通过对已知的QTL数据的引入和建模,对分子网络构建结果进行多种手段的评估验证。本项目将提出和开发一套基于农作物高通量表达谱数据的流程和平台,研究人员可以进行特定农艺性状机理分析、构建相关的分子网络,并对机理网络构建结果进行多数据源的评估。通过机理与数据的结合与分析,揭示与相应农艺性状相关的分子遗传和代谢机理,为分子育种设计和复杂基因组复杂性状的分子遗传改良提供范例。
在基因表达谱数据处理方面,项目组先后提出了成对微阵列表达数据的基于最大相关最小冗余(mRMR)的总体特征选择方法,基于样本局部化的基因特征选择方法,基因表达谱数据的局部SVM-RFE特征选择方法,以及基于微阵列数据的多阶段特征选择算法。大量数值实验表明我们提出的表达谱数据特征选择方法比已有的常规方法更高效。 此外针对时间序列基因表达谱数据,项目组提出了基于多分辨率形状的混合聚类模型。进一步基于12种作物的基因芯片数据构建了植物miRNA靶基因表达数据库PMTED。.在农作物分子网络构建和分析研究方面,项目组建立了基于转录组数据的水稻抗盐机理的模型,提出了蛋白互作网络中关键蛋白质识别算法,构建了基于蛋白-蛋白相互作用网络中的边权值排序识别必需蛋白的模型,建立了更高效鲁棒的microRNA与靶基因互作的网络。.在基因组数据研究方面,项目组提出了基于保守基因簇的系统发生树推断方法,给出了基于 k-mer 频率的基因岛预测方法,构建了lncRNA识别的在线服务平台,以及基于非比对的RNA二级结构比较的在线服务平台。.在项目的完成过程中涉及了大量的生物医学文本挖掘问题,为此项目组提出了生物医学文献文本挖掘和关联网络分析算法,研究了生物医学命名实体识别算法,构建了深度置信网络和Softmax模型相结合的文本分类模型。.在算法理论层面,研究了机器学习和模式识别算法,提出了关于非均衡数据集重采样集成分类算法,构建了单数据簇支持向量模型,提出了去相关可分离的吸引子传播聚类算法,提出了基于极限学习与稀疏自动编码器混合算法。.本项目共发表学术论文29篇,其中SCI检索论文18篇,申请发明专利1项,获得软件著作权2项,获得高等学校科学研究优秀成果奖自然科学奖二等奖1项。组织国际学术系列会议3次,组织龙星计划课程4次。培养博士生9人、硕士生12人。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
监管的非对称性、盈余管理模式选择与证监会执法效率?
跨社交网络用户对齐技术综述
硬件木马:关键问题研究进展及新动向
城市轨道交通车站火灾情况下客流疏散能力评价
基于GPU的基因表达谱数据特征选择策略研究
基于流感病毒感染不同物种高通量数据的动态分子网络构建和关键特征研究
基于癌和癌旁miRNA与mRNA并行表达谱数据的人肝癌复合表达调控网络构建、解析和关键模块与分子的识别
基于生物网络模块的高通量表达谱差异分析