The biomedical big data (BBD), generated from a variety of sources and multiple layers, include personal-level exposure data, population-level environmental exposure information, high-resolution medical images, electronic health records, as well as data from high-throughput genomic platforms such as DNA sequencing, DNA methylation, gene expression, et al.. Most of previous studies only focused on the dataset from a single layer, ignoring the association among the multiple layers in BBD. In this study, we aim to develop more effective statistical methods for BBD integration to improve understanding of and provide insights into biomedical big data. Following strategy will be applied in the study: a) Preliminary fast screening of the risk factors; b) Fine evaluation of the risk factors; c) Building risk prediction model; d) Validation in independent populations. To further understand the sophisticated association among factors and risk of cancers, we will propose entropy based weighted information gain (WIG) method to efficiently enrich the genes carrying main effects, interactions within a single layer, interactions among multiple layers, as well as interactions with environment. Majority advantage of WIG method is utilizing the prior biological information into subsequencing analysis, such as molecular processes and regulatory relationships. Further, we will propose a Bayesian sequential method to integrate data from multi-layers to provide a better prediction of cancer risk. Furthermore, we will use the improved causal mediation analysis to explore the potential causal pathways. The proposed methods will be applied to lung cancer and gastric cancer. Risk factors and prediction models will also be explored and validated in large-scale cohorts.
生物医学数据来源广泛,涉及个体、群体环境暴露、遗传变异、DNA甲基化、基因表达等多个层面。常规研究往往仅利用某一层面单个完全数据集进行分析,忽视了多层面数据间的关系。本课题拟采用“初步筛选→再次筛选→精细建模→人群验证”的分析思路,利用大数据思维,对基于人群的生物医学多层面数据进行整合分析,探索肺癌、胃癌等常见肿瘤的复杂关联因素,建立风险预测模型,提高预测精度。拟充分考虑各层面间的结构、调控关系等生物先验信息,提出加权信息熵法,快速富集具有主效应或层面内、跨层面基因-基因、基因-环境交互作用信息的基因;提出Bayes序贯分析法,逐层整合数据,更高效地筛选预测因素;改进因果中介分析模型,探索多层面因素的作用方式及强度;将所建方法尝试应用于肺癌、胃癌的关联分析及风险预测模型的建立,并基于大规模人群队列进行验证。
复杂疾病由外环境暴露和内环境失衡共同作用所致。从外到内多个维度探寻疾病发生、发展的原因,是疾病预防、诊断、治疗的关键,对实现“健康中国”具有重要科学意义。多组学数据整合分析可以系统地、深入地鉴定疾病相关生物标志物;识别驱动疾病的复杂关联模式,包括:疾病因果链,基因与环境之间、之内的交互作用,疾病风险及预后预测模型。然而,多组学数据的“块缺失结构缺陷”、“高维灾难”、“复杂关联模式”等特点对数据挖掘提出了巨大的技术挑战。为此,我们从5个方面开展多组学数据的理论方法与临床研究:.i. 缺失处理。现实研究中,多组学数据有典型的“块缺失”结构缺陷。我们提出“填补”和“架桥”两种解决方案。与传统方法相比,我们构建的TOBMI填补算法具备填补精度高,有效维持原数据结构的特点。此外,两种“架桥”算法:全信息极大似然法和配对删除法,估计精度也优于传统方法。.ii. 降维策略。高维度的多组学数据具有信噪比低,分析耗时长的问题。我们提出ERB降维策略:基于信息熵(Entropy),提取特征值;基于随机森林(Random forest),按重要性筛选生物标志物;基于贝叶斯(Bayes),利用先验信息,大规模并行筛选重要靶点。模拟实验与实例研究表明:上述降维策略可有效降低数据维度,聚焦重要标志物。.iii. 精细挖掘。复杂疾病由因素间复杂的关联模式所驱动。一方面,从因果推断角度,发展并运用孟德尔随机化、中介分析的方法,控制未知混杂因素,估计真实关联效应;探索因果关系,识别致病因子。另一方面,从交互作用角度,探索基因与环境之间、之内的复杂关联模式。.iv. 预测模型。复杂疾病由宏观、微观多个层面因素所决定。我们整合多维度指标,基于“初步筛选→再次筛选→精细建模→人群验证”的分析策略,构建了多个高精度的肿瘤预后预测模型。.v. 平台开发。获批国家版权局软件著作权5件,开发了2个交互式可视化平台,使得复杂的整合分析策略及方法变得操作便捷、易于实现。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
论大数据环境对情报学发展的影响
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
基于LASSO-SVMR模型城市生活需水量的预测
Ti-1.5Al-4.5Fe-6.8Mo合金在氢气相变烧结(HSPT)过程中的致密化及相变机理研究
应用整合模型定量评估及预测气候变化背景下人群健康风险
基于大数据的人群心血管疾病风险预测模型构建及应用研究
基于多源数据整合的药物组合预测方法研究
基于多组学数据整合的疾病基因预测方法研究