Predicting tumor purity and ploidy is critical for efficient analysis of DNA variations and detection of disease genes. The central point is to infer the number of heterogeneous tumor cell populations, fraction of each population and its related ploidy from DNA signals of a tumor sample. Most of the existing approaches have the assumption that tumor cells are homogeneous or heterogeneous but with the number of heterogeneous tumor cell groups being known and fixed, and take no consideration on correlation between tumor heterogeneity and tumor ploidy. Though simplify computation, they are not suitable for the more general situation of the number of heterogeneous cell groups being unknown and uncertain and the degree of tumor heterogeneity being varying from sample to sample. This seriously influences their prediction power. In this project, we study methodology for accurately predicting the number of heterogeneous tumor cell populations, the fraction of each population and its related ploidy from next-generation DNA sequencing data, aiming at high prediction accuracy for the general situation. Based on in-depth feature exploration of sequencing reads, valid observation extraction of somatic mutations and copy number alterations and suitable clustering of these signals, we establish an inherent-relation model for the correlation between tumor purity and ploidy using somatic copy number alterations; further, based on the model, we create a dynamic prediction scheme driven by the uncertainty of tumor heterogeneity: starting from the integrating of tumor cells, tumor cell populations are gradually decomposed until to some stable state, where the stable state automatically issues the answer: the number of heterogeneous tumor cell populations, fraction of each population and its related ploidy. The project will provide a new method and platform for tumor DNA variation analysis and cancer ploidy detection, and further support the understanding of tumor evolution and the prediction of cancers.
预测肿瘤纯度及倍体是有效分析DNA变异并检测致病基因的关键,其核心问题是如何利用DNA信号推断肿瘤异质成分的数量、成分及相关倍体。现有方法多假定肿瘤细胞成分单一或数量确定,且不考虑肿瘤异质成分与倍体间存在的关联关系,尽管简化了计算,却不符合肿瘤成分数量未知且不确定、样本间异质程度不一致的一般情况,极大限制了高度异质肿瘤的纯度及倍体预测能力。本项目以新一代DNA测序数据为背景,研究一般情况下肿瘤异质成分及倍体的高精度预测。在探索测序读段特征、提取有效的可观察的体细胞突变及拷贝数变异并合理分层的基础上,建立以体细胞拷贝数变异为引导的肿瘤纯度与倍体关联模型,构造以异质成分不确定性为驱动的动态预测方法:以肿瘤细胞整体为起点、通过不断分解肿瘤细胞群体至稳定状态,从而自动给出肿瘤异质成分的数量、成分及倍体的准确预测,为肿瘤DNA变异分析和癌症倍体检测提供新方法和平台,为认识肿瘤进化和预测癌症提供支持。
肿瘤测序样本是肿瘤细胞和正常细胞的混合体,而且肿瘤细胞存在极大的异质性和不一致性,若不经过处理而直接拿来做肿瘤研究,会直接影响基因组变异分析的质量,甚至会误导癌症机理的认识、癌症生物靶点的识别和生物药物的研发。因此,对肿瘤样本进行纯化、去异质化,使所获得的基因组信息是高纯度的肿瘤成分和倍体,对于从源头上准确把握癌症机理研究有重要意义。本项目基于新一代DNA测序数据,研究肿瘤异质成分及倍体的高精度预测方法。在探索测序读段特征、提取有效的可观察的体细胞突变及拷贝数变异并合理分层的基础上,建立以体细胞拷贝数变异为引导的肿瘤纯度与倍体关联模型,构造以异质成分不确定性为驱动的动态预测方法:以肿瘤细胞整体为起点、通过不断分解肿瘤细胞群体至稳定状态,从而实现肿瘤异质成分及倍体的综合预测。本项目研究的重要结果包括:(1)建立了一种面向新一代测序技术的综合性仿真理论、算法及软件平台;(2)建立了一种面向新一代测序数据的多样本拷贝数变异模式检测算法;(3)建立了一种基于单样本的肿瘤拷贝数变异及缺失类型检测方法;(4)建立了一种基于孤立森林和全变分的拷贝数变异及肿瘤纯度检测方法;(5)建立了一种基于动态模型的肿瘤样本的纯度和绝对拷贝数推断的方法,能够实现肿瘤纯度及倍体的综合预测;(6)构建了一套从原始测序数据到基因组变异类型检测、靶向药物定位的平台。本项目研究成果为癌症机理研究提供技术支持,对于癌症精诊断具有潜在应用价值。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
论大数据环境对情报学发展的影响
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
环境类邻避设施对北京市住宅价格影响研究--以大型垃圾处理设施为例
基于新一代测序数据的顺式调控模体预测与分析
基于新一代肿瘤测序数据的驱动通路发现与综合分析方法研究
基于新一代高通量测序数据的若干统计方法学研究
基于新一代测序数据的复杂疾病特异共调控网络构建及分析方法研究