This proposal aims to develop the methods to model the sequencing bias and identify the differentially expressed genes with RNA-Seq datasets, produced with next generation sequencing technique. Different with traditional processes, all the analysis is based on the information of nucleotide base instead of gene or exon unit in order to make full use of the high resolution information with RNA-Seq datasets, and current exon and gene unit processing can be represented as the integral of nucleotide base. .System identification techniques are introduced into the RNA-Seq datasets. The potential factors causing sequencing bias are treated as independent variables, and the observed reads number for each nucleotide base is the response variable. Bias tendency of single factor are evaluated with sampling statistical technique to obtain the correct model structure. The complete bias model can be depicted as linear or nonlinear model. The two-step scheme are proposed for optimization. The least square method combined with weight functions and the EM algorithm are applied to estimate the undetermined parameters. Based on the corrected reads number for nucleotide base unit, regression, spline fit and L2 error norm techniques are integrated to estimate significance of the difference between the reads number of the same nucleotide sequence under two conditions, to identify the differentially expressed genes. The proper setting of integral interval in the L2 error norm can cover the current exon-unit and gene-unit processing methods; and the spline technique can handle the discontinuous reads distribution between different exons. Based on the approach, the conditions without technical/ biological replicates can be compared more accurately. Furthermore, the bias caused by gene length and sequencing depth can be avoided effectively. With the approach, the identification results, potential relationship and internal mechanism among the respective methods based on base/exon/gene unit will be analyzed. .Overall, with the introducing system identification techniques into the RNA-Seq datasets analysis, novel modeling and optimization ideas are explored to make the most of the high resolution information from RNA-Seq. In biology , we target to obtain the valid and accurate sequencing bias correcting model and differentially-expressed gene identification. In information, based on system identification, we effort to explore the suitable research ideaes for RNA-Seq and further bioinformatics analysis to achieve effective,valid and explicit-biological-meaning modeling and optimization approaches.
本项目对高通量RNA-Seq数据的偏差建模和差异表达基因识别展开研究。本项目将碱基视为信息处理的基本单位,而外显子、基因等都可视为碱基单元的某种积分。分析偏差时,将可能的偏差因素作为解释变量,观测到的碱基短序列匹配数作为响应变量,通过采样获取各因素对短序列分布的影响趋势,从而获取正确的模型结构,构建针对不同测序协议、平台适用的模型结构;提出两步骤优化方法,采用权系数与最小二乘的混合估计法、EM算法,对建立的线性或非线性模型寻优偏差权重,修正碱基位上的短序列匹配数。基于修正结果,提出基于碱基单元的差异表达基因识别方法。通过利用碱基的位置对应信息和短序列匹配数信息,结合线性拟合、样条回归、L2误差范数等技术识别不同条件下碱基序列上的匹配数差异的显著性,进而识别出差异表达基因。以上思路在统计方法中引入系统辨识的思想,以碱基为单位,充分利用RNA-Seq数据带来的高分辨率信息进行后续的数据分析。
新一代高通量测序受到极大关注。利用宏基因组和宏转录组高通量测序数据比较微生物群落间的差异成为重要的科学问题。该差异不仅涉及物种的丰度差异,也涉及物种的组成差异。本项目基于高通量测序数据对不同样本,特别是微生物群落之间差异的分析方法进行研究和探索,建立以下模型和平台,并运用到不同类型的高通量测序数据中:.①基于k-tuple频度的序列显著性统计模型及平台,基于定阶次马尔科夫模型的频度转移概率估计方法,无需配准,不需要种群的生物种类和基因组序列参考信息,仅仅基于数据本身分析不同样本和种群间的差异。该模型应用于99个海洋水域的微生物群落宏转录组数据以及16个宏基因组数据,利用该模型研究分析不同种群间的差异度,环境梯度的影响。.②基于RNA-Seq的基因组注解数据库评估模型:基于RNA-Seq至注解参考序列的配准信息提出在基因、转录物、外显子、剪切位点和碱基层面的特异性和敏感性度量指标, 进而评估基因组注解数据库的完整性和精确性。对5 个代表性的人类基因组注解数据库评估, 并构建人体综合准确注解数据库; 此外, 通过对现有恒河猴基因组注解数据库的评估 发现该数据库的完整性的欠缺, 及其注解精确性与人类数据库的注解水平的差距。该评估体系可对各物种的基因组注解信息进行全面、快速和高效的评估及验证,为差异表达基因选择合理的注解数据库提供了很好的参考依据。.③基于数据配准的全基因组注解动态规划模型:基于物种的相似性,利用已注解物种的基因组注解信息对未注解物种的基因组进行注解。通过序列比对,建立基于配准性能、配准位置关系、顺序和距离关系的动态规划模型,无需收集参考数据库和生物实验,快速建立具有足够精确度和完整性的物种注解信息,提供重要的参考信息。.④基于长k-tuple的信息显著性能初步分析:前期研究都集中在2-10bp的tuple中,主要关注tuple分布的总体统计特性。利用长k-tuple(k≥30),基于文本挖掘的信息聚类进行初步探索,发现长k-tuple独有的优势。.研究得到以下结论:.①基于2-10bp的tuple统计模型能较好地度量不同样本间的差异程度。对微生物群落能反映外部环境对群落的影响梯度。.②基于RNA-Seq高通量测序数据有效快速地验证全基因组的注解信息,为注解信息的评估和完善提供参考信息。.③当k-tuple变长,信息量变大,是很好的探索
{{i.achievement_title}}
数据更新时间:2023-05-31
基于分形L系统的水稻根系建模方法研究
涡度相关技术及其在陆地生态系统通量研究中的应用
DeoR家族转录因子PsrB调控黏质沙雷氏菌合成灵菌红素
硬件木马:关键问题研究进展及新动向
转录组与代谢联合解析红花槭叶片中青素苷变化机制
LncRNA RPL37AP1通过调控HNF4A/CEBPA/RPSA轴促使贲门腺癌侵袭迁移的新机制
高通量RNA-Seq测序数据的基因表达水平建模研究
基于时间序列RNA-Seq测序数据的基因表达动态分析建模研究
基于RNA-Seq技术的不同倍性麻竹基因表达差异研究
基于高通量数据的基因调控网络构建模型和方法研究