How to extract the intrinsic feature of protein-coding gene sequences is the core of gene prediction algorithm, which is also one of the most important basis for genome annotation. With the development of genome related researches, more and more studies indicate that the discription of the specific feature of protein-coding genes is not complete still, and the prediction results obtained by different gene finding methods vary greatly, which causes the serious problem of protein-coding gene annotation errors in microbial genomes. The accumulation of such errors in public databases may have the potential for propagation effect, which will decrease greatly the value of public databases and even cause false scientific conclusions. Therefore, how to improve the quality of protein-coding gene annotation is an important task. Aiming to this problem, the main works of this project include: (1) Recent research that some genomes with different G+C content exhibit highly universal properties of protein-coding gene. To reveal the biological mechanisms of such common properties, we design filtering algorithm and construct the data set composed of genomes with high common protein-coding properties, based on which we perform genes composition analysis. To display the universal sequence feature of protein-coding genes, we attempt to propose graphical representation and statistical methods to display the specific sequence features, such as polynucleotides, sequence order, long-range correlations at multiple sites. (2) Based on comprehensive mining of the common sequence feature of protein-coding genes, we will outline universal algorithms for detecting the missing genes. On the other hand, by combining the physiochemical properties of amino acids with graphical representation, we will develop novel approaches for protein sequences analysis, based on which we can derive efficient feature parameters to design hypothetical gene function prediction algorithms. (3) To objectively evaluate the efficiency and reliability of the reannotating algorithms proposed in this project, we perform genome reannotation on some model microognisms, and use molecular biologics experiments and bioinformatcs resources to analyze the predicting results. Then, this project can provide practical methods for improving the annotation quality of microbial genomes.
蛋白质编码基因序列的有效信息挖掘是基因预测算法的核心,也是基因组注释的基础。近期研究表明目前对基因序列信息描述还不全,不同基因预测算法得到的结果具有较大差异,导致微生物基因组基因错误注释不断积累,影响了数据库使用质量及研究结果的准确性。针对该问题,本课题研究重点:1、以有些具有不同G+C含量基因组展现出很高的蛋白编码基因共性特征为契机,设计基因组筛选算法,构建数据集,发展有效刻画序列多聚体组成、排列及多位点长程相关等重要特征的几何模型和数理统计方法,揭示基因共性特征的生物机理及信息编码规律,为今后基因预测提供新思路。2、基于对基因共性信息的充分挖掘,并针对蛋白序列将几何分析模型与氨基酸理化特性有机融合,提取特征参数,发展准确、可靠的欠注释基因和基因功能预测算法。3、将发展的重注释算法应用于重要模式微生物,利用分子生物学实验进行预测结果可靠性验证分析,为提高基因组注释质量提供有力技术保障。
快速发展的基因组测序技术为生命科学带来了巨大的机遇和挑战。在目前已完成测序的基因组中有超过90%为原核生物,因而对原核生物基因组的准确注释成为当前生命科学的重要课题。本项目针对目前数据库中普遍存在的原核生物基因组错误注释问题,深入研究了不同类型原核生物基因组中蛋白质编码基因序列异同特征,重点针对不同类型原核生物中大、小染色体及质粒之间蛋白质编码基因的异同序列进化特征、多拷贝基因组成及生物机理等问题进行了系统研究。在此基础上,发展了系列核酸及蛋白质序列分析及特征挖掘新方法来定量刻画蛋白质编码基因及相应蛋白序列组成、排列等固有特征,进而提出一套完善的原核生物基因组蛋白质编码基因重注释算法。同时,本项目将理论预测与实验研究结合,开展了分子生物学实验及转录组测序工作,在一些重要细菌基因组中发现了一批新基因,并充分利用生物信息学工具,从不同角度初步完成了相关基因及其产物的功能分析工作。本项目可为今后基因组分析及注释提供可靠的理论和方法保障。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
DeoR家族转录因子PsrB调控黏质沙雷氏菌合成灵菌红素
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
环境类邻避设施对北京市住宅价格影响研究--以大型垃圾处理设施为例
基因专利信息相关的微生物基因组注释平台
基于蛋白基因组学的黄曲霉菌基因组重注释及蛋白质翻译后修饰的全局鉴定
基于ORF水平质量控制的蛋白质基因组学对酵母基因组注释的研究
整合基因组注释信息对西门塔尔牛生长性状的全基因组选择研究