High-throughput and cost-effective genome sequencing approaches have resulted in the completion of over one thousand genome sequences with the sequencing of thousands of additional species, which laid the foundation to inquiry the genetic blueprint of genomes and to decode the diversity of biology. However, the current genome annotation mainly depend on the prediction of the gene structure, DNA or RNA sequence analysis, or RNA sequencing, which is difficult to identify species specific genes or genes with specific structure, to accurately predict the start sites of genes or the variable transcriptional or translational product, leading to considerable false or missing annotation. Proteins constitute the ultimate evidence of annotated coding genes. Proteogenomics, based on the proteomic data, combining with the genomic and transcriptomic data, directly measures peptides arising from expressed proteins, which allows for direct verification of annotated coding regions, on the other hand helping to correct overestimated genes and identify missed genes. As other omics studies, current proteogenomics also faces the challenge of the high false positives and the insufficiency of validation procedures. In this study, taking the well-studied model organism yeast as example, we focused on the development of the proteogenomics method based on the ORF level quality control and the systematic validation strategy, in order to correctly and efficiently identify peptides of novel gene product from the massive mass spectrometry data. Combining of synthetic peptides validation, gene transcription, prediction of the gene structure and regulatory sequences and phylogenetic analysis, we try to build the automatic and high-throughput technological flow for the novel gene identification, helping to improve the precision and accuracy of genome annotation, to verify some novel genes or novel gene structure from yeast, to achieve the re-annotation of yeast and provide new genetic material for biological research.
测序技术的快速发展为基因组注释和功能基因组学研究提出了更高的要求。而现有的基因组注释依赖的注释软件和核酸序列信息,难于发现物种特异基因或新的基因类群,也难于确定基因边界和翻译调控,导致基因组注释的遗漏或错误。蛋白质是编码基因存在与否的最终判据。蛋白质基因组学不仅可验证已注释基因组,校正原有注释,还可发现遗漏注释基因甚至物种特异的新基因,实现基因组的重注释。然而目前的蛋白质基因组学方法存在假阳性率高,预测基因证据不充分等缺点。本研究以真核模式生物酵母为对象,拟采用基于ORF水平的质量控制策略,正确、高效地鉴定海量质谱数据中存在的新基因编码肽段,结合合成肽段验证、基因转录、结构和调控序列预测以及系统进化等完整的基因验证技术体系,开发自动化高通量的新基因发现和验证流程,提高基因组注释的精度和准度,鉴定一批可验证的酵母新基因或新基因结构,实现酵母基因组的重注释,并为生物学研究提供全新的基因材料。
在基因组测序技术快速发展的今天,利用蛋白质组学结果对基因组进行进一步注释和修正的蛋白质基因组学在基因组注释中发挥着越来越重要的作用。但是目前的蛋白质基因组学面临着假阳性率高,预测基因证据不充分等缺点。.在本研究中,以真核模式生物酵母为研究对象,我们使用QE-HF质谱仪采集了酵母全蛋白质组串联质谱数据,在6框翻译数据库搜索下获得了4,652条包含特异肽段的酵母已注释基因这一新纪录,补充了新鉴定基因的蛋白质证据,预测了酵母基因组注释的完整性比例为99.7%,未注释区域存在的新基因数目不超过20个。对于常规方法所发现的新肽段候选,我们发展了更为准确的新肽段FDR估计新方法,其次还发展了系统的合成肽段检验方法,并最终获得了12条比较可信的新肽段。同时我们发展2种小蛋白富集方法,并成功的鉴定了3个新的小蛋白,进一步补充注释了酵母基因组。.接下来我们又综合RNA-seq、RT-PCR、系统进化和基因特征调控序列预测等技术手段,进一步系统验证这些新肽段对应的新基因或新现象,并解释漏注释原因。最终,6条新基因得到了验证(包括3个小蛋白);特别是其中1条新基因是物种特异基因,可作为物种鉴定的标志物。此外我们还发现3个已注释基因存在新N端,另外4个已注释基因还存在特殊的翻译现象。本文建立的方法,有助于形成精准蛋白质基因组学的标准流程。
{{i.achievement_title}}
数据更新时间:2023-05-31
端壁抽吸控制下攻角对压气机叶栅叶尖 泄漏流动的影响
基于ESO的DGVSCMG双框架伺服系统不匹配 扰动抑制
肉苁蓉种子质量评价及药材初加工研究
丙二醛氧化修饰对白鲢肌原纤维蛋白结构性质的影响
中外学术论文与期刊的宏观差距分析及改进建议
基于蛋白基因组学的黄曲霉菌基因组重注释及蛋白质翻译后修饰的全局鉴定
家蚕基因组中未知转座子的注释及比较基因组学研究
水平基因转移对酿酒酵母基因组进化和表型性状的影响
酿酒酵母菌驯化群体的进化基因组学研究