Whether encoding protein is the golden standard for distinguishing protein coding genes and non-coding RNA (ncRNA), while recent detected peptide coding small open reading frames (sORFs) from lncRNA challenged this standard. More and more studies have shown that peptide coding sORFs exist in different regions of eukaryotic and prokaryotic genomes universally, which play important roles in biological activities. Because of the low expression level and low abundance and the short sequence length, there is few effective computational or experimental methods and data resources for peptide coding sORFs, then study of peptide coding sORFs is in its early phase. At present, most studies of peptide coding sORFs are concentrated on several model eukaryotes,people know little about its intrinsic features, therefore the peptide coding sORFs bring more challenges for genome annotation under the precision medicine era. Then, this project is proposed to comprehensively understand the intrinsic biological mechanism and accurately identify the peptide coding sORFs in prokaryotic genomes. Firstly, the intrinsic sequential and structural features of peptide coding sORFs are deeply analyzed at nucleic acids and protein level by integration of multi-omics data and bioinformatics methods. Secondly, based on abovementioned comprehensive analysis, mathematical statistics models are developed to numerically describe their specific properties, from which we can derive efficient numerical descriptors for completing the peptide coding sORFs prediction algorithm. Thirdly, prediction credibility analysis can be performed with the help of multiple omics sequencing and biological technologies, then the peptide coding sORFs prediction algorithm can be further perfected according to the feedback of the credibility analysis. Finally, the web server of the accomplished peptide coding sORFs prediction algorithm will be developed and the prediction algorithm will be also applied to more prokaryotic genomes, and then an online database for sORFs and their encoding peptides can be constructed. In summary, this project will provide solid theoretical basis and useful tools for later genome annotation and study ncRNA as well as peptides design related studies.
能否编码蛋白是区分mRNA与非编码RNA的理论基础,而近期在非编码RNA中发现的可编码多肽小开放阅读框(sORFs)对此提出了挑战。大量研究表明肽编码sORFs普遍存在于真核和原核生物基因组各区域,具有重要生物功能。由于表达水平及丰度低、序列短等因素,肽编码sORFs还缺乏有效研究方法及数据资源,现有研究仅集中于少数真核模式生物,对其固有生物特征认识不深入,因此肽编码sORFs为“精准”时代基因组注释提出了更高要求。本项目重点:深度融合转录组、核糖体谱、蛋白质组等多种组学测序及生物信息技术,发展序列结构特征有效分析方法,从核酸及蛋白质层次深入挖掘原核生物基因组肽编码sORFs及其翻译多肽的多维固有特征,进而针对原核生物肽编码sORFs识别发展普适性预测算法,并开发肽编码sORFs在线预测平台及相关数据库资源,为今后基因组注释、非编码RNA及基于多肽的药物设计等研究提供坚实的理论和技术基础。
肽编码sORFs是一种被长期忽略的基因组“暗物质”,随着测序技术发展,已有大量研究证实肽编码sORFs普遍存在并发挥重要生物功能。与常规蛋白编码基因(>300碱基)相比,其表达水平及丰度低、序列短、实验技术缺乏,因此对sORFs研究难度大,多数集中在人、鼠、拟南芥等几种真核模式生物,对原核生物研究较少,能够有效识别sORFs的生物信息算法及相关数据库资源缺乏,其相关序列结构机制等生物特征认识亟待深入。因此,本项目深入融合转录组、蛋白质组等多种组学数据,充分结合生物信息技术、分子动力学模拟技术,从核酸和蛋白质层次针对原核生物基因组肽编码sORFs序列、结构、功能、结构无序等多维特征挖掘开展了系统研究,同时,发展了能够有效展现序列结构特征的数理研究方法,提出了一套完善的能够对原核生物基因组普遍适用的肽编码sORFs预测算法,并开发了系列sORFs数据资源库及预测平台。同时,本项目发展的分析方法进一步应用于产甲烷菌等生态、农业相关微生物组学分析,得到了较好的研究结果,因而本项目为今后sORFs研究提供了坚实的理论和方法基础。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
DeoR家族转录因子PsrB调控黏质沙雷氏菌合成灵菌红素
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
环境类邻避设施对北京市住宅价格影响研究--以大型垃圾处理设施为例
通过多组学数据融合及临床数据验证预测神经母细胞瘤潜在驱动基因
基于多组学数据融合的泛癌中非编码RNA crosstalk模式研究
基于多源数据融合的出行特征挖掘和需求预测建模
融合多组学数据优化筛选恶性肿瘤中表观失调非编码RNA及其功能研究