標題: 使用分數卡方法從蛋白質序列預測聚酮合酶與非核糖體肽合成酶
Using Scoring Card Method for Predicting Polyketide Synthases and Nonribosomal Peptide Synthetases from Protein Sequences
作者: 賴仕鈞
Lai,Shih-Chung
何信瑩
Ho, Shinn-Ying
生物資訊及系統生物研究所
關鍵字: 聚酮合酶;非核糖體肽合成酶;分數卡方法;智慧型基因演算法;polyketide synthases;nonribosomal peptide synthetases;scoring card method;intelligence genetic algorithm
公開日期: 2013
摘要: 聚酮合酶﹙Polyketide Synthases,簡稱為PKS﹚與非核糖體肽合成酶﹙Nonribosomal Peptide Synthetases,簡稱為NRPS﹚為兩類催化合成具有藥理學活性的天然產物﹙Natural Products﹚的蛋白酶。在合成聚酮化合物與非核糖體肽的過程像是生產線一般,這兩類的結構巨大的酵素能區分成很多模組﹙module﹚並分別包涵功能domain與 active sites,各個模組互相合作形成產物的完整生產流程以及進行多肽鏈延伸與相關的官能團修飾。透過剪接合成與修改模組,科學家可以研究生產人工合成肽的可行性,來設計特製化的聚酮化合物和非核糖體肽。此外,科學家在於做實驗往往需要大量金錢和時間的投入,生物科技的電腦運算工具發展,能夠幫助科學家來降低成本和收尋時間去發現新的功能蛋白以及其機制。 在目前多數的聚酮合酶與非核糖體肽合成酶研究中,都是對於其Adenylation domain和Acyltransferase domain上對於胺基酸有專一性的特性來進行預測與發展;而有部份研究提出有些新的聚酮合酶與非核糖體肽合成酶被發現,與常見組成結構有所不同,不包涵Adenylation和Acyltransferase domain,由特殊的稀有 domains所構成。然而卻鮮少有研究發展出從蛋白質序列來預測是否為聚酮合酶與非核糖體肽合成酶的工具。 本研究提出一套新的分數卡方法來解決此一問題。分數卡方法是一個簡單明瞭並藉由統計400個雙胜肽的成份組合來分析,計算出用以區分PKS/NRPS與非PKS/NRPS臨界值,對兩種類別的蛋白質進行模型建立與預測。其雙胜肽的分數比重更進一步的經由我們的智慧型基因演算法﹙IGA﹚來調整,透過ROC曲線下之面積做為適應函數的判斷依據,以達成最佳化分數卡的計算。分數卡在預測PKS/NRPS進行10折交叉驗證之訓練準確率為89.52%,獨立測試準確率為82.84%。不僅如此,分數卡還結合AAindex的531個物化特性以及20個胺基酸的傾向分數分析,能更進一步的找到是否為PKS/NRPS之蛋白的特徵因素,做為生物意義的研究與發現。
Polyketide synthases (PKS) and nonribosomal peptide synthetases (NRPS) are accountable for the biosynthesis of two classes of pharmacologically active natural products. The biosynthesis of polyketides and nonribosomal peptides is an assembly-line like process. Those two large megaenzymes can be divided into several modules containing a set of functional domains and active sites. Each module is engaged in a complete cycle of polyketide or polypeptide chain elongation and relevant functional group modifications. By shuffling and modifying the modules, scientists could investigate the feasibility of producing artificial peptides by creating customizes polyketides and nonribosomal peptides. Besides, scientists need bioinformatics tool to reduce prime cost and time of discovery new functional proteins and its mechanism. For PKS/NRPS, most substrate specificity prediction tools that have been developed are based on recognition of the A and AT domains. However, some novel PKS and NRPS are being discovered that deviate from the canonical organization and can also include unusual domains. Those tools could not predict the sequences which do not composite of A and AT domains. Besides, only few researches aim at predicting query protein sequences to be PKS/NRPS or not. This work purpose a novel scoring card based method to solve this problem. Scoring card method uses dipeptide composition to estimate scores of sequences for predicting PKS/NRPS protein. SCM-PKS/NRPS calculates the propensities of 400 individual dipeptides using statistic discrimination between PKS/NRPS and non-PKS/NRPS proteins of a training data set. The propensity scores of all dipeptides are further optimized using an intelligent genetic algorithm. The score of a sequence is determined by the weighted sum of all propensity scores and dipeptides composition. SCM-PKS/NRPS for predicting PKS/NRPS achieves an accuracy of 89.52% for 10-fold cross-validation and a test accuracy of 82.84%. Additionally, more informative physicochemical properties of 20 amino acids are identified using the estimated propensity scores to characterize PKS/NRPS.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070157209
http://hdl.handle.net/11536/76087
Appears in Collections:Thesis