標題: 由蛋白質序列預測可行的環狀結構重組切位
Prediction of viable circular permutation sites from protein sequences
作者: 王儷芬
黃鎮剛
羅惟正
生物資訊及系統生物研究所
關鍵字: 蛋白質環狀結構重組;蛋白質序列;機器學習;Circular Permutation;protein sequences;Machine learning
公開日期: 2010
摘要: 蛋白質環狀結構重組 (CP) 是一種有別於傳統mutagenesis方法,而可有效被用來研究蛋白質摺疊或應用於蛋白質工程的技術。CP可被視為是蛋白質的氨基端和羧基端自原本的位置變換到其他位置,整體結構卻大致不變。已有許多文獻指出,CP可增強了蛋白質的穩定度、活性或功能多樣性,但也可能會導致重組後的蛋白質無法折疊。也就是說,並非所有位置在進行CP後,都能形成穩定且正確摺疊的蛋白質。因此,若能預測出可行的CP位置,對CP於蛋白質基礎科學研究及生物科技領域的應用將有很大幫助。我們早前已開發了一套以結構為基礎的方法來預測可行的CP位置,並將之命名為CPred。雖然CPred是目前用來預測CP位置的最佳方法,但是,當一個蛋白質的結構無法取得時,CPred就無法被應用。另外,目前已知的蛋白質序列數遠大於已解出的結構數量,為促進CP於基礎研究與生物科技領域更廣泛的應用,在此,我們發展了一套以序列為基礎來預測可行CP位置的方法。根據先前的研究和我們開發CPred時的分析,CP的許多序列、結構喜好,及可行CP切位的動力學特性已被確認。在這篇以序列為基礎的研究中,我們推想,若能精確預測蛋白質二級結構,並有效模擬出蛋白質的一些重要結構特性、動力學特性,便能套用CPred現有的架構而得到效能相似而僅以序列為唯一輸入資訊的預測系統。我們將這這新的、以序列為基礎的系統命名為CPred-seq。一如CPred, CPred-seq整合了四種機器學習方法,包括人工類神經網路、支援向量機、隨機森林以及階層式特徵整合程序。以標準的二氫葉酸還原酶資料組做測試,CPred-seq的AUC值高達0.80, 雖然低於CPred的0.91, 卻顯著優於過去其他以結構為基礎的可行CP切位預測方法(早先方法的最佳AUC僅0.7)。值得強調地,CPreq-seq是目前唯一僅靠氨基酸序列便可預測CP切位的方法。
Circular permutation (CP) is a different way from conventional mutagenesis to study protein folding and engineer protein structures. It can be viewed as the amino- and carboxyl-termini of a protein were relocated from the native position to another. Previous studies have revealed that CP may enhance the stability, activity or functional diversity of a protein; however, CP can also result in unfoldable protein variants. Not every position in a protein sequence can be used to generate a viable, i.e., stable and correctly folded, circular permutant. Since CP is an expensive and complicated technique, and a successful application of it requires the selection of a suitable permutation site, we suppose that an accurate CP viability prediction method would be highly beneficial to protein research and engineering wherever CP can be applied. We have previously developed a structure-based system for predicting CP viability, namely CPred. Although CPred performed best among related methods, it is not applicable when the structure of a protein is unavailable. In order to facilitate the application of CP, this work aimed to develop a sequence-based method for predicting viable CP sites. Many sequence and structural preferences/properties of known CP sites have been reported previously and determined as we developed CPred. In this study, we speculated that if the secondary structure as well as several key tertiary structural and dynamic properties of a protein can be well predicted or simulated, a prediction system with performances similar to CPred’s can be established by utilizing the current framework of CPred. Our sequence-based CP viability prediction system, namely CPred-seq, thus also utilizes the four machine learning methods combined in the CPred system, including an artificial neural network, a support vector machine, a random forest, and a hierarchical feature integration procedure. The AUC of CPred-seq assessed with the standard dihydrofolate reductase dataset was 0.80. Except for the CPred (AUC = 0.91), it clearly outperformed previous structure-based CP viability prediction methods (AUC ≤ 0.7). CPred-seq is the first method replying only on a protein sequence to predict viable CP sites.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079851519
http://hdl.handle.net/11536/48210
Appears in Collections:Thesis