標題: 遠距離同源蛋白之序列比對在蛋白質功能與三級結構預測之應用
Protein sequence alignment methods for remote homologs detection and their applications in protein functional assignment and tertiary structure prediction
作者: 陳志杰
Chen, Chih-Chieh
黃鎮剛
楊進木
Hwang, Jenn-Kang
Yang, Jinn-Moon
生物資訊及系統生物研究所
關鍵字: 蛋白質序列比對;置換矩陣;二級結構;資料庫搜尋;摺疊辨識;遠距離同源蛋白;蛋白質結構預測;同源模擬法;protein sequence alignment;substitution matrix;secondary structure;database search;fold recognition;remote homology;protein structure prediction;homology modeling
公開日期: 2009
摘要: 同源序列的偵測與比對是預測蛋白質功能和結構及演化分析的關鍵步驟。而置換矩陣已被廣泛的運用於量測兩兩序列的相似程度,目前普遍被使用的置換矩陣通常只考慮到20種不同的氨基酸置換關係,但這種方式不適合用於遠距離同源序列的偵測。目前二級結構預測的成功率已達80%,且結構的資訊通常要比序列的資訊來的保守。因此,我們開發了一個新的置換矩陣—S2A2。S2A2是一個60乘60置換矩陣,它是同時考慮20種氨基酸及3種不同二級結構所有可能的配對方式所計算出來的。一如預期的,S2A2顯著的與不考慮結構資訊的BLOSUM62來的不同。藉由ROC及序列比對的分析,我們的置換矩陣確實優於BLOSUM62及PSSM。因此,我們認為,S2A2置換矩陣可用來取代目前以序列或序列側寫(profile)為主的比對與搜尋。 為了更有效的偵測遠距離同源序列的關係,我們也開發了一個新的序列比對計分系統—ProS2A2—它能合併S2A2及PSSM的資訊,且同時具備個別的優點。我們利用Lindahl, PFAM及ProSup的量測基準,來評估ProS2A2在摺疊辨識(fold recognition)及序列比對(sequence alignment)上的正確率,並將其應用於序列資料庫的搜尋。評估的結果一致的顯示,ProS2A2優於以序列及序列側寫為主的比對方式,且其結果也不亞於那些以結構為主的方法。此研究顯示,利用S2A2置換矩陣及ProS2A2計分系統,在遠距離同源序列的偵測與比對上是非常有幫助的。 蛋白質三級結構預測能使我們對蛋白質的功能有更深入的了解。同源模擬法(comparative modeling)是蛋白質三級結構預測中較為可靠的方法,其中正確模版的選擇及模版序列的比對,是預測成功與否的兩個關鍵步驟。在此研究中,對於模版的選擇及序列的比對,我們使用一種有效的一致性策略來結合PSI-BLAST, IMAPAL及T-Coffee三種方法,並開發了一個自動化的蛋白質三級結構預測系統—(PS)2。藉由計算GDT_TS分數的方式,在CASP6 47個被歸類為同源型標靶序列的評估中,我們的方法優於其它10個自動化的方法。由於我們的方法只依賴序列的一致性,因此與其它額外考慮結構一致性的方法相比,我們的方法會快上許多。由結果顯示,合適的一致性策略結合新的相似性計分方式,能顯著的增加結構預測的成功率。 在蛋白質結構預測上,對於模糊地帶(序列一致性介於15~25%之間)的模版選擇,仍然是一件困難的問題。對此,在原本已開發的架構下,我們進行了一些修正,來增加我們在結構預測上的可靠度即可用性。首先,我們利用新的ProS2A2計分系統,來偵測遠距離的同源序列,並進行標靶序列及模版的比對。此外,多重模版(multiple-template)及多重預測模型(multiple-model)的策略,也被運用於建構及評估最終的預測結果。在CASP8 154個被歸類為同源型標靶序列的評估中,我們的方法在72個自動化的方法中排名第6,且我們的方法與排名在前面的五種方法相比,要來的快速。此研究顯示,利用S2A2置換矩陣及ProS2A2計分系統,確實能提升我們在遠距離同源模版的選擇及序列比對的正確率,且多重模版及多重預測模型的策略也顯著的改進我們在結構預測上的準確性。最後,我們也將此方法開發成網路上的工具—(PS)2-v2,網址為http://ps2v2.life.nctu.edu.tw。
The homology detection and the sequence alignment are the critical steps for protein function prediction, structure prediction, and evolution analysis. The scoring matrix was widely used to measure the degree of sequence similarity between two sequences. The substitution matrices in popular use today are usually constructed only consideration of the amino acid type in which the substitution takes place, but it is unable to detect homologous protein pairs with remote similarity. Recently, the accuracy of secondary structure prediction had achieved ~80%, which was often more conserved than amino acid sequence. Here, we developed a BLOSUM-like substitution matrix, S2A2, which was a 60 × 60 substitution matrix based on secondary structure propensities of 20 amino acids. As expected, the matrix showed striking differences from the popular BLOSUM62 matrix, which does not include structural information. The S2A2 matrix outperformed BLOSUM62 and PSSM matrices as assessed by the alignment accuracy and ROC curve analyses of the number of true and false hits. It can replace substitution matrices in sequence-based and profile-based alignment and search methods for protein sequences. To detect remote homologs more effectively, we also developed a new scoring system, named ProS2A2, which incorporated the S2A2 matrix with the position-specific sequence profile (PSSM) generated by PSI-BLAST. Our method was evaluated on the ProSup benchmark for alignment accuracy, and on the Lindahl and PFAM benchmarks for fold recognition and functional assignment, respectively. We also applied the method to search a large sequence database. The evaluation results consistently indicated that our method outperformed sequence-profile and profile-profile approaches, and had comparable performance to that of structure-based methods on these benchmarks. These results demonstrated that the S2A2 matrix and the ProS2A2 scoring system were very useful for remote homology detection and sequence alignment. Protein structure prediction provides valuable insights into function, and comparative modeling is one of the most reliable methods to predict 3D structures directly from amino acid sequences. However, critical problems arise during selection of correct templates and alignment of query sequences therewith. Here, we developed an automatic protein structure prediction server, (PS)2, which uses an effective consensus strategy by combining PSI-BLAST, IMPALA, and T-Coffee in both template selection and target template alignment. (PS)2 was evaluated for 47 comparative modeling targets in CASP6 (Critical Assessment of Techniques for Protein Structure Prediction). For the benchmark dataset, the predictive performance of (PS)2, based on the mean GDT_TS score, was superior to ten other automatic servers. Our method is based solely on the consensus sequence and thus is considerably faster than other methods that rely on the additional structural consensus of templates. Our results show that (PS)2, coupled with suitable consensus strategies and a new similarity score, can significantly improve structure prediction and modeling. To identify a template for the twilight zone of 15~25% sequence similarity between targets and templates is still difficult for template-based protein structure prediction. Here, based on our original server with numerous enhancements and modifications, we developed a new (PS)2-v2 server to improve the reliability and applicability. First, the ProS2A2 alignment method was used to detect homologous proteins with remote similarity and perform the target-template alignment. Then, the multiple-template and multiple-model strategies were used to build and assess the final models. We tested our method using 154 TBM targets of the CASP8 (Critical Assessment of Techniques for Protein Structure Prediction) dataset. Experimental results show that (PS)2-v2 is ranked 6th among 72 severs and is faster than the top-rank five servers, which utilize ab initio methods. The results also demonstrate that (PS)2-v2 with the S2A2 matrix and the ProS2A2 alignment method is useful for template selections and target-template alignments by blending the amino acid and structural propensities. The multiple-template and multiple-model strategies are able to significantly improve the accuracies for target-template alignments in the twilight zone. We believe that this server is useful in structure prediction and modeling, especially in detecting homologous templates with remote sequence similarity. The (PS)2-v2 is available through the website at http://ps2v2.life.nctu.edu.tw/.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079451807
http://hdl.handle.net/11536/40917
Appears in Collections:Thesis