標題: | PiSA-BLAST:快速蛋白質結構比對與資料庫搜尋工具 PiSA-BLAST: A New Tool for Protein Structure Alignment and Database Search |
作者: | 董其樺 Chi-hua Tung 楊進木 Jinn-moon Yang 生物資訊及系統生物研究所 |
關鍵字: | 蛋白質結構比對;結構資料庫搜尋;kappa角與alpha角;計分陣列;即時網頁服務;Protein structure alignment;Structure database search;Kappa and alpha angle;Substitution matrix;real-time web services |
公開日期: | 2004 |
摘要: | 近年來隨著蛋白質結構數量快速成長,有效搜尋結構資料庫的方法愈形重要。當一個新的蛋白質結晶產生後,研究者會希望得知該蛋白質是否跟其他已知結構的蛋白質相似,以及其相似程度。由於蛋白質結晶結構的數量龐大,研究者便十分需要一個準確而有效率的搜尋相似結構之工具。在本研究論文中,我們發展一套新的工具「PiSA-BLAST」,除了提出準確的比對結果外,也能大幅提昇結構搜尋的執行速度。
這套工具依據DSSP程式所定義的蛋白質特殊資訊:kappa角與alpha角,利用分群演算法加以分析後得一轉換規則表。依據此規則表,將蛋白質結構資料庫裡所有已知結構的蛋白質轉換成一級序列,並建成序列資料庫。根據此序列資料庫,我們同時也發展一套新的計分陣列,將之用來計算序列比對時的比對分數。接著,我們結合知名的序列比對工具「BLAST」,在輸入一欲查詢、比對的蛋白質結構後,不需真正疊合兩個三級結構,即能快速地從含有大量序列的結構資料庫搜尋、比對,最後能獲得相似蛋白質的清單。
我們從SCOP及PDB資料庫中挑選出五套測試資料,以驗證PiSA-BLAST之效能。我們以108個查詢結構(query structures)在SCOP 95的搜尋結果為例,此資料庫包含9,354個蛋白質結構, PiSA-BLAST及CE在108個查詢的平均準確度分別為78.2%與82.1%,PiSA-BLAST總搜尋時間只需34秒,遠快於CE搜尋所需的1,169,832秒。另外,PSI-BLAST的平均準確度則為69.8%,並共花費18.3秒。根據本篇論文的研究結果,顯示下列結論:一、PiSA-BLAST能以接近BLAST的速度搜尋結構資料庫,並較CE快上34,000倍左右。二、PiSA-BLAST能獲得接近CE的準確度,同時較以胺基酸為基礎的序列比對工具,如BLAST、PSI-BLAST等,提供更精確的搜尋結果。這些結果顯示,在結構比對時,我們所發展的結構編碼以及計分陣列確實正確、可用。三、如同BLAST在執行序列比對時能輸出一e-value,PiSA-BLAST亦可在搜尋結構時提供此輸出值。經測試,當e-value小於閾值e-15時,PiSA-BLAST可達到90%的準確度。四、PiSA-BLAST可成為一個結構比對的快速篩選工具,先執行一次快速比對,輸出多個結果後再利用其他速度較慢,但比對方法詳盡、可信的工具如CE、DALI,作第二次的分析。五、PiSA-BLAST已建立網頁服務,使用者能在線上即時搜尋結構資料庫。綜合以上所述,本研究對於結構基因體學與蛋白質體學應有相當的貢獻。 The structural database searching has become increasingly important with growing numbers of known protein structures. This increase was near exponential in the early 1990s and has become linear over the past several years. As more and more the availability of the growing number of protein crystal structures, the demand for a very fast and accurate method to searching for structures similar to a query structure is high. In this thesis, we have developed a new tool, termed PiSA-BLAST for protein structure database search that does not require the alignment of two 3D structures. Here we have developed a new method for the protein structure alignment by transforming 3D structures into 1D sequences. This method use the information of kappa and alpha angles, derived from DSSP program, to represent the protein 3D structure. Based on the segment information and clustering method, we transform the structural information with kappa and alpha angles into coded regions. After that, each protein with 3D structure is able to transfer into 1D sequence and we could develop a new substitution matrix that can be used as the scoring matrix of sequence alignment for 23 new codes. These encoded sequences are collected as a structure database. Launching BLAST, a well-known sequence alignment tool, to search structure database in a short time and we will get a list of proteins that are similar in structure. We evaluated PiSA-BLAST on five diverse data sets from SCOP and protein data bank. For the dataset SCOP 95 with 108 queries on 9,354 protein domains, the average precisions of PiSA-BLAST and CE are 78.2% and 82.1%, respectively, and the total executing times are 34 seconds for PiSA-BLAST and about 1,169,832 seconds for CE. The average precision is 69.8% and time is 18.3 seconds for PSI-BLAST. Based on these experiments, we summarized several observations: (1) PiSA-BLAST is as fast as BLAST for protein structure database search and is 34,000 times faster than CE on the database SCOP 95. (2) The accuracy of PiSA-BLAST closes the accuracy of CE and much better than BLAST and PSI-BLAST which are based on amino-acid sequences. These results imply that our structural new codes and substitute matrix are useful for protein structure alignment. (3) PiSA-BLAST is able to provide a significant e-value with e-15 for structure database search as the e-value with e-3 in BLAST for sequence database search. PiSA-BLAST achieved about 90% accuracy for a query when e-value is less than e-15. (4) PiSA-BLAST is a useful filtering tool before performing a detailed database search, such as CE and DALI. (5) PiSA-BLAST is able to provide real-time web services for protein structure database search as BLAST in protein sequence search. We believe that this issue is important for structural genomics and proteomics. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009151503 http://hdl.handle.net/11536/61413 |
顯示於類別: | 畢業論文 |