标题: | PiSA-BLAST:快速蛋白质结构比对与资料库搜寻工具 PiSA-BLAST: A New Tool for Protein Structure Alignment and Database Search |
作者: | 董其桦 Chi-hua Tung 杨进木 Jinn-moon Yang 生物资讯及系统生物研究所 |
关键字: | 蛋白质结构比对;结构资料库搜寻;kappa角与alpha角;计分阵列;即时网页服务;Protein structure alignment;Structure database search;Kappa and alpha angle;Substitution matrix;real-time web services |
公开日期: | 2004 |
摘要: | 近年来随着蛋白质结构数量快速成长,有效搜寻结构资料库的方法愈形重要。当一个新的蛋白质结晶产生后,研究者会希望得知该蛋白质是否跟其他已知结构的蛋白质相似,以及其相似程度。由于蛋白质结晶结构的数量庞大,研究者便十分需要一个准确而有效率的搜寻相似结构之工具。在本研究论文中,我们发展一套新的工具“PiSA-BLAST”,除了提出准确的比对结果外,也能大幅提升结构搜寻的执行速度。 这套工具依据DSSP程式所定义的蛋白质特殊资讯:kappa角与alpha角,利用分群演算法加以分析后得一转换规则表。依据此规则表,将蛋白质结构资料库里所有已知结构的蛋白质转换成一级序列,并建成序列资料库。根据此序列资料库,我们同时也发展一套新的计分阵列,将之用来计算序列比对时的比对分数。接着,我们结合知名的序列比对工具“BLAST”,在输入一欲查询、比对的蛋白质结构后,不需真正叠合两个三级结构,即能快速地从含有大量序列的结构资料库搜寻、比对,最后能获得相似蛋白质的清单。 我们从SCOP及PDB资料库中挑选出五套测试资料,以验证PiSA-BLAST之效能。我们以108个查询结构(query structures)在SCOP 95的搜寻结果为例,此资料库包含9,354个蛋白质结构, PiSA-BLAST及CE在108个查询的平均准确度分别为78.2%与82.1%,PiSA-BLAST总搜寻时间只需34秒,远快于CE搜寻所需的1,169,832秒。另外,PSI-BLAST的平均准确度则为69.8%,并共花费18.3秒。根据本篇论文的研究结果,显示下列结论:一、PiSA-BLAST能以接近BLAST的速度搜寻结构资料库,并较CE快上34,000倍左右。二、PiSA-BLAST能获得接近CE的准确度,同时较以胺基酸为基础的序列比对工具,如BLAST、PSI-BLAST等,提供更精确的搜寻结果。这些结果显示,在结构比对时,我们所发展的结构编码以及计分阵列确实正确、可用。三、如同BLAST在执行序列比对时能输出一e-value,PiSA-BLAST亦可在搜寻结构时提供此输出值。经测试,当e-value小于阈值e-15时,PiSA-BLAST可达到90%的准确度。四、PiSA-BLAST可成为一个结构比对的快速筛选工具,先执行一次快速比对,输出多个结果后再利用其他速度较慢,但比对方法详尽、可信的工具如CE、DALI,作第二次的分析。五、PiSA-BLAST已建立网页服务,使用者能在线上即时搜寻结构资料库。综合以上所述,本研究对于结构基因体学与蛋白质体学应有相当的贡献。 The structural database searching has become increasingly important with growing numbers of known protein structures. This increase was near exponential in the early 1990s and has become linear over the past several years. As more and more the availability of the growing number of protein crystal structures, the demand for a very fast and accurate method to searching for structures similar to a query structure is high. In this thesis, we have developed a new tool, termed PiSA-BLAST for protein structure database search that does not require the alignment of two 3D structures. Here we have developed a new method for the protein structure alignment by transforming 3D structures into 1D sequences. This method use the information of kappa and alpha angles, derived from DSSP program, to represent the protein 3D structure. Based on the segment information and clustering method, we transform the structural information with kappa and alpha angles into coded regions. After that, each protein with 3D structure is able to transfer into 1D sequence and we could develop a new substitution matrix that can be used as the scoring matrix of sequence alignment for 23 new codes. These encoded sequences are collected as a structure database. Launching BLAST, a well-known sequence alignment tool, to search structure database in a short time and we will get a list of proteins that are similar in structure. We evaluated PiSA-BLAST on five diverse data sets from SCOP and protein data bank. For the dataset SCOP 95 with 108 queries on 9,354 protein domains, the average precisions of PiSA-BLAST and CE are 78.2% and 82.1%, respectively, and the total executing times are 34 seconds for PiSA-BLAST and about 1,169,832 seconds for CE. The average precision is 69.8% and time is 18.3 seconds for PSI-BLAST. Based on these experiments, we summarized several observations: (1) PiSA-BLAST is as fast as BLAST for protein structure database search and is 34,000 times faster than CE on the database SCOP 95. (2) The accuracy of PiSA-BLAST closes the accuracy of CE and much better than BLAST and PSI-BLAST which are based on amino-acid sequences. These results imply that our structural new codes and substitute matrix are useful for protein structure alignment. (3) PiSA-BLAST is able to provide a significant e-value with e-15 for structure database search as the e-value with e-3 in BLAST for sequence database search. PiSA-BLAST achieved about 90% accuracy for a query when e-value is less than e-15. (4) PiSA-BLAST is a useful filtering tool before performing a detailed database search, such as CE and DALI. (5) PiSA-BLAST is able to provide real-time web services for protein structure database search as BLAST in protein sequence search. We believe that this issue is important for structural genomics and proteomics. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009151503 http://hdl.handle.net/11536/61413 |
显示于类别: | Thesis |
文件中的档案:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.