标题: 蛋白质序列-结构-动力学之相关性
On the sequence-structure-dynamics relationship of proteins
作者: 陆志豪
Chih-Hao Lu
黄镇刚
Jenn-Kang Hwang
生物资讯及系统生物研究所
关键字: 双硫键;双硫键连接性;双硫键配对样式;区域结构比对;结构模组;蛋白质动力学;热扰动;基因演算法;特征点选取;支持向量机器;disulfide bond;disulfide connectivity;disulfide pattern;local structure alignment;structural motif;protein dynamics;thermal fluctuations;genetic algorithm;feature selection;support vector machine
公开日期: 2007
摘要: 双硫键(disulfide bond)在稳定蛋白质结构型态和调节蛋白质功能上扮演很重要的角色,如果能直接从蛋白质序列中得到双硫键的连接性(disulfide connectivity)将对蛋白质结构模组化以及功能分析有很大的帮助。然而因为在双硫键的两个成对的半胱氨酸(cysteine pair)必须在三维空间上邻近,以及在序列上并不局限于部分区域的原因,所以在预测双硫键的连接性上对计算生物学家仍然是一大挑战。最近,陈和黄两位计算生物学家已经发展出一个方法,这个方法是将每一个双硫键配对的样式(disulfide pattern)都视为一个类别,然后将这个问题利用支持向量机器(support vector machine)来解决这个多类别的分类问题。他们的方法使一个彼此之间都小于百分之三十的序列一致性(sequence identity)的资料组,显着的提升了预测其双硫键连接性的准确率。然而这个方法有一个缺点,就是当双硫键的个数增加的时候,其双硫键配对的样式会急剧的增加,导致其预测的准确率会跟着快速下降。在这个研究中,我们将原预测的双硫键配对的样式改为成对的半胱氨酸的链结状态,并利用支持向量机器以及基因演算法来预测这个成对的半胱氨酸的链结状态。因为成对的半胱氨酸的链结状态是有限的,并不会随着双硫键的个数增加而增加,所以可以成功的避免预测类别过多的情况。最后我们会利用成对的半胱氨酸的链结状态建立双硫键连接矩阵(connectivity matrix)来预测双硫键配对的样式,目前我们的方法在准确率方面胜过其他的方法,也成为一个研究蛋白质中的双硫键的有用工具。
近几年来,因为结构基因体计画的进展,从未知功能的蛋白质结构来预测其功能模组变得越来越重要。虽然有一些结构样式例如:天冬胺酸(Aspartic acid)─组胺酸(Histidine)─丝胺酸(Serine)这些参与接触反应的胺基酸的组合,因为这个胺基酸类别的组合固定加上空间上的位置有一定的限制,所以比较容易侦测,但是要侦测一般性的结构模组仍是一大挑战。例如:贝塔贝塔阿法─金属离子结合模组(□bba-metal binding motif),这个模组具有多变化的结构形态与胺基酸序列组合,所以不易侦测。目前为止,要确认这类型的模组仍依靠人工处里。从不同的结构与序列比对工具来加以分析。在这个研究中,我们结合了结构与序列的资讯发展了一套结构比对演算法可以侦测区域结构模组,我们也将其运用在两个测试的例子:贝塔贝塔阿法─金属离子结合模组以及高音谱记号模组(treble clef motif)。贝塔贝塔阿法─金属离子结合模组在与非特定去氧核醣核酸(DNA)的交互作用以及细胞凋零(apoptosis)中扮演重要的角色;高音谱记号模组是一种锌离子结合模组(zinc-binding motif),这个模组会适应不同的功能例如:与核酸结合或是磷酸双酯键(phosphodiester bonds)的水解(hydrolysis)。结果显示我们的方法是自动而有效率的,透过侦测具有特殊功能的特定的结构模组来提供一个有用且有意义的功能诠释。
最近施计算生物学家发展了一个方法(Shih et al. Proteins: Structure, Function, and Bioinformatics 2007)可以去计算蛋白质分子结构与扰动(fluctuations)的相互关系,这个方法也叫做蛋白质固定点模型(protein fixed-point model),原理是去计算从蛋白质中任一个原子到空间中一个固定点之间的位置向量,这个固定点具有在蛋白质中拥有最小的扰动程度的特性,从这个模型中可以推论出即使是不同的原子,但是与固定点的距离相等仍会拥有相等的热扰动程度。这个模型在实行上提供了一个较方便的方法来计算蛋白质动力学特性,因为它可以直接从计算蛋白质空间几何的型态得来而不需要复杂的轨道积分(trajectory integration)或是艰难的矩阵运算,所以此方法比分子动力学模拟(molecular dynamics simulation)或是正规模式分析(normal mode analysis)来的有效率。尽管在之前的研究,这个蛋白质固定点模型已经成功的应用在数个不同蛋白质分析上,但是仍不清楚这个模型对于所有的蛋白质是否都能适用,所以在这个研究中,我们将利用这个蛋白质固定点模型对于一个特定的资料组作完整的分析。在这资料组当中,所有的蛋白质彼此之间的序列一致性小于百分之二十五,且都具有高解析度的X射线结晶结构(X-ray structure)。从结果来看,我们发现大部分的蛋白质都适用于这个固定点模型,但是有小部分的蛋白质是由多个具有意义的区域(multiple domains)组成,这些区域必须分开来计算各区域的固定点,每个区域必须视为独立的动力学模组。还有一部分的蛋白质,是某个具有功能意义的蛋白质复合体(protein complex)中的一部份,而这些蛋白质必须将整个蛋白质复合体视为只有一个固定点的动力学模组。在此考量之下,依照这个固定点模型所计算出的以及X射线结晶结构中的B因子(B-factor)的相关系数(correlation coefficient)是0.59,而且在资料组中有百分之七十五的蛋白质,其相关系数是大于或等于0.5。这个结果显示蛋白质固定点模型在分析蛋白质动力学特性上,的确是一个可以普遍适用于各个蛋白质的有用且高效率的工具。
Disulfide bonds play important roles in both stabilizing the protein conformations and regulating protein functions. The ability to infer disulfide connectivity directly from protein sequences will be useful in structural modeling and in functional analysis. However, the prediction of disulfide connectivity from protein sequences presents a major challenge to computational biologists due to the nonlocal nature of disulfide connectivity, i.e., close spatial proximity of the cysteine pair that forms a disulfide bond does not necessarily imply short sequence separation between the cysteines. Recently, Chen and Hwang have developed an approach with each distinct disulfide pattern defined as a class, and treat the problem as a multi-class classification using the support vector machine technique. Their method significantly improves the prediction accuracy of disulfide connectivity for a standard benchmark dataset sharing less than 30% sequence identity. However, this method suffers from the drawback that the number of possible disulfide patterns grows rapidly when disulfide bonds increase. The performance of the method quickly drops off as the number of disulfide bonds increases. In this work, we represent the disulfide patterns in terms of cysteine pairs. We predict the bonding states of the cysteine pairs using support vector machine together with feature selection through the genetic algorithm. Since the number of bonding states of the cysteine pairs remains constant independent of the number of disulfide bonds, we avoid the problem of class explosion upon larger number of disulfide bonds. Consequently, we construct the connectivity matrix from the bonding states of the cysteine pairs to predict the complete disulfide pattern. Our approach outperforms other current approaches and may provide a useful tool in the study of disulfide proteins.
Identify functional structural motifs from protein structures of unknown function becomes increasingly important in recent years due to the progress of the structural genomics projects. Though some structural patterns such as the Asp-His-Ser catalytic triad are easy to be detected because of their conserved residues and stringently constrained geometry, it is usually more challenging to detect a general structural motifs like, for example, the bba-metal binding motif, which has a much more variable conformation and sequence. At present, the identification of these motifs usually relies on manual procedures based on different structure and sequence analysis tools. In this study, we developed a structural alignment algorithm combining both structural and sequence information to identify the local structure motifs. We applied our method to two test cases: the bba-metal binding motif and the treble clef motif. The bba-metal binding motif plays an important role in non-specific DNA interactions and cleavage in host defense and apoptosis. The treble clef motif is a zinc-binding motif adaptable to diverse functions such as the binding of nucleic acid and hydrolysis of phosphodiester bonds. Our results are encouraging, indicating that we can effectively identify these structural motifs in an automatic fashion. Our method may provide a useful means for automatic functional annotation through detecting structural motifs associated with particular functions.
Recently, Shih et al. have developed a method (Shih et al. Proteins: Structure, Function, and Bioinformatics 2007) to compute correlation of fluctuations. This method, referred to as the protein fixed-point model, is based on the positional vectors of atoms issuing from the fixed point, which is the point of the least fluctuations in proteins. One corollary from this model is that atoms lying on the same shell centered at the fixed point will have the same thermal fluctuations. In practice, this model provides a convenient way to compute the average dynamical properties of proteins directly from the geometrical shapes of proteins without the need of any mechanical models, and hence no trajectory integration or sophisticated matrix operations are needed. As a result, it is more efficient than molecular dynamics simulation or normal mode analysis. Though in the previous study the protein fixed-point model has been successfully applied to a number of proteins of various folds, it is not clear to what extent this model can be applied. In this report, we carried out comprehensive analysis of the protein fixed-point model for a dataset comprising high-resolution X-ray structures with pairwise sequence identity >=25%. We found that in most cases the protein fixed-point model works well. However, in case of proteins comprising multiple domains, each domain should be treated separately as an independent dynamical module with its own fixed point; and in case of the protein complex comprising a number of subunits, if functioning as a biological unit, the whole complex should be considered as one single dynamical module with one fixed point. Under such considerations, the resultant correlation coefficient between the computed and the X-ray structural B-factors for the data set is 0.59 and 75% (727/972) of proteins with a correlation coefficient >=0.5. Our result shows that the fixed-point model is indeed quite general and will be a useful tool for high throughput analysis of dynamical properties of proteins.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009151509
http://hdl.handle.net/11536/61458
显示于类别:Thesis