標題: 蛋白質序列-結構-動力學之相關性
On the sequence-structure-dynamics relationship of proteins
作者: 陸志豪
Chih-Hao Lu
黃鎮剛
Jenn-Kang Hwang
生物資訊及系統生物研究所
關鍵字: 雙硫鍵;雙硫鍵連接性;雙硫鍵配對樣式;區域結構比對;結構模組;蛋白質動力學;熱擾動;基因演算法;特徵點選取;支持向量機器;disulfide bond;disulfide connectivity;disulfide pattern;local structure alignment;structural motif;protein dynamics;thermal fluctuations;genetic algorithm;feature selection;support vector machine
公開日期: 2007
摘要: 雙硫鍵(disulfide bond)在穩定蛋白質結構型態和調節蛋白質功能上扮演很重要的角色,如果能直接從蛋白質序列中得到雙硫鍵的連接性(disulfide connectivity)將對蛋白質結構模組化以及功能分析有很大的幫助。然而因為在雙硫鍵的兩個成對的半胱氨酸(cysteine pair)必須在三維空間上鄰近,以及在序列上並不侷限於部分區域的原因,所以在預測雙硫鍵的連接性上對計算生物學家仍然是一大挑戰。最近,陳和黃兩位計算生物學家已經發展出一個方法,這個方法是將每一個雙硫鍵配對的樣式(disulfide pattern)都視為一個類別,然後將這個問題利用支持向量機器(support vector machine)來解決這個多類別的分類問題。他們的方法使一個彼此之間都小於百分之三十的序列一致性(sequence identity)的資料組,顯著的提升了預測其雙硫鍵連接性的準確率。然而這個方法有一個缺點,就是當雙硫鍵的個數增加的時候,其雙硫鍵配對的樣式會急劇的增加,導致其預測的準確率會跟著快速下降。在這個研究中,我們將原預測的雙硫鍵配對的樣式改為成對的半胱氨酸的鏈結狀態,並利用支持向量機器以及基因演算法來預測這個成對的半胱氨酸的鏈結狀態。因為成對的半胱氨酸的鏈結狀態是有限的,並不會隨著雙硫鍵的個數增加而增加,所以可以成功的避免預測類別過多的情況。最後我們會利用成對的半胱氨酸的鏈結狀態建立雙硫鍵連接矩陣(connectivity matrix)來預測雙硫鍵配對的樣式,目前我們的方法在準確率方面勝過其他的方法,也成為一個研究蛋白質中的雙硫鍵的有用工具。 近幾年來,因為結構基因體計畫的進展,從未知功能的蛋白質結構來預測其功能模組變得越來越重要。雖然有一些結構樣式例如:天冬胺酸(Aspartic acid)─組胺酸(Histidine)─絲胺酸(Serine)這些參與接觸反應的胺基酸的組合,因為這個胺基酸類別的組合固定加上空間上的位置有一定的限制,所以比較容易偵測,但是要偵測一般性的結構模組仍是一大挑戰。例如:貝塔貝塔阿法─金屬離子結合模組(□bba-metal binding motif),這個模組具有多變化的結構形態與胺基酸序列組合,所以不易偵測。目前為止,要確認這類型的模組仍依靠人工處裡。從不同的結構與序列比對工具來加以分析。在這個研究中,我們結合了結構與序列的資訊發展了一套結構比對演算法可以偵測區域結構模組,我們也將其運用在兩個測試的例子:貝塔貝塔阿法─金屬離子結合模組以及高音譜記號模組(treble clef motif)。貝塔貝塔阿法─金屬離子結合模組在與非特定去氧核醣核酸(DNA)的交互作用以及細胞凋零(apoptosis)中扮演重要的角色;高音譜記號模組是一種鋅離子結合模組(zinc-binding motif),這個模組會適應不同的功能例如:與核酸結合或是磷酸雙酯鍵(phosphodiester bonds)的水解(hydrolysis)。結果顯示我們的方法是自動而有效率的,透過偵測具有特殊功能的特定的結構模組來提供一個有用且有意義的功能詮釋。 最近施計算生物學家發展了一個方法(Shih et al. Proteins: Structure, Function, and Bioinformatics 2007)可以去計算蛋白質分子結構與擾動(fluctuations)的相互關係,這個方法也叫做蛋白質固定點模型(protein fixed-point model),原理是去計算從蛋白質中任一個原子到空間中一個固定點之間的位置向量,這個固定點具有在蛋白質中擁有最小的擾動程度的特性,從這個模型中可以推論出即使是不同的原子,但是與固定點的距離相等仍會擁有相等的熱擾動程度。這個模型在實行上提供了一個較方便的方法來計算蛋白質動力學特性,因為它可以直接從計算蛋白質空間幾何的型態得來而不需要複雜的軌道積分(trajectory integration)或是艱難的矩陣運算,所以此方法比分子動力學模擬(molecular dynamics simulation)或是正規模式分析(normal mode analysis)來的有效率。儘管在之前的研究,這個蛋白質固定點模型已經成功的應用在數個不同蛋白質分析上,但是仍不清楚這個模型對於所有的蛋白質是否都能適用,所以在這個研究中,我們將利用這個蛋白質固定點模型對於一個特定的資料組作完整的分析。在這資料組當中,所有的蛋白質彼此之間的序列一致性小於百分之二十五,且都具有高解析度的X射線結晶結構(X-ray structure)。從結果來看,我們發現大部分的蛋白質都適用於這個固定點模型,但是有小部分的蛋白質是由多個具有意義的區域(multiple domains)組成,這些區域必須分開來計算各區域的固定點,每個區域必須視為獨立的動力學模組。還有一部分的蛋白質,是某個具有功能意義的蛋白質複合體(protein complex)中的一部份,而這些蛋白質必須將整個蛋白質複合體視為只有一個固定點的動力學模組。在此考量之下,依照這個固定點模型所計算出的以及X射線結晶結構中的B因子(B-factor)的相關係數(correlation coefficient)是0.59,而且在資料組中有百分之七十五的蛋白質,其相關係數是大於或等於0.5。這個結果顯示蛋白質固定點模型在分析蛋白質動力學特性上,的確是一個可以普遍適用於各個蛋白質的有用且高效率的工具。
Disulfide bonds play important roles in both stabilizing the protein conformations and regulating protein functions. The ability to infer disulfide connectivity directly from protein sequences will be useful in structural modeling and in functional analysis. However, the prediction of disulfide connectivity from protein sequences presents a major challenge to computational biologists due to the nonlocal nature of disulfide connectivity, i.e., close spatial proximity of the cysteine pair that forms a disulfide bond does not necessarily imply short sequence separation between the cysteines. Recently, Chen and Hwang have developed an approach with each distinct disulfide pattern defined as a class, and treat the problem as a multi-class classification using the support vector machine technique. Their method significantly improves the prediction accuracy of disulfide connectivity for a standard benchmark dataset sharing less than 30% sequence identity. However, this method suffers from the drawback that the number of possible disulfide patterns grows rapidly when disulfide bonds increase. The performance of the method quickly drops off as the number of disulfide bonds increases. In this work, we represent the disulfide patterns in terms of cysteine pairs. We predict the bonding states of the cysteine pairs using support vector machine together with feature selection through the genetic algorithm. Since the number of bonding states of the cysteine pairs remains constant independent of the number of disulfide bonds, we avoid the problem of class explosion upon larger number of disulfide bonds. Consequently, we construct the connectivity matrix from the bonding states of the cysteine pairs to predict the complete disulfide pattern. Our approach outperforms other current approaches and may provide a useful tool in the study of disulfide proteins. Identify functional structural motifs from protein structures of unknown function becomes increasingly important in recent years due to the progress of the structural genomics projects. Though some structural patterns such as the Asp-His-Ser catalytic triad are easy to be detected because of their conserved residues and stringently constrained geometry, it is usually more challenging to detect a general structural motifs like, for example, the bba-metal binding motif, which has a much more variable conformation and sequence. At present, the identification of these motifs usually relies on manual procedures based on different structure and sequence analysis tools. In this study, we developed a structural alignment algorithm combining both structural and sequence information to identify the local structure motifs. We applied our method to two test cases: the bba-metal binding motif and the treble clef motif. The bba-metal binding motif plays an important role in non-specific DNA interactions and cleavage in host defense and apoptosis. The treble clef motif is a zinc-binding motif adaptable to diverse functions such as the binding of nucleic acid and hydrolysis of phosphodiester bonds. Our results are encouraging, indicating that we can effectively identify these structural motifs in an automatic fashion. Our method may provide a useful means for automatic functional annotation through detecting structural motifs associated with particular functions. Recently, Shih et al. have developed a method (Shih et al. Proteins: Structure, Function, and Bioinformatics 2007) to compute correlation of fluctuations. This method, referred to as the protein fixed-point model, is based on the positional vectors of atoms issuing from the fixed point, which is the point of the least fluctuations in proteins. One corollary from this model is that atoms lying on the same shell centered at the fixed point will have the same thermal fluctuations. In practice, this model provides a convenient way to compute the average dynamical properties of proteins directly from the geometrical shapes of proteins without the need of any mechanical models, and hence no trajectory integration or sophisticated matrix operations are needed. As a result, it is more efficient than molecular dynamics simulation or normal mode analysis. Though in the previous study the protein fixed-point model has been successfully applied to a number of proteins of various folds, it is not clear to what extent this model can be applied. In this report, we carried out comprehensive analysis of the protein fixed-point model for a dataset comprising high-resolution X-ray structures with pairwise sequence identity >=25%. We found that in most cases the protein fixed-point model works well. However, in case of proteins comprising multiple domains, each domain should be treated separately as an independent dynamical module with its own fixed point; and in case of the protein complex comprising a number of subunits, if functioning as a biological unit, the whole complex should be considered as one single dynamical module with one fixed point. Under such considerations, the resultant correlation coefficient between the computed and the X-ray structural B-factors for the data set is 0.59 and 75% (727/972) of proteins with a correlation coefficient >=0.5. Our result shows that the fixed-point model is indeed quite general and will be a useful tool for high throughput analysis of dynamical properties of proteins.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009151509
http://hdl.handle.net/11536/61458
顯示於類別:畢業論文