標題: 蛋白質於細胞位置之預測
Prediction of Protein Subcellular Localization
作者: 游景盛
Chin-Sheng Yu
黃鎮剛
Jenn-Kang Hwang
生物科技學系
關鍵字: 細胞位置;支持向量機;序列相似度;subcellular localization;support vector machine;sequence identity
公開日期: 2006
摘要: 蛋白質在細胞體內的位置(subcellular localization)與其生理功能有著密不可分的關係,一般相信,在此有利的線索下,能幫助研究者更快速有效的分析該蛋白質的功能;在此需求下,近年來不斷的有相關的預測工具被發展出來,胺基酸與核甘酸序列數目快速增長的當前,透過計算工具直接針對序列作預測和分析尤其重要,而這些根據不同演算法(algorithms)所發展的各種方法差異極大,應用的序列來源物種和各細胞位置的預測結果也有很大的變異。在這篇論文中,首先我利用支持向量機(Support Vector Machines),根據各種不同多樣特性的 n-peptide組成份,並針對已知的幾個標準資料群作測試,都能比原有的方法得到更好的預測結果。接下來,為了持續改進原有的方法同時也希望深入探討這個方法的優缺點和限制,便更進一步利用序列比對的方式檢測這些標準資料群,發現目前各廣泛使用的資料群中均存在非常高比例的同源性序列(highly homologous sequences) ,以至於造成高估的預測結果;而前人的研究中,如Rost and Nair (Protein Sci, 11:2836-47 (2002))曾探討蛋白質序列的相似度與細胞位置的關係,即在相同細胞位置的蛋白質帶有著較為保留的胺基酸序列(conserved sequence),對於序列和細胞位置的關係,明確劃定了一個序列比對可辨識的相似度界限(threshold);同時我發展了一個雙層支持向量機(two-level support vector machine)系統,於第一層由胺基酸序列所轉換的不同特徵向量(feature vectors)製造數種有效的支持向量機分類器(SVM classifiers) ,第二層將上述的分類器的預測結果由支持向量機結合,藉此得到一個預測某蛋白質可能細胞位置的機率分布值;再將序列比對與此方法分別應用於目前兩個非常常用的標準資料群,前者並全面性的兩兩配對比較其序列相似度(sequence identity),與前人研究相符,在胺基酸序列相似度小於30%時,序列比對所能判別該蛋白質所在細胞位置的能力急劇降低,而雙層的支持向量機預測力則不受影響並遠優於序列相似度判別的效力,利用這樣的特性進一歩將兩者有效結合,此合成方法得到了極佳的預測結果。我們將此方法所發展的工具建置成網頁伺服器,命名為CELLO (subCELlular Localization predictive system) ,供研究者使用,對於大量high-throughput 的蛋白質體和基因體資料分析應有相當的幫助。
Since the protein's function is usually related to its subcellular localization, the ability to predict subcellular localization directly from protein sequences will be useful to biologists to infer protein function. Recent years we have seen a surging interest in the development of novel computational tools to predict subcellular localization. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. At present, these approaches, based on a wide range of algorithms, have achieved varying degrees of success for specific organisms and for certain localization categories. In this thesis, I used support vector machine (SVM) method based on n–peptide composition in predicting the subcellular locations of proteins. For an unbiased assessment of the results, we apply our approach to several independent data sets in the beginning. In those data sets, our approach gives superior performance compared with other approaches. A number of authors have noticed that sequence similarity is useful in predicting subcellular localization. For example, Rost and Nair (Protein Sci, 11:2836-47 (2002)) have carried out extensive analysis of the relation between sequence similarity and identity in subcellular localization and found a close relationship between them above a certain similarity threshold. However, many existing benchmark data sets used for the prediction accuracy assessment contain highly homologous sequences – some data sets comprising sequences up to 80-90% sequence identity. Using these benchmark test data will surely lead to overestimation of the performance of the methods considered. Here, we developed an approach based on a two-level SVM system: the first level comprises a number of SVM classifiers, each based on a specific type of feature vectors derived from sequences; the second level SVM classifier functions as the jury machine to generate the probability distribution of decisions for possible localizations. We compare our approach with a global sequence alignment approach and other existing approaches for two iii often-used benchmark data sets – one comprising prokaryotic sequences and the other eukaryotic sequences. Furthermore, we carried out all-against-all sequence alignment for several data sets to check the relationship between sequence homology and localization. Our results, which are consistent with previous studies, indicate that the homology search approach performs surprisingly well for sequences sharing homology as low as 30%, but its performance deteriorates considerably for sequences sharing lower sequence identity. A data set of high homology levels will obviously lead to biased assessment of the performances of the predictive approaches - especially those relying on homology search or sequence annotations. Since our two-level classification system based on SVM does not rely on homology search, its performance remains relatively unaffected by sequence homology. When compared with other approaches, our approach outperformed other existing approaches, even though some of which use homology search as part of their algorithms. Furthermore, for the practical purpose, we also develop a practical hybrid method that pipelines the two-level SVM classifier and the homology search method in sequential order as a general tool for the sequence annotation of subcellular localization. Our approaches should be valuable in the high throughput analysis of genomics and proteomics.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009128802
http://hdl.handle.net/11536/56079
顯示於類別:畢業論文


文件中的檔案:

  1. 880201.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。