一個基於同義字辭典的蛋白質序列分析與分類的方法

標題:	一個基於同義字辭典的蛋白質序列分析與分類的方法 A synonymous dictionary based approach for protein sequence analysis and classification
作者:	林信男 Lin, Hsin-Nan 許聞廉何信瑩 Hsu, Wen-Lian Ho, Shinn-Ying 生物資訊及系統生物研究所
關鍵字:	蛋白質序列分析;結構預測;蛋白質細胞定位預測;同源蛋白質偵測;同義字辭典;protein sequence analysis;structure prediction;subcellular localization;remote homology detection;synonymous dictionary
公開日期:	2010
摘要:	由於蛋白質序列不斷地增加，蛋白質序列的分析與分類在生物資訊中是非常重要的課題。許多的研究顯示蛋白質二級結構對於了解蛋白質的功能及三級結構有很大的幫助，並且透過預測蛋白質在細胞中的定位，有助於分析蛋白質的功能和藥物標靶的發現，此外找出同源蛋白質序列也是另外一個非常重要的課題。藉由偵測同源蛋白質，可以更迅速地了解未知蛋白質可能的功能和屬性。因此在本研究中，我們提出一個基於同義字辭典的蛋白質序列分析與分類的方法，用來預測蛋白質二級結構、蛋白質細胞定位和同源蛋白質偵測等相關重要課題。在蛋白質序列分析的方法上我們採用了自然語言處理的概念，提出以同義字的方法來擷取一群同源蛋白質之間的區域相似性。一個同義字就是一個 n 字元的胺基酸片段，一組同義字可顯示蛋白質在演化過程中可能發生的序列變化。我們利用PSI-BLAST從一組蛋白質序列中產生了一個與蛋白質相依的同義字字典，以這個字典當作蛋白質序列分析與分類的參考依據。在蛋白質二級結構預測方面，基於同義字辭典我們發展了 SymPred 與 SymPsiPred 的方法。使用一組序列相似度在 25% 以下的蛋白質序列測試預測效率，SymPred 和 SymPsiPred 平均的 Q3 分別為 81.0% 和 83.9%。使用兩組 EVA 公用測試資料，SymPred 平均的 Q3 分別是 78.8% 和 79.2%，預測準確率比現有方法高出 1.4% 至 5.4%。我們分析發現 SymPred 的準確率與已知蛋白質序列的數量有正相關，這個發現說明 SymPred 和 SymPsiPred 的預測準確率會隨著蛋白質序列的增加而不斷地提高。在蛋白質細胞定位預測中，基於同義字辭典我們發展了 KnowPredsite 的自動預測方法。KnowPredsite 可同時預測單一胞器定位與多胞器定位。在一組公用的測試資料中，包含了取自1923個不同物種的 25887 單一胞器定位蛋白質與 2169 多胞器定位蛋白質。實驗結果發現KnowPredsite 的預測準確率高於現有許多蛋白質細胞定位預測方法。在單一胞器定位預測上，KnowPredsite 的準確率為 91.7%，高於 ngLOC 的 88.8%。在多胞器定位預測上，KnowPredsite 的準確率為 72.1%，高於 ngLOC 的 59.7%。此外KnowPredsite 的預測結果是可說明的，KnowPredsite 可呈列預測結果的來源。實驗結果顯示即使序列相似度低，使用同義字辭典仍可以捕捉到有意義的區域序列相似性用來幫助預測。在同源蛋白質序列的偵測中，基於同義字辭典我們發展了 SymDetector 用來偵測序列相似度很低的同源蛋白質。我們下載了一組公用測試資料，包含了2,476條相似度極低的蛋白質序列。在允許一個 false positive pair 的條件下，SymDetector 可偵測到 5,308 組 true positive pair，然而現有的方法 ConSequenceS及PSI-BLAST僅能偵測到低於1,000組的 true positive pairs。隨著 false positive pair的提高為100和1000，SymDetector 可分別偵測到6,906及7,666組 true positive pairs，而相同條件下，現有的方法ConSequenceS 僅能偵測到 2000 及3500，而 PSI-BLAST 則僅有 ConSequenceS 所偵測到的一半。 With the increasing number of protein sequences, the protein sequence analysis and classification is an important issue in Bioinformatics. Many researches show that protein secondary structure plays an important role in analyzing and modeling protein structures when characterizing the structural topology of proteins because protein secondary structure represents the local conformation of amino acids into regular structures. The study of protein subcellular localization (PSL) is important for elucidating protein functions involved in various cellular processes. Most of the PSL prediction systems are established for single-localized proteins. However, a significant number of eukaryotic proteins are known to be localized into multiple subcellular organelles. Many studies have shown that proteins may simultaneously locate or move between different cellular compartments and be involved in different biological processes with different roles. The analysis of novel proteins usually starts from searching homologous proteins in annotated databases. Homologous proteins usually share a common ancestor, and thus often have similar functions and structures. Based on pairwise identities and some specific thresholds, sequence search tools retrieve annotated homologous sequences to infer annotations of the novel sequences. As the number of protein sequences grows, sensitive strategies of homology detection using simply sequence information are still demanding and of great importance in post-genomic era. Sequence similarity is a frequently used simple metric for homology detection and other annotation transfers. However, sequence itself provides incomplete and noisy information about protein homology. Many improvements on homology searching and sequence comparisons have been developed to overcome the limitation of sequence similarity. Based on above observation, we propose a general approach based on a synonymous dictionary for protein sequence analysis and classification. We apply it to the problems of protein secondary structure prediction, protein subcellular localization and remote homology detection. We adopt the techniques from natural language processing and use synonymous words to capture local sequence similarities in a group of similar proteins. A synonymous word is an n-gram pattern of amino acids that reflects the sequence variation in a protein’s evolution. We generate a protein-dependent synonym dictionary from a set of protein sequences. Protein secondary structure prediction: On a large non-redundant dataset of 8,297 protein chains (DsspNr-25), the average Q3 of SymPred and SymPsiPred are 81.0% and 83.9% respectively. On the two latest independent test sets (EVA_Set1 and EVA_Set2), the average Q3 of SymPred is 78.8% and 79.2% respectively. SymPred outperforms other existing methods by 1.4% to 5.4%. We study two factors that may affect the performance of SymPred and find that it is very sensitive to the number of proteins of both known and unknown structures. This finding implies that SymPred and SymPsiPred have the potential to achieve higher accuracy as the number of protein sequences in the NCBInr and PDB databases increases. Protein subcellular localization: We downloaded the dataset from ngLOC, which consisted of ten distinct subcellular organelles from 1923 species, and performed ten-fold cross validation experiments to evaluate KnowPredsite's performance. The experiment results show that KnowPredsite achieves higher prediction accuracy than ngLOC and Blast-hit method. For single-localized proteins, the overall accuracy of KnowPredsite is 91.7%. For multi-localized proteins, the overall accuracy of KnowPredsite is 72.1%, which is significantly higher than that of ngLOC by 12.4%. Notably, half of the proteins in the dataset that cannot find any Blast hit sequence above a specified threshold can still be correctly predicted by KnowPredsite. Remote homology detection: We propose a two-stage method called SymDetector for the problem of remote homology detection. We downloaded a benchmark dataset which contains 2,476 protein sequences with mutual sequence identity below 25%. When allowing only one false positive, SymDetector achieves 5,308 true positive pairs while ConSequenceS and PSI-BLAST report less than 1,000 true homologous ones. As the error rate grows, SymDetector can identify 6,906 along with 7,666 sequence pairs given 100 and 1000 false positives permitted separately. Under the same setting, ConSequenceS only reports about 2000 and 3500 pairs in the same Fold, which improve PSI-BLAST by 50% in average.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079451810 http://hdl.handle.net/11536/40918
Appears in Collections:	Thesis

Files in This Item:

181001.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.