Title: | 使用雙胜肽分數卡預測蛋白質在大腸桿菌表現系統中之溶解度 Scoring Card of dipeptides for predicting solubility of recombinant proteins in E. coli expression system |
Authors: | 高德芬 何信瑩 生物資訊及系統生物研究所 |
Keywords: | 雙胜肽分數卡;大腸桿菌表現系統;蛋白質溶解度;包涵體;Scoring Card;E. coli expression system;Protein solubility;Inclusion body |
Issue Date: | 2010 |
Abstract: | 蛋白質表現系統為一個非常實用的生物技術且普遍被利用在蛋白質相關的研究上。而大腸桿菌則是最被常使用在蛋白質表現系統的宿主,因為大腸桿菌表現系統具有簡單、快速,價格又低廉的優勢。但在此表現系統中有時會產生一個嚴重的又難以解決的問題,表現時時有些蛋白質會形成一種結構不正確且無正確生物功能的包涵體,所以無法被利用在接下來之研究上。所以生物學家都希望能在蛋白質表現系統中盡可能得到可溶性之蛋白質,因可溶性蛋白質就代表結構正確且擁有生物功能。而生物學家因此則利用改變各種實驗條件等方法,來使包涵體能轉變為可溶性蛋白質,但目前此些改變實驗條件的方法都還是處於反覆試驗與不斷的嘗試錯誤的階段,所以非常耗費材料、金錢與時間。
在目前的相關文獻中,許多研究都使用SVM等機器學習的方法來藉由蛋白質的一級結構預測大腸桿菌表現後之蛋白質的溶解度。其中使用了許多與蛋白質序列相關的特徵,包括了各種物化特性、胺基酸組合、雙胜肽或三胜肽組合等等的特徵,不勝枚舉。然而相關文獻中所使用的分類方法對於生物學家幾乎都是屬於黑盒子的分類法,難以了解其中分類過程的依據。所以在此篇研究中我們研究了許多種特徵並挑選出了認為對此分類有效的雙胜肽特徵,於是提出了一個以雙胜肽值來建立出的分數卡之方法來預測表現後的蛋白質之溶解度狀態。
本研究所提出之分數卡方法是一個簡單明瞭並可直接利用統計雙胜肽的方式來達到預測之目的。每個欲測試蛋白質都可從分數卡之計算得到一個分數,並再藉由從驗證資料中計算出一個將兩類蛋白質由分數切開的臨界值。而為了更進一步的強化分數卡的分類效果,之後我們又加入了智慧型基因演算法來調整由統計產生之雙胜肽分數卡,其中在此問題中並以ROC曲線下之面積當做基因演算法中的適應性函數值來判斷效能。在使用相同的資料下,智慧型基因演算法分數卡的方法能得到81.7%的準確率,高於使用SVM之76.9%。並由比較SVM、分數卡方法與經過智慧型基因演算法調整後的分數卡方法之結果來證明基因演算法的確可使此分類問題的準確率大幅提升。 Protein expression system is a very common and useful experiment skill in protein studying. Nowadays, Escherichia coli (E. coli) are mostly universal hosts for cloning and expressing in a broad of researches with its fast and inexpensive characters. However, there is a serious obstacle in protein expression system. Many proteins are produced in the form of insoluble aggregation that is a major obstruct for a lot of experiments, and the misfolded aggregation is called inclusion body. Accordingly, researchers usually do their best to get the soluble form of protein via regulating experimental conditions, but the processes are still trial-and-error. Many recent researches did their effort to predict the solubility of expressed proteins in E. coli via support vector machine (SVM). Existing methods applied a wide variety of primary structure feature sets, including physical chemical index and composition of amino acid, dipeptide and tripeptide. Generally, the prediction models and results using a black-box like method, such as SVM, are not easily interpretable. This study investigated several feature types and then proposed a scoring card method of dipeptides to predict the solubility of expressed proteins in E. coli. The proposed scoring card is a very intuitive prediction method that uses dipeptide statistic to construct a scoring matrix. Every input sample can get a score according to this scoring matrix, and then a best cut-off value was chosen from the validation data. Furthermore, to improve the scoring card method, an intelligent genetic algorithm (IGA) is used to optimize the scoring matrix, in which it can get a better performance of ROC curve to promote the classification accuracy. The IGA-scoring card could yield an accuracy of 81.7%, higher than 76.9% of using an SVM method using the same dataset. Finally, the better accuracy and more efficient classification result could be confirmed by the comparison among SVM, scoring card and IGA-scoring card for this problem of classification between expressed proteins. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT079851504 http://hdl.handle.net/11536/48200 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.