從微陣列資料中建立基因網路來識別生物標記

標題:	從微陣列資料中建立基因網路來識別生物標記 Biomarker Identification by Constructing Gene Networks from Microarray Data
作者:	黃慧玲 Hunag Hui-Ling 國立交通大學生物科技學系(所)
關鍵字:	生物資訊;計算生物;生物標記;蛋白質;基因網路;基因表現;微陣列;最佳化方法;Bioinformatics;Computational Biology;Biomarker;Protein;Gene Network;GeneExpression;Microarray;Optimization Method
公開日期:	2008
摘要:	生物標記是疾病的標示者，定義為細胞、分子或基因層級的變異樣本。本二年計劃研究癌症之蛋白質生物標記的識別問題，以便能早期偵測癌症。蛋白質的功能不僅僅是由其結構來決定，其周遭和有交互作用的蛋白質也參與其中，蛋白質交互網路在探索疾病的生物信號傳遞路徑上扮演重要角色。隨著微陣列技術的快速發展，從基因表現資料來建構基因網路，對於揭示基因和蛋白質的功能，以及了解基因之間複雜的關係和交互作用是很重要的一項技術。第一年計劃擬提出一個創新的方法從有時間序列的微陣列基因表現資料來建構基因網路。推論基因網路之建構技術的主要困難在於『高維度困擾』問題，原因來自於微陣列的基因數目遠大於時間點的取樣數目，造成無法有效決定唯一正確的基因網路。新近有一套使用演化式演算法的方法iTEA 能夠從多套模擬的基因表現資料求得可描述動態基因網路的S-system 模型。考量真實的微陣列基因表現資料只有一或少數套可用，本計劃擬研究之方法(稱之iTEAP)是使用iTEA 加上有雜訊擾亂的多套副本，來克服『高維度困擾』問題。初步的研究成果顯示iTEAP 表現良好，以SOS DAN E. coli 微陣列基因表現資料為例，使用一套基因表現資料的效果可達到iTEA 使用二套資料的基因網路品質。本計劃目標希望能由有限的真實微陣列基因表現資料建構盡可能正確的基因網路供生物學家進一步的實驗驗證。第二年計劃擬發展一套可靠的方法，從癌症與非癌症的基因微陣列資料集來識別生物標記。由於大量基因與少量樣本兩者造成的高自由度搜尋問題，有可能存在眾多的癌症相關基因的集合，在配合已知訓練資料上具有相同的高正確性，稱之為系統不確定性。本研究方法擬有效結合二種最新的解決方式來處理系統不確定性問題，分別是基於分類效能及基因網路差異的二種生物標計識別方法。另外在預先篩選一小集合的最有可能的候選基因方法中，本計劃擬採用吾人新近發表的方法ESVM，它是結合SVM 分類方法與多變數因素分析法的一階段基因選擇方法，其效果優於常用的二階段單變數因素分析法。利用ESVM 與一建構相依基因網路方法所得的生物標記的交互比對，預期能夠找到一組高信任生物標記的集合。這個方法將使用第一年計劃成果來擴展，希望能從癌症與非癌症的微陣列時間序列基因表現資料集來識別生物標記，了解到早期正確的癌症偵測。 Biomarkers are defined as the alternations of patterns at the cellular, molecular or genetic level, which serve as the indicator of diseases. The two-year project investigates the identification of protein biomarkers of cancer for early detection of cancer. The functionality of a protein is not only characterized by its own structure, but also its surroundings and interacting proteins. Protein interaction networks play an important role in finding biological signaling pathways of diseases. With the rapid advancement of microarray techniques, constructing gene networks from gene expression data is important in revealing functions of genes and proteins, and understanding complex relations and interaction between genes. In the first year, the project aims to propose a novel method of constructing gene networks from time-series expression data. The major difficulty of inferring gene networks is the “curse of dimensionality”because the number of genes in the expression data is much larger than the number of time points. A newly-developed efficient method (iTEA) using evolutionary algorithms shows the effectiveness for inferring S-system model for gene networks from multiple sets of simulated gene expression data. Considering the practical applications that only a set of expression data is available, the novel method (named iTEAP) using iTEA with the technique of adding noisy duplicates for coping with the curse of dimensionality is investigated. The preliminary study shows that iTEAP performs well and can achieve the network quality of iTEA using two sets of expression data by using SOS DAN microarray data in E. coli as an example. The goal is to construct gene networks as accurate as possible from a limited amount of gene expression data for further experimental verification of biologists. In the second year, the project aims to develop a reliable method to identify biomarkers from cancer and non-cancer gene microarray datasets by constructing gene networks. Due to high degree of freedom, it may occur that there are multiple sets of relevant genes having the same high accuracy in fitting the training data that is so called model uncertainty. The investigated method hybridizes two state-of-the-art approaches, classification-performance-based and gene-network-based biomarker methods. Furthermore, based on the achievement of my previous study, an efficient method ESVM for identifying a small set of promising genes using SVM with multivariate feature selection method will be utilized. From the cross validation of the two set of promising biomarkers obtained from ESVM and a dependence network method, it is expected to have a set of high-confidence biomarkers. The method will be further extended using the achievement of the first-year project to utilize time-series gene expression profiles of cancer and non-cancer for biomarker identification to provide a new insight into early and accurate cancer detection.
官方說明文件#:	NSC97-2221-E009-187
URI:	http://hdl.handle.net/11536/102373 https://www.grb.gov.tw/search/planDetail?id=1691965&docId=292007
Appears in Collections:	Research Plans