標題: CoMI:以共識互信息鑑定生物標記及分析全基因組表達譜
CoMI: Consensus Mutual Information for discovering biomarkers and analyzing genome-wide expression profiling
作者: 陳怡君
Chen, Yi-Chun
楊進木
Yang, Jinn-Moon
生物資訊及系統生物研究所
關鍵字: 互信息;亂度;全基因組表達譜;生物標誌;Mutual Information;Entropy;Genome-wide expression profiling;Biomarker
公開日期: 2015
摘要: 世界衛生組織(WHO)曾提及癌症的早期診斷將有助於提升疾病治癒率。而基因檢測在臨床診斷扮演著不可或缺的角色。以乳癌為例,早期會透過乳房攝影、超音波或核磁共振初篩,判定可能罹患乳癌後會再進一步的做活體組織切片來判定腫瘤良性惡性與否,進一步會在透過基因檢測方式來作為預後分子,判斷病人罹患的是何種乳癌亞型別,並針對各型別做治療。 一般來說,生物標誌應具有以下三個特性,分別為(1)「可讀性」-生物標誌物須容易在生物檢體中被偵測 (2)「一致性」-生物標誌物應於同一狀態的樣本中有一致表現 (3)「顯著差異性」-生物標誌物於疾病形成過程,表現行為應具顯著上升。 全基因組表達譜 (e.g. microarray and NGS)揭露了基因相互調控的資訊,近年來常用來做疾病的早期診斷。T檢定(T-test) 是目前最常用在分析全基因組表達譜中鑑別生物標誌的方法,然而T檢定通常會定義出過多沒有符合生物標誌特性的顯著基因。 所以我們提出了一個新方法 (Consensus Mutual Information , CoMI),主要用來分析全基因組表達譜及發展符合上述特性的生物標誌。首先,我們將相對於整個基因表達譜中低表現的基因過濾掉,並保留了疾病高表現的基因已達到可讀性。根據生物標誌物特性的第二點及第三點,我們發展了一個新的計分模型(SCoMI),主要用來鑑別在正常人與病人中保有一致表現行為及顯著差異表現的基因,此計分模型(SCoMI)主要包含了互信息(SMI)、亂度(Scon)及組距(Sdist)。互信息主要用來鑑別正常人跟病人間是否有差異表現,亂度則用來衡量正常人或病人中有一致表現的基因,組距則用來評估正常人跟病人之間是否顯著有差異表現。 為了驗證我們的方法及計分模型是否合理,我們應用在兩個微陣列(Microarray)的資料集中,分別是乳癌及阿茲海默症資料集,用來分析全基因表達譜以開發生物標誌。透過我們方法鑑別出來的基因,我們會使用基因本體論資料庫(Gene ontology)去做生物功能及胞器位置上的分析,透過將基因作分群以確保其生物意義並確保是否符合生物標誌之特性。實驗資料也顯示出我們所鑑別出來的基因是與乳癌及阿茲海默症高度相關的,此外,我們整合了T檢定及我們的方法,並應用在這兩個資料集中。在乳癌資料集中,我們選出了115名基因,這些基因除了能夠區分出正常人及病人的樣本且更能夠鑑別基底型(Basal-like)的病患,有趣的是,我們在獨立的TCGA乳癌資料集中也得到相同的結果。我們相信此新方法將有助於我們開發一組早期診斷的生物標誌。
World Health Organization (WHO) mentioned that early detection of cancer greatly increases the chances for successful treatment. Genetic testing in clinical diagnosis plays an integral role. In breast cancer, mammography, ultrasound, and magnetic resonance imaging (MRI) are considered as an effective strategy for early detection. Once judged suffering breast cancer, biopsy could determine whether the tumor is benign or malignant. Furthermore, Genetic testing as a prognostic molecular way determines breast cancer subtype that provides a chance to treatment. In general, a biomarker should have the following properties: (1) Readily quantifiable in accessible biological samples, (2) Expression is consistent in the general population, and (3) Expression is significantly increased especially in the disease condition. Recently, genome-wide expression profiling (e.g. microarray and NGS) holds tremendous promise for revealing the patterns of coordinately regulated genes for early detection of diseases. Currently, T-test is a commonly used method for identifying biomarkers from genome-wide expression gene profiling. T-test often identified numerous significant genes without biomarker properties. Here, we propose a new method, called (Consensus Mutual Information, CoMI), for analyzing genome-wide expression profiling and discovering biomarkers fitting the biomarker properties. First, we keep high expression genes by filtering low expression genes for readily quantifiable. Based on the second and third biomarker properties, we have developed a new scoring function SCoMI to identify consistent expression genes and significant differential expressed genes between normal and disease state. The scoring function SCoMI consist of mutual information (SMI), entropy (Scon), and differential of group rank (Sdist). The mutual information and entropy are used to measure the differential and consistent expression genes, respectively. Sdist can evaluate the differential expression between normal and disease state. We have evaluated our method and scoring function SCoMI to analyze genome-wide expression profiling and discover the biomarkers on two microarray data sets, breast cancer and Alzheimer's disease. For discovery of the selected genes using our method, we applied the enrichments of gene ontology terms (i.e., biological process and cellular component,) and gene clustering to check the biological meanings and biomarker properties of these genes. Experimental results indicate that these selected genes are highly correlated with breast cancer and Alzheimer's disease. In addition, we integrated T-test and our method on these two sets. In breast cancer dataset, we not only identified 115 genes which incriminate normal and tumor samples and but also identify basal-like type patients. Interestingly, we got the similar results from these 115 selected genes that were applied to an independent TCGA data set. Our method and scoring function SCoMI provide a useful method to discover a set of biomarkers for early detection of diseases.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070257208
http://hdl.handle.net/11536/126971
顯示於類別:畢業論文