標題: 多種核糖核酸序列分析工具間之比較
Comparison of Several RNA-Seq Expression Analytic Tools
作者: 張耀升
Chang, Yao-Sheng
黃冠華
劉玉麗
Huang, Guan-Hua
Liu, Yu-Li
統計學研究所
關鍵字: 基因表現量的核糖核酸序列分析;選擇性剪切;單純總片斷計數方法;考慮不同轉錄結構下之計數方法;RNA-Seq;alternative splicing;raw-count method;isoform deconvolution method
公開日期: 2015
摘要: 摘 要 隨著次世代定序技術的蓬勃發展,許多用來偵測基因表現量的核糖核酸序列分析工具也隨之被建立起來。然而,透過這些分析工具得到的分析結果往往不是很一致,這是由於不同的分析工具被建立乃基於不同的統計思維與模型。核糖核酸序列分析工具就方法上主要分為兩種。其中一種計數方法,單純只考慮某個基因上所有片段的總計數量,進而決定該基因在不同狀況下表現與否,此方法稱為raw-count method,如edgeR與DESeq。也基於如此,針對該方法進而改進的工具也隨之建立,如DEXSeq,其不同點在於考慮各個基因在不同外顯子上的片段計量與使用程度。另一種方法則是進一步考慮基因在不同轉錄型態(isoform)上的片段富含程度,進而決定該基因表現與否,此法稱為isoform deconvolution method,如Cuffdiff 2。在過去研究中曾指出,考慮基因因替代性剪接(alternative splicing)在不同轉錄結構下的表現量,能有效地判斷出某些癌症細胞株與正常細胞株在基因表現量上的差異,而非單純能從基因的整體表現上區分出來。如此對於某些疾病而言,也許是個重要的轉機。在此次研究中,我們利用真實的資料並透過不同的核糖核酸序列分析工具,比較在不同條件或狀況下,所有表現量上具有顯著差異的基因總個數。進一步地,此次研究也模擬了四種最容易發生的轉錄型態,並讓片段的分配比例在此四種轉錄型態上同時發生單調遞增及遞減的變化,並與假設的基本轉錄型態配對且兩兩進行比較。最後藉由本次模擬,了解本文所建議的工具在上述的狀況發生下,整體的表現狀況為何。透過本次研究,我們發現在實際資料的分析上,edgeR與DESeq偵測到有表現差異的基因總個數與DEXSeq很相近。然而在模擬上,我們發現傳統的raw-count method,如edgeR與DESeq,無法偵測出基因在不同轉錄型態上片段分配比例產生單調遞增或遞減的變化。相對地,改進後的DEXSeq以及考慮另一種方法下的Cuffdiff 2,卻能有效地偵測出此種變化。於是透過本次研究,我們建議使用者在進行基因表現量差異的核糖核酸序列分析上,基於便利性,可以先粗略地採用edgeR以及DESeq,初步地獲得一個整體的結果,並進階的使用DEXSeq以及Cuffdiff 2有效地進行轉錄型態的分析。 關鍵字: 基因表現量的核糖核酸序列分析,選擇性剪切,單純總片斷計數方法,考慮不同轉錄結構下之計數方法。
ABSTRACT With the advance of Next Generation Sequencing (NGS) technique, several RNA-Seq analytic tools have been developed to detect the gene expression levels. However, the results from these tools sometimes are not consistent. This is because different tools are built based on different statistical ideas and models. Two different types of approaches are adopted in RNA-Seq analytic tools. One is the raw-count method and the other is the isoform deconvolution method. The traditional raw-count method only considers the abundance of reads for each gene. The isoform deconvolution method considers the abundance of reads in each transcript of each gene in advance, which can be useful because sometimes one can effectively discriminate the expression of cancer cell lines from non-cancer lines at the isoform level, but not at the gene level. This may be an important mechanism for some disease potentially. Our study ran various RNA-Seq analytic tools to detect the total number of expression genes in a real data study. Furthermore, in the simulation study, we imitated four possible alternative spicing structures with the monotonic change of transcript proportions, and examined the performance of these tools via pairwise comparison. We concluded that the total numbers of expressed genes detected by non-modified and modified raw-count methods, edgeR, DESeq and DEXSeq, in real data study were very similar. However once the read proportions change under different isoforms (or transcripts), the modified raw-count method DEXSeq and the isoform deconvolution method Cuffdiff 2 can efficiently and significantly detect the changes of isoform expression, but raw-count methods edgeR and DESeq cannot verify these conditions. For convenience, user can use the simpler raw-count-based tools such as edgeR and DESeq to roughly verify the total number of expressed gene. After this, user can perform the gene expression in RAN-Seq analyses by using more efficient tool such as DEXSeq or Cuffdiff 2. Key words: RNA-Seq, alternative splicing, raw-count method, isoform deconvolution method
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070252606
http://hdl.handle.net/11536/126303
Appears in Collections:Thesis