標題: 新一代全基因定序分析流程:以人類基因為例
The Workflow for Next-Generation Sequencing Data Analysis of Human Genome
作者: 廖子慧
Liao, Tzu-Huei
黃冠華
Huang, Guan-Hua
統計學研究所
關鍵字: 全基因體定序;外顯子定序;新一代全基因定序;BWA;Bowtie2;Samtools;Picard;GATK
公開日期: 2012
摘要: 新一代全基因定序(next-generation sequencing)的技術不但快速又經濟,且具有高輸出量、高解析度與低錯誤率的特性,而全基因體定序(whole genome sequencing)與全外顯子定序(whole exome sequencing)資料的研究除了能夠了解生物演化的親緣關係以外,還能比較個體之間的差異。隨著現今技術越來越純熟,新一代全基因定序的技術越來越受到重視,但隨之而來的問題是定序資料的儲存空間和複雜運算。國內新一代全基因定序資料分析的參考文獻較少,且沒有非常完整的分析流程與軟體應用介紹。本篇論文將提供完整的新一代全基因定序技術參考分析流程與使用軟體介紹。我們將從最前端的基因序列參照比對(alignment),至中介資料處理與轉換以便將PCR擴增的部分標記,到質量篩選並對可能的插入/缺失片段(indels)再做一次區域性的基因組比對(local alignment),最後做變異基因呼叫與分析(variant discovery and genotyping)、深度和覆蓋率分析(depth of coverage analysis)、插入與缺失事件偵測(somatic indel detection)。文中使用的分析軟體有BWA、Bowtie2、Samtools、Picard、GATK等等。本篇論文會以國際千人基因組計畫(1000 Genomes)網站上提供編號為NA12878歐洲白種人女性的全基因體與全外顯子定序資料,用以列舉所提出的新一代全基因定序分析流程。
Next-generation sequencing (NGS) technology is fast and economical. It also has high-output, high-resolution and low failure rate. We can get whole genome sequencing (WGS) and whole exome sequencing (WES) data by NGS technology. Recently, WGS and WES analyses are the most popular way to analyze disease association with genome. They can help us understand biological evolution and compare the different between individuals. However, it is difficult to process WGS and WES data since these data a too large to store and analyze. At present, there have been few literatures about the workflow for analyzing WGS and WES data from beginning to end. Therefore, this thesis offers a general workflow to analyze WGS and WES data. The workflow first aligns raw sequence reads to reference by software BWA or Bowtie2. Then, we convert around different file formats, mark PCR duplicates, and perform local realigning around indels by using software Picard or samtools. Finally, we use software GATK to discover variants, analize depth of coverage, and detect somatic indel. Following this flow path, we can obtain many useful files for subsequent research. In this thesis, the WGS and WES data of sample NA12878 from 1000 Genomes website is used to illustrate the summarized workflow.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT079926501
http://hdl.handle.net/11536/71816
顯示於類別:畢業論文