標題: 革新次世代定序雲端分析平台---使用雲端隨需自助服務與統一計算架構GPU運算
Innovating the Next Generation Sequencing Cloud Analysis Platform --- Using the Cloud-Based On-Demand Self-Service and the Compute Unified Device Architecture GPU Computing Technology.
作者: 洪瑞鴻
Hung Jui-Hung
國立交通大學生物科技學系(所)
關鍵字: 次世代定序;Next Generation Sequencing
公開日期: 2015
摘要: 次世代定序(Next generation Sequencing, NGS)技術大幅降低全基因體規模的定序成本,讓研究者 得以系統化地了解分子層級的生物機制。然而,次世代定序產生的巨量資料也同時增加執行生物資 訊分析的成本。動輒上百GB的資料量的儲存、壓縮、分析已超出許多傳統生物資訊方法能處理的範 疇。結合雲端運算有效率地運用處在雲端之中虛擬的運算與儲存單元來進行生物資訊分析,已逐漸 受到全球學界的重視。世界最大之定序儀製造商Illumina於2012年12月正式提出了其雲端定序資料倉 儲服務BaseSpace,提供每個使用者1TB的免費空間儲存其定序資料,迄今已儲存超過四萬套數據。 本研究將結合雲端軟硬體設施(Amazon EC2)以及雲端定序資料倉儲(Illumina BaseSpace),補傳統 平台之不足,提供生物學家一個更有效率的雲端NGS隨需自助分析服務平台,大幅降低分析NGS資 料的門檻與成本。本研究的主要目標條列如下: (1) 隨需可動態組態雲端服務系統: BaseSpace以及Amazon EC2皆有提供應用程序介面(API)讓外 部的雲端服務可以與其溝通。本NGS平台將透過這些API,配置虛擬機器並動態控制其啟動、資 料輸入輸出、流量控制、錯誤回報、進度監測以獲得隨需自助服務便利與低成本的好處。 (2) 平行與GPU編程元件與演算法設計: 本計劃會充分利用multithreading、MapReduce、MPI以 及GPU運算等平行技術結合新穎的演算法來解決效能瓶頸。 (3) 開發可動態組態雲端管線(pipeline): 基於本實驗室固有之雲端服務組裝生產線技術(C-Salt), 開發能自動根據需求進行軟體組裝之RNA定序雲端分析管線。 (4) 互動式視覺化分析介面:利用動態網頁技術系統性的將各個分析管線所產生的資料轉化成互 動式的視覺化圖表與瀏覽系統。 本研究發展的演算法、分析管線、雲端分析系統將大幅降低分析NGS資料的成本。預期本平台 將促進利用NGS進行各式各樣生物課題的研究,並透過與海內外的實驗生物學家合作,利用所開發 的新穎演算法與分析管線進行尖端的學術研究,預期會對分子生物學及生物資訊學的相關研究做出 重大貢獻。產業應用性的面向則會試著透過產學合作把此雲端平台的建構以及服務模式專利化並技 轉。
The Next Generation Sequencing (NGS) technology greatly reduces the cost of doing genome scale sequencing, which allows researchers to systematically understand the biological mechanisms at the molecular level. However, on the other hand, the tremendous amount of data generated by NGS also greatly increases the cost of doing bioinformatics analyses. It is not rare to see an NGS project generates over 100 GBs of data, and the conventional bioinformatics methods, which were originally designed for running on desktop workstations, have been excessively overloaded by storing, compressing, and analyzing data in such a scale. It has been widely recognized by the NGS society that we have to embrace the power of cloud computing and storage to efficiently utilize the virtualized computation and storage units on the cloud. The world's biggest sequencer manufacturer, Illumina has proposed the cloud storage service, BaseSpace, for storing NGS data generated from their sequencers. Each user has 1TB of free space and currently 40,000 NGS datasets are stored in BaseSpace. This work will combine cloud infrastructure service (by Amazon EC2) and cloud NGS data storage (by Illumina BaseSpace) to provide biologists a more efficient on-demand self service NGS cloud computing platform. The major aims are listed below: (1) Constructing on-demand, configurable cloud services: the proposed platform will take advantage of the API provided by BaseSpace and Amazon EC2 to manage the initiation, I/O, flow control, error report, progress and termination to minimize the time and cost spent. (2) Designing parallel and GPU computing algorithms and components: We will use multithreading, MapReduce, MPI and GPU computing technologies to accurate some performance bottlenecks in common NGS pipelines. (3) Dynamical assembly of analysis pipelines: Based on the cloud computing service assembly line technology, we will develop self-assembly analysis services for a variety of NGS applications. (4) Devising interactive visualization analysis interface: We will use data driven document object model and dynamic hypertext markup language technologies to systematically transform data generated from each analytical pipelines into an interactive plotting and browsing system. The novel algorithms, analytical pipelines, and the cloud analysis systems developed in this project will greatly reduce the cost of analyzing NGS data. We will use these novel algorithms and pipelines to cooperate with international experimental biologists on the most advanced research topics. We anticipate the completion of this cloud platform will contribute considerably in the fields of molecular biology and bioinformatics.
官方說明文件#: MOST103-2221-E009-128-MY2
URI: http://hdl.handle.net/11536/130474
https://www.grb.gov.tw/search/planDetail?id=11281640&docId=458010
顯示於類別:研究計畫