標題: 革新次世代定序雲端分析平台---使用雲端隨需自助服務與統一計算架構GPU運算
Innovating the Next Generation Sequencing Cloud Analysis Platform --- Using the Cloud-Based On-Demand Self-Service and the Compute Unified Device Architecture GPU Computing Technology.
作者: 洪瑞鴻
Hung Jui-Hung
國立交通大學生物科技學系(所)
關鍵字: 次世代定序;Next Generation Sequencing
公開日期: 2015
摘要: 次世代定序(Next generation Sequencing, NGS)技術大幅降低全基因體規模的定序成本,讓研究者
得以系統化地了解分子層級的生物機制。然而,次世代定序產生的巨量資料也同時增加執行生物資
訊分析的成本。動輒上百GB的資料量的儲存、壓縮、分析已超出許多傳統生物資訊方法能處理的範
疇。結合雲端運算有效率地運用處在雲端之中虛擬的運算與儲存單元來進行生物資訊分析,已逐漸
受到全球學界的重視。世界最大之定序儀製造商Illumina於2012年12月正式提出了其雲端定序資料倉
儲服務BaseSpace,提供每個使用者1TB的免費空間儲存其定序資料,迄今已儲存超過四萬套數據。
本研究將結合雲端軟硬體設施(Amazon EC2)以及雲端定序資料倉儲(Illumina BaseSpace),補傳統
平台之不足,提供生物學家一個更有效率的雲端NGS隨需自助分析服務平台,大幅降低分析NGS資
料的門檻與成本。本研究的主要目標條列如下:
(1) 隨需可動態組態雲端服務系統: BaseSpace以及Amazon EC2皆有提供應用程序介面(API)讓外
部的雲端服務可以與其溝通。本NGS平台將透過這些API,配置虛擬機器並動態控制其啟動、資
料輸入輸出、流量控制、錯誤回報、進度監測以獲得隨需自助服務便利與低成本的好處。
(2) 平行與GPU編程元件與演算法設計: 本計劃會充分利用multithreading、MapReduce、MPI以
及GPU運算等平行技術結合新穎的演算法來解決效能瓶頸。
(3) 開發可動態組態雲端管線(pipeline): 基於本實驗室固有之雲端服務組裝生產線技術(C-Salt),
開發能自動根據需求進行軟體組裝之RNA定序雲端分析管線。
(4) 互動式視覺化分析介面:利用動態網頁技術系統性的將各個分析管線所產生的資料轉化成互
動式的視覺化圖表與瀏覽系統。
本研究發展的演算法、分析管線、雲端分析系統將大幅降低分析NGS資料的成本。預期本平台
將促進利用NGS進行各式各樣生物課題的研究,並透過與海內外的實驗生物學家合作,利用所開發
的新穎演算法與分析管線進行尖端的學術研究,預期會對分子生物學及生物資訊學的相關研究做出
重大貢獻。產業應用性的面向則會試著透過產學合作把此雲端平台的建構以及服務模式專利化並技
轉。
The
Next
Generation
Sequencing
(NGS)
technology
greatly
reduces
the
cost
of
doing
genome
scale
sequencing,
which
allows
researchers
to
systematically
understand
the
biological
mechanisms
at
the
molecular
level.
However,
on
the
other
hand,
the
tremendous
amount
of
data
generated
by
NGS
also
greatly
increases
the
cost
of
doing
bioinformatics
analyses.
It
is
not
rare
to
see
an
NGS
project
generates
over
100
GBs
of
data,
and
the
conventional
bioinformatics
methods,
which
were
originally
designed
for
running
on
desktop
workstations,
have
been
excessively
overloaded
by
storing,
compressing,
and
analyzing
data
in
such
a
scale.
It
has
been
widely
recognized
by
the
NGS
society
that
we
have
to
embrace
the
power
of
cloud
computing
and
storage
to
efficiently
utilize
the
virtualized
computation
and
storage
units
on
the
cloud.
The
world's
biggest
sequencer
manufacturer,
Illumina
has
proposed
the
cloud
storage
service,
BaseSpace,
for
storing
NGS
data
generated
from
their
sequencers.
Each
user
has
1TB
of
free
space
and
currently
40,000
NGS
datasets
are
stored
in
BaseSpace.
This
work
will
combine
cloud
infrastructure
service
(by
Amazon
EC2)
and
cloud
NGS
data
storage
(by
Illumina
BaseSpace)
to
provide
biologists
a
more
efficient
on-demand
self
service
NGS
cloud
computing
platform.
The
major
aims
are
listed
below:
(1)
Constructing
on-demand,
configurable
cloud
services:
the
proposed
platform
will
take
advantage
of
the
API
provided
by
BaseSpace
and
Amazon
EC2
to
manage
the
initiation,
I/O,
flow
control,
error
report,
progress
and
termination
to
minimize
the
time
and
cost
spent.
(2)
Designing
parallel
and
GPU
computing
algorithms
and
components:
We
will
use
multithreading,
MapReduce,
MPI
and
GPU
computing
technologies
to
accurate
some
performance
bottlenecks
in
common
NGS
pipelines.
(3)
Dynamical
assembly
of
analysis
pipelines:
Based
on
the
cloud
computing
service
assembly
line
technology,
we
will
develop
self-assembly
analysis
services
for
a
variety
of
NGS
applications.
(4)
Devising
interactive
visualization
analysis
interface:
We
will
use
data
driven
document
object
model
and
dynamic
hypertext
markup
language
technologies
to
systematically
transform
data
generated
from
each
analytical
pipelines
into
an
interactive
plotting
and
browsing
system.
The
novel
algorithms,
analytical
pipelines,
and
the
cloud
analysis
systems
developed
in
this
project
will
greatly
reduce
the
cost
of
analyzing
NGS
data.
We
will
use
these
novel
algorithms
and
pipelines
to
cooperate
with
international
experimental
biologists
on
the
most
advanced
research
topics.
We
anticipate
the
completion
of
this
cloud
platform
will
contribute
considerably
in
the
fields
of
molecular
biology
and
bioinformatics.
官方說明文件#: MOST103-2221-E009-128-MY2
URI: http://hdl.handle.net/11536/130474
https://www.grb.gov.tw/search/planDetail?id=11281640&docId=458010
顯示於類別:研究計畫