標題: | 高度可平行化的長片段生物序列之FM-index變體 A highly parallelizable FM-index variant for long biological sequences |
作者: | 李昀龍 Li, Yun-Lung 洪瑞鴻 Hung, Jui-Hung 生物資訊及系統生物研究所 |
關鍵字: | 次世代定序;FM-index;Ferragina-Manzini index;sort transform;next generation sequencing |
公開日期: | 2015 |
摘要: | 基於Burrows-Wheeler轉換之Ferragina-Manzini index(FM-index)目前已經被廣泛地應用在針對基因體序列進行快速字串搜尋的應用中。在實際應用FM-index時,我們會進入建表階段,於其中針對目標字串建立index資料結構,據此我們便可在之後的字串搜尋階段中,利用index資料結構達到在目標字串中進行快速字串搜尋的目的。在現有技術中,FM-index的建表操作需執行大量的運算,而且整個建表實作亦十分複雜,且整體的執行效率不彰。而de novo assembly一類利用FM-index來進行overlap graph資料結構建立的應用來說,其需動態地(On-the-fly)進行FM-index建表操作,而 FM-index低落的建表效率亦使得此類應用的整體運算效率不彰。在本論文中,我們提出了k-ordered FM-index,其可透過簡化的演算法來完成其資料結構的建表,同時仍可針對高度重複(Highly Repetitive)的目標字串支援快速的字串搜尋操作。此外,簡化的k-ordered FM-index建表演算法中的桶排序(Bucket sort)階段更具有可輕易利用硬體平行來加速的特性,這使得k-ordered FM-index整體的建表階段可以在未來,進一步地利用硬體平行的方,式來進行加速。 The Ferragina-Manzini index (FM-index) derived from the Burrows-Wheeler transform (BWT) is broadly used for fast string searching in genomes. An indexing phase, where the index structure of a target text is constructed, is entered beforehand, so that efficient string searching among the indexed target text can be facilitated in the following string searching phase. Conventionally, FM-index construction takes a great deal of computation, and the implementation of FM-index construction is also extremely complex and handled in an ineffective manner. On-the-fly indexing demanding applications, such as constructing overlap graphs for de novo assembly, implemented with current FM-index schemes also suffers greatly from low performance. In this work, we provide a k-ordered FM-index that can be built with significantly simplified algorithms, and still suffices fast string searching and matching in highly repetitive references. Besides, the proposed k-ordered FM-index is constructed with a much-simplified algorithm with a bucket sort stage that is highly parallelizable, so that performance of the indexing phase of the proposed k-ordered FM-index can be further enhanced with future parallelization implementation. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT070157219 http://hdl.handle.net/11536/125690 |
Appears in Collections: | Thesis |