標題: 雜訊資料之連續一性質與區間圖辨識演算法─實體圖譜與序列組合之應用
Testing for the Consecutive Ones Property and Interval Graphs on Noisy Data─ Application to Physical Mapping and Sequence Assembly
作者: 呂威甫
Wei-Fu Lu
張瑞川
許聞廉
Dr. Ruei-Chuan Chang
Dr. Wen-Lian Hsu
資訊科學與工程研究所
關鍵字: 連續一性質;區間圖;物理圖譜建構;DNA序列組合;探針雜合;分群演算法;consecutive ones property;interval graphs;physical mapping;DNA sequence assembly;probe hybridization;clustering algorithm
公開日期: 2002
摘要: 「連續一性質」與「區間圖」是實體圖譜建構與克隆組合的兩種基本數學模式。一個(0,1)-矩陣,如果存在一種行排列,使得排列後的矩陣之中每個列上面的1都是連續的,則其滿足列的「連續一性質」。區間圖是由區間所形成的相交圖。Booth與Lueker在1976年使用PQ-trees完成了連續一性質以及區間圖的線性時間測試。他們的線性時間演算法有著嚴重的缺點:輸入的資料必須是正確無誤的。然而,實驗室的工作無可避免的會包含著錯誤。因為即使是一個單一的錯誤都可能會造成圖譜建構的失敗,因此傳統的辨識演算法很難被直接應用在具有雜訊的資料上,同時,傳統演算法也沒有直接的推廣而能夠克服這樣的缺點。為了解決這些問題,演算法的設計需要新的想法。在這篇論文裡面,我們利用分群技巧維持連續一矩陣與區間圖的穩定性局部結構以面對資料之中含有雜訊的問題。我們沒有設定任何最佳化的「整體性」目的函數,而是試著去維持一個局部的「單調性」結構,也就是盡可能的去減少與局部單調性性質之間的差距。在合理的假設之下,我們的演算法可以處理下面四種錯誤:偽負、偽正、非唯一探針以及非真克隆。如果局部資料的雜訊過大,我們的演算法將會發現它們並建議生物學家進行進一步的生物實驗,以減少這個部分所造成的錯誤。我們方法的特色是,我們會去除某些雜訊過大的資料,而在最後排出的結果裡面並不會包含所有的探針和克隆。 因此,演算法將會產生不只一個contig,其中間隙的產生大部分都是由於雜訊資訊所致。
The consecutive ones property and interval graph are two fundamental mathematical models for physical mapping and clone assembly. A (0,1)-matrix satisfies the consecutive ones property (COP) for the rows if there exists a column permutation such that the ones in each row of the resultant matrix are consecutive. An interval graph is the intersection graph of a collection of intervals. Booth and Lueker (1976) used PQ-trees to test the consecutive ones property and recognize interval graphs in linear time. The linear time algorithm by Booth and Lueker (1976) has a serious drawback: the data must be error-free. However, laboratory work is never flawless. Because a single error might cause map construction to fail, traditional recognition algorithms can hardly be applied on noisy data. Moreover, no straightforward extension of traditional algorithm can overcome the drawbacks. To solve these problems, a different philosophy toward algorithm design is necessary. In this thesis, we opt to maintain a stable local structure of consecutive ones matrices and interval graphs through clustering techniques to deal with errors. We do not set any “global” objective to optimize. Rather, our algorithms try to maintain the local monotone structure, namely, to minimize the deviation from the local monotone property as much as possible. Under moderate assumptions, the algorithm can accommodate the following four types of errors: false negatives, false positives, non-unique probes and chimeric clones. In case some local data is too noisy, our algorithm could likely discover that and suggest additional lab work to reduce the degree of ambiguity in that part. A unique feature of our algorithm is that, rather than forcing all probes or clones to be included and ordered in the final arrangement, our algorithm would delete some noisy information. Thus, it could produce more than one contig. The gaps are created mostly by noisy data.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT910394030
http://hdl.handle.net/11536/70203
顯示於類別:畢業論文