标题: 杂讯资料之连续一性质与区间图辨识演算法─实体图谱与序列组合之应用
Testing for the Consecutive Ones Property and Interval Graphs on Noisy Data─ Application to Physical Mapping and Sequence Assembly
作者: 吕威甫
Wei-Fu Lu
张瑞川
许闻廉
Dr. Ruei-Chuan Chang
Dr. Wen-Lian Hsu
资讯科学与工程研究所
关键字: 连续一性质;区间图;物理图谱建构;DNA序列组合;探针杂合;分群演算法;consecutive ones property;interval graphs;physical mapping;DNA sequence assembly;probe hybridization;clustering algorithm
公开日期: 2002
摘要: “连续一性质”与“区间图”是实体图谱建构与克隆组合的两种基本数学模式。一个(0,1)-矩阵,如果存在一种行排列,使得排列后的矩阵之中每个列上面的1都是连续的,则其满足列的“连续一性质”。区间图是由区间所形成的相交图。Booth与Lueker在1976年使用PQ-trees完成了连续一性质以及区间图的线性时间测试。他们的线性时间演算法有着严重的缺点:输入的资料必须是正确无误的。然而,实验室的工作无可避免的会包含着错误。因为即使是一个单一的错误都可能会造成图谱建构的失败,因此传统的辨识演算法很难被直接应用在具有杂讯的资料上,同时,传统演算法也没有直接的推广而能够克服这样的缺点。为了解决这些问题,演算法的设计需要新的想法。在这篇论文里面,我们利用分群技巧维持连续一矩阵与区间图的稳定性局部结构以面对资料之中含有杂讯的问题。我们没有设定任何最佳化的“整体性”目的函数,而是试着去维持一个局部的“单调性”结构,也就是尽可能的去减少与局部单调性性质之间的差距。在合理的假设之下,我们的演算法可以处理下面四种错误:伪负、伪正、非唯一探针以及非真克隆。如果局部资料的杂讯过大,我们的演算法将会发现它们并建议生物学家进行进一步的生物实验,以减少这个部分所造成的错误。我们方法的特色是,我们会去除某些杂讯过大的资料,而在最后排出的结果里面并不会包含所有的探针和克隆。 因此,演算法将会产生不只一个contig,其中间隙的产生大部分都是由于杂讯资讯所致。
The consecutive ones property and interval graph are two fundamental mathematical models for physical mapping and clone assembly. A (0,1)-matrix satisfies the consecutive ones property (COP) for the rows if there exists a column permutation such that the ones in each row of the resultant matrix are consecutive. An interval graph is the intersection graph of a collection of intervals. Booth and Lueker (1976) used PQ-trees to test the consecutive ones property and recognize interval graphs in linear time. The linear time algorithm by Booth and Lueker (1976) has a serious drawback: the data must be error-free. However, laboratory work is never flawless. Because a single error might cause map construction to fail, traditional recognition algorithms can hardly be applied on noisy data. Moreover, no straightforward extension of traditional algorithm can overcome the drawbacks. To solve these problems, a different philosophy toward algorithm design is necessary. In this thesis, we opt to maintain a stable local structure of consecutive ones matrices and interval graphs through clustering techniques to deal with errors. We do not set any “global” objective to optimize. Rather, our algorithms try to maintain the local monotone structure, namely, to minimize the deviation from the local monotone property as much as possible. Under moderate assumptions, the algorithm can accommodate the following four types of errors: false negatives, false positives, non-unique probes and chimeric clones. In case some local data is too noisy, our algorithm could likely discover that and suggest additional lab work to reduce the degree of ambiguity in that part. A unique feature of our algorithm is that, rather than forcing all probes or clones to be included and ordered in the final arrangement, our algorithm would delete some noisy information. Thus, it could produce more than one contig. The gaps are created mostly by noisy data.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT910394030
http://hdl.handle.net/11536/70203
显示于类别:Thesis