标题: 利用共同子序列及其字距特性来做为文句相似的依据,并应用在资讯撷取中
Applying Pattern Similarity with Word Proximities of Common Subsequences in Information Retrieval
作者: 王兴涌
Hsing-Yung Wang
杜敏文
Dr. Min-Wen Du
资讯科学与工程研究所
关键字: 资讯撷取;字距;共同子序列;相似文句比对;information retrieval;word proximity;common subsequence;approximate text matching
公开日期: 1999
摘要: 资讯撷取这门科学力图于找到某个查询字句的相关文件。建立索引是资讯撷取流程中的一个重要步骤,而决定某个字词对一个文件的重要程度也通常在建立索引这个时候做。目前的作法主要是针对所有文件集中,每个字词出现的频率及普遍度作统计,然后根据此字词出现的频率及普遍度来决定此字词的重要程度。一个资讯撷取系统除了有等待被查询的文件集之外,还会有使用者输入的查询字句,所以除了从文件集中得到其梗要之外,我们也可以分析查询字句在文件中出现的位置来当作相关性的另一种提示。我们相信共同字序列及其字距是和相似度有关的提示,并且把相似度就当作相关度,我们要从查询字句及某个文件中找到一个和他们两个最相似中介者,然后再利用这个中介者来调整字词的重要程度,而这个中介者就是从众多共同子序中挑一个最好的,而我们心目中的共同字序列,希望的条件有二个,一是它要包含很多字,二是要这些字之间的字距要尽量靠近。但很多时候你总是找不到那匹跑得又快又不吃草的马,到底是跑得快重要还是不吃草重要有时很难抉择,因此我们的作法是尽可能列出所有的评量式子,再将这些式子套用在实际的测试资料上,找出对实际测试资料有最好效果的式子。我们最终的目的是希望能建立一种演算法能找出相似的文句,并应用在资讯撷取中期望能够增加撷取的准确性。
The notion at the center of Information Retrieval (IR) is the relevance between a given query and documents. Traditional term weighting systems evaluate term weights while building index. Indexing is an important step of the whole IR processes. Traditional systems can get relevance clues from document corpus. Another step of IR processes, retrieving data corresponds to a given query, gives us word sequence and proximity information as relevance clues. We wish to find similar query patterns in the documents with the criteria of word proximities of common subsequences. We match a query sequence and a document sequence to find a best common subsequence that is an intermediate between the two. This common subsequence of a relevant document is expected not only to be long sequence length but also close word proximities. Then we compute term weights depending on the quality of the found common subsequence. We tested many formulas to find a best approach to evaluate the quality of common subsequences. Our goal is to develop an algorithm that can find similar pattern and wish to improve retrieval performance when apply in IR.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880394036
http://hdl.handle.net/11536/65531
显示于类别:Thesis