標題: 利用共同子序列及其字距特性來做為文句相似的依據,並應用在資訊擷取中
Applying Pattern Similarity with Word Proximities of Common Subsequences in Information Retrieval
作者: 王興湧
Hsing-Yung Wang
Dr. Min-Wen Du
關鍵字: 資訊擷取;字距;共同子序列;相似文句比對;information retrieval;word proximity;common subsequence;approximate text matching
公開日期: 1999
摘要: 資訊擷取這門科學力圖於找到某個查詢字句的相關文件。建立索引是資訊擷取流程中的一個重要步驟,而決定某個字詞對一個文件的重要程度也通常在建立索引這個時候做。目前的作法主要是針對所有文件集中,每個字詞出現的頻率及普遍度作統計,然後根據此字詞出現的頻率及普遍度來決定此字詞的重要程度。一個資訊擷取系統除了有等待被查詢的文件集之外,還會有使用者輸入的查詢字句,所以除了從文件集中得到其梗要之外,我們也可以分析查詢字句在文件中出現的位置來當作相關性的另一種提示。我們相信共同字序列及其字距是和相似度有關的提示,並且把相似度就當作相關度,我們要從查詢字句及某個文件中找到一個和他們兩個最相似中介者,然後再利用這個中介者來調整字詞的重要程度,而這個中介者就是從眾多共同子序中挑一個最好的,而我們心目中的共同字序列,希望的條件有二個,一是它要包含很多字,二是要這些字之間的字距要盡量靠近。但很多時候你總是找不到那匹跑得又快又不吃草的馬,到底是跑得快重要還是不吃草重要有時很難抉擇,因此我們的作法是儘可能列出所有的評量式子,再將這些式子套用在實際的測試資料上,找出對實際測試資料有最好效果的式子。我們最終的目的是希望能建立一種演算法能找出相似的文句,並應用在資訊擷取中期望能夠增加擷取的準確性。
The notion at the center of Information Retrieval (IR) is the relevance between a given query and documents. Traditional term weighting systems evaluate term weights while building index. Indexing is an important step of the whole IR processes. Traditional systems can get relevance clues from document corpus. Another step of IR processes, retrieving data corresponds to a given query, gives us word sequence and proximity information as relevance clues. We wish to find similar query patterns in the documents with the criteria of word proximities of common subsequences. We match a query sequence and a document sequence to find a best common subsequence that is an intermediate between the two. This common subsequence of a relevant document is expected not only to be long sequence length but also close word proximities. Then we compute term weights depending on the quality of the found common subsequence. We tested many formulas to find a best approach to evaluate the quality of common subsequences. Our goal is to develop an algorithm that can find similar pattern and wish to improve retrieval performance when apply in IR.