標題: 中文文件擷取中以字為基礎的特徵法之研究
The Study of Character-based Signature Methods in Chinese Text Retrieval
作者: 梁婷
Tyne Liang
李素瑛, 楊維邦
Suh-Yin Lee, Wei-Pang Yang
關鍵字: 中文,檢索,誤判,文件,特徵;Chinese, indexing, false hits, text, signature
公開日期: 1994
摘要: 多數的中文全文檢索是以中文單字而非中文詞來做為基本的檢索單位。因 此在對多音詞搜尋時,如果沒有處理字組間的順序訊息,將會造成所擷取 出來的文件可能只含有此多音詞的字而非多音詞本身。在本篇論文裡我們 稱這些文件為順序誤判。因此在尋求有效的中文文件擷取法則時,我們一 方面先就中文的多音詞結構和順序誤判之間的關聯做一探討。另一方面, 我們也評估若為減少順序誤判而儲存字序將會在常用的擷取方法造成多少 額外的空間和處理時間。從搜尋時間和儲存空間的評估裡顯示特徵法較之 反轉法在處理中文字序問題和儲存空間上更具有良好的應用潛力。然而以 特徵檔做文件擷取法時會產生所謂的隨機誤判。因此我們針對中文文件擷 取時常用的雙音詞和三音詞,提出一個更接近實際數值的隨機誤判的理論 計算。在建構中文特徵檔時,我們另將文件中的連續雙字鍵訊息利用轉換 函數產生相對應的特徵碼。再以重疊方式儲存在文件特徵中。這種特徵擷 取我們稱之為結合法。在論文中,我們並提出應用此法在雙音詞查詢時所 造成的誤判機率的理論計算公式。同時對應單字鍵特徵和雙字續鍵特徵, 提出最佳加權設定使得在對雙音詞查詢時所產生的誤判率是最低的。在設 計最佳加權設定上,乃考量了不同鍵值在文件中出現的頻率,雙音詞與其 構成字的語意結合關係,特徵碼長度和儲存空間大小。 Many Chinese text access methods use characters instead of words as the basic search units and treat polysyllabic queries as conjunctive combinations of their constituent characters. Therefore if no character sequence information is incorporated in the search algorithm, one may retrieve an adjacency false hit which is a document containing all the characters of a polysyllabic query but not in the exact character sequence as in the query itself. In search of a good character-based Chinese text retrieval methods, the relation of adjacency false hit to the construction of polysyllabic words in Chinese is examined. On the other hand, the extra storage overhead and processing time needed to eliminate adjacency false hits for commonly-used character-based text access methods (inversion and signature) are estimated. It turns out that signature method is more promising than the inversion method for its less space overhead and easy support for adjacency operation in Chinese text retrieval. However, signature-based access may retrieve those documents which do not contain all the keys of search term. In this thesis, the origin of random false hits is investigated and more realistic estimation of random false hit probability is derived for Chinese disyllabic and trisyllabic terms. To construct a Chinese signature file, a special scheme (combined scheme) is proposed in which every character (monogram ) and character pair (bigram) in the document is hashed to the document signature. For disyllabic queries, an analytical expression of the false hit rate is found. With this expression, the optimal monogram and bigram weight assignments are obtained in terms of the signature length, the storage overhead , as well as the occurrence frequency and the association value of the query.