标题: 改進轉置檔壓縮之文件辨識碼重編技術
Document Identification Reassignment for Inverted File Compression
作者: 戴憲文
單智君
資訊科學與工程研究所
关键字: 轉置檔;叢集;資訊存取系統;網際網路搜尋;壓縮;d-gap技術;Inverted File;Cluster;Information Retrieval System;Internet Searching;Compression;d-gap technique
公开日期: 1999
摘要: 目前的網際網路資訊存取系統普遍採用轉置檔(inverted file)來加快文件的搜尋。但是,整個轉置檔的體積通常相當龐大。因此,壓縮轉置檔便成為減少儲存成本及處理運算所需資料量最簡潔的方式之一。傳統的轉置檔壓縮通常是透過d-gap技術壓縮轉置檔中的文件辨識碼,再經過某種簡單的編碼方式來達到縮小轉置檔體積的目的。在本篇論文中,我們提出兩個改進的方法以提昇轉置檔的壓縮率。第一,利用資料庫文件的叢集特性,透過重新編排每個文件的辨識碼,使文件間之辨識碼的差距縮小,以提昇壓縮率。第二,針對編碼方式,提出二種以除法為基礎的改進技巧,重新表示每個差距,再經由簡單的編碼方式進行編碼。模擬結果顯示,透過重新編排文件的辨識碼,轉置檔的壓縮率可以提昇約15%;此外,我們提出的編碼改進技巧也能更進一步提昇約3%的整體壓縮率。
The inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the most compact ways to reduce the space cost and the amount of data required to be processed. Traditionally, the d-gap technique proposed by Moffat is applied to an inverted file to replace document identifications (document IDs) by some smaller numbers. Then, these smaller numbers can be effectively encoded by a prefix-code to reduce the size of the inverted file. In this thesis, we propose two improvement of the compression procedure to increase the compression rate of the inverted file. First, owing to the clustering property in documents, a document ID reassignment procedure is proposed to reduce the gaps in the original inverted file. In this procedure, a relation graph is constructed from original inverted lists to represent the relation among documents. Then, a heuristic TSP algorithm is used to travel the graph to get a new order of document IDs. After document ID reassignment, a prefix-code is applied to the new inverted file. Second, for encoding techniques, we propose two division-based notation to represent a gap as a quotation and a remainder, and then encode these gaps by prefix-code. The simulation results show that the compression rate can be improved about 15% by the document ID reassignment. Besides, the division-based encoding technique can further increase about 3% of the compression rate.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT880392045
http://hdl.handle.net/11536/65442
显示于类别:Thesis