完整後設資料紀錄
DC 欄位語言
dc.contributor.author戴憲文en_US
dc.contributor.author單智君en_US
dc.date.accessioned2014-12-12T02:22:51Z-
dc.date.available2014-12-12T02:22:51Z-
dc.date.issued1999en_US
dc.identifier.urihttp://140.113.39.130/cdrfb3/record/nctu/#NT880392045en_US
dc.identifier.urihttp://hdl.handle.net/11536/65442-
dc.description.abstract目前的網際網路資訊存取系統普遍採用轉置檔(inverted file)來加快文件的搜尋。但是,整個轉置檔的體積通常相當龐大。因此,壓縮轉置檔便成為減少儲存成本及處理運算所需資料量最簡潔的方式之一。傳統的轉置檔壓縮通常是透過d-gap技術壓縮轉置檔中的文件辨識碼,再經過某種簡單的編碼方式來達到縮小轉置檔體積的目的。在本篇論文中,我們提出兩個改進的方法以提昇轉置檔的壓縮率。第一,利用資料庫文件的叢集特性,透過重新編排每個文件的辨識碼,使文件間之辨識碼的差距縮小,以提昇壓縮率。第二,針對編碼方式,提出二種以除法為基礎的改進技巧,重新表示每個差距,再經由簡單的編碼方式進行編碼。模擬結果顯示,透過重新編排文件的辨識碼,轉置檔的壓縮率可以提昇約15%;此外,我們提出的編碼改進技巧也能更進一步提昇約3%的整體壓縮率。zh_TW
dc.description.abstractThe inverted file is one of the most popular mechanisms to speedup the document search in an Information Retrieval System (IRS). However, the size of the inverted file might is usually enormous. Therefore, compressing the inverted file become one of the most compact ways to reduce the space cost and the amount of data required to be processed. Traditionally, the d-gap technique proposed by Moffat is applied to an inverted file to replace document identifications (document IDs) by some smaller numbers. Then, these smaller numbers can be effectively encoded by a prefix-code to reduce the size of the inverted file. In this thesis, we propose two improvement of the compression procedure to increase the compression rate of the inverted file. First, owing to the clustering property in documents, a document ID reassignment procedure is proposed to reduce the gaps in the original inverted file. In this procedure, a relation graph is constructed from original inverted lists to represent the relation among documents. Then, a heuristic TSP algorithm is used to travel the graph to get a new order of document IDs. After document ID reassignment, a prefix-code is applied to the new inverted file. Second, for encoding techniques, we propose two division-based notation to represent a gap as a quotation and a remainder, and then encode these gaps by prefix-code. The simulation results show that the compression rate can be improved about 15% by the document ID reassignment. Besides, the division-based encoding technique can further increase about 3% of the compression rate.en_US
dc.language.isoen_USen_US
dc.subject轉置檔zh_TW
dc.subject叢集zh_TW
dc.subject資訊存取系統zh_TW
dc.subject網際網路搜尋zh_TW
dc.subject壓縮zh_TW
dc.subjectd-gap技術zh_TW
dc.subjectInverted Fileen_US
dc.subjectClusteren_US
dc.subjectInformation Retrieval Systemen_US
dc.subjectInternet Searchingen_US
dc.subjectCompressionen_US
dc.subjectd-gap techniqueen_US
dc.title改進轉置檔壓縮之文件辨識碼重編技術zh_TW
dc.titleDocument Identification Reassignment for Inverted File Compressionen_US
dc.typeThesisen_US
dc.contributor.department資訊科學與工程研究所zh_TW
顯示於類別:畢業論文