大型資訊檢索系統之轉置檔案設計

標題:	大型資訊檢索系統之轉置檔案設計 Inverted File Design for Large-Scale Information Retrieval System
作者:	鄭哲聖 Cher-Sheng Cheng 單智君 Jean Jyh-Jiun Shann 資訊科學與工程研究所
關鍵字:	資訊檢索;轉置檔案;轉置檔案壓縮;搜尋引擎;查詢處理;平行資訊檢索;Information Retrieval;Inverted File;Inverted File Compression;Search Engine;Query Evaluation;Parallel IR
公開日期:	2004
摘要:	本論文主要在探討各種改善資訊檢索效率的技術。最近幾年來，資訊檢索系統已廣泛地使用於各種應用中，例如：搜尋引擎、數位圖書館、基因序列分析等。為了在大量的資料中有效率地搜尋，資訊檢索系統採用已壓縮的轉置檔案來迅速地找到使用者所需要的資料。在轉置檔案中，每一個字彙都有一個相對應的文件編號串列(稱為轉置串列)來指示那一個文件包含這個字彙。大型資訊檢索系統的查詢處理時間大多花在讀取與解壓縮各個出現在查詢中的字彙所對應到的轉置串列。由於每新增一個文件就會使得出現在文件中的字彙所對應的轉置串列長度增加，因此轉置串列的長度與文件數目呈現正比關係。這意味著查詢處理時間與文件數目亦呈現正比關係。所以，發展有效率的演算法來降低轉置串列的處理(讀取、解壓縮、與合併)時間就成了設計大型資訊檢索系統的成功關鍵。本論文將探討下列的研究議題： (1) 發展一個有效率的編碼方法來縮減轉置檔案所佔用的空間在這個議題中我們透過轉置檔案的壓縮來減少磁碟的輸出與輸入所需的時間並藉以改善查詢處理時間，但這卻會帶來額外的解壓縮時間。本議題的目標即是設計一個可節省空間並可快速解碼的方法來壓縮轉置檔案。我們以壓縮率最高的內插編碼法為基礎，採用遞迴移除與迴圈展開的技巧來加速內插編碼法的編碼與解碼速度。實驗顯示與其他已知的編碼法比較，我們所提的方法提供了快速的解碼與良好的壓縮效能。 (2) 發展雙層可跳躍式轉置檔案來除去多餘的解碼在這個議題中我們提出一個雙層可跳躍式轉置檔案來減少查詢處理時所需的轉置串列解壓縮時間。這個議題所面臨的困難是利用目前的可跳躍式機制來實作雙層可跳躍式轉置檔案所需耗用的空間太大。為了設計一個節省空間並可有效加速查詢處理的雙層可跳躍式轉置檔案，在以區塊空間預估為基礎，我們發展了一個新的跳躍機制。這個機制可以搭配目前已知的可跳躍式機制在相當小的空間耗用下實作雙層可跳躍式轉置檔案。實驗顯示我們所提的雙層可跳躍式轉置檔案可以同時加速連接式布林查詢與排名查詢。 (3) 利用文件編號來使得轉置檔案最佳化在這個議題中我們提出一個文件編號演算法以加速查詢地處理速度。我們觀察到透過指派合適的編號給文件可以讓轉置串列在使用相同的編碼方法下被壓縮的更好，並提升查詢處理的速度。本議題我們提出一個新的演算法，稱為Partition-based document identifier assignment (PBDIA) 演算法，來為文件產生合適的編號。這個演算法可以有效率地指派連續的編號給那些包含有經常被查詢的字彙之文件，使得經查被查詢的字彙之轉置串列可以被壓縮得更好。實驗顯示我們所提的PBDIA演算法可以有效縮短查詢處理時間。 (4) 發展平行資訊檢索系統上的轉置檔案切割方法在這個議題中我們針對平行資訊檢索系統提出一個轉置檔案切割方法。叢集系統利用分散在各工作站上的轉置檔案，以平行計算的方式處理查詢。此演算法的目的是降低處理查詢地平均時間。我們首先採用PBDIA演算法讓包含經查被查詢的字彙之文件可以被指派連續的編號。然後，再採用交錯式切割方案來分配轉置檔案到各工作站上。實驗顯示利用此步驟切割轉置檔案，可以達到負載與儲存量的平衡，以及幾近理想的速度提升。本論文之研究成果包括： • 在縮減轉置檔案所佔用的空間方面，我們所提出的編碼方法除了可提供優越的壓縮效果外，在查詢處理速度上也比目前最常使用的Golomb coding還快了大約30%。 • 在除去多餘的解碼方面，我們所提出的雙層可跳躍式轉置檔案比起目前的單層可跳躍式轉置檔案在連接式布林查詢的處理速度上最高可提升16%，而在排名查詢的處理速度上最高可提升44%。 • 在轉置檔案最佳化方面，我們所提出的PBDIA演算法可以在數秒的時間內為1GB大小的文件集產生合適的文件編號並使得查詢處理速度最高可提升25%。 • 在平行資訊檢索方面，我們所提出轉置檔案切割步驟可以改善只使用交錯式切割方案的平行查詢處理速度達14%到17%，無論一個叢集有多少台工作站。 This dissertation investigates a variety of techniques to improve efficiency in information retrieval (IR). Information retrieval systems (IRSs) are widely used in many applications, such as search engines, digital libraries, genomic sequence analyses, etc. To efficiently search vast amount of data, a compressed inverted file is used in an IRS to locate the desired data quickly. An inverted file contains, for each distinct term in the collection, a posting list. The query processing time of a large-scale IRS is dominated by the time needed to read and decompress the posting list for each query term. Moreover, adding a document into the collection is to add one document identifier into the posting list for each term appearing in the document, hence the length of a posting list increases with the size of document collection. This implies that the time needed to process posting lists increase as the size of document collection grows. Therefore, efficient approaches to reduce the time needed to read, decompress, and merge the posting lists are the key issues in designing a large-scale IRS. Research topics to be studied in this dissertation are (1) Efficient coding method for inverted file size reduction The first topic is to propose a novel size reduction method for compressing inverted files. Compressing an inverted file can greatly improve query performance by reducing disk I/Os, but this adds to the decompression time required. The objective of this topic is to develop a method that has both the advantages of compression ratio and fast decompression. Our approach is as follows. The foundation is interpolative coding, which compresses the document identifiers with a recursive process taking care of clustering property and yields superior compression. However, interpolative coding is computationally expensive due to a stack required in its implementation. The key idea of our proposed method is to facilitate coding and decoding processes for interpolative coding by using recursion elimination and loop unwinding. Experimental results show that our method provides fast decoding speed and excellent compression. (2) Two-level skipped inverted file for redundant decoding elimination The second topic is to propose a two-level skipped inverted file, in which a two-level skipped index is created on each compressed posting list, to reduce decompression time. A two-level skipped index can greatly reduce decompress time by skipping over unnecessary portions of the list. However, well-known skipping mechanisms are unable to efficiently implement the two-level skipped index due to their high storage overheads. The objective of this topic is to develop a space-economical two-level skipped inverted file to eliminate redundant decoding and allow fast query evaluation. For this purpose, we propose a novel skipping mechanism based on block size calculation, which can create a skipped index on each compressed posting list with very little or no storage overhead, particularly if the posting list is divided into very small blocks. Using a combination of our skipping mechanism and well-known skipping mechanisms can implement a two-level skipped index with very little storage overheads. Experimental results showed that using such a two-level skipped index can simultaneously allow extremely fast query processing of both conjunctive Boolean queries and ranked queries. (3) Document identifier assignment algorithm design for inverted file optimization The third topic is to propose a document identifier assignment (DIA) algorithm for fast query evaluation. We observe that a good DIA can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. The objective of this topic is to develop a fast algorithm that finds an optimal DIA to minimize the average query processing time in an IRS. In a typical IRS, the distribution of query terms is skewed. Based on this fact, we propose a partition-based DIA (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms. Therefore, the posting lists for frequently used query terms can be compressed better without increasing the complexity of decoding processes. This can result in reduced query processing time. (4) Inverted file partitioning for parallel IR The fourth topic is to propose an inverted file partitioning approach for parallel IR. The inverted file is generally partitioned into disjoint sub-files, each for one workstation, in an IRS that runs on a cluster of workstations. When processing a query, all workstations have to consult only their own sub-files in parallel. The objective of this topic is to develop an inverted file partitioning approach that minimizes the average query processing time of parallel query processing. Our approach is as follows. The foundation is interleaving partitioning scheme, which generates a partitioned inverted file with interleaved mapping rule and produces a near-ideal speedup. The key idea of our proposed approach is to use the PBDIA algorithm to enhance the clustering property of posting lists for frequently used query terms before performing the interleaving partitioning scheme. This can aid the interleaving partitioning scheme to produce superior query performance. The results of this dissertation include: • For inverted file size reduction, the proposed coding method allows query throughput rate of approximately 30% higher than well-known Golomb coding and still provides superior compression. • For redundant decoding elimination, the proposed two-level skipped inverted file improves the query speed for conjunctive Boolean queries by up to 16%, and for ranked queries up to 44%, compared with the conventional one-level skipped inverted file. • For inverted file optimization, the PBDIA algorithm only takes a few seconds to generate a DIA for a collection of 1GB, and improves query speed by up to 25%. • For parallel IR, the proposed approach can further improve the parallel query speed for interleaving partitioning scheme by 14% to 17% no matter how many workstations are in the cluster.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT008517802 http://hdl.handle.net/11536/66778
顯示於類別：	畢業論文

文件中的檔案：

780201.pdf

若為 zip 檔案，請下載檔案解壓縮後，用瀏覽器開啟資料夾中的 index.html 瀏覽全文。