Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems

doi:10.1016/j.ipm.2005.05.003

完整後設資料紀錄

DC 欄位	值	語言
dc.contributor.author	Cheng, CS	en_US
dc.contributor.author	Chung, CP	en_US
dc.contributor.author	Shann, JJJ	en_US
dc.date.accessioned	2014-12-08T15:16:46Z	-
dc.date.available	2014-12-08T15:16:46Z	-
dc.date.issued	2006-05-01	en_US
dc.identifier.issn	0306-4573	en_US
dc.identifier.uri	http://dx.doi.org/10.1016/j.ipm.2005.05.003	en_US
dc.identifier.uri	http://hdl.handle.net/11536/12338	-
dc.description.abstract	Compressing an inverted file can greatly improve query performance of an information retrieval system (IRS) by reducing disk I/Os. We observe that a good document identifier assignment (DIA) can make the document identifiers in the posting lists more clustered, and result in better compression as well as shorter query processing time. In this paper, we tackle the NP-complete problem of finding an optimal DIA to minimize the average query processing time in an IRS when the probability distribution of query terms is given. We indicate that the greedy nearest neighbor (Greedy-NN) algorithm can provide excellent performance for this problem. However, the Greedy-NN algorithm is inappropriate if used in large-scale IRSs, due to its high complexity O(N-2 x n), where N denotes the number of documents and n denotes the number of distinct terms. In real-world IRSs, the distribution of query terms is skewed. Based on this fact, we propose a fast O(N x n) heuristic, called partition-based document identifier assignment (PBDIA) algorithm, which can efficiently assign consecutive document identifiers to those documents containing frequently used query terms, and improve compression efficiency of the posting lists for those terms. This can result in reduced query processing time. The experimental results show that the PBDIA algorithm can yield a competitive performance versus the Greedy-NN for the DIA problem, and that this optimization problem has significant advantages for both long queries and parallel information retrieval (IR). (c) 2005 Elsevier Ltd. All rights reserved.	en_US
dc.language.iso	en_US	en_US
dc.subject	inverted index	en_US
dc.subject	inverted file compression	en_US
dc.subject	query evaluation	en_US
dc.subject	document identifier assignment	en_US
dc.subject	d-gap technique	en_US
dc.title	Fast query evaluation through document identifier assignment for inverted file-based information retrieval systems	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1016/j.ipm.2005.05.003	en_US
dc.identifier.journal	INFORMATION PROCESSING & MANAGEMENT	en_US
dc.citation.volume	42	en_US
dc.citation.issue	3	en_US
dc.citation.spage	729	en_US
dc.citation.epage	750	en_US
dc.contributor.department	資訊工程學系	zh_TW
dc.contributor.department	Department of Computer Science	en_US
dc.identifier.wosnumber	WOS:000233552400010	-
dc.citation.woscount	1	-
顯示於類別：	期刊論文

文件中的檔案：

000233552400010.pdf

若為 zip 檔案，請下載檔案解壓縮後，用瀏覽器開啟資料夾中的 index.html 瀏覽全文。