標題: 以概念萃取為基礎之文件分群與視覺化
A Concept Extraction Approach for Document Clustering and Visualization
作者: 張家寧
Chia-Ning Chang
柯皓仁
楊維邦
Hao-Ren Ke
Wei-Pang Yang
資訊科學與工程研究所
關鍵字: 文件分群;關鍵字分群;概念萃取;主題關鍵字;視覺化;引用;Document Clustering,;Keyword Clustering;Concept Extraction;Topic Keyword;Visualization;Citation
公開日期: 2005
摘要: 近年來,網際網路已經成為取得資訊最方便的管道,其中又以在搜尋引擎輸入關鍵字取得資訊的方式最為普遍。然而,搜尋引擎通常不會對搜尋結果進行過濾與篩選,過多的資料提高了評估資料相關性的複雜度,如何在獲取的資料中去蕪存菁,並建立出容易讓使用者了解的模型,進而讓資料有效率地轉化為使用者容易吸收的知識,是目前重要的研究課題之一。分群演算法可以將資料分析之後,依照相似度將類似的資料群聚,不同的群具有不同的含意與概念,如何從群中自動萃取出其含意並賦予概念,是本研究的主要目的之一。 本研究提出以關鍵字分群的方式達到概念萃取的目的,且將文件以多種概念描述後,基於這些概念進行文件分群。進行概念萃取主要分為以下幾個主要的步驟:特徵選擇、特徵關係的建立,以及特徵分群;特徵分群的結果即為所有文件包含的概念。此外,透過文件內引用文章 (Citing Article)的相似度,建立文件間的引用關係 (Citation Relation),進而建立群與群之間的引用關係,達到建立概念之間的相關性。最後,取代傳統條列式的顯示方式,以視覺化的方式展現分群結果並呈現出概念之間的相關性。 本研究採用CiteSeer資料庫的論文做為語料庫,選取標題、摘要及引用做為資料來源,摘要部分所收錄的文字大約只有1000個字元,這個數量相當於在搜尋引擎中以關鍵字查找所得到的結果資料。根據實驗結果分析,本研究所萃取出的概念可以適合地表達出文件的整體概念,在文件分群的□確率(Accuracy)上亦有一定水準,可達到80%的□確率。
The World Wide Web (WWW) contains a giant amount of information, but finding relevant information from WWW is also a great challenge. Keyword-based querying usually returns many documents; however, they are neither strongly related nor presented in a comprehensible order. Clustering is capable of solving such a problem by grouping relevant documents. Users are able to find relevant documents through groups containing documents with similar concepts. This thesis attempts to extract concepts from a corpus, each of which is defined as a collection of keywords in documents, and conduct document clustering on the basis of the extracted concepts. The overall processes are as follows. First, a clustering algorithm groups similar keywords to create concepts. Second, a document is represented by a vector, each element of which indicates the similarity between the document and a concept. Then, documents are clustered according to the abovementioned vector. Furthermore, citations between documents are used to construct documents connections. Such connections are further used for discovering group relations and concept relations. In addition to extracting concepts and clustering documents, this thesis uses the visualization technique to present clustering results and show the relationship between concepts. Several experiments with CiteSeer documents are performed in order to show that concepts extracted by our method can not only clearly represent each group, but also achieve good clustering accuracy, which is about 80%.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009323591
http://hdl.handle.net/11536/79120
Appears in Collections:Thesis


Files in This Item:

  1. 359101.pdf
  2. 359102.pdf
  3. 359103.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.