Title: 為中文新聞瀏覽建構概念式階層架構
Building Concept Hierarchy for Chinese News Browsing
Authors: 曾廣華
Kuang-Hua Tseng
Dr. Suh-Yin Lee
Keywords: 概念式階層架構;最大擴展樹;新聞;資訊擷取;中文文字處理;Concept Hierarchy;Maximum Spanning Tree;News;Information Retrieval;Chinese Word Processing
Issue Date: 2000
Abstract: 概念式階層架構提供了使用者可以用聯想的方式瀏覽相關的資訊,知名的入口網站-雅虎(Yahoo!),便是應用概念式階層架構的一個例子。然而,使用人工的方式建構出概念式階層架構是一件大費周章的工程。因此,在本論文中,我們嘗試為中文新聞的內容自動化地建構概念式階層架構。 我們將中文新聞內容所出現的文字和彼此出現的頻率建構了一個文字網(Word-net),再應用最大擴展樹(Maximum spanning tree)及文字出現頻率判斷主從關係的方法來建構概念式階層架構。然後再將此結果製作一個瀏覽介面供使用者查詢及應用。 要評估系統所建構的概念式階層架構是否正確,是很主觀的一件事。為了評估系統的效能,我們請求一些使用者幫忙作鑑定。從實驗的結果發現,在超過60%的查詢中,我們所建構出來的概念式階層架構可以幫助使用者找到所希望閱讀的新聞。除此之外,在超過一半的查詢中,使用者所花費的查詢時間都較一般的搜尋引擎來的少。從實驗結果中,我們也發現了新聞的第一段文字包含了整篇新聞內容的大部分資訊;如果使用新聞第一段文字來作資料來源的話,可以節省電腦計算的時間。根據這個觀察,在建構概念式階層架構時,我們建議使用新聞第一段文字作為資料來源以節省電腦計算的時間。
Concept hierarchies provide an interactive interface for users to find desirable information by using association of ideas. “Yahoo!” the most popular portal website, is one of the examples which provide concept hierarchies of websites. However, manual construction of concept hierarchies is an expensive and elaborate task. Therefore, how to automatically construct concept hierarchies from contents is an important issue of Internet news content providers. In this thesis, we focus on automatically construction of concept hierarchies for Chinese news contents. Chinese news contents are used as a training set to build a word-net with Chinese term co-occurrence. From the word-net, concept hierarchies can be derived using the method of maximum spanning tree and document frequency of each Chinese word. After the construction of concept hierarchies, an interface for users to query and browse news contents can be built based on the constructed concept hierarchies. It is difficult to objectively evaluate a concept hierarchy. In this thesis, in order to investigate the satisfaction degree of users, we design some questions for the users. From the experimental results, we find that in more than 60% of queries, the concept hierarchies we built can help users to get their desired news contents. In addition, users spend less time in more than half queries when compared with common search engines. The experimental results also reveal that content of first paragraph contains main important information of news. It is important because contents of first paragraph use less computing time than contents of full-text. Based on this observation, we can build the concept hierarchies by using the first paragraph instead of full text to save the computing time.
Appears in Collections:Thesis