标题: 为中文新闻浏览建构概念式阶层架构
Building Concept Hierarchy for Chinese News Browsing
作者: 曾广华
Kuang-Hua Tseng
李素瑛
Dr. Suh-Yin Lee
资讯科学与工程研究所
关键字: 概念式阶层架构;最大扩展树;新闻;资讯撷取;中文文字处理;Concept Hierarchy;Maximum Spanning Tree;News;Information Retrieval;Chinese Word Processing
公开日期: 2000
摘要: 概念式阶层架构提供了使用者可以用联想的方式浏览相关的资讯,知名的入口网站-雅虎(Yahoo!),便是应用概念式阶层架构的一个例子。然而,使用人工的方式建构出概念式阶层架构是一件大费周章的工程。因此,在本论文中,我们尝试为中文新闻的内容自动化地建构概念式阶层架构。
我们将中文新闻内容所出现的文字和彼此出现的频率建构了一个文字网(Word-net),再应用最大扩展树(Maximum spanning tree)及文字出现频率判断主从关系的方法来建构概念式阶层架构。然后再将此结果制作一个浏览介面供使用者查询及应用。
要评估系统所建构的概念式阶层架构是否正确,是很主观的一件事。为了评估系统的效能,我们请求一些使用者帮忙作鉴定。从实验的结果发现,在超过60%的查询中,我们所建构出来的概念式阶层架构可以帮助使用者找到所希望阅读的新闻。除此之外,在超过一半的查询中,使用者所花费的查询时间都较一般的搜寻引擎来的少。从实验结果中,我们也发现了新闻的第一段文字包含了整篇新闻内容的大部分资讯;如果使用新闻第一段文字来作资料来源的话,可以节省电脑计算的时间。根据这个观察,在建构概念式阶层架构时,我们建议使用新闻第一段文字作为资料来源以节省电脑计算的时间。
Concept hierarchies provide an interactive interface for users to find desirable information by using association of ideas. “Yahoo!” the most popular portal website, is one of the examples which provide concept hierarchies of websites. However, manual construction of concept hierarchies is an expensive and elaborate task. Therefore, how to automatically construct concept hierarchies from contents is an important issue of Internet news content providers. In this thesis, we focus on automatically construction of concept hierarchies for Chinese news contents.
Chinese news contents are used as a training set to build a word-net with Chinese term co-occurrence. From the word-net, concept hierarchies can be derived using the method of maximum spanning tree and document frequency of each Chinese word. After the construction of concept hierarchies, an interface for users to query and browse news contents can be built based on the constructed concept hierarchies.
It is difficult to objectively evaluate a concept hierarchy. In this thesis, in order to investigate the satisfaction degree of users, we design some questions for the users. From the experimental results, we find that in more than 60% of queries, the concept hierarchies we built can help users to get their desired news contents. In addition, users spend less time in more than half queries when compared with common search engines. The experimental results also reveal that content of first paragraph contains main important information of news. It is important because contents of first paragraph use less computing time than contents of full-text. Based on this observation, we can build the concept hierarchies by using the first paragraph instead of full text to save the computing time.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT890392036
http://hdl.handle.net/11536/66828
显示于类别:Thesis