标题: | GIDL - 网页萃取技术的一般化介面定义语言之研究 GIDL:General Interface Definition Language for Web Extraction |
作者: | 陈俊琪 Jun-Chi Chen 吴毅成 I-Chen Wu 资讯科学与工程研究所 |
关键字: | 网页资料萃取;GIDL;PBP;Multiple PBP;GIDLet |
公开日期: | 2000 |
摘要: | 由于全球资讯网的盛行,许许多多、各式各样的资讯皆可在网路上取得。但是网路上的资料量愈来愈多,却加深了使用者在寻找自己想要的资料时的困难度。自动化网页资料萃取系统在这里解决了我们的问题,它可以代理使用者到网路上萃取出他们想要的资料。 在本篇论文中,我们对现行的全球资讯网以及相关的网页萃取技术作一番剖析,提出一套新的网页萃取的一般化介面定义语言,GIDL (General Interface Definition Language),特别针对Multiple PBP(Page-By-Page)萃取型态的网页萃取提出解决的方法,并提出了GIDLet的概念来扩充萃取系统的功能。这里的PBP萃取型态指的是一个网页接着一个网页的萃取,而Multiple PBP萃取型态指的就是在萃取一个网页之后,会同时继续萃取其中所连结的某些网页。最后,我们也以GIDL为核心技术实作了一套自动化网页资料萃取系统,来达到最一般化的网页萃取功能。 Millions of users can get more and more information through web. Because of ever-increasing volumes of information, it’s not only harder for users to find what they want but also inefficient. An automatic data extraction system of web pages may help us to do that, because it can be users’ agent so to extract what they want on the web. This thesis analyzes the situation of web and current relative technology of data extraction of web pages, and proposes a new data extraction language, named GIDL (General Interface Definition Language). GIDL provides with a new extraction model, named Multiple PBP (Page-By-Page) extraction, and a new plug-in extension mechanism, GIDLet, to extend the function of data extraction system of web pages. PBP is a data extraction model to extract data page by page, and Multiple PBP is a data extraction model to extract data pages by pages. Besides, we also implement an automatic data extraction system of web pages based on GIDL to meet the general extraction of web pages. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#NT890392012 http://hdl.handle.net/11536/66805 |
显示于类别: | Thesis |