標題: | 瀏覽器導向資料萃取系統之資料模式 Data Model for Browser Oriented Data Extraction System |
作者: | 邱一芳 Yi-Fang Chiu 吳毅成 I-Chen Wu 資訊學院資訊學程 |
關鍵字: | 瀏覽器導向資料萃取系統;資料模式;Browser Oriented Data Extraction System;Data Model |
公開日期: | 2005 |
摘要: | 網際網路上的資料,種類繁多,數量龐大。如果以人工的方式,透過網頁瀏覽器,從網際網路上一個網頁接著一個網頁慢慢搜尋,然後紀錄所搜尋的資料,往往需要花費相當多的人力及時間。有鑑於此,我們發展出一套網際網路上的資料萃取系統BODE,提供使用者方便而有效率的工具。
BODE定義了一套網際網路資料萃取語言BODED Script,用以描述模擬人工搜尋資料的作業方式及資料的路徑,BODE依據BODED Script的描述,進行網頁的搜尋及資料的萃取。BODE在萃取資料完成後,可將資料輸出到檔案。然而,目前的狀況是,使用者在應用經由BODE萃取的資料前,仍然需要經過人為的前置處理。
為了能簡化使用者對所萃取資料所做的前置處理作業,並且增加系統的彈性,BODE將定義數種定義語言,使用者僅需將資料的資料類別定義清楚之後,系統便可以依據資料類別的定義,產生結構化的資料輸出。
本篇論文針對BODE提出一個資料類別,定義資料的結構模式,將BODE萃取後的資料以結構化的型式表示,並且可以輕易的將他轉存至資料庫中,如此,除了可以增加資料的可讀性及重覆使用的可能性,更可以資料的維護更容易更有效率。 With the rapid growth of World Wide Web, it becomes very important to extract information from such a huge amount of database. For example, a price comparison system needs to extract information, such as product names, prices, purchase methods, etc, from some related e-commerce sites. Other examples are to extract news from news providers and to collect categories of publishers. In our past projects, we develop a browser oriented data extraction system. And, defined a data extraction language, named BODED (Browser Oriented Data Extraction Description), and designed to solve the problem of data extraction a system for it whose technique has been transferred to industry. However, the way to manage the extracted data for data management is the next very important research topic. The basic applications include: data storing, data query, Web site reengineering, and Web site integration. BODE is a strong tools to extract Web data. In order to handle the extracted data, we introduce a data model and develop a generic method to store those extracted data to database properly. Base on BODE and BODELET, we define a description language to describe the structure and relation of the extracted data. Hence, we developed a software component, named BODELETDB. This component can automatically store the extracted data to database directly and clearly. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009167576 http://hdl.handle.net/11536/63902 |
顯示於類別: | 畢業論文 |