標題: 作資料萃取服務的DESDL系統之發展研究
The Study on the development of the DESDL system for data extraction services
作者: 錢文祺
Wen-Chi Chien
吳毅成
I-Chen Wu
資訊學院資訊學程
關鍵字: GIDL;DESDL;萃取服務語言;GIDL;DESDL;service-based languages
公開日期: 2002
摘要: 隨著網際網路與電子商務的快速發展,網路中的資料非常多,要萃取出我們想要的資訊並不容易,如何將網路中的資料自動化的萃取並儲存到資料庫是一個很重要的問題。因此我們需要一套萃取語言來幫助電腦快速且有系統地萃取出他們想要的資訊。 本篇論文以一個新的XML為基礎的敘述語言Data Extraction Service Description Language (DESDL)來表達如何自動的萃取網頁中的資訊。其改良本實驗室過去開發的General Interface Definition Language (GIDL)所面臨的問題,而使用(1)文件查詢敘述使用XPath作為標準W3C的規格來改良GIDL中自行開發的標準來查詢資訊,(2)加入"按下"按入行動來模擬使用者行使下一次服務來解決GIDL中遇到網頁中含有Javascript訊息會面臨的萃取失敗的問題。並且研究網頁連結時所遇的問題,並提供解決方式。進一步開發一套以DESDL的語法為基礎的視覺化萃取工具,其具有自動執行使用者萃取的萃取語言功能,有了這套視覺化萃取工具將可以加速網頁資料之萃取並可以降低萃取所需花費的人力與時間成本。
The rapid growth and development of the Internet and e-Commerce result in a huge amount of information generated on Internet. Sometimes it is hard to extract the information what we exactly need. It is a very important solution to help us to store correct data into databases from the Internet. Therefore, we need an extraction language to assist computers in extracting the exact data rapidly and systematically. This paper will demonstrate a new description language for extracting the information from Web Pages. It is based on XML, and we call it Data Extraction Service Description Language, or DESDL in short. DESDL is refined from General Interface Definition Language (GIDL), which is developed in our laboratory years ago. DESDL contains the following features: 1. It applies the standard X3C specification, XPath, for improving the original specifications of GIDL itself. 2. It adds a "click" for simulating users' behaviors to solve the extraction failure problems due to those Web Pages with JavaScript inside. We also study the problems in connecting Web Pages and provide solutions. Hence, we developed a DESDL-based and visulized extraction tool. This tool can automatically execute the scripts defined by users for data extraction. With this tool, we can speed up the information extraction fromn Internet, and also reduce the manpower and time for extraction.
URI: http://140.113.39.130/cdrfb3/record/nctu/#NT911706015
http://hdl.handle.net/11536/71310
顯示於類別:畢業論文