Title: | 網站內網頁之區塊等級分析 Block-level Ranking for Intra-Website Pages |
Authors: | 姚文鋒 Wen-Feng Yao 吳毅成 I-Chen Wu 資訊學院資訊學程 |
Keywords: | 網站;區塊等級;鏈結分析;Intra-WebSite;Block-level;Link Analysis |
Issue Date: | 2006 |
Abstract: | 依據統計資料,截自2007 年6 月為止全世界的網頁數量有超過140 億個之多,面對這樣龐大的資料庫,如何有效地使用是一件很重要的事。對於未知路徑的資料,通常尋求搜尋引擎的協助來正確定位資料;對於已知路徑的資料,為了增加使用效率,則會使用資料萃取的技術。
本實驗室所開發的BODE (Browser Oriented Data Extraction)系統即是一套網頁資料萃取系統,使用者可以透過人性化的操作介面點選所要萃取的資料,再由系統產生萃取所需的腳本(BODE script),並進行萃取的動作。
然而在建構BODE script 的過程中,使用者必須要對BODE script 語法、XPath 及HTML Tag 有一定程度的了解才能順利進行。因此為了降低BODE系統的使用門檻,本論文提出了自動辨識單一網站內有用資料區塊的演算法,以便協助達成自動建立BODE script 的目標。 According to the statistical data, there are more than 14 billion web pages in whole world by June of 2007. It’s a important thing that how to use this huge database efficiently. For the information that we do not know its location, we usually use search engines to help us to find it out. And for the information that we do know where it is, we use data extraction to increase the efficiency. BODE (Browser Oriented Data Extraction), developed by our laboratory, is such a web data extraction system. Its GUI can be used to indicate the data they want to retrieve, and the system will generate the BODE script that is used in the extraction process, and then start to extract. However, people must have the basic knowledge about the syntax of BODE script, XPath and HTML Tag to build the BODE script. To reduce the threshold of using BODE system, this thesis proposes an algorithm to distinguish the useful information blocks from a single web site, so as to accomplish the goal of automatically generating BODE script. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009367594 http://hdl.handle.net/11536/80117 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.