Title: 網站內網頁之區塊等級分析
Block-level Ranking for Intra-Website Pages
Authors: 姚文鋒
Wen-Feng Yao
吳毅成
I-Chen Wu
資訊學院資訊學程
Keywords: 網站;區塊等級;鏈結分析;Intra-WebSite;Block-level;Link Analysis
Issue Date: 2006
Abstract: 依據統計資料,截自2007 年6 月為止全世界的網頁數量有超過140 億個之多,面對這樣龐大的資料庫,如何有效地使用是一件很重要的事。對於未知路徑的資料,通常尋求搜尋引擎的協助來正確定位資料;對於已知路徑的資料,為了增加使用效率,則會使用資料萃取的技術。 本實驗室所開發的BODE (Browser Oriented Data Extraction)系統即是一套網頁資料萃取系統,使用者可以透過人性化的操作介面點選所要萃取的資料,再由系統產生萃取所需的腳本(BODE script),並進行萃取的動作。 然而在建構BODE script 的過程中,使用者必須要對BODE script 語法、XPath 及HTML Tag 有一定程度的了解才能順利進行。因此為了降低BODE系統的使用門檻,本論文提出了自動辨識單一網站內有用資料區塊的演算法,以便協助達成自動建立BODE script 的目標。
According to the statistical data, there are more than 14 billion web pages in whole world by June of 2007. It’s a important thing that how to use this huge database efficiently. For the information that we do not know its location, we usually use search engines to help us to find it out. And for the information that we do know where it is, we use data extraction to increase the efficiency. BODE (Browser Oriented Data Extraction), developed by our laboratory, is such a web data extraction system. Its GUI can be used to indicate the data they want to retrieve, and the system will generate the BODE script that is used in the extraction process, and then start to extract. However, people must have the basic knowledge about the syntax of BODE script, XPath and HTML Tag to build the BODE script. To reduce the threshold of using BODE system, this thesis proposes an algorithm to distinguish the useful information blocks from a single web site, so as to accomplish the goal of automatically generating BODE script.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009367594
http://hdl.handle.net/11536/80117
Appears in Collections:Thesis


Files in This Item:

  1. 759401.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.