標題: 基於HTML文件佈局之網頁分割演算法
A HTML Rendering-Based Page Segmentation Algorithm ( HRPS )
作者: 余提梵
Yu, Ti-fan
Wu, I-Chen
關鍵字: 網頁區塊分割;視覺化分割線;視覺化網頁分割演算法;Page Segmentation;Separator;Vision-based page segmentation
公開日期: 2009
摘要: 依據統計資料,截自2010年來共有1.13億個網站存在過,其中有99.9%是在近15年間成立的, 面對這樣龐大又高替換的網頁資料,如何有效地使用是一件很重要的事。 對於大量變動態的資料,通常尋求搜尋引擎的協助來正確定位資料; 對於已知位址的資料,為了增加使用效率,則會使用資料萃取的技術。而不管是搜尋引擎或資料萃取工具,要對複雜的網頁進行分析,首要就是要對網頁作區塊分類或標記,以濾除噪音 ( Noise ) 區塊及提取各主題( Topic )區域之本文區塊,也就是網頁區塊分割( Page Segmentation) 。
2003年微軟團隊發表視覺化網頁分割演算法(Vision-based page segmentation: VIPS )後,很多網頁分割研究多參考了視覺化分割技術。但在近幾年來,越來越多網頁的頁面框架設計,採用DHTML技術為主時,原始的VIPS的方法在使用上,便出現當初設計時沒有顧及的小缺陷,雖然 之後的研究,出現很多組合型態的頁面分割演算法來彌補使用上的不足。但因為是採用其它特性的演算法來彌補VIPS, 所以這部份切割區塊也就喪失視覺化分割的特性。
本文提出一個方法,在以視覺化分割為基礎上,帶入網頁文件布局特性(HTML Rendering-Based),以解決視覺化區塊分割在DHTML網頁上,可能找不到視覺化分割線( Separator )的問題。
According to the statistical datas, Up to 2010, a total of 113 million websites existed, of which 99.9% was established nearly 15 years, the face of such large and high replacement page data, how to effectively use is a very important matter。 For the information that we don’t know its location, we usually use search engine to help us to find it out。 And for the information that we do know where it is, we use data extraction to increase the efficiency。 And whether it is a search engine or information extraction tool, to analyze the complex web, the first steps is to split the Web Page to provide subject area of this location, It’s a important thing that how to use this huge database efficiently。
Since 2003 the team released Microsoft Visual Web segmentation algorithm (Vision-based page segmentation: VIPS), many papers are mostly used segmentation based on visual segmentation, However, in recent years, more and more web page Layout design, using DHTML technology-based, the original method of VIPS in the use, they are in the original design did not take into account small defects, though after the study, there are many page segmentation algorithm combined patterns to make up for the use of deficiency。
But since they are using other features of the algorithm to make up for VIPS, so this part of the Visual cues is losing the characteristics of visual segmentation,This paper presents a method, in order to split based on visualization, into the HTML document Rendering features, to solve the visual segmentation in DHTML pages, you may not find the visual Separator problems。


  1. 759201.pdf
  2. 759202.pdf
  3. 759203.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。