標題: 基於模式泛化之自動化網頁資料萃取
Automatic web data extraction based on pattern generalization
作者: 柯燿興
Yao-Hsing Ko
吳毅成
陳隆彬
I-Chen Wu
Lung-Pin Chen
資訊學院資訊學程
關鍵字: 網頁資料萃取;模式;模式泛化;等價類;web data extraction;pattern;pattern generalization;equivalence classes
公開日期: 2005
摘要: 隨著網際網路的快速發展,萃取並搜集網頁資料的需求也日益增多,該如何有效率並正確地萃取到使用者所需的資料成為重要的課題。本論文希望藉由分析使用者瀏覽網頁的行為模式,找出規律性,推測使用者所需之網頁資料並產生wrapper,達到自動萃取之目的。 在使用者需求方面,使用者平日瀏覽點選網頁,觀看有興趣之網頁及資料,同一使用者一段時間的網頁瀏覽行為必存在著某些規律性,此規律性可視為一種模式(Pattern),即網頁個人化模式。在網頁架構方面,本論文針對動態網頁進行研究,動態網頁的特徵是通常一個模板(template)可產生數個實例(instances),而一個網頁是由同一模板之多個實例所組成。本論文藉由分析經使用者一段時間的瀏覽,多數實例產生之模式,並研發新的模式泛化演算法來將這些模式泛化到其他實例,最後產生網頁萃取程式所需之wrapper,達到自動萃取之目的。
With the rapid growth of Internet, the requirement in extraction and collection web content increases with time. How to extract web data for user requirement more effectively and correctly becomes an important class. The paper want to find the regularity in web browsing by analyzing web browsing behavior. We anticipate the web data for user requirement, creating wrapper, and reach the purpose of automation extraction. In user requirement aspect, user browsed interesting data in internet. Single person web browsing behavior for a long time implies some regularity. The regular model could be a “pattern”, a web personal pattern. In web structure aspect, a page template could create many instances in a dynamic web page. One dynamic web page consists of many instances created by one template. For a long time browsing, we analyze the personal pattern created by many instances, discovery a new pattern generalization algorithm to generalize to other instances. Finally, the extraction wrapper could be created automatically. Then, we could arrive auto-extraction.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009167594
http://hdl.handle.net/11536/64013
顯示於類別:畢業論文