一個針對多語系網頁內容過濾的快速精確之代理伺服器

標題:	一個針對多語系網頁內容過濾的快速精確之代理伺服器 A Fast Accurate Proxy for Multi-Language Text Webpage Classification
作者:	黃福祥 Fu-Hsiang Huang 林盈達 Ying-Dar Lin 資訊科學與工程研究所
關鍵字:	內容過濾;文件分類;N-gram;及早阻擋;及早通過;content filtering;text classification;N-gram;early blocking;early bypassing
公開日期:	2003
摘要:	即時性的內容分析具有低維護成本及低空間需求性的特色，因此對網頁內容過濾來說是一種非常重要的技巧，但其同時也有準確度較低及處理時間過長的問題。由於多語系網頁的影響，相對也影響了準確度，因此我們嘗試以N-gram的演算法訓練樣本並找出關鍵字加入到內容過濾器中，評估以加入關鍵字的方式影響準確度的程度。此外，我們提出及早決策的演算法，此演算法包含兩部份，分別稱為及早阻擋和及早通過。前者在分類過程中一旦有足夠條件證明標的網頁屬於禁止類別便予以阻擋。反之，後者在發現標的網頁應屬於正常類別時，就會做出及早通過的決定。實驗結果顯示，在使用Pentium III 1GHZ CPU及NetBSD 1.6的作業系統環境下，我們提出的方式較原始的方式在傳輸效能上提升六倍，而在傳輸延遲上改善了三倍以上。同時在阻擋率從原來70%提升到99%。 Real-time content analysis is an important technique in Web content filtering and has two advantages: low maintenance cost and low storage requirement. However, it may also suffer lower accuracy and longer processing time. Because Web pages in different languages can complicate content analysis, we try to extract keywords from training samples by the N-gram algorithm and evaluate the accuracy. To shorten the processing time, we propose the early decision algorithm that has two parts: early blocking and early bypassing. The former algorithm allows making the blocking decision as early as we have enough confidence that the Web page should belong to a forbidden category, while the latter helps to make the bypassing decision as soon as the Web page is considered a normal one. Experiments performed on NetBSD 1.6 with Pentium III 1GHZ CPU show our algorithm can improve the throughput about six times higher than the original and reduce the latency by two thirds. Furthermore, the blocking ratio is raised from 70% to 99%.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT009123562 http://hdl.handle.net/11536/53179
Appears in Collections:	Thesis

Files in This Item:

356201.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.