標題: 剖析與加速三種具有字串比對之內容安全應用
Profiling and Accelerating String Matching Algorithms in Three Content Security Applications
作者: 李志祥
Zhi-Xiang Li
林盈達
Ying-Dar Lin
資訊科學與工程研究所
關鍵字: 字串比對;演算法;內容安全;string matching;pattern matching;algorithm;content security
公開日期: 2004
摘要: 網路內容安全已經成為網際網路中重要的議題。在內容處理中,字串比對演算法效率的必要性也漸漸被證實。字串比對演算法的效能主要受到樣本的個數、最短特徵值的長度與特徵值所組成的字元集而有所影響。這份研究將復審與剖析一些典型的演算法來了解各種演算法適合在什麼情況下使用。Aho-Corasick演算法適合使用在特徵值最短的長度為1的時候,Modified-WM演算法適合使用在特徵值最短的長度為2且樣本個數小於1,000的情況下,FNPw2則適合使用在特徵值最短的長度為2且樣本個數大於1,000的情況下,特徵值最短的長度為3的時候則適合使用Modified-WM演算法,特徵值大於等於4的時候則適合使用2-gram BG+演算法。接著,各種有效率的演算法皆實作到開放原始碼套件中,以便觀察在實際應用上效能的差異。如果字串比對處理在總執行時間的比例重的話,效能上則有很大的改善。如在ClamAV的實驗中,新的方法在效能上提升了5倍以上。另外,將會餵真實的資料與人造的資料到這些套件中,讓它們處理。以便觀察這些套件處理真實資料與人造資料效能上的差異。由於字元集分布的影響,它們處理真實資料所需的時間會比處理人造資料來的長。最後,這份研究亦觀察出在字串比對演算法中的一些實際設計議題。
Network content security has become a critical issue of the Internet. It is shown that the efficiency of string matching algorithms is essential to content processing. The performance of a string matching algorithm is sensitive to the number of patterns, the minimum length of the signature and the character set that the signatures are composed of. This work reviews and profiles some typical algorithms to understand which algorithm is suitable in which situation. The AC algorithm is suitable for LSP=1, the Modified-WM algorithm is suitable for LSP =2 when the pattern set size is smaller than 1,000, the FNPw2 algorithm is suitable for LSP=2 when the pattern set size is larger than 1,000, the Modified-WM algorithm is suitable for LSP=3 and the BG+ algorithm is suitable for LSP 4. Then, these algorithms are implemented on open-source content security applications to observe the performance in practice. The performance improvement is significant if the percentage of string matching processing on total execution time is great. For example, the novel method is five times faster than the original method on the experiment of ClamAV. In addition, these applications are fed with the real and synthetic data. The differences of performance between the real and synthetic data are also compared. The execution time for processing the real data is longer than that for processing the synthetic data due to the character set distribution. Finally, the practical design issues for string matching are also observed in this work.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009223516
http://hdl.handle.net/11536/76565
Appears in Collections:Thesis


Files in This Item:

  1. 351601.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.