狀態性預先過濾器及高效能驗證引擎構成之字串比對系統

Title:	狀態性預先過濾器及高效能驗證引擎構成之字串比對系統 String matching system with stateful pre-filter and efficient verification engine
Authors:	黃迺倫 Huang, Nai-Lun 李程輝 Lee, Tsern-Huei 電信工程研究所
Keywords:	字串比對演算法;預先過濾器;驗證引擎;封包檢驗;入侵偵測/預防;惡意程式/惡意軟體;電腦病毒;搜尋引擎;string matching algorithm;pre-filter;verification engine;deep packet inspection;intrusion detection/prevention;malware;virus;search engine
Issue Date:	2012
Abstract:	本論文研究旨在處理字串比對議題。字串比對為在文字串列中搜尋特定關鍵字（組）的程序，其中，文字串列及關鍵字（組）皆由已知的字符表中的字符所構成。知名且被廣為採用的Aho-Corasick（AC）演算法可同時比對多個關鍵字，並在任何情況下都確保線性時間效能。　　本論文中，我們首先提出一種可有效節省資料結構儲存空間的AC字串比對機實現方式，稱之為compressed AC。相較於原始的AC字串比對機，compressed AC達到98.90%的資料結構儲存空間縮減率，此外，在我們習知的所有AC-based壓縮機制中，compressed AC所需的儲存空間是最少的，在比對效能方面，它也具有最佳表現。　　其後，我們進而提出一套由預先過濾器（pre-filter）及驗證引擎（verification engine）所組成的字串比對架構。預先過濾器旨在迅速過濾掉不可能含有關鍵字（組）的文字串列片段，能過濾掉愈大量的片段愈好，它藉由詢問事先根據關鍵字（組）所建造的資料結構，在文字串列中尋找可疑的關鍵字出現位置，驗證引擎則負責驗證這些可疑的位置是否真有關鍵字（組）的存在。　　我們的預先過濾器可累計詢問結果，稱之為狀態性預先過濾器（stateful pre-filter）。我們提出了極易實現的實作方式，僅需一個存放位元映象（bitmap）的暫存器及簡單的位元運算即可。這個實作方式等效於採用所有過去習得的詢問結果來做過濾，因此在此層面上為最佳解。我們的驗證引擎是compressed AC的變型，可以同時驗證所有關鍵字，此外，它所需的資料結構儲存空間比compressed AC更少。我們稱這整套字串比對架構為Pre-filter+AC，無論在比對效能或儲存空間上，它的表現都比compressed AC為佳。　　達成了利用所有過去的詢問結果（及現階段的詢問結果）來決定預先過濾器現階段可以過濾掉的片段之後，我們持續致力於在現階段的詢問結果獲知之前就推測得知可以過濾掉的片段，提出stateful pre-filter的進化型，稱之為speculation pre-filter。speculation pre-filter能推測有多少字符可能可以被過濾掉，若其在所推測可能可以過濾掉的字符數中，採用絕對安全的策略，僅選擇最少量的過濾數，我們稱其為保守（conservative）。無論是在平均上或是最糟情況下，保守的（conservative）speculation pre-filter都具有比stateful pre-filter更好的比對效能，值得一提的是，最糟情況下的效能提升為雙倍。　　當然，本論文研究亦存在可以改進之處，我們的預先過濾器對於處理中等長度以上的關鍵字（組），具有良好的吞吐效能，然而，處理長度短的關鍵字（組）時，效能則會降低，事實上，所有無需檢驗文字串列中每一字符的預先過濾器都具有這種特性。對於短的關鍵字（組），預先過濾機制可能無法提升（甚或會阻礙）系統效能，慶幸的是，我們的狀態性設計僅需一個位元映象的暫存器及簡單的位元運算之代價，即可等效於採用所有已習得的詢問結果來做過濾，提升吞吐效能。再者，我們相信此特性是能夠根除的，信念來自此觀察：驗證引擎除了驗證功能之外，其實可以具備擔保過濾字符數量的能力，這著實也是一個引人入勝的研究主題啊。 This dissertation addresses the problem of string matching, which can be defined as a process of searching for all occurrences of given keyword(s) in a text string. Here, the text string and keyword(s) are stings of symbols over a given alphabet. The Aho-Corasick (AC) algorithm, which can match multiple keywords simultaneously and guarantee linear-time deterministic performance under all circumstances, is a well-known string matching algorithm widely adopted. We first provided a memory-efficient implementation of the AC string matching machine. The implementation, called compressed AC, achieves 98.90% reduction of memory requirement compared with the original AC string matching machine. Among all the investigated AC-based compression schemes, our implementation requires the least memory space. As for the throughput performance, it also performs the best. Then, we proposed a string matching architecture consisting of a pre-filter and a verification engine. The pre-filter queries data structures built from keywords to find the starting positions of suspicious keyword occurrences. It aims to quickly skip as much as possible the pieces not in a match, while the verification engine is responsible for confirming whether the suspicious occurrences are true matches. Our pre-filter, named stateful pre-filter, can accumulate query results. We realize this with a bitmap and simple bitwise-AND and shift operations. The proposed stateful realization is proved to be optimal in the sense that it is equivalent to utilizing all previous query results. Our verification engine is designed based on the AC algorithm so that all candidate keywords can be verified simultaneously. It is a variant of our proposed compressed AC. Because of the adoption of a pre-filter, its required data structures can be further reduced. The resulting string matching architecture, named Pre-filter+AC, performs better than compressed AC regarding both throughput performance and memory requirements. After making use of all previous query results (and the current one) to determine current skipped distance of the pre-filter, we kept devoting to speculate skipped distances before the current query result is obtained. The proposed pre-filter, named speculation pre-filter, can speculate the number of symbols that are likely to be skipped. It is called conservative if it safely speculates the minimum possible skipped distance. The conservative speculation pre-filter outperforms our previously proposed stateful pre-filter on average. Moreover, the worst case performance of it doubles that of the stateful pre-filter. It should be pointed out that our proposed pre-filters are suitable for keywords of moderate or large lengths. For short keywords, the throughput performance degrades. This is a common drawback of schemes using skip-based pre-filters. Nevertheless, the stateful design improves throughput with little cost of a bitmap and simple operations. Although pre-filtering may not help (and possibly even hurts) system performance if there are short keywords, there exist one possible remedy, which is another interesting topic. Conceptually, one can utilize the verification results to help determine the skipped distances of a pre-filter.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT079513802 http://hdl.handle.net/11536/41106
Appears in Collections:	Thesis

APA	黃., Huang, N., 李., & Lee, T. (2012). 狀態性預先過濾器及高效能驗證引擎構成之字串比對系統. http://hdl.handle.net/11536/41106.
Bibtex	@article{黃迺倫 and Huang2012, title={狀態性預先過濾器及高效能驗證引擎構成之字串比對系統}, author={黃迺倫 and Huang, Nai-Lun and 李程輝 and Lee, Tsern-Huei}, journal={http://hdl.handle.net/11536/41106}, year={2012}, url={https://ir.lib.nycu.edu.tw/handle/11536/41106}, }