完整后设资料纪录
DC 栏位语言
dc.contributor.author黄乃伦en_US
dc.contributor.authorHuang, Nai-Lunen_US
dc.contributor.author李程辉en_US
dc.contributor.authorLee, Tsern-Hueien_US
dc.date.accessioned2014-12-12T01:24:48Z-
dc.date.available2014-12-12T01:24:48Z-
dc.date.issued2012en_US
dc.identifier.urihttp://140.113.39.130/cdrfb3/record/nctu/#GT079513802en_US
dc.identifier.urihttp://hdl.handle.net/11536/41106-
dc.description.abstract  本论文研究旨在处理字串比对议题。字串比对为在文字串列中搜寻特定关键字(组)的程序,其中,文字串列及关键字(组)皆由已知的字符表中的字符所构成。知名且被广为采用的Aho-Corasick(AC)演算法可同时比对多个关键字,并在任何情况下都确保线性时间效能。
  本论文中,我们首先提出一种可有效节省资料结构储存空间的AC字串比对机实现方式,称之为compressed AC。相较于原始的AC字串比对机,compressed AC达到98.90%的资料结构储存空间缩减率,此外,在我们习知的所有AC-based压缩机制中,compressed AC所需的储存空间是最少的,在比对效能方面,它也具有最佳表现。
  其后,我们进而提出一套由预先过滤器(pre-filter)及验证引擎(verification engine)所组成的字串比对架构。预先过滤器旨在迅速过滤掉不可能含有关键字(组)的文字串列片段,能过滤掉愈大量的片段愈好,它藉由询问事先根据关键字(组)所建造的资料结构,在文字串列中寻找可疑的关键字出现位置,验证引擎则负责验证这些可疑的位置是否真有关键字(组)的存在。
  我们的预先过滤器可累计询问结果,称之为状态性预先过滤器(stateful pre-filter)。我们提出了极易实现的实作方式,仅需一个存放位元映象(bitmap)的暂存器及简单的位元运算即可。这个实作方式等效于采用所有过去习得的询问结果来做过滤,因此在此层面上为最佳解。我们的验证引擎是compressed AC的变型,可以同时验证所有关键字,此外,它所需的资料结构储存空间比compressed AC更少。我们称这整套字串比对架构为Pre-filter+AC,无论在比对效能或储存空间上,它的表现都比compressed AC为佳。
  达成了利用所有过去的询问结果(及现阶段的询问结果)来决定预先过滤器现阶段可以过滤掉的片段之后,我们持续致力于在现阶段的询问结果获知之前就推测得知可以过滤掉的片段,提出stateful pre-filter的进化型,称之为speculation pre-filter。speculation pre-filter能推测有多少字符可能可以被过滤掉,若其在所推测可能可以过滤掉的字符数中,采用绝对安全的策略,仅选择最少量的过滤数,我们称其为保守(conservative)。无论是在平均上或是最糟情况下,保守的(conservative)speculation pre-filter都具有比stateful pre-filter更好的比对效能,值得一提的是,最糟情况下的效能提升为双倍。
  当然,本论文研究亦存在可以改进之处,我们的预先过滤器对于处理中等长度以上的关键字(组),具有良好的吞吐效能,然而,处理长度短的关键字(组)时,效能则会降低,事实上,所有无需检验文字串列中每一字符的预先过滤器都具有这种特性。对于短的关键字(组),预先过滤机制可能无法提升(甚或会阻碍)系统效能,庆幸的是,我们的状态性设计仅需一个位元映象的暂存器及简单的位元运算之代价,即可等效于采用所有已习得的询问结果来做过滤,提升吞吐效能。再者,我们相信此特性是能够根除的,信念来自此观察:验证引擎除了验证功能之外,其实可以具备担保过滤字符数量的能力,这着实也是一个引人入胜的研究主题啊。
zh_TW
dc.description.abstractThis dissertation addresses the problem of string matching, which can be defined as a process of searching for all occurrences of given keyword(s) in a text string. Here, the text string and keyword(s) are stings of symbols over a given alphabet. The Aho-Corasick (AC) algorithm, which can match multiple keywords simultaneously and guarantee linear-time deterministic performance under all circumstances, is a well-known string matching algorithm widely adopted.
We first provided a memory-efficient implementation of the AC string matching machine. The implementation, called compressed AC, achieves 98.90% reduction of memory requirement compared with the original AC string matching machine. Among all the investigated AC-based compression schemes, our implementation requires the least memory space. As for the throughput performance, it also performs the best.
Then, we proposed a string matching architecture consisting of a pre-filter and a verification engine. The pre-filter queries data structures built from keywords to find the starting positions of suspicious keyword occurrences. It aims to quickly skip as much as possible the pieces not in a match, while the verification engine is responsible for confirming whether the suspicious occurrences are true matches.
Our pre-filter, named stateful pre-filter, can accumulate query results. We realize this with a bitmap and simple bitwise-AND and shift operations. The proposed stateful realization is proved to be optimal in the sense that it is equivalent to utilizing all previous query results. Our verification engine is designed based on the AC algorithm so that all candidate keywords can be verified simultaneously. It is a variant of our proposed compressed AC. Because of the adoption of a pre-filter, its required data structures can be further reduced. The resulting string matching architecture, named Pre-filter+AC, performs better than compressed AC regarding both throughput performance and memory requirements.
After making use of all previous query results (and the current one) to determine current skipped distance of the pre-filter, we kept devoting to speculate skipped distances before the current query result is obtained. The proposed pre-filter, named speculation pre-filter, can speculate the number of symbols that are likely to be skipped. It is called conservative if it safely speculates the minimum possible skipped distance. The conservative speculation pre-filter outperforms our previously proposed stateful pre-filter on average. Moreover, the worst case performance of it doubles that of the stateful pre-filter.
It should be pointed out that our proposed pre-filters are suitable for keywords of moderate or large lengths. For short keywords, the throughput performance degrades. This is a common drawback of schemes using skip-based pre-filters. Nevertheless, the stateful design improves throughput with little cost of a bitmap and simple operations. Although pre-filtering may not help (and possibly even hurts) system performance if there are short keywords, there exist one possible remedy, which is another interesting topic. Conceptually, one can utilize the verification results to help determine the skipped distances of a pre-filter.
en_US
dc.language.isoen_USen_US
dc.subject字串比对演算法zh_TW
dc.subject预先过滤器zh_TW
dc.subject验证引擎zh_TW
dc.subject封包检验zh_TW
dc.subject入侵侦测/预防zh_TW
dc.subject恶意程式/恶意软体zh_TW
dc.subject电脑病毒zh_TW
dc.subject搜寻引擎zh_TW
dc.subjectstring matching algorithmen_US
dc.subjectpre-filteren_US
dc.subjectverification engineen_US
dc.subjectdeep packet inspectionen_US
dc.subjectintrusion detection/preventionen_US
dc.subjectmalwareen_US
dc.subjectvirusen_US
dc.subjectsearch engineen_US
dc.title状态性预先过滤器及高效能验证引擎构成之字串比对系统zh_TW
dc.titleString matching system with stateful pre-filter and efficient verification engineen_US
dc.typeThesisen_US
dc.contributor.department电信工程研究所zh_TW
显示于类别:Thesis