標題: 基於主旨輔以自然語言特徵之線上垃圾郵件偵測系統
An Online Spam E-mail Detection System Using Natural Language Features Based on E-mail Subject
作者: 李致寧
曾文貴
Lee, Chih-Ning
Tzeng, Wen-Guey
資訊科學與工程研究所
關鍵字: 垃圾郵件偵測;郵件主旨;樸素貝氏分類器;自然語言特徵;漸進式模型;線上學習;spam e-mail detection;e-mail subject;naive Bayesian classifier;natural language features;incremental scheme;online learning
公開日期: 2016
摘要:   電子郵件(electronic mails, e-mails)是現今人們主要的溝通工具之一,但是由於其快速、價格低廉及方便的特性,電子郵件也成為傳遞垃圾郵件的管道。大量的垃圾郵件造成使用者的疲乏與困難,同時也消耗網路資源。用戶必須從大量的郵件中挑出有用的訊息,而隱含惡意的郵件也會威脅使用者安全,因此如何有效的分類出垃圾郵件便成為網路安全重要的問題。   目前大部分的郵件分類方法都需要獲得完整的郵件以得到分類結果,然而僅郵件主旨(subject)便含大量資訊。使用主旨進行分類不僅效率較好,而且可避免無法得到郵件內文之明文的情形。因此,本論文提出一個僅使用主旨欄資訊偵測垃圾郵件之系統,並且為了符合於現實情形採用線上學習演算法及漸進式的訓練模型,用新郵件更新分類器以適應隨著時間變化的郵件特徵。經過實驗,我們發現權重樸素貝氏分類器,能夠有效的過濾垃圾郵件。同時,我們額外加入統計量特徵,以及詞性和上位詞兩種自然語言特徵。我們發現,以詞袋和詞性聯集而成之特徵屬性集合為依據,可提升分類產生之機器學習分類結果。
  Nowadays, electronic mail (e-mail) is one of the most popular communication tools. However, e-mail has also become a way to deliver spam messages, since sending emails is very convenient, fast, and cheap. The large amount of spam e-mails not only exhausts the users, but also consumes network resources. The users have to pick out useful messages from tons of e-mails, while some spams, which contain malicious attachments, threaten the safety of users’ computers.   Currently, most of the detection methods need to obtain the whole e-mail messages, whereas only the subjects line can provide a lot of information. Classifying e-mail by subjects is more efficient than by the whole e-mail messages, and can still works if the plaintext of e-mail contents is not available. Therefore, we propose a spam e-mail detection system considering only e-mail subjects. In this work, the incremental scheme and online learning algorithm enable our system to adapt to the change of new spam pattern. Furthermore, we use statistic features and natural language features to enhance the performance. The results of the experiments show that by using weighted multinomial naïve Bayesian classifier gives the best outcomes, while choosing the union of bag-of-word and part-of-speech features as feature set increases the classification accuracy.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070356057
http://hdl.handle.net/11536/140676
顯示於類別:畢業論文