標題: | 自動化文章敵意分級系統之初探研究 A Pilot of Automatic Sorting System with Hostile Articles |
作者: | 林志鴻 Jyh-Horng Lin 林珊如 劉旨峰 Dr. Sunny S. J. Lin Dr. Eric Zhi-Feng Liu 理學院科技與數位學習學程 |
關鍵字: | 資訊檢索;向量模型;論戰;敵意;文件分類;information retrieval;vector space model;flame;hostility;document classification |
公開日期: | 2003 |
摘要: | 隨著網路上文件的等比級數增加,如何精確地找出所需要文件成為了重要的議題。在本文中,參酌自動化文件分類的相關研究,提出了利用向量模型對中文敵意文件的分類程序與方法。從學術網路BBS站的硬體討論版(tw.bbs.comp.hardware)抽樣5000篇文章,先以人工分類方式,將文章依敵意的程度分類後,再進行自動分類實驗,先輸入數篇文章,由系統分析出文章的關鍵詞,並計算權重,建立敵意文章中心向量,再依據輸入的文章會計算出與敵意文章相似度,最後將相似度高於門檻值的文章判定為敵意文章,其他則為非敵意文章,研究發現:
(1)利用同一主題文章作為訓練文章,來計算敵意與非敵意文章與敵意中心向量的相似度時,其相似度具有明顯差異。
(2)訓練文章的主題不同時,所計算出的相似度亦有差距。
(3)利用門檻值實驗計算出的最佳門檻值0.17來進行分類時,對於非敵意文章有較佳的精確度,約為0.98,但對於敵意文章的分類精確度則較差,約為0.25。
(4)當門檻值降低至0.136時,可提高HR值至0.72。 With the increasing of Website documents drastically, how to precisely find what are needed documents turns to be an important issue. In this article by referring to relevant study on the automatic document classification, it brings out to utilize the vector model to classify and process Chinese hostile documents.By sampling 5000 articles from the hardware discussion board in the academic BBS (tw.bbs.comp.hardware), we classify them by manual first, based on the degree of the hostile.Later on, proceeded automatic classification experiment. By ntering several articles in the beginning, the system can analysis key terms, and calculate the term weight ratio in order to establish the central vector of the hostile articles.Then, this system can calculate the similarity by comparing with the build-in hostile articles. Finally, if an article's similarity is higher than the threshold, then it will be classified into hostile articles. Other than that, it will be classified into articles without hostile. Some observations found through this study as following: 1. By using the same topic of articles for the purpose of training articles to calculate the similarity of the hostile central vector between hostile and unhostile articles, the similarity was obviously different. 2. When the topic of training articles was different, the similarity was different. 3. When using an optimum threshold value 0.17 to proceed classification, it came out a better accuracy for the articles without hostile with about 0.98, but got a worse classification accuracy for the hostile articles, about 0.25. 4.We can get better HR by decreasing the threshold. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT009073539 http://hdl.handle.net/11536/42613 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.