標題: 以ROUGE和WordNet為基礎的N-gram共現於剽竊偵測
Plagiarism Detection using N-gram Co-occurrence Statistics Based on ROUGE and WordNe
作者: 陳建穎
Chen-Ying Chen
柯皓仁
Hao-Ren Ke
資訊管理研究所
關鍵字: 剽竊偵測;ROUGE;WordNet;N-gram 共現;Plagiarism Detection;ROUGE;WordNet;N-gram co-occurrence Statistics
公開日期: 2007
摘要: 隨著數位時代的到來和網際網路的蓬勃發展,對於資訊流的控制幾乎是不可能的。而在資訊缺乏管制的情況下,網路和電腦使用者可以隨意地複製並使用任何他們能取得的資訊內容。但是如果在使用時,沒有列出資料的出處和其智慧財產的擁有者,那麼此舉就會形成剽竊而侵犯了智慧財產權。 目前大多數的剽竊偵測方法分成fingerprinting和term occurrence。雖然兩種方法在剽竊偵測的領域裡已有一定的成果,它們還是有不足之處。刻意針對原文做修改就會影響上述方法對於剽竊偵測的表現,尤其是fingerprinting受其影響甚鉅。因此,本論文提出了套用了ROUGE和WordNet來偵測剽竊的演算法,因為前者包括了n-gram co-occurrence statistics、skip-bigram和longest common subsequence (LCS),而後者有著同義詞典的功能也提供詞意上的資訊。N-gram co-occurrence statistics可以有效地偵測照抄和更動句子結構的剽竊,skip-bigram和LCS則不會受到純粹地新增詞彙於原文中或部分原文被刪除的影響,而運用WordNet則得以偵測用同義詞替換原文的情形。 本論文用兩組以人力做成的資料集(稱之為abstract 和 paraphrased),來評估方法的效果。每個方法都依實驗結果的觀察來推薦適合的標準值和前置處理的設定。最後,由幾個不同類型的剽竊例子來支持先前對於每個方法的強項和弱點的假設。
With the arrival of Digital Era and the Internet, control of information flow is nearly impossible; the lack of control provides an incentive for Internet users and computer owners to freely copy and paste any content available to them. Plagiarism often occurs when users fail to credit the original owner for the content borrowed, and such behavior leads to violation of intellectual property. Two main approaches to plagiarism detection are fingerprinting and term occurrence. Although these two approaches have yielded considerable results, they are not without faults. One common weakness suffered by both approaches, especially fingerprinting, is the incapability to detect modified text plagiarism. This research proposed adoption of ROUGE and WordNet. The former includes n-gram co-occurrence statistics, skip-bigram, and longest common subsequence (LCS), while the latter acts as a thesaurus dictionary, which also provides semantic information. N-gram co-occurrence statistics can detect verbatim copy and certain sentence structural changes, skip-bigram and LCS is immune from text modification such as simple addition or deletion of words, and WordNet may handle the problem of word substitution. The proposed methods have been tested on two manually created corpora, abstract set and paraphrased set. Empirically derived threshold and preprocessing setting for each method are recommended based on the evaluation of the performance. Different types of plagiarism examples are shown to support the statements made about the strengths and weaknesses of the proposed methods.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009534505
http://hdl.handle.net/11536/39188
Appears in Collections:Thesis


Files in This Item:

  1. 450501.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.