標題: 從CopeOpi純量擴充至CopeOpi向量:用於多類別本文分類的詞向量
From CopeOpi Scores to CopeOpi Vectors: Word Vectors for Multiclass Text Classification
作者: 蔡佩珊
陳穎平
Tsai, Pei-Shan
Chen, Ying-Ping
資訊科學與工程研究所
關鍵字: 本文分類;向量空間模型;詞向量;Text classification;Vector space model;Word vector
公開日期: 2017
摘要: 在這資訊爆炸的時代,每天有大量的數位本文被產出。為了從這些資料中獲取有用的訊息,文字探勘成了當前的熱門議題,而本文分類便是其中的重要任務之一。 在本論文中,我們提出一個用於多類別本文分類的向量空間模型,詞向量CopeOpi vectors。我們將用於中文情感分析的CopeOpi scores,擴充至能夠用於多類別本文分類且無語言限制的CopeOpi vectors。 我們測試CopeOpi vectors於英文及中文的情感分析及主題分類問題,並與幾個常用於本文分類的特徵向量進行比較,將這些特徵向量套用至不同的機器學習演算法。實驗結果顯示CopeOpi vectors能夠用更小的向量長度與更短的訓練時間,達到與其他特徵向量同樣水平的分類成果。 CopeOpi vectors是適用於多類別本文分類,兼具效果與效率的詞向量。
In the era of technology, millions of digital texts are generated every day. To derive useful information from these textual data, text mining has become a popular area of both research and business. One of the most important task of text mining is text classification. In this thesis, we propose a vector space model for multiclass text classification, the word vectors---CopeOpi vectors. We expand CopeOpi scores which are used in Chinese sentiment analysis, to CopeOpi vectors which can be used in multiclass text classification without the language limit. We verify the functionality of CopeOpi vectors by a series of text classification problems, including sentiment analysis and topic categorization, in both English and Chinese. We make comparisons with several commonly-used features for text classification, and examine these features on different types of machine learning algorithms. The results show that CopeOpi vectors can produce comparable results with a smaller vector size and shorter training time. CopeOpi vectors are effective and efficient features for multiclass text classification.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070456024
http://hdl.handle.net/11536/142320
顯示於類別:畢業論文