新聞分群方法之比較研究及應用

標題:	新聞分群方法之比較研究及應用 Study of Comparing News Clustering Methods and Application
作者:	黃郁豪張芳仁 Huang,Yu-Hau 經營管理研究所
關鍵字:	新聞分群;階層分群;新聞推薦;Word2vec;Doc2vec;TF-IDF
公開日期:	2017
摘要:	本研究利用Word2vec方法訓練新聞資料的字詞並形成字詞向量，將之與TFIDF字詞權重做結合，加總後形成新聞向量。分析擷取新聞關鍵字字詞多寡對於文章分群結果的優劣，並加入傳統TFIDF和Doc2vec新聞文件形成向量的方法做比較，結果顯示Word2vec形成新聞向量於語意分析能力優於其他方法，未來可將這模式應用到語意分析實務領域。本篇研究使用jieba系統對A網路公司提供之新浪(sina)新聞做斷詞並計算TFIDF權重，另外利用近期Google推出的Word2vec和Doc2vec開源套件，將新聞字詞和新聞向量化。Word2vec方面，將TFIDF權重乘以Word2vec之字詞向量，擷取3%、5%、7%新聞關鍵字和全篇字詞加總形成各新聞的向量代表，比較傳統TFIDF方法、以及Doc2vec新聞向量共六種方法比較，計算新聞間餘弦相似度，使用階層分群方法做分群。再比較各群新聞中的新聞類別純度(purity)，以及亂度(熵Inforamtion entropy)，評估六種方法在新聞分群後的結果。實驗結果發現Word2vec全新聞字詞權重向量方法表現在六種方法中最優異。但是Doc2vec的新聞向量分群結果表現皆最差。本研究進一步實驗Word2vec新聞向量方法取多少比例的新聞字詞形成新聞向量有最佳的分群結果，以亂度的下降斜率來看擷取新聞字詞在10%~15%是最有效率的，而擷取最多70%的新聞字詞就有最低的分群亂度，因此有最精確的語意分析效果。最後本研究提出應用研究結果於實務新聞推薦的流程，利用本研究方法形成個人專屬的新聞向量，進而可以推薦閱聽者個人專屬的相關新聞，並可應用到其他文章類的網站，達到增加閱聽者喜歡文章提供平台集更便利吸收資訊的目的。 This study trains News data to transform text into words vector by Word2vec and combine with TFIDF. This study uses Jieba system to do word segmentation on Sina News and then sum up the combined TFIDF weight on words with words vector formed by Word2vec into the News vector. This study forms the News vector from 3%,5% and 7% keywords and all words from news and compares the News Clustering result with the methods of Doc2vec and normal TFIDF.There are total 6 ways to form News Vector. In the process of clustering, the study chooses Cosine Similarity as metric to do the HAC. Next, the experiment compares the Purity and Information Entropy of News Category in the result of clusters to measure the effectiveness of the 6 ways to cluster News. The experiment result shows that the method of “Word2vec_Allwords” peforms the best among all methods. Furthermore, the purity and Information Entropy performance of the method of News vector formed by 3%,5%, and 7% keywords improve with more training data, which can be explained that the Word2vec Semantic Analysis ability improves with more training data. However, the Doc2vec method performs the worst. The study experiment how much percentage of words vector to form News vector will have good clustering result as “Word2vec_Allwords” does. The result shows that taking only 70% of News words will get the lowest Entropy, which means the best News clustering result. The study also illustrates how to implement the experiment process and result into recommending personal News to achieve the purpose of the study: Attract readers to stay more time in the article platform and receive their interested information conveniently.
URI:	http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070453728 http://hdl.handle.net/11536/140864
Appears in Collections:	Thesis