Title: 網路財務新聞的文字分析於風險預警應用之實證研究
An Empirical Analysis of Risk Alert with the Text Mining of Internet Financial News
Authors: 葉鴻青
戴天時
Yeh, Hung-Ching
Dai, Tian-Shyr
管理學院財務金融學程
Keywords: 監督式機器學習;超額報酬;中文分詞;邏輯迴歸;貝氏分類器;支持向量分類;支持向量迴歸;Supervised machine learning;Excess return;Chinese words;Logit regression;Naïve Bayes classifier;Support vector classifier;Support vector regression
Issue Date: 2016
Abstract: 隨著中文文字分詞工具與機器學習方法論的發展成熟,愈來愈多國內研究透過分析公司重大訊息、財報的文字訊息、財金新聞的文字報導或是社群媒體的即時評論等,以各種不同的分析模式,從中文文字的語意擷取有用資訊,以預測個股股價走勢或是對投資部位做出風險預警。本論文用自然語言方法及監督式機器學習(Supervised machine learning)方法分析即時新聞訊息,以預測個股股價的超額報酬(excess return),其中超額報酬定義為個股報酬扣除大盤報酬。在中文分詞的處理上,引入王釧茹老師的中文財務情緒字典,使用財務新聞使用的字詞反應新聞內容的傾向,分析模型則透過中文分詞技術及量化新聞的中文分詞的詞頻權重,並建構超額報酬及特徵分詞權重的關聯矩陣,繼而建立兩種類型的分析模型對個股未來股價走勢予以預測。第一類是針對應變數為類別型(categorical),即辨別超額報酬為正或負的預測模型,包含邏輯迴歸(Logit regression)、貝氏分類器(Naïve Bayes classifier)及支持向量分類(Support vector classifier);第二類則是針對應變數為連續型(continuous),即預測超額報酬的量值的預測模型,包含線性迴歸模型(Linear regression)及支持向量迴歸(Support vector regression)模型。根據我們的實證結果顯示,應變數為類別型的三種模型預測的50次交叉驗證的平均正確性(Accuracy)皆有約六成以上,AUC((Area under curve)顯示模型具備中度以上之區別力。而應變數為連續型的兩種模型之模型解釋能力(即判定係數)與均方誤差(Mean square error) 尚可,預測結果具參考價值。綜合五種模型的結果顯示,個股新聞的中文分詞對個股股價之超額報酬確實有預測能力,印證台灣證券市場不符合強式效率市場的假說。
With the increasing development of Chinese text analysis tools and machine learning methodologies, more researches analyze Madridan financial statements and financial news provided from internet media to predict the performance of stocks returns for risk management purpose. This study analyzes the excess return of the stocks with the nature language method and supervised machine learning. The Chinese financial mood dictionary provided by professor Chuan-Ju Wang makes the financial meanings of the news more accurate. By using technique of the Chinese word segmentation like Jieba, we can build a matrix with the excess return and the feature TF-IDF weightings. There are two types of models. The first type of models categorizes excess returns into positive/negative. These methods include the Logit regression model, Naïve bayes classifier and Support vector classifier. The second type of models estimates quantitatives of excess returns. These models include the linear regression model and Support vector machine. According to our empirical study, the average accuracies of 50-fold cross validation in the first type of models are about above 60%, and the AUC indicates the model has at least middle discremanatory power. The mean square errors and the coefficients of determination are passable. The predict excess return is worthy of our reference. It appears that my proposed five models can process Chineses financial news to predict the excess return of the stock. It confirms the hypothesis that Taiwan stock market does not meet the strong form of efficient market.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070363919
http://hdl.handle.net/11536/138604
Appears in Collections:Thesis