標題: 基於語料庫及時間戳之推特主題分類及熱門標籤預測
Corpus-based Topic Derivation and Timestamp-based Popular Hashtag Prediction in Twitter
作者: 尚庫柏
王國禎
Sharath Kumar B R
Wang, Kuochen
資訊科學與工程研究所
關鍵字: 語料庫;標籤預測;時間戳;主題分類;推特;Corpus;popular hashtag prediction;timestamp;topic derivation;Twitter
公開日期: 2016
摘要: 隨著網路、手機、電子商務和社群媒體的蓬勃發展,大量的使用者行為足跡被記錄在網路上。推特是現在熱門的社群網站也是能夠捕捉世界上最新消息的重要來源。主題分類是推特上很重要的功能,能夠用來做情勢預測、市場分析、內容過濾和推薦系統。但是推特上的每篇內容不能超過140字,這使得主題分類變得十分困難。之前在主題分類上的論文在推特上的效果都不太好。這篇論文中,我們提出一個結合推特語料庫及基於非對稱標題之Latent Feature LDA的方法,而Latent Feature LDA是一種文章模型,用來找出文章的主題及和這個主題相關的標籤。跟一個有代表性的相關論文intJNMF相比,我們所提之方法的純度在20至100個主題中提高了5.26%至11.32%和and F-measure 則增加了27.81%至 34.28%。我們也提出了一個以時間戳為基礎的熱門標間預測方法,藉由計算出一段期間內多數使用者常用的標籤,來建立Trending Hashtags Lists (THLs)。我們用Edit Distance來計算相鄰的THLs之間的差異。然後這個差異可以計算出Volatility,並用它來找出人們對真實世界之事件的反應。跟一篇主要的相關論文Hybrid+相比,我們所提的方法在mean average precision上提高了19.45% (週-日) 、15.08% (週-週) 和16.95% (月-週) 。
With the use of the Internet, mobile platforms, online commerce, and social media services, the footprints of human behavior can be easily recorded in the digital world, which generates data on an extremely large scale. Twitter as a big data social network becomes one of the most important sources for capturing up-to-date events happened in the world. Deriving topics from Twitter is important for various applications, such as situation awareness, market analysis, content filtering, and recommendations. However, topic derivation with high purity in Twitter is hard to achieve because tweets are limited to 140 characters. Previous works on topic derivation in Twitter suffer from low purity. In this thesis, we propose corpus-based topic derivation (CTD) approach which combines Twitter corpus and Latent Feature LDA (LF-LDA), which is a text processing model, to identify topics and clusters of similar hashtags. We use asymmetric topic LF-LDA to obtain better purity of topics. Compared to intJNMF, a representative related work, the purity (F-measure) of our proposed CTD increase from 5.26% (27.81%) to 11.32% (34.28%) for 20 to 100 topics. We also propose a timestamp-based popular hashtags prediction (TPHP) approach by creating trending hashtags lists (THLs), which are lists of hashtags used by many users and make use of timestamps in tweets. We use the edit distance to find the difference between consecutive THLs. Then this difference can be used to calculate volatility to find how people react to real world events. Compared to Hybrid+, a representative related work, the mean average precision of our TPHP increases by 19.45% (week-day), 15.08% (week-week) and 16.95% (month-week).
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070256161
http://hdl.handle.net/11536/139744
Appears in Collections:Thesis