標題: 基於中國餐廳過程之在線學習方法
Online Chinese Restaurant Process
作者: 蔡宗勳
Tsai, Tsung-Hsun
李嘉晃
Lee, Chia-Hoang
資訊科學與工程研究所
關鍵字: 在線學習;中國餐廳過程;分類;無母數模型;Online Learning;Chinese Restaurant Process;Classification;Non-parametric
公開日期: 2013
摘要: 目前各領域的資料已經漸漸成長為巨量資料,許多傳統的機器學習方法已經無法處理這些巨量資料。在線學習方法具備動態模型更新特性且一次只需將一筆資料載入記憶體做處理,可即時處理大量資料,因此為解決巨量資料的一個方法。此外,處理巨量資料時,要在訓練模型之前就事先決定參數是一件困難的事,往往只能透過專家經驗或實驗測試以得到模型參數;貝氏無母數模型提供了一個使群數參數能夠依資料特性自行決定的方法,適合用於巨量資料上。 中國餐廳過程早期是機率論上用來描述空間中一群切割之分佈的隨機過程,若將其對應至從Dirichlet Process取樣的一個過程,則可以從一個分佈取樣出多組參數,每一組參數又分別代表一個分佈。本論文提出的方法為將在線學習的概念擴展於中國餐廳過程上,並利用在線學習過程中的每一筆訓練資料來影響機率模型中參數的估計,進而建立出整個模型。在實驗中,當資料量大時,我們提出的Online CRP 不僅在分類的效能上能夠達到監督式學習方法的標準,且在執行時間也比很多方法快速,驗證本方法可準確並有效率的處理巨量資料問題。
The rise of big data provides an opportunity for the enterprises to use data analytics to gain competitive advantage, but it also brings challenges to process, manage and analyze the large data sets. One typical challenge is to process large volumes of streaming data in real time. Online machine learning allows the model to learn one instance at a time, in which the model is updated according to the prediction result and the true label of the instance. Compared with batch machine learning algorithms, online machine learning is more appropriate to process streaming data, and it can adjust learning model as receiving more new unknown data. Besides online processing, parameter selection is an important task in machine learning in dealing with model selection, but the task is generally achieved by heuristic rules or cross-validation technique with a validation set. In big data process, parameter should be adapted as with data rather than a fixed one. Nonparametric Bayesian model provides a means for the model to adapt parameters with the data. This study proposes an online Chinese Restaurant Process algorithm, which extended from Chinese Restaurant Process (CRP). The proposed algorithm is an online and nonparametric parameter algorithm, so it can process streaming data efficiently and the parameters are adapted with the data. Compared with CRP, the proposed algorithm is an online algorithm, in which we use regret theory to design a new prior knowledge and likelihood function based on the consistence between the real label information and prediction result. In the experiments, the proposed algorithm works well in large data set, and generally outperform the other online machine learning algorithms.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT070056109
http://hdl.handle.net/11536/73258
顯示於類別:畢業論文


文件中的檔案:

  1. 610901.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。