Title: 運用分散式架構與資料探勘之細懸浮 微粒預測
Data Mining for PM2.5 Prediction using Apache Spark
Authors: 劉康熙
劉敦仁
Liu, Kang-Hsi
Liu, Duen-Ren
管理學院資訊管理學程
Keywords: 細懸浮微粒;資料探勘;分散式系統;時間序列;非線性迴歸;PM2.5;data mining;distributed system;time series;nonlinear regression
Issue Date: 2017
Abstract: 近年來,環境汙染所造成的健康危害遽增,細懸浮微粒(以下簡稱PM2.5)尤其嚴重。除了尋找PM2.5的解決方法,運用資料探勘對PM2.5數值進行預測為重要研究。 PM2.5的數值是以時間序列來表示,現今針對時間序列數值的預測大多以線性迴歸方式來進行預測,但其預測方式無法針對突然劇烈變動的數值進行準確的判斷。隨著收集到的數據資料量增加,傳統的集中式資料庫系統與數據分析軟體已然不敷使用。 有鑑於以上兩點,本研究提出使用Apache Hadoop與Apache Spark建立之分散式數據分析平台對PM2.5的數值進行非線性迴歸(決策樹、隨機森林與梯度提升樹)預測未來24小時的數據,並與傳統的線性迴歸方式比較,藉此了解是否可以利用非線性迴歸來預測變動劇烈的數據。實驗結果顯示,非線性迴歸在數據突然變動劇烈時的預測結果會較線性迴歸準確。
We are facing the health hazards caused by the environmental pollutions, and the particulate matter 2.5 (PM2.5) is one of the major pollutions. How to accurately predict the value of pollution to establish a defense mechanism is an important way to prevent the health hazards caused by those pollutions. The values of PM2.5 are basically expressed in time series, and most of the predictions for time-series data are based on the linear regression (LR). But the LR may not be effective to predict the time series data which contains anomaly change of values. Moreover, with the increasing grow of data volume, using the distributed system for data analytics is more reasonable and affordable than using the traditional centralized system. This research analyzes historic PM2.5 data to predict the value of PM2.5 for the next-24 hours by using the machine learning methods provided by the Apache Spark platform. Data pre-processing, feature selection and input parameter tune up are conducted to build the prediction models by using Non-Linear Regression (NLR) methods, including Decision Tree (DT), Random Forest (RF) and Gradient Boosting Tree (GBT), to predict the result of PM2.5 for the next-24 hours. This research evaluates the performance of each model and compares the result with LR. The experiment result shows that the performances of NLR methods are better than the LR method when the value of PM2.5 is changing dramatically.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070463408
http://hdl.handle.net/11536/141410
Appears in Collections:Thesis