Full metadata record
DC FieldValueLanguage
dc.contributor.author劉康熙zh_TW
dc.contributor.author劉敦仁zh_TW
dc.contributor.authorLiu, Kang-Hsien_US
dc.contributor.authorLiu, Duen-Renen_US
dc.date.accessioned2018-01-24T07:40:38Z-
dc.date.available2018-01-24T07:40:38Z-
dc.date.issued2017en_US
dc.identifier.urihttp://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070463408en_US
dc.identifier.urihttp://hdl.handle.net/11536/141410-
dc.description.abstract近年來,環境汙染所造成的健康危害遽增,細懸浮微粒(以下簡稱PM2.5)尤其嚴重。除了尋找PM2.5的解決方法,運用資料探勘對PM2.5數值進行預測為重要研究。 PM2.5的數值是以時間序列來表示,現今針對時間序列數值的預測大多以線性迴歸方式來進行預測,但其預測方式無法針對突然劇烈變動的數值進行準確的判斷。隨著收集到的數據資料量增加,傳統的集中式資料庫系統與數據分析軟體已然不敷使用。 有鑑於以上兩點,本研究提出使用Apache Hadoop與Apache Spark建立之分散式數據分析平台對PM2.5的數值進行非線性迴歸(決策樹、隨機森林與梯度提升樹)預測未來24小時的數據,並與傳統的線性迴歸方式比較,藉此了解是否可以利用非線性迴歸來預測變動劇烈的數據。實驗結果顯示,非線性迴歸在數據突然變動劇烈時的預測結果會較線性迴歸準確。zh_TW
dc.description.abstractWe are facing the health hazards caused by the environmental pollutions, and the particulate matter 2.5 (PM2.5) is one of the major pollutions. How to accurately predict the value of pollution to establish a defense mechanism is an important way to prevent the health hazards caused by those pollutions. The values of PM2.5 are basically expressed in time series, and most of the predictions for time-series data are based on the linear regression (LR). But the LR may not be effective to predict the time series data which contains anomaly change of values. Moreover, with the increasing grow of data volume, using the distributed system for data analytics is more reasonable and affordable than using the traditional centralized system. This research analyzes historic PM2.5 data to predict the value of PM2.5 for the next-24 hours by using the machine learning methods provided by the Apache Spark platform. Data pre-processing, feature selection and input parameter tune up are conducted to build the prediction models by using Non-Linear Regression (NLR) methods, including Decision Tree (DT), Random Forest (RF) and Gradient Boosting Tree (GBT), to predict the result of PM2.5 for the next-24 hours. This research evaluates the performance of each model and compares the result with LR. The experiment result shows that the performances of NLR methods are better than the LR method when the value of PM2.5 is changing dramatically.en_US
dc.language.isozh_TWen_US
dc.subject細懸浮微粒zh_TW
dc.subject資料探勘zh_TW
dc.subject分散式系統zh_TW
dc.subject時間序列zh_TW
dc.subject非線性迴歸zh_TW
dc.subjectPM2.5en_US
dc.subjectdata miningen_US
dc.subjectdistributed systemen_US
dc.subjecttime seriesen_US
dc.subjectnonlinear regressionen_US
dc.title運用分散式架構與資料探勘之細懸浮 微粒預測zh_TW
dc.titleData Mining for PM2.5 Prediction using Apache Sparken_US
dc.typeThesisen_US
dc.contributor.department管理學院資訊管理學程zh_TW
Appears in Collections:Thesis