Big Data Platform Configuration Using Machine Learning

doi:10.6688/JISE.202005_36(3).0001

標題:	Big Data Platform Configuration Using Machine Learning
作者:	Yeh, Chao-Chun Lu, Han-Lin Zhou, Jiazheng Chang, Sheng-An Lin, Xuan-Yi Sun, Yi-Chiao Huang, Shih-Kun 資訊工程學系資訊技術服務中心 Department of Computer Science Information Technology Services Center
關鍵字:	big data platform;machine learning;configuration optimization;learning by design;algorithms
公開日期:	1-May-2020
摘要:	By ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of Word-Count, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.
URI:	http://dx.doi.org/10.6688/JISE.202005_36(3).0001 http://hdl.handle.net/11536/154631
ISSN:	1016-2364
DOI:	10.6688/JISE.202005_36(3).0001
期刊:	JOURNAL OF INFORMATION SCIENCE AND ENGINEERING
Volume:	36
Issue:	3
起始頁:	469
結束頁:	493
Appears in Collections:	Articles