完整後設資料紀錄
DC 欄位語言
dc.contributor.authorYeh, Chao-Chunen_US
dc.contributor.authorLu, Han-Linen_US
dc.contributor.authorZhou, Jiazhengen_US
dc.contributor.authorChang, Sheng-Anen_US
dc.contributor.authorLin, Xuan-Yien_US
dc.contributor.authorSun, Yi-Chiaoen_US
dc.contributor.authorHuang, Shih-Kunen_US
dc.date.accessioned2020-07-01T05:22:13Z-
dc.date.available2020-07-01T05:22:13Z-
dc.date.issued2020-05-01en_US
dc.identifier.issn1016-2364en_US
dc.identifier.urihttp://dx.doi.org/10.6688/JISE.202005_36(3).0001en_US
dc.identifier.urihttp://hdl.handle.net/11536/154631-
dc.description.abstractBy ensuring well-developed complex big data platform architectures, data engineers provide data scientists and analysts infrastructure with computational and storage resources to perform their research. Based on such supports, data scientists are provided an opportunity to focus on their domain problems and design the required intelligent modules (i.e., prepare the data; select, train, and tune the machine-learning modules; and validate the results). However, there are still gaps between system engineering and data scientist/engineering teams. Generally, system engineers have limited knowledge on the application domains and the purposes of an analytical program. On the contrary, both data scientists and engineers are usually unfamiliar with the configuration of a computational system, file system, and database. However, the performance of an application can be affected by a system's configuration, and the data scientists and engineers have little information and knowledge about which of the system's properties can affect the application's performance. As a typical example, for Internet-scale applications that have thousands of computing nodes or billions of Internet of Things devices, even a slight improvement may have an enormous influence on energy management and environmental protection issues. To bridge the gap between system engineering and data scientist/engineering teams, we proposed the concept of a configuration layer based on a big data platform, Hadoop. We built a configuration tuner, BigExplorer, to collect and preprocess data. Furthermore, we also created golden configurations for performance improvement. Based on the processed data, we used a semi-automatic feature engineering technique to provide more features for data engineers and developed the performance model using three different machine learning algorithms (i.e., random forest, gradient boosting machine, and support vector machine). Using the commonly used benchmarks of Word-Count, TeraSort, and Pig workloads, our configuration tuner achieved a significant performance improvement of 28%-51% for different workloads than using the rule-of-thumb configuration.en_US
dc.language.isoen_USen_US
dc.subjectbig data platformen_US
dc.subjectmachine learningen_US
dc.subjectconfiguration optimizationen_US
dc.subjectlearning by designen_US
dc.subjectalgorithmsen_US
dc.titleBig Data Platform Configuration Using Machine Learningen_US
dc.typeArticleen_US
dc.identifier.doi10.6688/JISE.202005_36(3).0001en_US
dc.identifier.journalJOURNAL OF INFORMATION SCIENCE AND ENGINEERINGen_US
dc.citation.volume36en_US
dc.citation.issue3en_US
dc.citation.spage469en_US
dc.citation.epage493en_US
dc.contributor.department資訊工程學系zh_TW
dc.contributor.department資訊技術服務中心zh_TW
dc.contributor.departmentDepartment of Computer Scienceen_US
dc.contributor.departmentInformation Technology Services Centeren_US
dc.identifier.wosnumberWOS:000537594300001en_US
dc.citation.woscount0en_US
顯示於類別:期刊論文