标题: 以Apache Spark平台为基础之重复广告分析
Duplicate Advertisements Detection on Apache Spark Platform
作者: 黄郁珊
杨千
陈安斌
Huang, Yu-Shan
Yang, Chyan
Chen, An-Pin
管理学院资讯管理学程
关键字: 二元分类;重复广告侦测;分散式机器学习;Binary classification;duplicate advertisement detection;distributed machine learning
公开日期: 2017
摘要: 在这篇论文中,我们使用了机器学习的二元分类法来侦测重覆的广告。对于购物网站来说,同一个产品的重覆广告,会在许多层面上造成损伤,包含恶化买家使用者体验,及增加网站营运成本。重覆广告侦测的目标在于,给定两个广告,是否能判别所广告的是相同的商品,而这是一个二元分类问题。
我们在一个公开的机器学习竞赛网站Kaggle.com上,得到一份由俄国公司Avito所公开的训练资料,并以此做为我们研究的资料集。透过架设分散式的Spark框架,我们使用了决策树、随机森林、单纯贝氏、逻辑回归、支援向量机及类神经网路来解决这个问题。我们使用了独热编码及word2vec技术做特征撷取。藉由接收者操作特征下的曲面面积,我们得以验证这些做法的有效性。
In this paper, we use binary classification algorithms by machine learning, to detect duplicate advertisement. For shopping websites, duplicate advertisements of the same product harm both buyers and sellers in ways like introducing bad user experience and increasing cost of website owners. The goal of duplicate advertisement detection is to determine whether a pair of advertisement is about a same product, and is a problem of binary classification.
We obtain the data on a public machine learning competition website Kaggle.com, where a Russian company Avito provided many training data. By setting up a distributed Spark framework, we use decision tree, random forest, naive Bayes, logistic regression, support vector machine, and artificial neural network to solve this problem. We extract feature by means of one-hot encoding and word2vec. The result evaluated by area under ROC curve indicates the validity of these methods.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070463423
http://hdl.handle.net/11536/140763
显示于类别:Thesis