標題: 以Apache Spark平台為基礎之重複廣告分析
Duplicate Advertisements Detection on Apache Spark Platform
作者: 黃郁珊
楊千
陳安斌
Huang, Yu-Shan
Yang, Chyan
Chen, An-Pin
管理學院資訊管理學程
關鍵字: 二元分類;重複廣告偵測;分散式機器學習;Binary classification;duplicate advertisement detection;distributed machine learning
公開日期: 2017
摘要: 在這篇論文中,我們使用了機器學習的二元分類法來偵測重覆的廣告。對於購物網站來說,同一個產品的重覆廣告,會在許多層面上造成損傷,包含惡化買家使用者體驗,及增加網站營運成本。重覆廣告偵測的目標在於,給定兩個廣告,是否能判別所廣告的是相同的商品,而這是一個二元分類問題。 我們在一個公開的機器學習競賽網站Kaggle.com上,得到一份由俄國公司Avito所公開的訓練資料,並以此做為我們研究的資料集。透過架設分散式的Spark框架,我們使用了決策樹、隨機森林、單純貝氏、邏輯回歸、支援向量機及類神經網路來解決這個問題。我們使用了獨熱編碼及word2vec技術做特徵擷取。藉由接收者操作特徵下的曲面面積,我們得以驗證這些做法的有效性。
In this paper, we use binary classification algorithms by machine learning, to detect duplicate advertisement. For shopping websites, duplicate advertisements of the same product harm both buyers and sellers in ways like introducing bad user experience and increasing cost of website owners. The goal of duplicate advertisement detection is to determine whether a pair of advertisement is about a same product, and is a problem of binary classification. We obtain the data on a public machine learning competition website Kaggle.com, where a Russian company Avito provided many training data. By setting up a distributed Spark framework, we use decision tree, random forest, naive Bayes, logistic regression, support vector machine, and artificial neural network to solve this problem. We extract feature by means of one-hot encoding and word2vec. The result evaluated by area under ROC curve indicates the validity of these methods.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070463423
http://hdl.handle.net/11536/140763
顯示於類別:畢業論文