標題: | 產品比對的研究 The Study of Product Matching |
作者: | 楊博宇 Po-Yu Yang 吳毅成 I-Chen Wu 資訊學院資訊學程 |
關鍵字: | 最長共同子序列;序列比對;字串比對;網頁萃取;比對;LCS;Sequence Comparison;String Matching;Data Extraction;Matching |
公開日期: | 2004 |
摘要: | 網路上產品資訊非常豐富且多樣,但是要找到自己真正需要的資訊卻不是件容易的事。一般的作法,是上各個相關網站收集資料,非常的耗費時間而且不方便。比較方便的作法是將資料存到資料庫,然後再利用查詢介面找到想要的資料。但是往往找到的是一堆相似的產品,還是需要人工判斷出相同的產品。所以本篇論文以產品資料比對,自動判斷相同產品為目標。
產品名稱、序號為辨識產品是否相同的重要條件。而比對兩個產品的名稱、序號,就像是兩個字串的比對。我們引用了最長共同子序列Longest Common Subsequence (LCS)的概念,提出最長最多共同片段Longest and Most Common Segments (LMCS)演算法,用來計算所有產品之間的分數,兩個產品之間分數越高代表兩個產品之間的相似度越高。並調整LMCS的計算權重,再以比對策略找到最相似的產品。經過調整後,回收率、精確度、相似度都可以達到85%以上。 The product information is very rich and various on the web. It is difficult to find the information that we really need. The general way is to connect to all relevant website to collect product information. It is time-consuming and inconvenient very much. A more convenient way is to store the product information to the database, then utilize and inquire about interfaces to find the wanted information. Usually found a lot of similar products, and need to judge which products are the same products. Therefore, the goal of this thesis is to automatically judge that the same product by product name matching. The products’ name and serial number are important terms to judge same products. Comparing the name and serial number of products is like sequence comparison. We propose longest and most common segments (LMCS) algorithms which are based on longest common subsequence (LCS). LMCS used for calculating all products matching that higher score of LMCS to represent have higher similar degree. Adjust weight to calculate LMCS and use matching strategy in order to find the most similar products. After adjusting, the rates of recall, precision and similarity can be more than 85%. |
URI: | http://140.113.39.130/cdrfb3/record/nctu/#GT008967587 http://hdl.handle.net/11536/80158 |
Appears in Collections: | Thesis |
Files in This Item:
If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.