標題: 分群與合併的多元尺度分析法之最佳分群決策與遺失值問題的討論
Optimal Grouping and Missing Data Handling for Split-and-Combine Multidimensional Scaling
作者: 陳珮琦
Pei-Chi Chen
盧鴻興
Henry Horng-Shing Lu
統計學研究所
關鍵字: 多元尺度化分析;大樣本;遺失值;MDS;large sample size;missing value
公開日期: 2007
摘要: MDS (Multidimensional scaling)為資料採礦中的重要方法之一。MDS的主要目的為二: (一)使資料能夠在空間中,以點座標形式來表示而不失其差異性。(二)降低資料維度,讓資料得以視覺化的形式呈現,更容易找出資料潛在的特徵。傳統的CMDS (Classical multidimensional scaling)計算量非常大,對於樣本過大的資料,計算不易。因此曾正男博士在2008年提出了一個新的方法SC-MDS (Split-and-combine multidimensional scaling)來解決CMDS計算量龐大的問題。而在計算SC-MDS過程中,有兩個重要的參數必須要決定:分群時應與鄰近的集合交疊多少個點 ( ),以及每個集合應該要多大 ( )。因此,該如何選擇兩參數才能將SC-MDS方法的表現最佳化則是本文討論的主要重點之一。這裡我們建議 至少要是資料維度加一,而 大約是1.51倍的 能讓SC-MDS有最佳的表現。另外,文中也討論各種SC-MDS在分群時的可能情況,並修正原本集合合併的方法,不應以一特定集合為中心,將其他集合往同一集合合併;而應該要考慮兩集合各自的維度,將低維度的集合往高維度的集合合併。因此,群集中只要任一群集的維度,與全部資料維度相同,則CMDS與SC-MDS將會在同一個空間下被展開,這部份會在文中有詳細的證明與討論。本文中另一個討論的主題,便是運用SC-MDS的基本想法處理遺失值的問題。在計算的過程中,我們不去補遺失值,而是將所有的點重新排序,使得每一個子群內,在計算MDS的時候沒有遺失值。在此種方法下,我們可以將遺失值的容許比例由20%提升至30%。
Multidimensional Scaling (MDS) is one of many important methods in data mining. It has two main purposes: (1) Express data in coordinate points in spatial configuration from given pair-wise distances between data. (2) Reduce data dimensions and find hidden features of data through visual display. We focus on discussing Classical multidimensional scaling (CMDS) in this paper since there are many types of MDS methods. CMDS faces some challenges. One of them is that CMDS's calculation time is huge. So it's hard to calculate data with a large sample size. Therefore, Tzeng, Lu and Li (2008) proposed split-and-combine multi- dimensional scaling (SC-MDS) to figure out this problem. However, in the process of SC-MDS, there are two important parameters to be decided: (1) the number of overlapping points with the neighboring groups, NI, and (2) the size of each group, Ng. These two parameters have great effects on the performance of SC-MDS. Thus, the main topic of this paper discusses how to best choose these two parameters. We suggest that NI should be at least the dimensionality of the data plus one and Ng be about 1.51 times NI to make SC-MDS perform optimally. In addition, we revise the method for combining groups. When combining two groups, we should consider their own dimension and then align the group with lower dimensionality to the group with higher dimensionality; instead, we randomly choose one group as the center then align the other group to it. Therefore, CMDS and SC-MDS will be spanned by the same space as long as any one group has the same dimension as the whole data's. There is a proof and discussion in this article. Another main topic in this article is using the SC-MDS concept to solve the missing value problem. We did not refill the missing value; instead, we permute over the index of objects, in which the subgroups in SC-MDS processing have no missing value. Then, it can raise the tolerance of ratio of missing values from 20% to 30% by simulation result.
URI: http://140.113.39.130/cdrfb3/record/nctu/#GT009526507
http://hdl.handle.net/11536/38989
顯示於類別:畢業論文


文件中的檔案:

  1. 650701.pdf

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。