關於統計上顯著性差異模式探索之研究

標題:	關於統計上顯著性差異模式探索之研究 A Study of Statisticallly Significant Difference Pattern Detection
作者:	羅仁杰 Ren-Jei Luo 曾憲雄 Dr. Shian-Shyoung Tseng 資訊科學與工程研究所
關鍵字:	資料探勘;顯著性差異;資料倉儲;線上分析處理;數位落差;Data Mining;Significant Difference;Data Warehousing;On-Line Analytic Processing;OLAP;Digital Divide
公開日期:	2005
摘要:	在傳統的問卷分析中，因為傳統的分析方式是一個非常依賴經驗並且不斷重複嘗試的分析方法，所以很容易發生顯著性差異被忽略或是未被認知的情況，進而影響到資料分析的結果。例如數位落差分析者可能著重於性別對成績的影響，而忽略到其它可能的因素或者覺得該因素不重要，例如父母親的教育程度對成績也會有影響。我們將此問題稱為「顯著差異未被認知問題」。為了解決顯著差異未被認知問題，我們希望將原本依賴經驗法則的分析方式轉換為一個主動發現的分析方式。所以我們需要更為豐富的資料和一個更具彈性的分析方法。為了達到這些目的，在這裡我們導入了資料倉儲技術。資料倉儲除了可以對資料做完善的處理，它還提供了方便的線上分析工具-OLAP。但是OLAP本身的設計並不是用來解決假設未被認知問題，所以我們致力研究如何在此多維多層的架構之下探索具有統計上顯著性意義的模式來解決假設未被認知問題。我們為此訂定了一個完善的定義並稱之為「顯著性差異模式探索問題(SDPD)」。因為導入了資料倉儲技術後會引發資料量過大和探索維度過於複雜的問題，所以我們也提供了一套貪婪演算法WISDOM來解決SDPD問題。這個貪婪演算法WISDOM包含二個主要程序，一個是具有啟發式資料縮減程序可以有效的縮減資料量並對資料做整理。另一個顯著性差異模式探勘程序則可以有效的判斷單一維度例如性別對單一量值例如成績是否存在顯著性差異。最後再將探索出來的模式交由專家去做參考與使用。 In the traditional Questionnaire Analysis, there exists a problem that researchers may miss or ignore some causes because the traditional analysis usually is performed in an experiential try-and-error manner. For example, the digital divide researchers may focus on the difference in the grade between different genders. But they may miss other causes of the difference in the grade (e.g., parents’ education, living locations, parents’ vocations). These causes may also lead to the difference in grade. We name it “Significant Difference Unawareness Problem”. In order to solve the Significant Difference Unawareness Problem, we propose a semi-automatic discovery-based analysis method instead of the traditional hypothesis-based analysis manner. Since a more flexible analysis on richer data is required in our method. Hence, we apply the data warehousing technique is applied. We discuss how to detect the entire interesting pattern that implies the causes of the difference on the multi-dimensional data structure, and define this problem as Significant Difference Pattern Detection (SDPD). After applying the data warehousing, some problems must be solved: the data size is huge and the combination of dimensions is very complex. So we propose a greedy algorithm, WISDOM (Wisely Imaginable Significant Difference Observation Mechanism), to solve the SDPD problem. The WISDOM includes two major processes: (1) Data Reduction Process. The Data Reduction Process has a sensitive-less data filtering heuristic that is useful to reduce the data size. (2) SD Pattern Mining Processes. The SD Pattern Mining has a significant difference pattern determination heuristic that is effective to determinate if there exists a significant difference in a single dimension versus a single measure.
URI:	http://140.113.39.130/cdrfb3/record/nctu/#GT009323573 http://hdl.handle.net/11536/79100
Appears in Collections:	Thesis

Files in This Item:

357301.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.