標題: 文字數字相混型資料與分散式資料庫之分群
Clustering of Mixed-Types Data and Data from Distribured Databases
作者: 林志青
LIN JA-CHEN
國立交通大學資訊工程學系(所)
關鍵字: 文字數字相混型資料;分散式資料;資料之快速分群;資料探勘;資料分類;Mixed-type data (categorical vs;numeric);Distributed data;fast clustering of data;_x000d_ data mining;data classification
公開日期: 2015
摘要: 這個三年期計畫之目的是針對不同類型的資料,或不同類型的資料儲存方式,做各 型資料的分群設計。第一年是「k-means 分群技術之加速」與「漸進可調式的分類器」; 第二年是設計一種「群數不必先給的處理文字數字相混型資料之分群法」。第三年則是 允許資料散在多處且使用各資料所在處之分散式分群資源的「分散式資料分群法」。分 群結果整合時,同質資料與異質資料會以不同方式分別設計。 第一年的「k-means 分群法之加速」利用Holder 不等式與Minkowski 不等式之適當 修改,再搭配前處理,以使k-means 這種大眾熟悉的方法能更快。該年子題二「漸進可 調式的分類器」則可加速分群後的資料分類時間。第二年的「群數不必先給的處理文數 相混型資料之分群法」則是因有愈來愈多的機構,其資料同時出現數字及文字,例如銀 行、保險公司、政府、社群、婚友社、醫院之資料。該分群法會有益機構之資料分析或 客戶開發。第三年的「分散式分群」則是允許資料散在多處,且各處使用自己的分群法 做分群。中央則負責整合各分群結果。整合須考慮各處資料之同質與異質,以不同方式 設計。分散式分群是因中央政府的資料本來就常是地方政府搜集來的。
This is a 3 years’ project. The goal is to design various types of clustering techniques to deal with different kinds of data or distributed databases. In Year 1, the designs include the acceleration of k-means clustering, and progressive adaptive classifier. In Year 2, we will design a clustering method to deal with data containing both categorical and numeric types. The design does not require the users to provide the number of clusters as input. In Year 3, we design a distributed method which allows that the data are grabbed separately from different databases; and individual local clustering results of local data are done using local clustering methods. The integration of the local clustering results is then designed. The integration of same-property data and distinct-property data will be designed separately The acceleration of k-means in year 1 will use, but not limited to, proper adjustment of Holder’s inequality and Minkowski’s inequality. The second issue of year 1 is to design a progressive adaptive classifier which can reduces the classification time of new data. In Year 2, without knowing the number of clusters in advance, we design a clustering method to deal with mixed-type data formed of categorical data and numeric data; this is because mixed-type data occur much more often nowadays in bank, insurance company, government, social network, dating agency, hospital, etc. This kind of clustering design will benefit these offices or companies, in data analysis, data mining, recruiting new customers, etc. Finally, in Year 3, we design a distributed clustering method which allows the data to be stored in or grabbed from different places; and local clustering methods in distinct sites can be different. Such a design is often needed because the data size nowadays becomes larger and larger, and many data are originally created in local places. The data in Federal government is just the union of many data sets which are local in nature, i.e. originally created locally from cities throughout the country.
官方說明文件#: MOST103-2221-E009-119-MY3
URI: http://hdl.handle.net/11536/130502
https://www.grb.gov.tw/search/planDetail?id=11269553&docId=454777
顯示於類別:研究計畫