標題: 小樣本多變數下選取重要變數之研究
Variables Selection with Small Sample Size and Large Number of Parameters
作者: 洪慧念
HUNG HUI-NIEN
國立交通大學統計學研究所
公開日期: 2009
摘要: 近十多年來由於基因晶片的發明產生了大量高密度的 cDNA 陣列資料。這些資料有著 共同的特性就是樣本數不多但是基因數目很多。解決這類的問題,可以分成兩的步驟。 首先是如何挑選重要的基因,接著是要如何的利用這些基因做分析。在本計畫中,我 們希望能對這兩方面做些有系統性的理論研究。在這些問題上,理論結果並不多。Fan 等人近幾年提出關於如何選取恰當的基因數的理論依據,他們選取基因數目的準則是 希望分類成功率愈高愈好。幾年前, Bickel 等人證明如果選取太多的基因,在分類上 都不會有太好的結果。在理論結果中,共通的假設是當測量的基因數目愈多時,影響 某特殊疾病的基因數目也成一定的方式迅速增多,且觀察的樣本數也以一定的方式增 多。在本計畫中,我們打算討論當樣本數固定時,可測得的基因數目增加很快。倘若 影響某疾病的基因數目也固定(或以非常慢的數度增加),我們應該選取多少數目的 基因以做資料分析最為恰當。在本計畫中的另一個重點在於對 Tibshirasni 與 Tastie 於 今年發表的文章做更深入的坦討與改進。在他們的研究中假設重要致病基因並不是對 所有病人皆會有異常的表現,大約有 ( 20%~100%) 的病人在此基因會有較正常人強 烈的表現。對於他們的方法我們認為還有不少可以討論與改進個空間。同時,我們也 希望能採用一些混和的模型對基因表現有異常的人數做出估計。除了以上兩個重點 外,我們也將對一筆由 Moffitt Cancer Center & Research Institute,University of South Florida 所提供的最新乳癌研究資料作分析,希望我們的方法能有效得運用在此筆資料 上。
With advance technology in biology, high-throughput data such as microarry data are frequently seen in research work. Those data sets usually contains only a few samples but large number of variables. For analyzing this kind of data, fist we need to rank the importance of variables (genes), then we need to choose an importance subset of variables (genes) to analyze the microarray data (classification problem). In this two-year project, we will try to solve these two problems systematically and find some theoretical results. For these problems there are only few theoretical results. Recent years, some researchers find good theoretical results about find a good subset of important genes. Many years ago, Bickel showed that if we use too many genes to do classification problem, the Fisher discriminant performs poorly. All the theoretically results, under large sample, assume that when the number of variables (genes) goes to infinity, the number of sample in normal group and disease group are both go to infinity. Also the number of the important variables (genes) goes to infinity. In this project, we will discuss the situation when the number of sample size is fixed and the number variables (genes) goes to infinity. Also, we will assume that the number of important genes is fixed (or goes to infinity in a slow speed). Under above assumptions, we will try to find a good subset of genes to do our data analysis. Another purpose of this project is to extend the result by Tibshirasni and Tastie (2007). In their paper, they assume that only part of the people (20%~100%) in disease group has abnormal gene expression. We hope that we can extend their method and then find a better statistic to rank the importance of the variables (genes). Finally, we will use our results to analyze a data set provide by Moffitt Cancer Center & Research Institute,University of South Florida about the breast cancer. We hope that we can learn somehthng from this new data set.
官方說明文件#: NSC97-2118-M009-001-MY2
URI: http://hdl.handle.net/11536/101141
https://www.grb.gov.tw/search/planDetail?id=1755349&docId=299413
顯示於類別:研究計畫


文件中的檔案:

  1. 972118M009001MY2(第1年).PDF
  2. 972118M009001MY2(第2年).PDF

若為 zip 檔案,請下載檔案解壓縮後,用瀏覽器開啟資料夾中的 index.html 瀏覽全文。