標題: 開發基於K-means-pHMM 機器學習演算法之交聯免疫沈澱法高通量定序的泛用分析框架
A General CLIP-Seq data analysis framework based on a K-means-pHMM learning and clustering algorithm
作者: 蕭瓊柏
洪瑞鴻
Hsiao, Chiung-Po
Hung, Jui-Hung
生物資訊及系統生物研究所
關鍵字: 高通量定序;CLIP-Seq;核糖核酸結合蛋白;小分子核糖核酸;非監督式機器學習;Profile 隱藏馬可夫模型;NGS;CLIP-Seq;HIT-CLIP;PAR-CLIP;iCLIP;RNA-binding protein;Profile HMM;Machine learning
公開日期: 2016
摘要: 核糖核酸結合蛋白(RNA-binding Proteins, RBPs)在生物體內扮演重要的角色,核糖核酸(RNAs)轉錄後修飾(Post-transcriptional regulation)的行為,都需要核糖核酸結合蛋白的協助。近年來發展了CLIP-Seq(Cross-linking immunoprecipitation high-throughput sequencing)的實驗技術,來協助研究核糖核酸結合蛋白與核糖核酸的關係。CLIP-Seq是使用紫外光照射細胞,加強核糖核酸與核糖核酸結合蛋白的交聯(cross-linking),再利用免疫沈澱法(Immunoprecipitation, IP)抓取核糖核酸結合蛋白,最後萃取核糖核酸結合蛋白上的核糖核酸進行高通量定序。當抓取核糖核酸結合蛋白為Agonaute(AGO)時,由於AGO會與小分子核糖核酸(microRNAs, miRNAs)形成核糖核酸誘導沈默複合體(RNA-induced silencing complex, RISC),我們不僅萃取到核糖核酸的序列,也得到了許多小分子核糖核酸(microRNAs, miRNAs)。現今出現了許多種CLIP-Seq實驗:有HIT-CLIP、PAR-CLIP、iCLIP。目前缺乏一個泛用的分析框架,提供尋找核糖核酸結合蛋白與核糖核酸的結合點位的功能,也支援小分子核糖核酸與核糖核酸結合關係的預測,且支援現存各種類的CLIP-Seq技術。 此篇論文,我們提出一個核心為K-means-pHMM的CLIP分析流程,具有高度泛用的特性,能分析HIT-CLIP、PAR-CLIP、iCLIP這三種CLIP次世代定序資料。我們進行模擬測試證明了我們的非監督式機器學習演算法的數學收斂性相當迅速,最後也收集了多筆NCBI CLIP-Seq資料,重新分析並觀察到符合過去研究的分子生物現象。
RNAs are regulated by RNA-binding proteins (RBPs) that bind to the single- or double- stranded RNAs in cells. RBPs bind RNAs and function as ribonucleoprotein complexes and involve in splicing (e.g., U1 snRNP), RNA editing (e.g., ADAR), polyadenylation (e.g., CPSF), mRNA localization (e.g., ZBP1), post-transcriptional regulation (e.g., miRNA-RISC), etc. To understand the relationship between the RBPs and RNAs, the cross-linking immunoprecipitaion followed by next generation sequencing (CLIP-Seq) method is developed. There are currently three major variants of CLIP-Seq based methods, HIT-CLIP, PAR-CLIP, iCLIP. Many algorithms have been proposed to define the binding sites, nevertheless, these methods can be applied to just one or a few CLIP-Seq variants and the results are hard to integrate and compare. In this work, we propose a universal algorithm, GLIP, can be applied to all three CLIP-Seq variants with powerful performance and efficiency.
URI: http://etd.lib.nctu.edu.tw/cdrfb3/record/nctu/#GT070357201
http://hdl.handle.net/11536/139477
Appears in Collections:Thesis