Phylogenetic tree construction using trinucleotide usage profile (TUP)

doi:10.1186/s12859-016-1222-3

Full metadata record

DC Field	Value	Language
dc.contributor.author	Chen, Si	en_US
dc.contributor.author	Deng, Lih-Yuan	en_US
dc.contributor.author	Bowman, Dale	en_US
dc.contributor.author	Shiau, Jyh-Jen Horng	en_US
dc.contributor.author	Wong, Tit-Yee	en_US
dc.contributor.author	Madahian, Behrouz	en_US
dc.contributor.author	Lu, Henry Horng-Shing	en_US
dc.date.accessioned	2019-04-03T06:44:15Z	-
dc.date.available	2019-04-03T06:44:15Z	-
dc.date.issued	2016-10-06	en_US
dc.identifier.issn	1471-2105	en_US
dc.identifier.uri	http://dx.doi.org/10.1186/s12859-016-1222-3	en_US
dc.identifier.uri	http://hdl.handle.net/11536/145549	-
dc.description.abstract	Background: It has been a challenging task to build a genome-wide phylogenetic tree for a large group of species containing a large number of genes with long nucleotides sequences. The most popular method, called feature frequency profile (FFP-k), finds the frequency distribution for all words of certain length k over the whole genome sequence using (overlapping) windows of the same length. For a satisfactory result, the recommended word length (k) ranges from 6 to 15 and it may not be a multiple of 3 (codon length). The total number of possible words needed for FFP-k can range from 4(6) = 4096 to 4(15). Results: We propose a simple improvement over the popular FFP method using only a typical word length of 3. A new method, called Trinucleotide Usage Profile (TUP), is proposed based only on the (relative) frequency distribution using non-overlapping windows of length 3. The total number of possible words needed for TUP is 43 = 64, which is much less than the total count for the recommended optimal " resolution" for FFP. To build a phylogenetic tree, we propose first representing each of the species by a TUP vector and then using an appropriate distance measure between pairs of the TUP vectors for the tree construction. In particular, we propose summarizing a DNA sequence by a matrix of three rows corresponding to three reading frames, recording the frequency distribution of the non-overlapping words of length 3 in each of the reading frame. We also provide a numerical measure for comparing trees constructed with various methods. Conclusions: Compared to the FFP method, our empirical study showed that the proposed TUP method is more capable of building phylogenetic trees with a stronger biological support. We further provide some justifications on this from the information theory viewpoint. Unlike the FFP method, the TUP method takes the advantage that the starting of the first reading frame is (usually) known. Without this information, the FFP method could only rely on the frequency distribution of overlapping words, which is the average (or mixture) of the frequency distributions of three possible reading frames. Consequently, we show (from the entropy viewpoint) that the FFP procedure could dilute important gene information and therefore provides less accurate classification.	en_US
dc.language.iso	en_US	en_US
dc.subject	Feature frequency profile (FFP)	en_US
dc.subject	Reading frame	en_US
dc.subject	Summary statistics	en_US
dc.subject	Phylogenetic tree construction	en_US
dc.subject	Tree comparison	en_US
dc.title	Phylogenetic tree construction using trinucleotide usage profile (TUP)	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1186/s12859-016-1222-3	en_US
dc.identifier.journal	BMC BIOINFORMATICS	en_US
dc.citation.volume	17	en_US
dc.citation.spage	0	en_US
dc.citation.epage	0	en_US
dc.contributor.department	統計學研究所	zh_TW
dc.contributor.department	Institute of Statistics	en_US
dc.identifier.wosnumber	WOS:000402048800013	en_US
dc.citation.woscount	2	en_US
Appears in Collections:	Articles

Files in This Item:

730f598ee88cc8d37c23f17ef0f17092.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.