PSLOoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis

doi:10.1002/prot.21944

Full metadata record

DC Field	Value	Language
dc.contributor.author	Chang, Jia-Ming	en_US
dc.contributor.author	Su, Emily Chia-Yu	en_US
dc.contributor.author	Lo, Allan	en_US
dc.contributor.author	Chiu, Hua-Sheng	en_US
dc.contributor.author	Sung, Ting-Yi	en_US
dc.contributor.author	Hsu, Wen-Lian	en_US
dc.date.accessioned	2014-12-08T15:11:04Z	-
dc.date.available	2014-12-08T15:11:04Z	-
dc.date.issued	2008-08-01	en_US
dc.identifier.issn	0887-3585	en_US
dc.identifier.uri	http://dx.doi.org/10.1002/prot.21944	en_US
dc.identifier.uri	http://hdl.handle.net/11536/8489	-
dc.description.abstract	Prediction of protein subcellular localization (PSL) is important for genome annotation, protein function prediction, and drug discovery. Many computational approaches for PSL prediction based on protein sequences have been proposed in recent years for Gram-negative bacteria. We present PSLDoc, a method based on gapped-dipeptides and probabilistic latent semantic analysis (PLSA) to solve this problem. A protein is considered as a term string composed by gapped-dipeptides, which are defined as any two residues separated by one or more positions. The weighting scheme of gapped-dipeptides is calculated according to a position specific score matrix, which includes sequence evolutionary information. Then, PLSA is applied for feature reduction, and reduced vectors are input to five one-versus-rest support vector machine classifiers. The localization site with the highest probability is assigned as the final prediction. It has been reported that there is a strong correlation between sequence homology and subcellular localization (Nair and Rost, Protein Sci 2002;11:2836-2847, Yu et al., Proteins 2006;64:643-651). To properly evaluate the performance of PSLDoc, a target protein can be classified into low- or high-homology data sets. PSLDoc's overall accuracy of low- and high-homology data sets reaches 86.84% and 98.219% respectively, and it compares favorably with that of CELLO H (Yu et al., Proteins 2006,64:643-651). In addition, we set a confidence threshold to achieve a high precision at specified levels of recall rates. When the confidence threshold is set at 0.7, PSLDoc achieves 97.89% in precision which is considerably better than that of PSORTb v.2.0 (Gardy et al., Bioinformatics 2005,21:617-623). Our approach demonstrates that the specific feature representation for proteins can be successfully applied to the prediction of protein subcellular localization and improves prediction accuracy. Besides, because of the generality of the representation, our method can be extended to eukaryotic proteomes in the future. The web server of PSLDoc is publicly available at http://bio-cluster.iis.sinica.edu.tw/similar to bioapp/PSLDoc/.	en_US
dc.language.iso	en_US	en_US
dc.subject	protein subcellular localization	en_US
dc.subject	document classification	en_US
dc.subject	vector space model	en_US
dc.subject	gapped-dipeptides	en_US
dc.subject	probabilistic latent semantic analysis	en_US
dc.subject	support vector machines	en_US
dc.title	PSLOoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis	en_US
dc.type	Article	en_US
dc.identifier.doi	10.1002/prot.21944	en_US
dc.identifier.journal	PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS	en_US
dc.citation.volume	72	en_US
dc.citation.issue	2	en_US
dc.citation.spage	693	en_US
dc.citation.epage	710	en_US
dc.contributor.department	生物資訊及系統生物研究所	zh_TW
dc.contributor.department	Institude of Bioinformatics and Systems Biology	en_US
dc.identifier.wosnumber	WOS:000257156500014	-
dc.citation.woscount	22	-
Appears in Collections:	Articles

Files in This Item:

000257156500014.pdf

If it is a zip file, please download the file and unzip it, then open index.html in a browser to view the full text content.