Efficient alternatives to PSI-BLAST

Open access


In this paper we present two algorithms that may serve as efficient alternatives to the well-known PSI BLAST tool: SeedBLAST and CTX-PSI Blast. Both may benefit from the knowledge about amino acid composition specific to a given protein family: SeedBLAST uses the advisedly designed seed, while CTX-PSI BLAST extends PSI BLAST with the context-specific substitution model. The seeding technique became central in the theory of sequence alignment. There are several efficient tools applying seeds to DNA homology search, but not to protein homology search. In this paper we fill this gap. We advocate the use of multiple subset seeds derived from a hierarchical tree of amino acid residues. Our method computes, by an evolutionary algorithm, seeds that are specifically designed for a given protein family. The seeds are represented by deterministic finite automata (DFAs) and built into the NCBI-BLAST software. This extended tool, named SeedBLAST, is compared to the original BLAST and PSI-BLAST on several protein families. Our results demonstrate a superiority of SeedBLAST in terms of efficiency, especially in the case of twilight zone hits. The contextual substitution model has been proven to increase sensitivity of protein alignment. In this paper we perform a next step in the contextual alignment program. We announce a contextual version of the PSI-BLAST algorithm, an iterative version of the NCBI-BLAST tool. The experimental evaluation has been performed demonstrating a significantly higher sensitivity compared to the ordinary PSI-BLAST algorithm.

[1] T. Smith and M. Waterman, “The identification of common molecular subsequences”, J. Molecular Biology 147, 195-197 (1981).

[2] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic local alignment search tool”, J. Molecular Biology 215, 403-410 (1990).

[3] S. Altschul, T. Madden, A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs”, NucleicAcids Research 25, 3389-3402 (1997).

[4] G. Kucherov, L. No´e, and M. Roytberg, “A unifying framework for seed sensitivity and its application to subset seeds”, J. Bioinformatics and Computational Biology 4 (2), 553-570 (2006).

[5] A. Gambin, S. Lasota, R. Szklarczyk, J. Tiuryn, and J. Tyszkiewicz, “Contextual alignment of biological sequences”, Proc. ECCB’02, Bioinformatics 18, 116-127 (2002).

[6] B. Brejova, D.G. Brown, and T. Vinar, “Optimal spaced seeds for homologous coding regions”, J. Bioinformatics and Computational Biology 1 (4), 595-610 (2004).

[7] A.S. Shiryev, J.S. Papadopoulos, A.A. S chaffer, and R. Agarwala, “Improved BLAST searches using longer words for protein seeding”, Bioinformatics 23, 2949-2951 (2007).

[8] B. Ma, J. Tromp, and M. Li, “PatternHunter: faster and more sensitive homology search”, Bioinformatics (Oxford, England) 18, 440-445 (2002).

[9] M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunter II: highly sensitive and fast homology search”, J. Bioinformaticsand Computational Biology 2 (3), 417-439 (2004).

[10] D. Kisman, M. Li, B. Ma, and L. Wang, “tPatternHunter: gapped, fast and sensitive translated homology search”, Bioinformatics(Oxford, England) 21, 542-544 (2005).

[11] L. Noe and G. Kucherov, “YASS: enhancing the sensitivity of DNA similarity search”, Nucl. Acids Res. 33, W540-543 (2005).

[12] J. Buhler, U. Keich, and Y. Sun, “Designing seeds for similarity search in genomic DNA”, J. Comput. Syst. Sci. 70 (3), 342-363 (2005).

[13] B. Brejov´a, D.G. Brown, and T. Vinar, “Vector seeds: an extension to spaced seeds”, J. Comput. Syst. Sci. 70 (3), 364-380 (2005).

[14] Y. Sun and J. Buhler, “Designing multiple simultaneous seeds for DNA similarity search”, RECOMB 1, 76-84 (2004).

[15] G. Kucherov, L. Noe, and M. Roytberg, “Multiseed lossless filtration”, IEEE/ACM Trans. Comput. Biol. Bioinformatics 2 (1), 51-61 (2005).

[16] M. Roytberg, A. Gambin, L. No´e, S. Lasota, E. Furletova, E. Szczurek, and G. Kucherov, “On subset seeds for protein alignment”, IEEE/ACM Trans. on Computational Biology andBioinformatics 6 (3), 483-494 (2009).

[17] W. Li, B. Ma, and K. Zhang, “Amino acid classification and hash seeds for homology search”, BICoB 1, 44-51 (2009).

[18] S.M. Kiebasa, R. Wan, K. Sato, P. Horton, and M.C. Frith, “Adaptive seeds tame genomic sequence comparison”, GenomeResearch 21 (3), 487-493 (2011).

[19] C.D. Livingstone and G.J. Barton, “Protein sequence alignments: a strategy for the hierarchical an alysis of residue conservation”, Computer Applications in the Biosciences: CABIOS 9, 745-756 (1993).

[20] T. Li, K. Fan, W. Wang, and J. Wang, “Reduction of protein sequence complexity by residue grouping”, Protein Engineering 16 (5), 323-330 (2003).

[21] L. Murphy, A. Wallqvist, and R. Levy, “Simplified amino acid alphabets for protein fold recognition and implications for folding”, Protein Engineering 13, 149-152 (2000).

[22] B. Rost, “Twilight zone of protein sequence alignments”, ProteinEngineering Design and Selection 12 (2), 85-94 (1999).

[23] A. Gambin and J. Tyszkiewicz, “Substitution matrices for contextual alignment”, Journees Ouvertes Biologie InformatiqueMathematique 1, 227-238 (2002).

[24] S. Henikoff and J. Henikoff, “Amino acid substitution matrices from protein blocks”, Proc. Natl. Acad. Sci. USA 89, 10915- 10919 (1992).

[25] A. Gambin and P. Wojtalewicz, “CTX-BLAST: context sensitive version of protein blast”, Bioinformatics 23 (13), 1686- 1688 (2007).

[26] I. Friedberg, T. Kaplan, and H. Margalit, “Evaluation of PSIBLAST alignment accuracy in comparison to structural alignments”, Protein Science 9, 2278-2284 (2000).

[27] A. Gambin, S. Lasota, M. Startek, M. Sykulski, L. Noé, and G. Kucherov, “Subset seed extension to protein blast”, Bioinformatics 1, 149-158 (2011).

[28] B. Korte and D. Hausmann, “An analysis of the greedy heuristic for independence systems”, Ann. Discrete Math. 2, 65-74 (1978).

[29] S. Cheng and Y.-F. Xu, “Constrained independence system and triangulations of planar point sets”, Computing and Combinatorics 1, 41-50 (1995).

[30] B. Boeckmann, A. Bairoch, R. Apweiler, M. Blatter, A. Estreicher, E. Gasteiger, M. Martin, K. Michoud, C. O’Donovan, and I. Phan, “The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003”, Nucl. Acids Res. 31 (1), 365-370 (2003).

[31] Y. Ponty, M. Termier, and A. Denise, “GenRGenS: software for generating random genomic sequences and structures”, Bioinformatics 22, 1534-1535 (2006).

[32] I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient methods for generating optimal single and multiple spaced seeds”, BIBE ’04: Proc.4th IEEE Symp. on Bioinformatics and Bioengineering 1, 411 (2004).

[33] B. Ma and H. Yao, “Seed optimization is no easier than optimal golomb ruler design”, APBC 1, 133-144 (2008).

[34] M. Mitchell, An Introduction to Genetic Algorithms, MIT Press, London, 1996.

[35] F. M. Liang, “Word hy-phen-a-tion by com-put-er”, Tech. Rep., Stanford University, Stanford, 1983.

[36] A. Gambin, J. Tiuryn, and J. Tyszkiewicz, “Alignment with context dependent scoring function”, J. Computational Biology 13 (1), 81-101 (2006).

[37] S. Altschul and W. Gish, “Local alignment statistics”, MethodsEnzymol. 266, 460-480 (1996).

[38] S. Altschul, R. Bundschuh, R. Olsen, and T. Hwa, “The estimation of statistical parameters for local alignment score distributions”, Nuclear Acids Res. 29 (2), 351-361 (2001).

[39] A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S. Eddy, S. Griffiths-Jones, K. Howe, M. Marshall, and E. Sonnhammer, “The pfam protein families database”, Nucl.Acids Res. 30 (1), 276-280 (2002).

[40] R.D. Finn, J. Tate, J. Mistry, P.C. Coggill, S.J. Sammut, H. Hotz, G. Ceric, K. Forslund, S.R. Eddy, E.L.L. Sonnhammer, and A. Bateman, “The pfam protein families database”, Nucl. Acids Res. 36 (1), D281-288 (2008).

[41] L. Oliveira, A.C.M. Paiva, and G. Vriend, “A common motif in g-protein-coupled seven transmembrane helix r eceptors”, J.Computer-Aided Molecular Design 7, 649-658 (1993).

[42] P. Peterlongo, L. No, D. Lavenier, G. illes Georges, J. Jacques, G. Kucherov, and M. Giraud, “Protein similarity search with subset seeds on a dedicated reco nfigurable hardware”, ParallelProcessing and Applied Mathematics 1, 1240-1248 (2008).

[43] V.H. Nguyen and D. Lavenier, “Speeding up subset seed algorithm for intensive protein sequence comparison”, RIVF 1, 57-63 (2008).

[44] T. Kahveci and A. Singh, “An efficient index structure for string databases”, Proc. 27th VLDB 1, 352-360 (2001).

[45] M. Cameron, H. Williams, and A. Cannane, “A deterministic finite automaton for faster protein hit detection in BLAST”, J.Comput. Biol. 13 (40), 965-78 (2006). Bull.

Bulletin of the Polish Academy of Sciences Technical Sciences

The Journal of Polish Academy of Sciences

Journal Information

IMPACT FACTOR 2016: 1.156
5-year IMPACT FACTOR: 1.238

CiteScore 2016: 1.50

SCImago Journal Rank (SJR) 2016: 0.457
Source Normalized Impact per Paper (SNIP) 2016: 1.239


All Time Past Year Past 30 Days
Abstract Views 0 0 0
Full Text Views 184 184 31
PDF Downloads 58 58 14