Matrix


A key element in evaluating the quality of a pairwise sequence alignment is the "substitution matrix", which assigns a score for aligning any possible pair of residues. The theory of amino acid substitution matrices is described in [1], and applied to DNA sequence comparison in [2]. In general, different substitution matrices are tailored to detecting similarities among sequences that are diverged by differing degrees [1-3]. A single matrix may nevertheless be reasonably efficient over a relatively broad range of evolutionary change [1-3]. Experimentation has shown that the BLOSUM-62 matrix [4] is among the best for detecting most weak protein similarities. For particularly long and weak alignments, the BLOSUM-45 matrix may prove superior. A detailed statistical theory for gapped alignments has not been developed, and the best gap costs to use with a given substitution matrix are determined empirically. Short alignments need to be relatively strong (i.e. have a higher percentage of matching residues) to rise above background noise. Such short but strong alignments are more easily detected using a matrix with a higher "relative entropy" [1] than that of BLOSUM-62. In particular, short query sequences can only produce short alignments, and therefore database searches with short queries should use an appropriately tailored matrix. The BLOSUM series does not include any matrices with relative entropies suitable for the shortest queries, so the older PAM matrices [5,6] may be used instead. For proteins, a provisional table of recommended substitution matrices for various query lengths is:

     Query length     Substitution matrix
     ------------     -------------------
     <35              PAM-30            
     35-50            PAM-70            
     50-85            BLOSUM-80         
     >85              BLOSUM-62         

[1] Altschul, S.F. (1991) "Amino acid substitution matrices from an information
    theoretic perspective." J. Mol. Biol. 219:555-565.
[2] States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity of
    nucleic acid database searches using application-specific scoring matrices."
    Methods 3:66-70.
[3] Altschul, S.F. (1993) "A protein alignment scoring system sensitive at all
    evolutionary distances." J. Mol. Evol. 36:290-300.
[4] Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
    protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919.
[5] Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of evolutionary
    change in proteins." In "Atlas of Protein Sequence and Structure, vol. 5,
    suppl. 3," M.O. Dayhoff (ed.), pp. 345-352, Natl. Biomed. Res. Found.,
    Washington, DC.
[6] Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant
    relationships." In "Atlas of Protein Sequence and Structure, vol. 5,
    suppl. 3," M.O. Dayhoff (ed.), pp. 353-358, Natl. Biomed. Res. Found.,
    Washington, DC.