To avoid the overestimation of the predictive performance, the redundant training sequences should be discarded. After the construction of non-redundant training set of kinase-specific phosphorylation sites, two features, i.e. sequence of surrounding catalytic sites and coupling pattern of surrounding catalytic sites, are extracted. As to sequence surrounding catalytic sites, 9-mer sequences (−4 ∼ +4) of kinase-specific phosphorylation sites are encoded in three ways: BLOSUM62 profile encoding (the corresponding row number of amino acids in BLOSUM62 matrix), reduced alphabet (sparse encoding with fewer letters) (18) and 20-dimensional vector (each amino acid is mapped to a 20-dimensional vector), as given in Table S2. It was found that amino acids have a great variety of properties such as mass, polarity, hydrophobicity, so many groupings are possible (19). With the hydrophobicity (20), for instance, the 20 amino acids are reduced into three classes, such as polar (R,K,E,D,Q,N), neutral (G,A,S,T,P,H,Y) and hydrophobic (C,V,L,I,M,F,W).