We first evaluated the overall precision and recall values of all models on the test benchmark. From Table 2, it is observed the Poisson GLM achieves the highest recall while ZINB GLM has the highest precision. F1 score, the harmonic mean of precision and recall, is used to evaluate the overall performance. The conclusion from F1 score is consistent with that of Vuong’s test, with ZINB performs the best, followed by NB, ZIP and Poisson GLM. However, the precision values listed in Table 2 are lower than the ones reported previously [7, 14, 15]. There are 2 major reasons: 1. the Ion Proton test benchmark dataset is designed to enrich with low-frequency SNVs, with 68.9 % of all SNVs of allele frequency < = 3 %, in which 17.3 % at 0.5 % frequency and 19.8 % at 1 % frequency, whereas the majority of previous studies focused on SNVs of > = 5 % allele frequency; 2. one popular paradigm of SNV calling is a two-step procedure, first generating SNV candidates and then applying multiple sequencing quality filters to