For all benchmark studies, we defined sets of values for all parameters mentioned above and generated r datasets for every possible parameter combination. We then applied scCODA with the last cell type chosen as reference to each synthetic dataset. For the model comparison benchmark (Methods—“Model comparison”), we analyzed the results at FDR levels of 0.05 and 0.2. The overall benchmark (Methods—“Power analysis”), heterogeneous response benchmark (Methods—“Analysis of heterogeneous response groups”) and runtime analysis (Methods—“Runtime analysis”) were carried out with an expected FDR level of 0.05.