Chunk #16 — Methodological issues — Potential sources of bias

Source: Gene set analysis of genome-wide association studies: methodological issues and perspectives.
Embedded: yes

Text

gene, gene set significance may be driven by only a few of these SNPs, because the significant SNPs mapped to multiple genes could be included multiple times. For example, in our analysis of the GAIN schizophrenia dataset [11], the "starch and sucrose metabolism gene set (HSA00500)" included several genes located closely on the chromosome (e.g., UGT1A1, UGT1A3, UGT1A4, UGT1A5, UGT1A6, UGT1A7, UGT1A8, UGT1A9, UGT1A10). When the most significant SNP was used to represent the association signal of each gene, most of the genes in the cluster were represented by the same SNP, which had the P-value 6.502×10−4. Therefore, when this SNP has a small P-value, the gene set would likely be identified as a significant gene set, while, in fact, the results of multiple significant genes in the gene set was driven by one highly significant SNP located on multiple genes.Gene set size and gene length. Finally, as mentioned above, in order to score gene sets in an unbiased manner, all selection processes (e.g., selecting the most significant SNPs to represent each gene and selecting the most significant genes to represent each gene set) need to be accounted for in the final gene set analysis. For example, when a gene