Chunk #27 — Methods — Data. — TF ChIP–seq data.

Source: Improving the trans-ancestry portability of polygenic risk scores by prioritizing variants in predicted cell-type-specific regulatory elements.
Embedded: yes

Text

the IMPACT method, we selected TFs with a known sequence motif, as recorded in the MEME database. Of the 442 TFs represented by the 3,181 TF ChIP–seq datasets, only 142 matched a known sequence motif, narrowing down the total number of datasets considered to 1,542. There was no dataset removal based on cell-type classification. Of the 1,542 datasets (each characterized by a TF-cell-type pair), there were 728 unique TF-cell-type pairs, meaning many pairs have been assayed more than once. We took the union of peaks among different experiments of the same TF-cell-type pair. Therefore, the number of consolidated TF ChIP–seq datasets (n = 728 is <1,542). Then, for each of 728 datasets, we scanned TF ChIP–seq peaks for corresponding TF motifs, using HOMER (v.4.8.3)60, to identify matches exceeding the empirically determined motif detection threshold. Similarly, we identified motif sites not bound by a TF by using HOMER to scan the entire genome for sequence matches. We removed consolidated datasets with fewer than 7 peaks with TF motifs, the lower bound at which the logistic regression could converge, resulting in 707 consolidated datasets. Regarding the corresponding GEO accessions, this removal reduced the 1,542 utilized GEO accessions to 1,511. The 1,511 datasets