Modern ×86 processors are designed to operate on data in (usually 64-bit) machine word or (≥ 128-bit) vector chunks. The PLINK 1 binary file format supports this well: the format’s packed 2-bit data elements can, with the use of bit arithmetic, easily be processed 32 or 64 at a time. However, most existing programs fail to exploit opportunities for bit-level parallelism; instead their loops painstakingly extract and operate on a single data element at a time. Replacement of these loops with bit-parallel logic is, by itself, enough to speed up numerous operations by more than one order of magnitude.