The following code writes a 16×16 bitmap to a framebuffer using up to AVX2
instuctions. I’m sure it can be improved with AVX512
but I’m only interested in improvements using AVX2
and below. Specifically, I want to know if there is a more efficient way of doing it. Each bit in the bitmap represents a pixel and requires a dword attribute to be written. Each 2 byte word of the 32 byte data does one 16 pixel row for a total of 16 rows. The below code does 8 loads and 32 stores for each 16×16 written.
FColor dd 0x00ffffff ; white BColor dd 0x000000ff ; blue ALIGN 32 bitshift_to_MSB dd 24, 25, 26, 27, 28, 29, 30, 31 ; 95 * 32 byte bitmaps (SPACE to '~') CH_ARIAL db 0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00,0x00, . . . 0x00,0x00,0x00,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x02,0x00,0x00,0x00,0x02,0x00,0x00,0x00,0x00,0x00, ; rdi = framebuffer address to write to ; edx = printable ascii character (assumed to be between 32 and 126 sub edx, 32 shl edx, 5 mov ecx, 8 vpbroadcastd ymm4, [FColor] vpbroadcastd ymm3, [BColor] vmovdqa ymm2, yword [bitshift_to_MSB] @@: vpbroadcastd ymm0, dword [CH_ARIAL + edx] ; broadcasts dword (32 bits, for 2 x 16 pixel rows) vpsllvd ymm1, ymm0, ymm2 ; shifts left according to ymm2 to set dword MSB based on 1st byte vblendvps ymm1, ymm3, ymm4, ymm1 ; set dwords in ymm1 = (each dword MSB)?ymm4:ymm3 vmovdqa [rdi], ymm1 vpsrld ymm0, ymm0, 8 ; shift 2nd byte of each dword to first position vpsllvd ymm1, ymm0, ymm2 vblendvps ymm1, ymm3, ymm4, ymm1 vmovdqa [rdi+32], ymm1 add rdi, 1920*4 ; mov down to next row of pixels vpsrld ymm0, ymm0, 8 vpsllvd ymm1, ymm0, ymm2 vblendvps ymm1, ymm3, ymm4, ymm1 vmovdqa [rdi], ymm1 vpsrld ymm0, ymm0, 8 vpsllvd ymm1, ymm0, ymm2 vblendvps ymm1, ymm3, ymm4, ymm1 vmovdqa [rdi+32], ymm1 add edx, 4 add rdi, 1920*4 dec ecx jnz @b