Can someone guide me to optimize the Convolution of a filter on an image using the benefits of ARM Neon intrinsics in C? I have already implemented this in traditional C, however, I need to time-optimize the code for faster image processing on ARM with NEON support. The resources available on the internet are very limited for the algorithm implementation on ARM using NEON using C.
I need to convolve a 3×3 filter with the image. The main problem I guess is the loop constraints to access the 3×3 matrix of the image. NEON intrinsics help us to load 8 bytes of data at a go, but how to take benefit of this, in order to access the 3×3 matrix? Convolution Example
For now, I’m accessing the 3×3 image matrix like this,
for(i=1;i<width;i++) // i = rows { if(i!=1) fseek(fp, 1078+(width*(i-1)), SEEK_SET); for(j=1;j<height-1;j++) // j = columns { if(j!=1) fseek(fp, 1077 + (i*width) + j , SEEK_SET); for(k=0;k<9;k+=3) { data[k] = getc(fp); data[k+1] = getc(fp); data[k+2] = getc(fp); //fread(buf, sizeof(char), width - 3, fp); fseek(fp, width - 3, SEEK_CUR); } pixel = vld1_u8(&data); pixel_last = data[8]; result = vmul_u8(kernel,pixel); for(k=0;k<8;k++) sum += result[k]; sum += pixel_last * kernel_last; sum = sum/9; sum = sum > 255 ? 255 : sum; imageData[i*width + j]= sum; } }