To me, spot finding is often the go-to for assessing new software performance optimisation, as it is robustly defined and performance critical for some of what I do as a day job. Some of this optimisation has been proof-of-concept, e.g. porting the basic DIALS spot finding to microcontrollers which is a useful exercise but not immediately useful for actual work.
More recently I spent some effort trying to document exactly how the dispersion extended spot finding in DIALS works: this was important to me as the algorithm gives excellent results but is not documented anywhere. This was preliminary work for a GPU port of the algorithm.
As documented above the simple dispersion spot finding depends primarily on computing the mean and variance of pixel values within a compact region of the image (typically a 7x7 kernel centred on the pixel) which in the main DIALS code is computed with an integral image / summed area table. On a GPU it is realistic to consider computing this with a brute force calculation e.g. having every thread compute the value for a given pixel:
__global__ void kernel_filter(uint16_t *image_in, uint16_t *mask_out,
int height, int width) {
static int32_t knl = 3;
float sigma_b = 6.0f, sigma_s = 3.0f;
const int i = threadIdx.y + blockDim.y * blockIdx.y;
const int j = threadIdx.x + blockDim.x * blockIdx.x;
if (i >= height)
return;
if (j >= width)
return;
const int k = i * width + j;
uint32_t m_sum = 0;
uint32_t i_sum = 0;
uint32_t i2_sum = 0;
for (int32_t di = -knl; di <= knl; di++) {
for (int32_t dj = -knl; dj <= knl; dj++) {
int32_t _i = i + di;
int32_t _j = j + dj;
if ((_i >= 0) && (_i < height) && (_j >= 0) && (_j < width)) {
uint32_t _k = _i * width + _j;
uint32_t m = image_in[_k] > 0xfffd ? 0 : 1;
uint32_t p = m * image_in[_k];
m_sum += m;
i_sum += p;
i2_sum += p * p;
}
}
}
// N.B. for photon counting detectors the threshold is usually zero,
// masked pixels have value zero so cannot be signal - if the pixel is
// > 0 then the sum must be
uint16_t signal = 0;
uint32_t p = image_in[k] > 0xfffd ? 0 : image_in[k];
if (p > 0 && m_sum >= 2) {
float bg_lhs = (float)m_sum * i2_sum - (float)i_sum * i_sum -
(float)i_sum * (m_sum - 1);
float bg_rhs = i_sum * sigma_b * sqrtf((float)2.0f * (m_sum - 1));
uint16_t background = bg_lhs > bg_rhs;
float fg_lhs = (float)m_sum * p - (float)i_sum;
float fg_rhs = sigma_s * sqrtf((float)i_sum * m_sum);
uint16_t foreground = fg_lhs > fg_rhs;
signal = background && foreground;
}
// save the pixel value for later use in connected component labelling
mask_out[k] = signal * p;
}
Wrapping this in a wrapper function which can be called from a CPU thread with a dedicated stream is relatively trivial.
void kernel_finder(uint16_t *in, uint16_t *out, uint16_t *d_in, uint16_t *d_out,
cudaStream_t s) {
// hard coded
const dim3 block(32, 32);
const dim3 grid((NX + 31) / 32, (NY + 31) / 32);
cudaMemcpyAsync(d_in, in, NX * NY * sizeof(uint16_t), cudaMemcpyHostToDevice,
s);
kernel_filter<<<grid, block, 0, s>>>(d_in, d_out, NY, NX);
cudaMemcpyAsync(out, d_out, NX * NY * sizeof(uint16_t),
cudaMemcpyDeviceToHost, s);
cudaStreamSynchronize(s);
}
Using a summed area table with this approach is inefficient on a GPU as the cost of building an integral image exceeds the time savings.
As described above the algorithms for dispersion extended spot finding are more complex so naturally the the GPU implementation is more complex. This will come in a subsequent blog post once I have decided how to write this up: the performance optimisation in particular was a thorny problem.