CUDA is not my go-to go-fast tool -- with a modern, big CPU, it is pretty great how much performance you can trivially get with
#pragma omp parallel for. The last couple times I have done it, my first cut at a CUDA kernel was slower than the C++ running on 128 threads. \