How I Made My SIMD Code 1700x Faster Without Writing Any Intrinsics
Background
Modern game development often requires balancing workloads between CPU and GPU, but writing optimized code for both remains challenging. While GPUs excel at parallel processing, CPUs offer advantages in terms of determinism, lower latency, and easier debugging. This raises an interesting question: could we write compute shaders that efficiently target both architectures from a single source?
The Forge…