SIMD Instructions Considered Harmful

www.sigarch.org
8 min read
fairly easy
In the process of writing a short introduction to RISC-V, we compared RISC-V vector code to SIMD. We were struck by the insidiousness of the SIMD instruction extensions of ARM, MIPS, and x86. We de…
In the process of writing a short introduction to RISC-V, we compared RISC-V vector code to SIMD. We were struck by the insidiousness of the SIMD instruction extensions of ARM, MIPS, and x86. We decided to share those insights in this blog, based on Chapter 8 of our book.

Like an opioid, SIMD starts off innocently enough. An architect partitions the existing 64-bit registers and ALU into many 8-, 16-, or 32-bit pieces and then computes on them in parallel. The opcode supplies the data width and the operation. Data transfers are simply loads and stores of a single 64-bit register. How could anyone be against that?

To accelerate SIMD, architects subsequently double the width of the registers to compute more partitions concurrently. Because ISAs traditionally embrace backwards binary compatibility, and the opcode specifies the data width, expanding the SIMD registers also expands the SIMD instruction set. Each subsequent step of doubling the width of SIMD registers and the number of SIMD instructions leads ISAs down the path of escalating complexity, which is borne by processor designers, compiler writers, and assembly language programmers. The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.

An older and, in our opinion, more elegant alternative to exploit data-level parallelism is vector architectures. Vector computers gather objects from main memory and put them into long, sequential vector registers. Pipelined execution units compute very efficiently on these vector registers. Vector architectures then scatter the results back from the vector registers to main memory.

While a simple vector processor might execute one vector element at a time, element operations are independent by definition, and so a processor could theoretically compute all of them simultaneously. The widest data for RISC-V is 64 bits, and today's vector processors typically execute two, four, or eight 64-bit elements per clock cycle.…
David Patterson, Andrew Waterman On Sep
Read full article