Loop alignment in .NET 6 | .NET Blog

devblogs.microsoft.com
7 min read
fairly easy
Learn some great ideas on how to implement loop alignment in .NET 6 and get maximum benefit without affecting the performance adversely.
Loop alignment in .NET 6

Kunal

April 19th, 2021

When writing a software, developers try their best to maximize the performance they can get from the code they have baked into the product. Often, there are various tools available to the developers to find that last change they can squeeze into their code to make their software run faster. But sometimes, they might notice slowness in the product because of a totally unrelated change. Even worse, when measured the performance of a feature in a lab, it might show instable performance results that looks like the following BubbleSort graph1. What could possibly be introducing such flakiness in the performance?

To understand this behavior, first we need to understand how the machine code generated by the compiler is executed by the CPU. CPU fetch the machine code (also known as instruction stream) it need to execute. The instruction stream is represented as series of bytes known as opcode. Modern CPUs fetch the opcodes of instructions in chunks of 16-bytes (16B), 32-bytes (32B) or 64-bytes (64B). The CISC architecture has variable length encoding, meaning the opcode representing each instruction in the instruction stream is of variable length. So, when the Fetcher fetches a single chunk, it doesn't know at that point the start and end of an instruction. From the instruction stream chunk, CPU's Pre-decoder identifies the boundary and lengths of instruction, while the Decoder decodes the meaning of the opcodes of those individual instructions and produce micro-operations ( μops ) for each instruction. These μops are fed to the Decoder Stream Buffer (DSB) which is a cache that indexes μops with the address from where the actual instruction was fetched. Before doing a fetch, CPU first checks if the DSB contains the μops of the instruction it wants to fetch. If it is already present, there is no need to do a cycle of instruction fetching, pre-decoding and decoding. Further, there also exist Loop Stream Detector (LSD) that…
Laurent Ellerbach, Rahul Bhandari, Msft
Read full article