Difference between a Register-based and a Stack-based CPU

January 23, 2008

This is just one small little aspect of how a stack and a register based CPU differs. It is not by any means a good comparison of the intricate differences, but rather just a little trivia that I’ve remembered in an older CPU era. Back in those days, where people would try to make use of every single CPU cycle that’s available, the ‘optimal’ way of clearing a register on for example, old 8086 machines, a programmer may code their application in C to make use of XOR negation to clear a register. In code, it looks like this:

int a = 8;      // just assigning a random value to clear to 0
a = 0;          // setting a variable to 0 normally

Versus the ‘efficient’ way:

int a = 8;
a ^= a;         // faster on register CPU, slower on stack CPU

If you translate that to x86 assembly code, it becomes:

MOV AX, 8
MOV AX, 0

Versus:

MOV AX, 8
XOR AX, AX      ; same number of instructions, but faster

The arithmetic operation of XOR vs MOV is just faster by 2-3 bus cycles. I might be wrong on the number but the key point is that the savings are really trivial.

Here’s a snippet of code generated for a stack based architecture, for example Java:

BIPUSH 8;
ISTORE_0;
ICONST_0;
ISTORE_0;       // 4 instructions

Versus:

BIPUSH 8;
ISTORE_0;
ILOAD_0;
ILOAD_0;
IXOR;
ISTORE_0;       // 6 instructions!

And you’ve just added an unnecessary overhead increase of 50%. That’s sometimes why people do mumble about stack based machine architectures like Java and C#, which probably have some kernels of truth. But with trivial savings like that, it’s probably relevant for old machine architectures, but for modern day RISC machine architectures, the stack slots are mapped onto registers directly, and with compiler optimizations, the differences are probably immaterial.

As a comparison, if the programmer fires up his profiler and optimize on even just the trivial-est of loops, he’ll probably end up having a faster application just by doing that, than having to manually tweak every single instance of the given example, with negligible gain (or even a loss of) performance. So if you’re ever looking for a good example of an unnecessary and counterproductive ‘premature optimization’, this is it!