This weekend I thought it would be a good idea to run the Eressea server both with and without optimizations enabled and compare the output. In theory, I thought, optimization should not change the results, and different results would hint at bugs like uninitialized variables or illegal memory access.

Needless to say, the output wasn’t the same. It was slightly different, and it looked like a small error snowballing towards the end. I’ll spare you the tale of a day trying to narrow down the exact location, and cut right to the chase:

There are more optimization options than you can shake a gnu at, and it takes time to find out which one is breaking your code. Like visual C++ (/fp), GCC has optimization options that change the behavior of floating-point operations. And since they change the result of your program and the compliance with IEEE and ANSI standards, they are disabled except when directly specified. For example:

    Allow optimizations for floating-point arithmetic that (a) assume that
    arguments and results are valid and (b) may violate IEEE or ANSI
    standards. When used at link-time, it may include libraries or startup
    files that change the default FPU control word or other similar

    This option should never be turned on by any -O option since it can result
    in incorrect output for programs which depend on an exact implementation
    of IEEE or ISO rules/specifications for math functions.

    The default is -fno-unsafe-math-optimizations. 

Now, we all know floating point math is a dark art. Processors have different size registers, some numbers (like 0.7) cannot be accurately described, and rounding is a science in itself. This is why there are IEE standard for what exactly should happen. Of course, following those standards clashes with optimization. So after some poking around in the wrong places, I began to get suspicious of floating point math. And after a lot of searching in the wrong lace, I finally narrowed my bug down to this little test program:

int main(int argc, char**argv) {
  float f = 0.7F;
  printf("%d\n", (int)(f*100));
  return 0;

I know, I know. Casting is evil. But knowing that 0.7 is one of those numbers we can’t represent in a floating-point register, I wasn’t surprised to see that this prints out 69. I was however surprised that with -Os optimizations, it prints 70. And here’s why:

        pushl   %ebp
        movl    %esp, %ebp
        pushl   $70 ; WTF?
        pushl   $.LC2
        call    printf

WTF is the point of disabling all the other math optimizations by default, if you’re going to do this kind of thing and make the whole reproducibility of results moot? It turns out there’s one optimization option that GCC does enable by default, which causes this:

    Do not store floating point variables in registers, and inhibit other
    options that might change whether a floating point value is taken from
    a register or memory.

    This option prevents undesirable excess precision on machines such as
    the 68000 where the floating registers (of the 68881) keep more precision
    than a double is supposed to have. Similarly for the x86 architecture. For
    most programs, the excess precision does only good, but a few programs
    rely on the precise definition of IEEE floating point. Use -ffloat-store
    for such programs, after modifying them to store all pertinent
    intermediate computations into variables.

And when I turned that one off, my output was the same, as it should be. Lesson learned. I have to say though, things are a lot easier in visual C++: I have three options for math, /fp:{precise,strict,fast} and they are a lot more intuitive.