Categories
assembly compiler-optimization fast-math floating-point gcc

Why doesn’t GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?

2259

I am doing some numerical optimization on a scientific application. One thing I noticed is that GCC will optimize the call pow(a,2) by compiling it into a*a, but the call pow(a,6) is not optimized and will actually call the library function pow, which greatly slows down the performance. (In contrast, Intel C++ Compiler, executable icc, will eliminate the library call for pow(a,6).)

What I am curious about is that when I replaced pow(a,6) with a*a*a*a*a*a using GCC 4.5.1 and options “-O3 -lm -funroll-loops -msse4“, it uses 5 mulsd instructions:

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13

while if I write (a*a*a)*(a*a*a), it will produce

movapd  %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm14, %xmm13
mulsd   %xmm13, %xmm13

which reduces the number of multiply instructions to 3. icc has similar behavior.

Why do compilers not recognize this optimization trick?

11

  • 15

    What does “recognizing pow(a,6)” mean?

    Jun 21, 2011 at 18:56

  • 697

    Um… you know that aaaaaa and (aaa)*(aa*a) are not the same with floating point numbers, don’t you? You’ll have to use -funsafe-math or -ffast-math or something for that.

    – Damon

    Jun 21, 2011 at 18:57

  • 124

    I suggest you read “What Every Computer Scientist Should Know About Floating Point Arithmetic” by David Goldberg: download.oracle.com/docs/cd/E19957-01/806-3568/… after which you’ll have a more complete understanding of the tar pit that you’ve just walked into!

    Jun 22, 2011 at 9:20

  • 202

    A perfectly reasonable question. 20 yrs ago I asked the same general question, and by crushing that single bottleneck, reduced the execution time of a Monte Carlo simulation from 21 hours to 7 hours. The code in the inner loop was executed 13 trillion times in the process, but it got the simulation into an over-night window. (see answer below)

    – user1899861

    Dec 21, 2012 at 3:47

  • 30

    Maybe throw (a*a)*(a*a)*(a*a) into the mix, too. Same number of multiplications, but probably more accurate.

    – Rok Kralj

    Aug 11, 2015 at 17:18

2871

Because Floating Point Math is not Associative. The way you group the operands in floating point multiplication has an effect on the numerical accuracy of the answer.

As a result, most compilers are very conservative about reordering floating point calculations unless they can be sure that the answer will stay the same, or unless you tell them you don’t care about numerical accuracy. For example: the -fassociative-math option of gcc which allows gcc to reassociate floating point operations, or even the -ffast-math option which allows even more aggressive tradeoffs of accuracy against speed.

8

  • 13

    Yes. With -ffast-math it is doing such optimization. Good idea! But since our code concerns more accuracy than the speed, it might be better not to pass it.

    – xis

    Jun 21, 2011 at 19:09

  • 24

    IIRC C99 allows the compiler to do such “unsafe” FP optimizations, but GCC (on anything other than the x87) makes a reasonable attempt at following IEEE 754 – it’s not “error bounds”; there is only one correct answer.

    – tc.

    Jun 22, 2011 at 2:19

  • 15

    The implementation details of pow are neither here nor there; this answer doesn’t even reference pow.

    Jan 3, 2013 at 2:19

  • 18

    @nedR: ICC defaults to allowing re-association. If you want to get standard-conforming behavior, you need to set -fp-model precise with ICC. clang and gcc default to strict conformance w.r.t. reassociation.

    Mar 27, 2014 at 18:19

  • 60

    @xis, it’s not really that -fassociative-math would be inaccurrate; it’s just that a*a*a*a*a*a and (a*a*a)*(a*a*a) are different. It’s not about accuracy; it’s about standards conformance and strictly repeatable results, e.g. same results on any compiler. Floating point numbers are already not exact. It is seldomly inappropriate to compile with -fassociative-math.

    Aug 24, 2014 at 16:11

694

Lambdageek correctly points out that because associativity does not hold for floating-point numbers, the “optimization” of a*a*a*a*a*a to (a*a*a)*(a*a*a) may change the value. This is why it is disallowed by C99 (unless specifically allowed by the user, via compiler flag or pragma). Generally, the assumption is that the programmer wrote what she did for a reason, and the compiler should respect that. If you want (a*a*a)*(a*a*a), write that.

That can be a pain to write, though; why can’t the compiler just do [what you consider to be] the right thing when you use pow(a,6)? Because it would be the wrong thing to do. On a platform with a good math library, pow(a,6) is significantly more accurate than either a*a*a*a*a*a or (a*a*a)*(a*a*a). Just to provide some data, I ran a small experiment on my Mac Pro, measuring the worst error in evaluating a^6 for all single-precision floating numbers between [1,2):

worst relative error using    powf(a, 6.f): 5.96e-08
worst relative error using (a*a*a)*(a*a*a): 2.94e-07
worst relative error using     a*a*a*a*a*a: 2.58e-07

Using pow instead of a multiplication tree reduces the error bound by a factor of 4. Compilers should not (and generally do not) make “optimizations” that increase error unless licensed to do so by the user (e.g. via -ffast-math).

Note that GCC provides __builtin_powi(x,n) as an alternative to pow( ), which should generate an inline multiplication tree. Use that if you want to trade off accuracy for performance, but do not want to enable fast-math.

13

  • 31

    Note also that Visual C++ provides an ‘enhanced’ version of pow(). By calling _set_SSE2_enable(<flag>) with flag=1, it will use SSE2 if possible. This reduces accuracy by a bit, but improves speeds (in some cases). MSDN: _set_SSE2_enable() and pow()

    – TkTech

    Jun 22, 2011 at 17:04


  • 21

    @TkTech: Any reduced accuracy is due to Microsoft’s implementation, not the size of the registers used. It’s possible to deliver a correctly-rounded pow using only 32-bit registers, if the library writer is so motivated. There are SSE-based pow implementations that are more accurate than most x87-based implementations, and there are also implementations that trade off some accuracy for speed.

    Jun 22, 2011 at 17:37


  • 11

    @TkTech: Of course, I just wanted to make clear that the reduction in accuracy is due to the choices made by the library writers, not intrinsic to the use of SSE.

    Jun 22, 2011 at 17:56

  • 9

    I’m interested to know what you used as the “gold standard” here for calculating relative errors — I would normally have expected it would be a*a*a*a*a*a, but that is apparently not the case! 🙂

    Sep 24, 2013 at 22:44

  • 11

    @j_random_hacker: since I was comparing single-precision results, double-precision suffices for a gold standard — the error from aaaaaa computed in double is *vastly smaller than the error of any of the single-precision computations.

    Sep 24, 2013 at 22:47

184

Another similar case: most compilers won’t optimize a + b + c + d to (a + b) + (c + d) (this is an optimization since the second expression can be pipelined better) and evaluate it as given (i.e. as (((a + b) + c) + d)). This too is because of corner cases:

float a = 1e35, b = 1e-5, c = -1e35, d = 1e-5;
printf("%e %e\n", a + b + c + d, (a + b) + (c + d));

This outputs 1.000000e-05 0.000000e+00

5

  • 13

    This is not exactly the same. Changin the order of multiplications/divisions (excluding division by 0) is safer than changin order of sum/subtraction. In my humble opinion, the compiler should try to associate mults./divs. because doing that reduce the total number of operations and beside the performance gain ther’s also a precision gain.

    Jul 7, 2014 at 8:22

  • 6

    @DarioOO: It is no safer. Multiply and divide are the same as addition and subtraction of the exponent, and changing the order can easily cause temporaries to exceed the possible range of the exponent. (Not exactly the same, because the exponent doesn’t suffer loss of precision… but the representation is still quite limited, and reordering can lead to unrepresentable values)

    – Ben Voigt

    Mar 4, 2015 at 17:47

  • 12

    I think you are missing some calculus background. Multplying and dividing 2 numbers introduce the same amount of error. While subtracting / addition 2 numbers may introduce a bigger error especially when the 2 numbers are order of magnitudes different, hence it is safer re-arrangin mul/divide than sub/add because it introduce a minor change in final error.

    Mar 5, 2015 at 8:49


  • 10

    @DarioOO: the risk is different with mul/div: Reordering either makes a negligible change in the final result, or the exponent overflows at some point (where it wouldn’t have before) and the result is massively different (potentially +inf or 0).

    Jul 30, 2015 at 4:37

  • @GameDeveloper Imposing a precision gain in unpredictable ways is hugely problematic.

    Sep 29, 2019 at 21:09