floating point - CUDA float addition gives wrong answer (compared to CPU float ops) -
i new cuda. using cuda find dot prod of float vectors , came across float point addition issue in cuda. in essence following simple kernel. i'm using -arch=sm_50 basic idea thread_0 add values of vector a.
__global__ void temp(float *a, float *b, float *c) { if (0 == threadidx.x && blockidx.x == 0 && blockidx.y ==0 ) { float xx = 0.0f; (int = 0; < length; i++){ xx += a[i]; } *c = xx; } } when initialize 'a' 1000 elements of 1.0 desired result of 1000.00
but when initialize 'a' 1.1, should 1100.00xx istead, getting 1099.989014. cpu implementation yields 1100.000024
i trying understand issue here! :-(
i tried count number of 1.1 elements in vector , yeilds 1000, expected. , used atomicadd , still have same issue.
would grateful if me out here!
best
edit: biggest concern here disparity of cpu result vs gpu result! understand floats can off decimal points. gpu error significant! :-(
it not possible represent 1.1 using ieee-754 floating point representation. @robertcrovella mentionned in comment, computation performed on cpu not use same ieee-754 settings gpu one.
indeed, 1.1 in floating point stored 0x3f8ccccd = 1.10000002384185. performing sum on 1000 elements, the last bits gets lost in rouding, 1 bit first addition, 2 bits after four, etc, until 10 bits after 1000. depending on rounding mode, may truncate 10 bits last half of operations, hence ending summing 0x3f8ccc00 1.09997558.
the result cuda divided 1000 0x3f8ccc71, consistent calculation in 32 bits.
when compiling on cpu, depending on optimization flags, may using fast math, uses internal register precision. can be, if not specifying vector registers, using x87 fpu 80 bits precision. in occurence, computation read 1.1 in float 1.10000002384185, add 1000 times using higher precision, hence not loosing bit in rounding resulting in 1100.00002384185, , display 1100.000024 round nearest display.
depending on compilation flags, actual equivalent computation on cpu may require enforcement of 32 bits floating-point arithmetics can done using addss of sse2 instruction set example.
you can play /fp: option or -mfpmath compiler , explore issued instructions. in case assembly instruction fadd 80-bits precision addition.
all of has nothing gpu floating-point precision. rather misunderstanding of ieee-754 norm , legacy x87 fpu behaviour.
Comments
Post a Comment