Histogram: The Razor's Edge

QUESTION: So I've been reading up on HISTOGRAM and how it is optimized to be wickedly fast. I've also discovered the fancy things one can do with reverse indices. I've even read JD Smith's exegesis on the topic: HISTOGRAM: The Breathless Horror and Disgust.

So, just when I thought it was safe to start using HISTOGRAM with the frequency that Californians use "like", I was brought this scary result by the guy on the other side of my wall (his name is Tiberius).

   IDL> print, histogram([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25],min=-5.50,binsize=0.05)
        1           2           0           2           0           1

Wait a minute, this should be a uniform distribution!

Now, I admit that Tiberius contrived this example with the intent to cause harm to HISTOGRAM by placing the values at the exact boundaries of each bin. But, we didn't expect it to break! Honest. After scratching our heads for a bit we realized that HISTOGRAM is conducting its internal affairs by subtracting MIN from the data and then dividing by BINSIZE:

   IDL> print, ([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25]-(-5.50))/0.05, format='(f21.19)'
   0.0000000000000000000
   1.0000038146972656250
   1.9999980926513671875
   3.0000019073486328125
   3.9999961853027343750
   5.0000000000000000000

Well, there ya go. It's a round-off problem that results from trying to balance the values on a razor's edge. The subtraction and division knock a few values off balance. But, the result is still wrong and I just don't know enough about roundoff error to know if this is an insurmountable problem (but my educated guess is "yes".) Is there some clever way to make HISTOGRAM behave properly in such situations?

ANSWER: The answer is supplied by a number of IDL users from the IDL newsgroup, where this question was originally asked. First, and concisely, here are the solutions by Craig Markwardt.

Here are five partial solutions to the problem:

  1. Use double precision values.
  2. Multiply all values by (1 + eps), where eps= (MACHAR()).EPS. (This assumes you always want to round "up" to the next bin.)
  3. Add a random deviate to each value. (This doesn't solve the razor's edge problem per se, but reduces the chance that a human-entered quantity will land on a bin-edge, and also reduces the bias of always rounding "up" to the next bin.)
  4. Work in powers of 2 instead of multiples of 0.05.
  5. Learn to live with it.

The real problem, of course, is that computers cannot represent real numbers as easily as we think they can, as Ed Meinel points out.

The technical answer is that computers represent numbers as base 2; humans represent numbers as base 10. Mapping numbers from base 10 to base 2 is exact, mapping numbers from base 10 to base 2 ain't.

For example, mapping 5+5/10 = 5+1/2 is an exact correspondence. Mapping 5+45/100 = 5+1/4+1/8+1/16+1/128+1/256+1/2048+... is not exact, even in double precision values.

So, depending on the machine precision, 5.45 on the computer is either slightly greater than or less than an exact representation of 5.45. On top of that, neither is 0.05 exact on the computer, so your bin size is also slightly different than what you are expecting.

The bottom line is this: if you want the results to be exact, think like a machine.

Bob Stockwell elaborates the proper solution: round to integer values.

The histogram gives the correct result. Check out what you put into the histogram function.

   IDL> print, [-5.50,-5.45,-5.40,-5.35,-5.30,-5.25],format='(f50.25)'
   -5.5000000000000000000000000
   -5.4499998092651367000000000
   -5.4000000953674316000000000
   -5.3499999046325684000000000
   -5.3000001907348633000000000
   -5.2500000000000000000000000

So, if such razor edge stuff is a concern of yours, pre-process the data and use integers (i.e. round to integers). This also applies to any conditional tests of a float ( for instance, for i = 0.1,10,0.001 do begin... etc). Here is your example, rounded to integers. You get the result you expect.

   IDL> Print, Histogram(Round([-5.50,-5.45,-5.40,-5.35,-5.30,-5.25]/0.05), $
   IDL     Min=-5.50/0.05, Binsize=1)
   1 1 1 1 1 1

Paul van Delst finishes the discusion with a suggestion.

The subtraction and division don't knock a few values "off balance", the values are simply not representable to the precision you require - such is the nature of float point numberology. As other posters suggested, you need to preprocess the values somehow to make them integers, or take into account the precision when you do the comparisons required. It's worth writing a separate function to do floating point number comparisons. In Fortran I do the following:

!         ABS( x - y ) < ( ULP * SPACING( MAX(ABS(x),ABS(y)) ) )
!
!       If the result is .TRUE., the numbers are considered equal.
!
!       The intrinsic function SPACING(x) returns the absolute spacing of numbers
!       near the value of x,
!
!                      {     EXPONENT(x)-DIGITS(x)
!                      {  2.0                        for x /= 0
!         SPACING(x) = {
!                      {  
!                      {  TINY(x)                    for x == 0
!
!       The ULP optional argument scales the comparison.

where

!       ULP:  Unit of data precision. The acronym stands for "unit in
!             the last place," the smallest possible increment or decrement
!             that can be made using a machine's floating point arithmetic.
!             A 0.5 ulp maximum error is the best you could hope for, since
!             this corresponds to always rounding to the nearest representable
!             floating-point number. Value must be positive - if a negative
!             negative value is supplied, the absolute value is used.
!             If not specified, the default value is 1.

I don't know off the top of my head how to translate all that to IDL, but it would be a useful thing to do.

After re-reading this article a month or so later, Paul added these thoughts for those of you brave enough to attempt his suggestion.

Hmm. Reading that razor's edge article made me dig out a little f90 program I wrote to determine the real number spacing.

Given some number, A, one can do:

   exponent=ceil(alog(abs(A))/alog(2.0d))
   radix=2.0d  ; (machar(/double)).ibeta ??
   epsilon=(machar(/double)).eps
   spacing=epsilon * (radix^(exponent-1))

So for, say, A = 1.234568d+16:

   EXPONENT        LONG      =           54
   SPACING         DOUBLE    =        2.0000000

For A = 1.234568d+01:

   EXPONENT        LONG      =            4
   SPACING         DOUBLE    =    1.7763568e-15

For A = 1.234568d-01:

   EXPONENT        LONG      =           -3
   SPACING         DOUBLE    =    1.3877788e-17

and for A = 1.234568d-16:

EXPONENT        LONG      =          -52
SPACING         DOUBLE    =    2.4651903e-32

which agree with the outputs of my f90 code using the intrinsic functions EXPONENT and SPACING. The problem is, of course, what to do when A = 0.0. I would just use spacing = epsilon/2.0. I think that, in this case, doing something like this would work.

   two=2.0d
   radix=two
   IF ( A .eq. 0.0 ) THEN $
     spacing = epsilon/two $
   ELSE BEGIN
     exponent=ceil(alog(abs(A))/alog(two))
     spacing=epsilon * (radix^(exponent-1))
   ENDELSE

where you are counting on the equality check .EQ. not to work for numbers that aren't represented as exactly zero (since exact zero can be represented). But I'm not sure. You may also need a check to limit the calculated exponent to the range allowed (the minexp and maxexp fields of the output from Machar).

Anyway.... back to work....

And, recently, JD Smith had a couple of thoughs about comparing floats in IDL expressions such as IF A EQ B THEN ...

I know a lot has been said about the issue of floating point representation, but I thought I'd have a stab.

There's nothing special about float comparison, other than the fact that it's hard to write down a floating point number uniquely, by hand as a literal in your code, as ASCII in a text file, etc. This issue just doesn't exist for integers, for which there is a one-to-one mapping between integer representation in ASCII text and binary representation on disk (as long as the number is within the integer range).

A float is equal to another float only when their exact representation in memory is equal. Note that the following is not an issue:

   IDL> print,array_equal(replicate(4.923,100),4.923)
      1

It will *always* be true for any floating number specified like this, even ones which cannot be represented at the given precision:

   IDL> print,array_equal(replicate(4.92309879079079087908790879087,100), $
   		       4.92309879079079087908790879087)
      1

Problems only begin to occur when you start to rely on your (incorrect) intuitive expectation of how a computer *should* handle a floating point number:

   IDL> print,array_equal(replicate(4.923,100),1.00 + 1.02 + 1.045 + 1.858)
      0

But, why not, you ask, since:

   IDL> print, 1.00 + 1.02 + 1.045 + 1.858
         4.92300

Well, not all of those numbers (if considered as exact values drawn from the inifinite set of all real numbers) can actually be represented uniquely as a float. The classic example of this issue is the decimal number 0.9, which in binary representation is:

   0.111001100110011001100110011........

and repeating so on forever. Unfortunately, you can only allocate so many bits for a float, so:

IDL> print, 0.9 eq (0.6 + 0.3)
   0
   IDL> print,FORMAT='(F0.8)',0.9, 0.6 + 0.3
   0.89999998
   0.90000004

Note that this doesn't mean that 0.9 and numbers like it will *never* equal to the "sum of their parts". Sometimes you get lucky and the bit truncation works out the same:

   IDL> print, 0.9 eq (0.5 + 0.4)
      1

Here's a nice example in C to get a better handle on this, which will let you peek behind the curtain and see the real bit representation of a given floating point number:

   #include <stdio.h>
   int main() {
     float f=0.9,f1;

     printf("Literal -- %f: %lx\n",f,*(unsigned int *) &f);

     f1=(float) 0.3 + (float) 0.6;
     printf("0.6+0.3 -- %f: %lx\n",f1,*(unsigned int *) &f1);

     printf("Floats are %s\n",f1==f?"Equal":"Not Equal");

   }

For convenience it prints the floats as 4 bytes of hexadecimal instead of 32 bits of binary, and it uses a de-referencing trick to get at the actual binary data for the float. Here are the results:

   % ./compare_float
   Literal -- 0.900000: 3f666666
   0.6+0.3 -- 0.900000: 3f666667
   Floats are Not Equal

Note how I had to cast the literal numbers 0.3 and 0.6 to floats, since by default most C compilers promote literals to double in arithmetic before assigning to float. IDL does this casting natively as well (which is why you aren't guaranteed to get the same answer with IDL and C).

Note also how the bit representation of the floats is not equal. Here they are in binary:

   3f666666: 111111011001100110011001100110
   3f666667: 111111011001100110011001100111

Just a single bit difference, but what a difference it is!

If you really want to test floats which arrived in a given array from various sources for "equality", you need to specify what "equality" means (if something other than the computer's quite naive definition of equality as "exact same representation in binary format"). You might try:

   ARRAY_EQUAL(abs(a-a[0]) lt epsilon,1b)

where epsilon is your maximum allowed difference within which floats should be considered "equal". A good number might be (machar()).eps, i.e.:

   IDL> print,abs(0.9 - (0.6+0.3)) lt (machar()).eps
      1

Ahh, all is right with the world.

Google
 
Web Coyote's Guide to IDL Programming