Pentium II Math Bug?
| UPDATE:
Intel has fixed this bug in November, 1998. |
It would appear that there may be a bug in the floating point
unit of the new Pentium II Processor, as well as the current
Pentium Pro Processor. Is it real? Is it serious? It appears to
be real. The observed behavior contradicts the IEEE Floating
Point Specifications, and Intel's printed documentation. However,
I'm not a numerical analyst, and therefore I'm not qualified to
comment on its seriousness or its implications. Instead, I'll
present the facts herein, and leave the determination to you.
The Facts
I received email from "Dan" who asked
if I could reproduce what he thought was a bug in the Pentium Pro
processor. I wrote an assembly language program that checked into
the problem. I also ran the test on a Pentium-II processor that I
had recently bought at Fry's Electronics, an Intel Pentium Processor (P54C), Intel Pentium
Processor with MMX Technology (P55C), and an AMD K6. Sure enough,
I came to the same conclusion as Dan: it looks like a bug to me.
What do we call this bug?
These days, astronomers name new stars and comets by combining
the discoverer's name and some number. Why should microprocessor
bugs be any different? In this case, "Dan" is the
discoverer of the bug, and 04-11 (1997) is the date on which I
got my first email about it. So I've named the bug
"Dan-0411" after its discoverer and the date he first
reported it to me.
What is the bug, and what does it
affect?
The bug relates to operations that convert floating point
numbers into integer numbers. Floating point numbers are stored
inside of the microprocessor in an 80-bit format. Integer numbers
are stored in two different sizes. A short integer is stored in
16-bits, and a long integer is stored in 32-bits. It is often
desirable to store the 80-bit floating point numbers as integer
numbers. Sometimes the converted number won't fit into the
smaller integer format. This is when the bug occurs.
The host software is supposed to be warned by the
microprocessor when such a floating point conversion error
occurs; a specific error flag is supposed to be set in a floating
point status register. If the microprocessor fails to set this
flag, it would not be in compliance with the IEEE Floating Point
Standards which mandate such behavior. For the Dan-0411 bug, the
Pentium II and Pentium Pro processors fail to set this error flag
in many cases.
When storing 16-bit integers, the chance of randomly hitting
the bug is 247/280 or 1 in 8,589,934,592 (1
in 8.6 billion). When storing 32-bit integers, the chance is 231/280
or 1 in 562,949,953,421,312 (1 in 562,950 billion). That's
approximately 140,739,635,839,000 different floating point
numbers that result in the incorrect behavior. The Pentium,
Pentium with MMX Technology, and AMD K6 microprocessors do not
appear to have this problem.
It might be interesting to note that a launch failure of the
Ariane 5 rocket, which happened less than a minute into the
launch, was traced to behavior around an overflow condition (in
this case, it was software, not hardware, that was the problem).
One of the computers on board had a floating point to integer
conversion that overflowed, but because the overflow was not
handled by the software the computer did a dump of its memory.
Unfortunately, this memory dump was interpreted by the rocket as
instructions to its rocket nozzles. Result--boom!
There is a stuffy but complete description of this story
(which is actually quite interesting) at http://www.math.ufl.edu/~cws/3114/ariane-siam.html
Why wasn't this bug detected before?
I'm not exactly sure why this bug wasn't detected sooner, but
there are a few clues that could help provide an explanation.
There appears to be a bug in a popular floating point test
program. If Intel relied on this program, its bug may have
inadvertently allowed the Dan-0411 bug to slip by undetected.
Professor William Kahan of Berkeley has written a suite of
floating point test programs in the FORTRAN programming language.
(Please refer to Dr. Kahan's home page at http://http.cs.berkeley.edu/~wkahan.)
These programs are commonly used to test the Float-to-Integer
Store instructions (FIST and FISTP). FORTRAN compilers may have
differences in how they handle bit-wise expressions. These
compiler differences could make this test behave differently as
well. Technically, it looks like the original intent of Dr.
Kahan's was to use a bit-wise AND instead of a logical AND in his
original FORTRAN source code; this is a potential non-portability
issue -- as I'm not sure how AND is defined by the FORTRAN
standard. This "non-portable" code was
discovered when Dan tried to convert Dr. Kahan's FORTRAN source
code to the C programming language -- which has separate bit-wise
and logical AND operators. Dan recognized Dr. Kahan's original
intent and used the proper bit-wise AND operator in his C source
code. This is when the bug appeared in the chip. So in the end,
either a bug in the test software, or in a FORTRAN compiler, may
have hidden a bug in the chip.
That's the end of the non-technical discussion. For further
technical details, continue reading.
How did I get involved?
"Dan, who wants his full name to remain anonymous, sent
me the following email on April 11, 1997 (reprinted with
permission):
Robert,
There seems to be a bug in the FIST[P] m16int and FIST[P] m32int
instructions for the P6 (Pentium Pro). Some (perhaps all) values
in the following ranges fail to set the IE (Invalid operation Exception)
flag as required for integer overflow.
FIST[P] m32int: [ c05e80000000000000001, c05e8000000080000000 ] (~-295)
FIST[P] m16int: [ c06e80000000000000001, c06e8000800000000000 ] (~-2111)
(Number of failing mantissas = 231 + 247)
Example on P6 (Pentium Pro):
fcw = 0x37f
FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
FPU status word: B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
0 0 000 0 0 0 0 0 1 0 0 0 0 0
***FAIL***
Example on P5 (Pentium):
fcw = 0x37f
FIST[P] m16int c06e80000000000000001 -> 8000 (stored in memory)
FPU status word: B C3 TOP C2 C1 C0 ES SF PE UE OE ZE DE IE
0 0 000 0 0 0 0 0 0 0 0 0 0 1
Prof. William Kahan at U.C. Berkeley wrote the following FORTRAN programs
to test floating-point to integer conversions:
http://HTTP.CS.Berkeley.EDU/~wkahan/tests/fistest2.lst
http://HTTP.CS.Berkeley.EDU/~wkahan/tests/fistest4.lst
The following line in the "fistest" programs is non-portable FORTRAN
and could prevent the P6 bug from being detected:
199 Li = ((kflag.AND.Invalid) .NE. Invalid) .OR. Li
-- Dan
|
Dan wanted to make sure that there wasn't a bug in his C
source code, or his C compiler. That's when he contacted me. Dan
wanted me to write assembly language source code on his behalf.
By writing in assembly language, the floating point hardware may
be tested directly and queried directly for its response without
the possible influence of compiler bugs and such.
Normally I don't get involved in
debugging other people's problems or writing source code on their
behalf. But Dan was persistent. Within a day or two, Dan had come
up with some very concrete examples of the bug and instructions
which I could use as guidelines for reproducing it. I still
wasn't convinced that I wanted to be involved (not being a
floating point expert). But after 10 days or so, I finally became
convinced, and that's when I wrote the first piece of assembly
language source code to detect the Dan-0411 bug.
The Nature of the Bug
This bug occurs when a large negative floating point number is
stored to memory in an integer format. Under normal operation,
the largest negative integer is stored in memory when a floating
point number is too large to fit in the integer format. The FPU
Status Word indicates that an Invalid operand Exception (IE)
occurred (FSW.IE = 1).
Storing floating point numbers that overflow the "real
number" format are supposed to behave differently than
floating point numbers that overflow the "integer
number" format. Floating point numbers set the overflow flag
(FSW.OE = 1), not the Invalid operand Exception flag (FSW.IE). Instead of setting the Invalid operand Exception
flag (FSW.IE), the Dan-0411 bug sets the Precision Exception flag
(FSW.PE = 1). The Pentium Pro Family Developer's Manual,
Volume 2, section 7.8.4 makes this difference quite clear:
The FPU reports a
floating-point numeric overflow exception (#O) whenever the
rounded result of an arithmetic instruction exceeds the
largest allowable finite value that will fit into the real
format of the destination operand. For example, if the
destination format is extended-real (80 bits), overflow
occurs when the rounded result falls outside the unbiased
range of -1.0 * 216834 to 1.0 * 216834
(exclusive). Numeric overflow can occur on arithmetic
operations where the result is stored in an FPU data
register. It can also occur on store-real operations (with
the FST and FSTP instructions), where a within-range value in
a data register is stored in memory in a single-or
double-real format. The overflow threshold range for the
single-real format is -1.0 * 2128 to 1.0 * 2128;
the range for the double-real format is -1.0 * 21024
to 1.0 * 21024.
That explains how float-to-real overflows are supposed to be
handled. But the Pentium Pro manual is very specific by making a
distinction between float-to-real overflows and float-to-integer
overflows. In fact, the very next paragraph in the Pentium Pro
manual describes the behavior for the exact conditions exposed by
Dan-0411.
The numeric overflow exception
cannot occur when overflow occurs when storing values in an
integer or BCD integer format. Instead, the
invalid-arithmetic-operand exception is signaled.
As I said, this is the precise condition which is not being
met by the Pentium Pro and Pentium II microprocessors. The
programs that demonstrate Dan-0411 will set up these conditions
and test whether or not the proper error condition codes are set
by the microprocessor.
Is this already a known bug?
Part of the process of disclosing this bug, was ensuring that
it hadn't already been reported in any of Intel's errata
documents. Thanks to Intel for
providing electronic versions of their errata for the Pentium
and Pentium
Pro microprocessors, it's very easy to perform an electronic
search to see if this bug has been previously reported. Using
this technique, I could not find any documentation disclosing the
Dan-0411 bug on either the Pentium or Pentium Pro
microprocessors.
The Source Code & Programs
I have provided one source code file,
and two executable programs. In the case of the executable
programs, both are executable versions of the stand-alone assembly
language source code. The first
program, FISTBUG.EXE
demonstrates the bug in a very simple manner. All that appears on
the screen is the simple message:
*** Dan-0411 bug found. ***
- or -
Dan-0411 not found.
The second program, FISTBUGV.EXE runs the same exact tests as the first, but is
much more verbose. This program shows the microprocessor stepping
information and itemized results. Each operand under test is
printed to the screen, along with pass/fail status for four
different testing methods.
The Results
I ran this test on various Pentia and
other microprocessors. For demonstration purposes of this
article, I will show the results of the Intel 486, Pentium
(P54C), Pentium with MMX Technology (P55C), AMD K6, Pentium Pro,
and Pentium II microprocessors. These
results demonstrate that the bug
is only present on the Pentium Pro and Pentium II
microprocessors. All other processors I tested did not
demonstrate the Dan-0411 bug.
Conclusion
After reading this, I'm sure than many people will work
vigorously to verify or refute my test results. For this reason,
I've provided the source code along with executable binaries that
can be run in DOS or Windows. Since I'm not a numerical analyst,
you should draw your own conclusions or rely on the conclusions
of a qualified expert as to the significance of the Dan-0411 bug.
One thing I can say conclusively: the Pentium Pro and Pentium II
processors behave differently than their predecessors.
View results of FISTBUG
ftp://ftp.x86.org/source/fistbug/fistbug.res
Source Code
Availability
View source code for FISTBUG.EXE and FISTBUGV.EXE
ftp://ftp.x86.org/source/fistbug/fistbug.asm
ftp://ftp.x86.org/source/fistbug/makefile
Executable Programs
Download FISTBUG.EXE and FISTBUGV.EXE binary executables.
ftp://ftp.x86.org/source/fistbug/fistbug.exe
ftp://ftp.x86.org/source/fistbug/fistbugv.exe
ftp://ftp.x86.org/source/fistbug/Dan0411x.ZIP
The Entire FISTBUG
Archive
Download FISTBUG.ZIP archive. Archive contains source code,
binary executables, and my results.
ftp://ftp.x86.org/dloads/FISTBUG.ZIP
Back to Secrets and
Bugs
|