Pentium Bug



 In considering and discussing the possibilities of serious errors
 arising from the Pentium division bug, it is important to be clear
 about absolute and relative errors, the nature of the bug itself,
 and the nature of the calculation.
 About the bug:
 1.  It bites infrequently, once every several billion divides if the
     numbers involved in the calculation are chosen by a truly random
     process.
 2.  The error (bite) is usually (relatively) small but can be (relatively)
     large.  The largest relative error reported thus far was (6.1*10^-05).
 About ab initio calculations:
 1.  Very large quantities of numbers are calculated and processed.  The
     numbers processed vary very widely in absolute size.
 2.  Different types of calculations (CI, MP, analytical or numerical energy
     derivative, SCF, etc) involve different types of intensive calculations.
     Some calculations approach numerical instability with the precision
     used for integrals, etc ( approaching *numerical* linear dependence
     with large basis sets, for example).
 Some of the discussion on this list has focused on the large number of
 quantities (two electron integrals) which have small absolute size.  It
 is true that relative errors, even as large so as to leave only 4 significant
 digits, are not important if the absolute value of the quantity under
 consideration is near the numerical threshold of the calculation.
 However, even after we discount the operations involving numbers within
 a factor of 10^4 of the threshold we are left with a large number of
 operations which might hit the bug and *might* lead to a significant error.
 The analogy of the computations to weighing a captain by weighing the
 ship and captain and subtracting the weight of the ship is a little
 oversimplified for this discussion unless you consider that one of the
 two weights may be in error by a factor which may be as large as
 ( 1 - 1*10^-4).  (How many *TONS* did you say the captain weighs?)
 To spruce up the analogy you might consider a ship composed of nearly
 equal amounts of positive and negative mass with the NET mass of
 a supertanker.  Instead of weighing the ship, you must imagine cutting
 up the ship two different ways and weighing the parts, including the
 filings resulting from the cutting, independently. Include the captain
 in one of the weighings.  The (signed) weights are summed twice, once
 using the weights including the captain and once using the weights
 without the captain.  If some fraction of the weights of the filings
 are off by 10^-4 or less, the two sums will still have sufficient
 accuracy for the difference to be a good approximation to the captain's
 weight, but if the value for just one of the large pieces is significantly
 in error, all bets are off.
 A few years back, I saw reports that indicated that about 20% of the
 cycles at the NSF supercomputer sites were being used for calculations
 on molecules and solids.  Today, we most certainly have a much larger
 volume of calculations being carried out on small (but powerful)
 computers.  Some of the calculations take days, weeks, and even months
 of cpu time.  It is inevitable that if enough calculations are being
 carried out on Pentium cpus, some of them will contain significant errors.
 The fact that one or more particular sets of calculations of particular
 types were carried out without detecting an error, or only detecting
 inconsequential errors, has limited relevance to the issue.  In
 order to consider this error inconsequential, we would have to know
 that the probability of an error of a given size be on the order of
 that for an error of the same size occurring due to multiple bit memory
 errors.  I don't believe that the available evidence supports such a
 conclusion at this time.
 I know of three cases in which powerful computers, operating in production
 environments, had broken hardware producing bad values for weeks or months.
 These cases involved a hot bit in a disk buffer on a Cray, a broken
 vector multiplier on a Cray, and a broken vector adder on a Convex.  In
 each case only a couple of users realized anything was wrong. Getting
 the machine out of production for repair required that a user *prove*
 that the machine had a problem.  In each case, dozens or hundreds of
 persons were getting questionable results from the machines
 without realizing it.  If Intel continues to require that users prove
 that their applications have a problem with the Pentium divide error
 in order to obtain a "repair", there are going to be a lot of machines
 producing questionable results for a long time.
 Don