I'm attending Bryan Lewis's High-Performance Computing class in Chicago today, and he just related another example of an improbable event that becomes expected in certain situations.
It's possible, although quite unlikely, that your computer may give the wrong answer for a calculation one day. This isn't anything to do with a bug in the hardware or software: randomly, a bit in the memory may suddenly flip from a 1 to a 0 (or vice versa). This can be caused by, of all things, high-energy particles (cosmic rays) that constantly bombard the planet from outer space. Relatively few such particles make it through the atmosphere to the surface, and of those, relatively few will interact with any matter (like your CPU or RAM). But it can happen.
Some CPUs have automatic error-correction capabilities built in that can detect when a random bit-flip occurs, and fix the problem without user intervention. But most consumer-level CPU's don't. It's such a rare event that most people will never notice: even if a bit does flip, it's probably going to be in RAM that isn't being used, or doesn't hold a critical calculation. (If a bit on your screen flips from blue to red momentarily, you'd never notice it.)
But in 2001, Bryan set up a cluster of several PC's with Pentium-800 chips, each with 8 Gb of RAM. To test the cluster, he ran a large Gaussian elimination problem, that ran for several hours and continually used practically all of the RAM in the cluster. The motherboards allowed you to load either error-correcting or ordinary (non-error correcting) RAM, so he ran the computation with both types of memory. Amazingly, every time he ran the computation with the non error-correcting RAM, he got the wrong answer. Even though a bit-flip is a rare event, if you run an computation for enough time using enough RAM, eventually you're almost certain to see it happen. High-performance computing infrastructures use error-correcting RAM for exactly this reason.
I guess the BOFH was right, after all.
Comments