[openib-general] IB: I don't like what I'm seeing.
Troy Benjegerdes
Wed Mar 31 23:43:38 PST 2004
On Thu, Apr 01, 2004 at 12:01:53AM -0700, Clauser, Milton wrote:
> Steve, I'm not convinced that it's possible to check for all possible bit
> errors in hardware or otherwise. Perhaps I'm showing my ignorance, but can
> you check for a bit error caused, for example, by a cosmic ray hitting a
> processor during an arithmetic operation?
>
> So, again I ask the hard question (not rhetorically) how do you decide
> what's worth checking, in order to give you adequate confidence in the
> result?
Here's the real issue:
Vendors deal well enough with bit errors in the CPU, memory subsystem,
PCI bus, etc, since those errors are well characterized, and tested for.
Memory in particular is pretty much all ECC now. Excessive bit error
rates result in customers RMA'ing a machine or components.
The probability of an error scales linearly with the number of nodes.
This is probably manageable.
The specific problem we get into with HPC is the scaling of the number
of network links goes above and beyond anything network hardware vendors
ever realistically have the capability of testing for, and simulation
cannot take into account the effect of things like ground loop noise
introduced across 1000 nodes connected by copper cable.
The network is really the 'weakest link', since any problems are
magnified by having something on the order of N*log(N) switches and
network links. If there is a lot of design margin in the network, it's
not a problem. But if the margins on signal-noise are tight, the scaling
factors quickly make it completely unmanageable. And the *only* people
that have these problems are the 5 customers with 1000+ node clusters.
Note to vendors: If you make you infinband network work in this
environment, you'll have fewer hardware returns from regular customers
;)
Personally, I think Infiniband has a pretty good chance of working, if
for no other reason than we can change out the switches and cards until
something works. Maybe one vendor has better signal margin on the switch
or card for some reason.
But it seems pretty clear people aren't going to be happy until we've
got something up and running to excercise a network with more than 2000
links and reporting end-to-end error rates.
--
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.
More information about the openib-general mailing list