[openib-general] IB: I don't like what I'm seeing.
Stephen Poole
Wed Mar 31 18:52:47 PST 2004
> ron> So, we are on our 5th (6th?) network which guarantees
> ron> end-to-end reliability, even though it is really
> ron> card-to-card. I no longer buy the arguments -- we have to
> ron> measure them. The BER sounds too high for us to assume
> ron> end-to-end will work.
>
>Fair enough -- you obviously have much more experience with giant
>installations than I do. Were you ever able to track down where these
>previous networks fell down? Was it HCA chip bugs, random glitches
>(cosmic rays :) inside the chips, data getting corrupted going across
>PCI or something else?
Yes. You laugh when you mention Cosmic Rays. Well, don't. At 7300ft.
we are 13X more likely to get memory errors than someone at sea
level. We have seen errors like this on anything that ends up being
unprotected. Not to mention the shear size of the "computer'. If you
look at Q for instance, it is in a room that is approximately
45,000SqFt. Think of the particle paths and then figure we have very
little atmosphere to "filter" out the bad ones. We therefore get more
than most. *IF* we do not have a reliable way of sending messages,
poof, errors. We have had to make sure things work.
HSPI
HIPPI
Dolphin
GSN
Quadrics
Myrinet
With the odd FDDI, FCS... thrown in for good measure. The issues have
been growing owing to the fact that our "computer" is growing and the
network really is part of the computer.
>
> - Roland
>
>--
>To unsubscribe send an email with subject unsubscribe to
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.
--
Steve Poole (spoole at lanl.gov)
Office:
Los Alamos National Laboratory
Office:
CCN - Special Projects / Advanced Development Fax:
--
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.
More information about the openib-general mailing list