[openib-general] IB: I don't like what I'm seeing.

Wed Mar 31 23:25:47 PST 2004

On Thu, 2004-04-01 at 06:02, Roland Dreier wrote:
>     ron> So, we are on our 5th (6th?) network which guarantees
>     ron> end-to-end reliability, even though it is really
>     ron> card-to-card. I no longer buy the arguments -- we have to
>     ron> measure them. The BER sounds too high for us to assume
>     ron> end-to-end will work.
> 
> Fair enough -- you obviously have much more experience with giant
> installations than I do.  Were you ever able to track down where these
> previous networks fell down?  Was it HCA chip bugs, random glitches
> (cosmic rays :) inside the chips, data getting corrupted going across
> PCI or something else?

I'm reading this discussion with great interest, because I'm much
worrying about current situation with interconnects reliability. Let me
introduce a real story with one of our solutions. We prepared a 64-node
SCI-based cluster. All went well but after installing the whole cluster
we found that linpack randomly fails. After investigating a problem we
found that this was due to data corruption somewhere. We also found that
lowering the PCI bus frequency to 33MHz solves a problem. After some
discussions with Dolphin and motherboard manufacturer they organized a
test platform and reported that a problem was due to the noise on PCI
bus. Motherboard manufacturer has done some MB modifications and said
that all works fine (and I checked this doing stress tests via remote
access to test platform). After we got these modified (exactly these!!!)
motherboards here in Moscow, we found that random though rare fails are
still occur. The final of this story is that we completely replaced all
the motherboards by the motherboards from other manufacturer.
And I'm sure this story is not unique. Remember for example an issue
with 3ware RAID controllers...

Finally I want to state the following:

1) Ron is absolutely right that doing full software checking is the only
solution to be sure that programs running days (or even months) on large
installations are working correctly. I do not like this, but we should
accept this fact.
2) Ron is right that current IB stacks do not have a good design.
3) It is possible that there are bugs in interconnect hardware, but we
can not be sure that all scalability errors are caused by interconnect
bugs.
4) While implementing software reliability is a good idea, we should not
completely ignore the hardware. Otherwise the hardware manufacturers
will offer more and more unreliable solutions, because nobody will be
interested in hardware reliability.

All above is my personal opinion, so please do not blame on me.

Best regards,
Andrey Slepuhin,
Head of Cluster Solutions Center,
T-Platforms

-- 
A right thing should be simple (tm)

-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.