[openib-general] IB: I don't like what I'm seeing.

Thu Apr 1 00:41:00 PST 2004

Ron,

As for the Infiniband scalability with the RC connections, you can take the
Virginia Tech cluster of 1100 2-way SMP's (2200 connections/node) that has
been running heavy computational tasks for days. Maybe this is the example
you are looking for.

Also, please keep in mind that Infiniband was designed to work in a very
different environment that TCP/IP that assumes that the network is
inherently unstable. So yes, if you want Reliable Connection to work
reasonably, you have to expect a low number of failures in a unit of time.
Real high-end applications can use advanced mechanisms as alternate path
failover (also built-in into HW) to achieve reliability under network
failures. 

Edward

-----Original Message-----
From: ron minnich [mailto:rminnich at lanl.gov]
Sent: Wednesday, March 31, 2004 9:58 PM
To: openib-general at openib.org
Subject: Re: [openib-general] IB: I don't like what I'm seeing.

we don't believe in HCA reliability here. It has not worked once in all 
the years of delivered networks. We're going to assume, unless we can see 
BER of 10-21 app-to-app, that the network is unreliable. So, yes, toss and 
start over is not inconceivable.

On the other hand, if we do get a perfect network, app to app, nobody's 
going to complain, but until we see it at scale 1024+, I am not sure we 
can really count on it. 

Sorry if I upset anyone on this list with my comments -- forgot it was 
this open and it was early morning. But the code still worries me.

ron

-- 
To unsubscribe send an email with subject unsubscribe to
openib-general at openib.org.
Please contact moderator at openib.org for questions.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://openib.org/pipermail/openib-general/attachments/20040401/de020572/attachment.html