[openib-general] IB: I don't like what I'm seeing.
Clauser, Milton
Wed Mar 31 23:01:53 PST 2004
Steve, I'm not convinced that it's possible to check for all possible bit
errors in hardware or otherwise. Perhaps I'm showing my ignorance, but can
you check for a bit error caused, for example, by a cosmic ray hitting a
processor during an arithmetic operation?
So, again I ask the hard question (not rhetorically) how do you decide
what's worth checking, in order to give you adequate confidence in the
result?
Milt Clauser
-----Original Message-----
From: Stephen Poole
To: openib-general at openib.org
Sent: 3/31/2004 11:22 PM
Subject: RE: [openib-general] IB: I don't like what I'm seeing.
> >>There is going to be a nasty tradeoff between BER and performance
>>>that I really don't think the vendors have thought about much yet.
>
>>But, infinitely fast with an infinitely large BER is a bad thing. :-)
>
>We're not interested in generating garbage rapidly, so it's not really
a
>tradeoff. High performance is relevent only if we have confidence there
are
>no errors.
My point.
>
>But I submit that it's not possible to do truly end-to-end error
checking,
>by which I mean checking for all possible errors from beginning to end
of a
>job. So where do you want to draw the line? What/how much do you have
to
>check to give you an acceptable level of confidence? And it shouldn't
be
>necessary to do as much checking on small jobs as on large jobs.
*IF* you check at each point in the link, then you can come close to
guaranteeing that the traffic is safe. But your point is well taken,
that is a drastic measure and can really only be done in HW.
>
>Milt Clauser
>
>-----Original Message-----
>From: Stephen Poole
>To: openib-general at openib.org
>Sent: 3/31/2004 8:03 PM
>Subject: Re: [openib-general] IB: I don't like what I'm seeing.
>
>>On Wed, Mar 31, 2004 at 12:58:08PM -0700, ron minnich wrote:
>>> we don't believe in HCA reliability here. It has not worked once
in
>all
>>> the years of delivered networks. We're going to assume, unless we
>can see
>>> BER of 10-21 app-to-app, that the network is unreliable. So, yes,
>toss and
>>> start over is not inconceivable.
>>>
>>> On the other hand, if we do get a perfect network, app to app,
>nobody's
>>> going to complain, but until we see it at scale 1024+, I am not
sure
>we
>>> can really count on it.
>>>
>>> Sorry if I upset anyone on this list with my comments -- forgot it
>was
>>> this open and it was early morning. But the code still worries me.
>>
>>This is open-source, peer review development, if nobody is getting
>upset
>>we're not doing it right ;)
>>
>>This is the first mention of bit error rates I've seen in an
infiniband
>>discussion. Does anyone have end-to-end BER numbers for any deployed
>>infiniband installations?
>
>Quite difficult to determine. It would be nice to see actual numbers
>when they are available. The question is *IF* it is an undetected
>error, how do you know you got one, if you are not looking for them ?
>:-) There was some nice work when we were working on GSN (remember
>the predecessor to IB) on potential error rates based on the two
>CRC's that GSN used. I will try and dig it up.
>
>>
>>There is going to be a nasty tradeoff between BER and performance that
>I
>>really don't think the vendors have thought about much yet.
>
>But, infinitely fast with an infinitely large BER is a bad thing. :-)
>
>>
>>--
>>To unsubscribe send an email with subject unsubscribe to
>>openib-general at openib.org.
>>Please contact moderator at openib.org for questions.
>
>
>--
>Steve Poole (spoole at lanl.gov)
> Office:
>Los Alamos National Laboratory
> Office:
>CCN - Special Projects / Advanced Development Fax:
>
>
>
>
>
>--
>To unsubscribe send an email with subject unsubscribe to
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.
>
>
>--
>To unsubscribe send an email with subject unsubscribe to
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.
--
Steve Poole (spoole at lanl.gov)
Office:
Los Alamos National Laboratory
Office:
CCN - Special Projects / Advanced Development Fax:
--
To unsubscribe send an email with subject unsubscribe to
openib-general at openib.org.
Please contact moderator at openib.org for questions.
--
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.
More information about the openib-general mailing list