[openib-general] IB: I don't like what I'm seeing.

Stephen Poole
Wed Mar 31 22:22:10 PST 2004


>  >>There is going to be a nasty tradeoff between BER and performance
>>>that I really don't think the vendors have thought about much yet.
>
>>But, infinitely fast with an infinitely large BER is a bad thing. :-)
>
>We're not interested in generating garbage rapidly, so it's not really a
>tradeoff. High performance is relevent only if we have confidence there are
>no errors.

My point.

>
>But I submit that it's not possible to do truly end-to-end error checking,
>by which I mean checking for all possible errors from beginning to end of a
>job.  So where do you want to draw the line?  What/how much do you have to
>check to give you an acceptable level of confidence?  And it shouldn't be
>necessary to do as much checking on small jobs as on large jobs.

*IF* you check at each point in the link, then you can come close to 
guaranteeing that the traffic is safe. But your point is well taken, 
that is a drastic measure and can really only be done in HW.

>
>Milt Clauser
>
>-----Original Message-----
>From: Stephen Poole
>To: openib-general at openib.org
>Sent: 3/31/2004 8:03 PM
>Subject: Re: [openib-general] IB: I don't like what I'm seeing.
>
>>On Wed, Mar 31, 2004 at 12:58:08PM -0700, ron minnich wrote:
>>>   we don't believe in HCA reliability here. It has not worked once in
>all
>>>   the years of delivered networks. We're going to assume, unless we
>can see
>>>   BER of 10-21 app-to-app, that the network is unreliable. So, yes,
>toss and
>>>   start over is not inconceivable.
>>>
>>>   On the other hand, if we do get a perfect network, app to app,
>nobody's
>>>   going to complain, but until we see it at scale 1024+, I am not sure
>we
>>>   can really count on it.
>>>
>>>   Sorry if I upset anyone on this list with my comments -- forgot it
>was
>>>   this open and it was early morning. But the code still worries me.
>>
>>This is open-source, peer review development, if nobody is getting
>upset
>>we're not doing it right ;)
>>
>>This is the first mention of bit error rates I've seen in an infiniband
>>discussion. Does anyone have end-to-end BER numbers for any deployed
>>infiniband installations?
>
>Quite difficult to determine. It would be nice to see actual numbers
>when they are available. The question is *IF* it is an undetected
>error, how do you know you got one, if you are not looking for them ?
>:-) There was some nice work when we were working on GSN (remember
>the predecessor to IB) on potential error rates based on the two
>CRC's that GSN used. I will try and dig it up.
>
>>
>>There is going to be a nasty tradeoff between BER and performance that
>I
>>really don't think the vendors have thought about much yet.
>
>But, infinitely fast with an infinitely large BER is a bad thing. :-)
>
>>
>>--
>>To unsubscribe send an email with subject unsubscribe to
>>openib-general at openib.org.
>>Please contact moderator at openib.org for questions.
>
>
>--
>Steve Poole (spoole at lanl.gov)
>     Office: 
>Los Alamos National Laboratory
>     Office: 
>CCN - Special Projects / Advanced Development                        Fax:
>
>
>
>
>
>--
>To unsubscribe send an email with subject unsubscribe to
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.
>
>
>--
>To unsubscribe send an email with subject unsubscribe to 
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.


-- 
Steve Poole (spoole at lanl.gov) 
        Office: 
Los Alamos National Laboratory 
        Office: 
CCN - Special Projects / Advanced Development                   Fax: 





-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.




More information about the openib-general mailing list