[openib-general] IB: I don't like what I'm seeing.
ron minnich
Wed Mar 31 15:12:18 PST 2004
On 31 Mar 2004, Roland Dreier wrote:
> I guess I don't understand your requirements. It seems like HPC apps
> would not want to burn CPU checksumming all the data they transfer.
They don't want to but they do today. They do so because, each time, HCAs
that guaranteed reliability could not deliver on that promise.
> And even if you do checksum it once how do you know your memory
> controller or cache controller doesn't corrupt it before the next time
> the app uses it?
This is a quantitative issue. The BER on these components is a tad better
than what I've been quoted for IB.
> Certainly storage people seem to trust the reliability features in
> SCSI over fibre channel. I've never heard of anyone trying to run an
> end-to-end reliability protocol between their disk and their
> application.
>
This is a repeat of all the arguments I had with DEC/Quadrics in 2000, SGI
before that, Dolphin before that, and Myricom before that. They could
argue me into the ground in all the same ways (they even used the SCSI
argument). And yet, when push came to shove, we found uncorrected data
errors in data moved over all these networks. The fix has always been the
same: software checksums. We hate them, but we have to have them.
So, we are on our 5th (6th?) network which guarantees end-to-end
reliability, even though it is really card-to-card. I no longer buy the
arguments -- we have to measure them. The BER sounds too high for us to
assume end-to-end will work.
ron
--
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.
More information about the openib-general mailing list