[openib-general] IB: I don't like what I'm seeing.

ron minnich
Wed Mar 31 20:10:52 PST 2004


On 31 Mar 2004, Roland Dreier wrote:

> Fair enough -- you obviously have much more experience with giant
> installations than I do.  Were you ever able to track down where these
> previous networks fell down?  Was it HCA chip bugs, random glitches
> (cosmic rays :) inside the chips, data getting corrupted going across
> PCI or something else?

It varies with (not kidding) altitude, day of month that mainboards were
made, which lot of lasers were used, date/location (e.g. we found one
problem with a Phillipines chipset not found on a Malaysian chipset --
same Intel part, but ...) during which northbridge or other chipsets were
made, power supply used in the nodes, day of month that NIC boards were
made, version of board software, version of board firmware, version of
motherboard bios, version of compiler used to build the OS, version of
compiler user do build board firmware, lot # of parts used in the VRM
modules, who made the cables, who made the connectors ...

and that's the easy ones. 

I've got great stories from the days when Ethernet pretended to do 
end-to-end data integrity. It really did. Best story is about an HP card 
that, it turned out, was unable to be used to ftp a particular file ... 
note, however, that this "giant installation" was two PCs at HPs Avondale 
Division!

Note that Myricom in 1994 claimed to do end-to-end data integrity, and
Myricom in 2000 threw in the towel and told people to turn on software
checksums!

Which is why I don't trust any HCA/network to do reliable transport. It's
not always the HCA fault, but that's not really important. The only
integrity you can get is end-to-end, and currently the only way to buy
that is with software checksums. Which I find regrettable but so it goes.
I've done a NIC design or two and once made the same mistake myself of
thinking I could guarantee data integrity -- turned out I could not. It 
was good to know I was in such good company.

ron



-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.




More information about the openib-general mailing list