[openib-general] IB: I don't like what I'm seeing.

Wed Mar 31 13:05:31 PST 2004

On Wed, 2004-03-31 at 12:34, Roland Dreier wrote:
>     ron> it's an old rule of networking that the only error
>     ron> detection/correction that works is end-to-end. Because IB
>     ron> puts that support in the card, it can not by definition do
>     ron> end-to-end error detection/correction.
> 
> I guess I don't understand your requirements.  It seems like HPC apps
> would not want to burn CPU checksumming all the data they transfer.
> And even if you do checksum it once how do you know your memory
> controller or cache controller doesn't corrupt it before the next time
> the app uses it?

   Our requirements are to maintain the high performance and
manageability features of an interconnect as we scale out to 1000 nodes
and beyond (10,000 or more).  The point Ron is making is that lots of
network interconnects that claimed to work for HPC at scale, really
don't.  Scaling out to large node counts exposes new and unforeseen
problems that may never occur at the 16, 128, or even 512 node levels. 
And potentially problems that would not occur under simulation.  For
these previous networks data corrupt and performance problems at large
scale have always been a serious problem.  This is one of the main
reasons Ron and others started writing code that has now become LA-MPI. 
LA-MPI is capable of end-to-end error detection/correction for MPI
apps.  We don't want to do the checksumming in software, but we've had
to at times because the underlying hardware has proven unreliable at
large scale.

   I wouldn't bring up the topic of cache corruption to the Los Alamos
HPC folks.  :) They have lots of stories about that particular issue. 
It can happen.  As an example,  you think things are reliable, then you
build a machine at 7,000 ft (Los Alamos).  Error rates are higher at
7,000 ft than at 300 ft (Livermore, CA).  Yes, there are solid technical
reasons for Ron's concerns.
> 
> Certainly storage people seem to trust the reliability features in
> SCSI over fibre channel.  I've never heard of anyone trying to run an
> end-to-end reliability protocol between their disk and their
> application.

   I don't think the SCSI over FC folks linking thousands of ports
together to maintain full bisection bandwidth and then run a single
parallel application across the entire fabric that requires vast amounts
of internode communication.  HPC scalability requirements are unique to
HPC.  

> 
> In any case the UC transport of IB might work better for you.  However
> the best network adapter for your application might be a multiport
> gigabit ethernet NIC with all the checksum offload features turned
> off.

   Why is ethernet not right for HPC?  High latency, low bandwidth, high
price, high CPU overhead, and less than full bisection bandwidth.  4X
InfiniBand ports are cheaper than GigE for full bisection bandwidth
switches over 100 ports.  10GigE is what,  $10k per port and 4X IB is
$300.  Ethernet doesn't scale well either, just look at www.top500.org. 
> 
>  - Roland
-- 
Matt L. Leininger, Ph.D. 
Sandia National Laboratory, Livermore CA
High Performance Computing and Networking
E-mail:  mlleini at ca.sandia.gov
World Wide Web: 
Office phone: 
Office fax:    

-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.