[openib-general] IB: I don't like what I'm seeing.

Wed Mar 31 13:30:18 PST 2004

>On Wed, 2004-03-31 at 12:34, Roland Dreier wrote:
>>      ron> it's an old rule of networking that the only error
>>      ron> detection/correction that works is end-to-end. Because IB
>>      ron> puts that support in the card, it can not by definition do
>>      ron> end-to-end error detection/correction.
>>
>>  I guess I don't understand your requirements.  It seems like HPC apps
>>  would not want to burn CPU checksumming all the data they transfer.
>>  And even if you do checksum it once how do you know your memory
>>  controller or cache controller doesn't corrupt it before the next time
>>  the app uses it?
>
>    Our requirements are to maintain the high performance and
>manageability features of an interconnect as we scale out to 1000 nodes
>and beyond (10,000 or more).  The point Ron is making is that lots of
>network interconnects that claimed to work for HPC at scale, really
>don't.  Scaling out to large node counts exposes new and unforeseen
>problems that may never occur at the 16, 128, or even 512 node levels.
>And potentially problems that would not occur under simulation.  For
>these previous networks data corrupt and performance problems at large
>scale have always been a serious problem.  This is one of the main
>reasons Ron and others started writing code that has now become LA-MPI.
>LA-MPI is capable of end-to-end error detection/correction for MPI
>apps.  We don't want to do the checksumming in software, but we've had
>to at times because the underlying hardware has proven unreliable at
>large scale.
>
>    I wouldn't bring up the topic of cache corruption to the Los Alamos
>HPC folks.  :) They have lots of stories about that particular issue.
>It can happen.  As an example,  you think things are reliable, then you
>build a machine at 7,000 ft (Los Alamos).  Error rates are higher at
>7,000 ft than at 300 ft (Livermore, CA).  Yes, there are solid technical
>reasons for Ron's concerns.

With said code, we have actually detected errors on "reliable" 
networks. Do we like the performance hit, NO, but do we potentially 
have to live with it, YES. Our business does not tolerate "wrong 
answers".

>  >
>>  Certainly storage people seem to trust the reliability features in
>>  SCSI over fibre channel.  I've never heard of anyone trying to run an
>>  end-to-end reliability protocol between their disk and their
>>  application.
>
>    I don't think the SCSI over FC folks linking thousands of ports
>together to maintain full bisection bandwidth and then run a single
>parallel application across the entire fabric that requires vast amounts
>of internode communication.  HPC scalability requirements are unique to
>HPC.

Do a trivial back of the envelope calculation. Say 4096 nodes, doing 
all-to-all communication of 8K messages once every 5-10 seconds 
running 24x7 for say five months as well as having to do I/O to 
potentially the same fabric, say to the tune of 50TB per day. These 
things stress networks and we do not have the luxury of saying, "Hey 
it is only a few bad numbers".

>
>
>>
>>  In any case the UC transport of IB might work better for you.  However
>>  the best network adapter for your application might be a multiport
>>  gigabit ethernet NIC with all the checksum offload features turned
>>  off.

If we thought Ethernet was the right answer we would have already 
done it. I can not think of a network technology that we have not 
tried.

>   
>    Why is ethernet not right for HPC?  High latency, low bandwidth, high
>price, high CPU overhead, and less than full bisection bandwidth.  4X
>InfiniBand ports are cheaper than GigE for full bisection bandwidth
>switches over 100 ports.  10GigE is what,  $10k per port and 4X IB is
>$300.  Ethernet doesn't scale well either, just look at www.top500.org.
>>
>>   - Roland
>--
>Matt L. Leininger, Ph.D.
>Sandia National Laboratory, Livermore CA
>High Performance Computing and Networking
>E-mail:  mlleini at ca.sandia.gov
>World Wide Web: 
>Office phone: 
>Office fax:   
>
>
>
>
>--
>To unsubscribe send an email with subject unsubscribe to 
>openib-general at openib.org.
>Please contact moderator at openib.org for questions.

-- 
Steve Poole (spoole at lanl.gov)                             Office: 
Los Alamos National Laboratory                            Fax:    
CCN - Special Projects/Advanced Development               Office: 

-- 
To unsubscribe send an email with subject unsubscribe to openib-general at openib.org.
Please contact moderator at openib.org for questions.