Tuesday, August 29, 2006

Life as a Debugger

What a day!

A DFS (distributed file system) suffered from a network problem and my mentor asked me to solve it, which took most of my afternoon. In fact the problem was rather old: when sending or receiving large blocks of data (10M – 70M on different machines), calls to send/recv probably fail with error code WSAENOBUFS (10055). I believe that the problem is due to running out of kernel memory for sending or receiving data. More powerful a machine is, the bigger the threshold buffer size could be.

As KB201213 suggests, following workarounds could be applied:
  • Use the socket in non-blocking or asynchronous mode.
  • Break large-size data blocks into small ones and specify a relatively small buffer in send for blocking sockets, preferably no larger than 64K.
  • Set the SO_SNDBUF socket option to 0 (zero) to allow the stack to send from your application buffer directly.
However, asynchronous I/O does not always work for this problem. Neither could setting SQ_SNDBUF option get over it.

In my opinion, overlapped I/O locks the memory that would be sent or received to avoid page switching. When the size of locked memory goes too large, the kernel would think that there’s no sufficient memory and WSAENOBUFS is returned.

The basic idea is to break large blocks into smaller ones. Using for-loops for send/recv on smaller buffers may be the simplest solution. An alternative is to use scatter/gather I/O by WSASend/WSARecv. First, a large buffer can also be allocated at once as usual, however, pass in an array of WSABUF that contains pointers to different sections of the buffer, where each section has a smaller size (16K - 1M, for example). This approach might be faster though more bug-prone.

I had reproduced the problem on a desktop and solved it using the proposed approach. Further test would be taken on the original source code to verify it. By the way, note that a recent API TransmitPackets is likely to fail without any error code reported.

Later I turned to a memory exception problem when using IT++ to do SVD on a 30,000*30,000 matrix on a server with an AMD Opteron 254 (daul-core) and 16G physical memory. Such a matrix takes slightly less than 8G memory and a SVD would consume about 24G memory for 3 matrices. However, someone set the maximum virtual memory size to only 2G which would definitely trigger the memory exception. Adjust it and everything is OK.

After supper, I got a message from the physics guy on GTalk asking me to help with compile another two libraries...

No comments: