Comment 6 for bug 199037

Revision history for this message
In , Thom (thom-redhat-bugs) wrote :

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11

Description of problem:
CommuniGate Systems is reporting this case on behalf of two CommuniGate/RedHat customers running RedHat Enterprise Server 5.1 and seeing some problems related to file integrity on this platform. We have two customers who upgraded their CommuniGate Pro cluster nodes to RedHat 5.1, from an earlier RHES 4.1 version. In both these cases, the kernel reportedly in use is this: 2.6.18-53.el5

We also have reports of a possibly identical problem with a customer running this kernel version, though we don't have the specifics of the Linux OS version: kernel version 2.6.18-8.1.8

In both of these cases, the customers began to get what appear to be
"null bytes" in mailboxes. I will a screenshot png of one of
these mailboxes, as seen with vi.

The mount options used are:
tcp,rsize=32768,wsize=32768,hard,intr,timeo=600,bg,retrans=2,noatime

The output of "mount -v" for one of these customers showed the following:
>>>172.30.35.5:/vol/CGPweb on /CGPweb type nfs
>>>(rw,nfsvers=3,proto=tcp,rsize=32768,wsize=32768,timeo=600,hard,intr,bg,acregmax=6,addr=172.30.35.5)

The exact operating system in use was:
>>>Linux MSA 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686 i686 i386 GNU/Linux Red Hat Enterprise Linux 5.

When I last researched this in detail, it appeared that the byte offsets
and total sizes were still correct after the null bytes were inserted;
only the contents of those bytes were replaced with null characters. So,
it appeared at first glance that it was a 1:1 replacement of valid data
with "corruption"-type data of some sort.

When analyzing CommuniGate Pro logs (which report the file sizes and offsets of all messages), we found two types of symptoms:

1. a missing message with null bytes inserted instead (1:1 replacement of characters bytes into NULL bytes)
2. no missing message, but null bytes between/within two messages, and there is some indication that parts of some messages are missing (and replaced with NULL bytes)

A key question that is not 100% clearly answered is whether there is any indication of additional bytes ever being added, or if the null bytes are simply byte-replacement of data. From the available evidence, it appears that there is just 1:1 byte replacement.

Also, shile this is not 100% confirmed - CommuniGate Pro appears to be getting the correct byte offsets from the file system, as noted in the "math" parts of this document. This would suggest a problem that is different than the previous pre-2.6.13 Linux NFS kernel problem. Also, the time interval between some of the events that insert null bytes is rather large, often times 10+ minutes of interval between events.

Of these two customers, both went back to RedHat 4.5, and the problem
immediately disappeared. We have many customers running RedHat ES 4.5 successfully. Earlier RedHat versions than this still have a different NFS kernel bug which can cause serious problems in an CommuniGate Pro Dynamic Cluster when NFS-based. (A few years ago, we discovered a Linux
kernel bug related to NFS client handling in the kernel (specifically
related to filesize caching), which was fixed by Trond Myklebust at
NetApp, and these fixes were put into the 2.6.13 and 2.6.14 kernel. If
interested, you can read more about this requirement here:)
https://support.communigate.com/tickets/kb_article.php?ref=2908-TIOL-4737

Duplicating this problem will be challenging, though we believe possible
using a Dynamic Cluster on RedHat 5 under relatively high load. We would be glad to work with RedHat to try to replicate the issue. Since our customers have
since rolled back to RedHat 4.5, we don't have any customers actively
using RedHat 5 within an NFS-based cluster, to our knowledge. If we could get access to RHES 5 with the latest patches, we would also be glad to begin trying to replicate this problem in-house. We would need two RedHat 5 servers running with an NFS-based storage backend - we have the equipment available, but would need to get the latest RedHat 5 software.

We realize that reporting this bug with only partial evidence is difficult. However, we felt it would be better to report the possible bug and discover if there were possibly known causes, or if other RedHat customers are experiencing anything similar. We are not aware of any other CommuniGate customers running Linux-based NFS-based Dynamic Clusters having this problem, including quite a few who run more recent kernels.

Version-Release number of selected component (if applicable):
kernel-2.6.18-53.el5

How reproducible:
Didn't try

Steps to Reproduce:
The basic file access flow used here would be

1. Have two or more NFS clients mount the same logical volume.
2. Have one NFS client modify a non-binary, text file, using C++ operations such as lseek(), write(), and fsync() (all filehandles are properly fsync()'d when closed by an NFS client)
3. No less than 6 seconds later, have a second NFS client open the same file, modify it (lseek/write/fsync).
4. Repeat steps 2-3 repeatedly.

At some point in this file access pattern, null bytes may be inserted into these files.

Actual Results:
We will attempt to do so, though we would like to request temporary access to the latest appropriate versions of RedHat Enterprise 5 in order to test.

Expected Results:
Files should be written without null bytes inserted.

Additional info: