mksh exits with status 0 upon error reading script
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
mksh |
New
|
Undecided
|
Unassigned |
Bug Description
A customer reported that a large, long-running script exited early, but with status 0. This should have been impossible. We eventually figured out that the cluster filesystem containing the script had gone offline while the script was executing. The shell, reading a block of the script itself, got ESTALE from read(2). The bug is that the shell swallowed the error, treating it like EOF. It should have exited with an error status.
To reproduce the bug easily, you need a modern enough Linux that strace can inject errors. In the following session, we create a trivial script and run it under strace twice, the first time to see which read(2) calls read the script, and the second time to inject the error.
$ echo 'echo hello' > s.sh
$ strace -e trace=read mksh s.sh
read(3, "\177ELF\
read(11, "echo hello\n", 512) = 11
hello
read(11, "", 512) = 0
+++ exited with 0 +++
$ strace -e trace=read -e inject=
read(3, "\177ELF\
read(11, 0x55629ae2aee0, 512) = -1 ESTALE (Stale file handle) (INJECTED)
+++ exited with 0 +++
$
Variations on that last command are to substitute EIO (a more severe error) for ESTALE, or when=3 for when=2. These variations also display the buggy behavior. The latter variation models something more like the original report we got from our customer, where the script was larger and the error occurred not in reading the first block of the file, but in reading a later block.
The OS is Ubuntu 20.04.5 LTS and KSH_VERSION is @(#)MIRBSD KSH R58 2020/03/27. We also see the bug in ksh93 on RHEL, in fact in every version of mksh and ksh93 we've tested; and in some but not all versions of bash. The bash bug report is https:/
It is common sense that this issue is a bug: one cannot build reliable software if an I/O error might cause a program to exit and report success. POSIX (https:/
How likely is the bug to reoccur without fault injection? Unlikely in a script smaller than mksh's read block size, 512 bytes. The bug becomes likely if the script lives on an unreliable filesystem, and is larger than one block, and runs for a long time in a portion of the script before the last block. These are not unusual conditions.
Good catch!
And thanks for including a reproducer. I didn’t know Linux strace could even do that.
I wonder why this wasn’t caught in so many shells.
In the sense of cross-shell coordination, what errorlevel do you suggest? I have a slight dislike for the small numbers here, but arguments could be held that the large numbers are more likely to have special meaning for the scripts, too.
In fact, interrupting by being unable to read the next part of the script is a rather severe issue. I’m tempted to argue that killing myself with a suitable signal (PIPE? BUS? HUP?) and, if ignored, simulating the corresponding 128+signo exit would be better.