BlockingIOError: [Errno 11] The file descriptor fd refers to a file other than a socket and has been marked nonblocking

Bug #1402421 reported by Sylvain Pineau
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
PlainBox (Toolkit)
Won't Fix
High
Unassigned

Bug Description

I got this error when I was running the dbus service from trunk. I hit CTRL-C several times to stop it.

For information this error happened when checkbox-gui was requesting the prerun of local jobs.

The same job (2013.com.canonical.certification::__resource__) ran successfully though with:

plainbox run -i 2013.com.canonical.certification::__resource__

The error (the glibc runner is started by the dbus servicce, this is the only difference):

$ checkbox-gui-service
WARNING plainbox.runner: execution_ctrl_list not passed to JobRunner
^C^Cexception calling callback for <Future at 0x7fc2a08d31d0 state=finished raised BlockingIOError>
Traceback (most recent call last):
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 297, in _invoke_callbacks
    callback(self)
  File "/home/sylvain/git_lp/checkbox/checkbox-ng/checkbox_ng/service.py", line 1509, in _result_ready
    self._result = result_future.result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 395, in result
    return self.__get_result()
  File "/usr/lib/python3.4/concurrent/futures/_base.py", line 354, in __get_result
    raise self._exception
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/highlevel.py", line 459, in _really_run
    update=False)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/applogic.py", line 72, in run_job_if_possible
    job_result = runner.run_job(job, config)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/runner.py", line 412, in run_job
    return runner(job, config)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/runner.py", line 515, in run_local_job
    return self._just_run_command(job, config)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/runner.py", line 710, in _just_run_command
    return_code, record_path = self._run_command(job, config, ctrl)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/runner.py", line 806, in _run_command
    return_code = self._run_extcmd(job, config, extcmd_popen, ctrl)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/runner.py", line 817, in _run_extcmd
    job, config, self._session_dir, extcmd_popen)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/impl/ctrl.py", line 468, in execute_job
    return extcmd_popen.call(cmd, env=env, cwd=cwd_dir)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/vendor/extcmd/glibc.py", line 186, in call
    return self._loop(selector, pid)
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/vendor/extcmd/glibc.py", line 268, in _loop
    for fdsi in key.fileobj.read():
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/vendor/pyglibc/_signalfd.py", line 226, in read
    num_read = read(self._sfd, byref(info_list), sizeof(info_list))
  File "/home/sylvain/git_lp/checkbox/plainbox/plainbox/vendor/glibc.py", line 969, in std_errcheck
    raise OSError(errno, errno_map.get(errno, os.strerror(errno)))
BlockingIOError: [Errno 11] The file descriptor fd refers to a file other than a socket and has been marked nonblocking (O_NONBLOCK), and the read would block.
The file descriptor fd refers to a socket and has been marked nonblocking (O_NONBLOCK), and the read would block. POSIX.1-2001 allows either EAGAIN or EWOULDBLOCK error to be returned for this case, and does not require these constants to have the same value, so a portable application should check for both possibilities.

Tags: extcmd

Related branches

description: updated
Changed in plainbox:
assignee: nobody → Zygmunt Krynicki (zkrynicki)
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

09:32 < spineau> zyga: I remmber the Friday one but the bug I founf this we with dbus is still reproducible in trunk
09:32 < zyga> spineau: yes, that's something else, I haven't seen that
09:33 < spineau> zyga: I'm wondering who get the sigchld
09:33 < spineau> zyga: I traced this module to just see that we're deplating both stderr and stdout and then nothing
09:34 < zyga> spineau: that's a weird thing to happen
09:34 < zyga> spineau: I think I know what happened though
09:34 < spineau> zyga: running a ps shows a [cat] <defunct>
09:34 < zyga> spineau: sigchld got delivered to another thread in the process
09:34 < zyga> spineau: and plainbox, by the time we got to read from signalfd, found that signal is gone now
09:34 < spineau> zyga: that's also what I 'm thinking
09:35 < zyga> spineau: ok, the hard way out then is to block it entirely, I'll do that
09:35 < zyga> spineau: thanks

Changed in plainbox:
status: New → In Progress
milestone: none → future
milestone: future → 0.18
Revision history for this message
Zygmunt Krynicki (zyga) wrote :
Download full text (3.7 KiB)

With some extra debugging I know that we're just not seeing any activity on the file descriptor associated with signalfd.

In the log below, signalfd is not marked as read-ready when SIGCHLD is intercepted (but passed to the process) by gdb. Still investigating further.

DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Obtained: <plainbox.vendor.pyglibc.selectors.EpollSelector object at 0x7ffff026fcf0>
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Obtained: <signalfd fileno():11 signals:frozenset({17})>
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Obtained: <sigprocmask signals:frozenset({17}) mode:SIG_BLOCK active:no>
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Obtained: (12, 13)
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Obtained: (14, 15)
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Registered key with selector: SelectorKey(fileobj=12, fd=12, events=1, data='stdout')
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Registered key with selector: SelectorKey(fileobj=14, fd=14, events=1, data='stderr')
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Registered key with selector: SelectorKey(fileobj=<signalfd fileno():11 signals:frozenset({17})>, fd=11, events=1, data='sfd')
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.glibc: sigprocmask(0, struct sigset_t at 0x7ffff02694a8
  __val: <plainbox.vendor.glibc.c_ulong_Array_16 object at 0x7ffff0269560>, None)
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Forked child process: 21264
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: waiting for, {'stdout', 'stderr', 'proc'}, sigmask is <sigprocmask signals:frozenset({17}) mode:SIG_BLOCK active:yes>
DEBUG[pid:21264, thread:Thread-1, plainbox.vendor.glibc: sigprocmask(1, struct sigset_t at 0x7ffff02694a8
  __val: <plainbox.vendor.glibc.c_ulong_Array_16 object at 0x7ffff0269560>, None)
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: processing event list: [(SelectorKey(fileobj=12, fd=12, events=1, data='stdout'), EVENT_READ)]
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Reading at most 65536 bytes of data from stdout pipe
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Read 268 bytes of data from stdout
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: waiting for, {'stdout', 'stderr', 'proc'}, sigmask is <sigprocmask signals:frozenset({17}) mode:SIG_BLOCK active:yes>
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: processing event list: [(SelectorKey(fileobj=12, fd=12, events=1, data='stdout'), EVENT_READ)]
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Reading at most 65536 bytes of data from stdout pipe
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: Read 0 bytes of data from stdout
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: pipe stdout depleted, unregistering and closing
DEBUG[pid:21246, thread:Thread-1, plainbox.vendor.extcmd: waiting for, {'stderr', 'proc'}, sigmask is <sigprocmask signals:frozenset({17}) mode:SIG_BLOCK active:yes>

Program received signal SIGCHLD, Child status changed.
DEBUG[pid:21246, thread:Thread-1, pl...

Read more...

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

With some more debugging I can confirm that the dbus service process has three threads:

(gdb) info threads
  Id Target Id Frame
  3 Thread 0x7ffff0d08700 (LWP 21353) "python" 0x00007ffff6b9ad53 in epoll_wait () at ../sysdeps/unix/syscall-template.S:81
  2 Thread 0x7ffff1d85700 (LWP 21336) "gmain" 0x00007ffff6b902cd in poll () at ../sysdeps/unix/syscall-template.S:81
* 1 Thread 0x7ffff7fd4740 (LWP 21335) "python" 0x00007ffff6b902cd in poll () at ../sysdeps/unix/syscall-template.S:81

So we have one running main checkbox-gui-service executable, one running gmain (glib main loop, this is where dbus bits happen) and lastly one where we run the command (it may or may not be recycled per command). This is thread "3" here.

Inside that thread, sigprocmaks is enabled, blocking signal 17 (I've added checks for that on every loop iteration).

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

OK, I understand the problem now. I need to check linux sources to confirm. I've built the runner on an incorrect assumption.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

21:38 <@zyga-afk> kissiel: I came to a sad reality check
21:38 <@zyga-afk> kissiel: I have three options now
21:38 <@zyga-afk> kissiel: now sure which to pick
21:38 <@zyga-afk> kissiel: I can wake up (busy sleeping loop)
21:39 <@zyga-afk> kissiel: I can change the api to require a hack in the main thread
21:39 <@zyga-afk> kissiel: or I can fork off a separate process that can be blockingly read forever and mourned, unlike test jobs
21:39 <@zyga-afk> kissiel: which all suck
21:39 <@zyga-afk> kissiel: as in, I wasn't expecting that :(
21:39 <@zyga-afk> kissiel: two event loops don't play nice
21:39 <@zyga-afk> kissiel: and I made a mistake in earlier research
21:40 <@zyga-afk> kissiel: the main thread always has priority over other threads so we never see sigchld in our small little event loop on the side
21:42 <@zyga-afk> kissiel: I think there's one more option to consider
21:42 <@zyga-afk> kissiel: to rewrite extcmd on top of glib's event loop (when in glib mode)
21:42 <@zyga-afk> kissiel: I'll check that option out
21:42 <@zyga-afk> kissiel: though I hate different modes in each case

Revision history for this message
Sylvain Pineau (sylvain-pineau) wrote :

Do we know if checkbox-touch will have the same problem? or better a description of the use cases where the pyglibc runner won't work? If checkbox-gui by nature is the only case where SIGCHLD is lost I'd keep the pyglibc runner as a way to provide features not possible with the "old" extcmd (such as crash detection/pause/kill). Otherwise I do not think it is worth the effort to keep it as a vendorized module.

Changed in plainbox:
milestone: 0.18 → none
Revision history for this message
Zygmunt Krynicki (zyga) wrote : Re: [Bug 1402421] Re: BlockingIOError: [Errno 11] The file descriptor fd refers to a file other than a socket and has been marked nonblocking

Yes, checkbox-touch will be affected the same way.

Zygmunt Krynicki (zyga)
Changed in plainbox:
assignee: Zygmunt Krynicki (zyga) → nobody
Zygmunt Krynicki (zyga)
Changed in plainbox:
status: In Progress → Confirmed
milestone: none → future
Zygmunt Krynicki (zyga)
Changed in plainbox:
importance: Critical → High
tags: added: extcmd
Changed in plainbox:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.