zuul service untrackable if it tries to start gearman when port 4730 is already open

Bug #1359001 reported by samwan
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Zuul
New
Undecided
Unassigned

Bug Description

If port 4730 is already opened by another process like gearmand and in /etc/zuul/zuul.conf [gearman_server] 'start=true' is set , it will cause problem, zuul service will become untrackable , you can’t use service to get the status of zuul because the pid file will not get created.

Under this situation, when we check zuul process, we always see there’s some defunct zuul-server process.
--------------------------------------------------------------------------------------------------------------------------------
root@master2:/var/run/zuul# ps -ef|grep -i zuul
zuul 4451 1 0 18:20 ? 00:00:00 /usr/bin/python /usr/local/bin/zuul-server
zuul 4454 4451 0 18:20 ? 00:00:00 [zuul-server] <defunct>
root 4483 28616 0 18:22 pts/5 00:00:00 grep --color=auto -i zuul
zuul 14487 1 0 Aug14 ? 00:00:00 /usr/bin/python /usr/local/bin/zuul-merger
root@master2:/var/run/zuul#
---------------------------------------------------------------------------------------------------------------------------------

   If you try to restart zuul-server, you’ll just start another zuul-server process instead of restart the old one.
---------------------------------------------------------------------------------------------------------------------------------
root@master2:/var/run/zuul# service zuul restart
* Restarting Zuul zuul
cat: /var/run/zuul/zuul.pid: No such file or directory
/etc/init.d/zuul: 82: kill: Usage: kill [-s sigspec | -signum | -sigspec] [pid | job]... or
kill -l [exitstatus]
---------------------------------------------------------------------------------------------------------------------------------
   The reason is that the /var/run/zuul/zuul.pid does not exist, it’s supposed to be created once zuul service is started.
---------------------------------------------------------------------------------------------------------------------------------
root@master2:/var/run/zuul# ls -l /var/run/zuul
total 0
root@master2:/var/run/zuul#
---------------------------------------------------------------------------------------------------------------------------------

  And each time you restart/start zuul service, you’ll get one more defunct process
---------------------------------------------------------------------------------------------------------------------------------
root@master2:/var/run/zuul# ps -ef|grep -i zuul-server
zuul 4451 1 0 18:20 ? 00:00:00 /usr/bin/python /usr/local/bin/zuul-server
zuul 4454 4451 0 18:20 ? 00:00:00 [zuul-server] <defunct>
zuul 4501 1 2 18:28 ? 00:00:00 /usr/bin/python /usr/local/bin/zuul-server
zuul 4504 4501 0 18:28 ? 00:00:00 [zuul-server] <defunct>
---------------------------------------------------------------------------------------------------------------------------------

  You won’t be able to stop zuul service either coz you don’t have the tracking pidfile.
---------------------------------------------------------------------------------------------------------------------------------
root@master2:/var/run/zuul# service zuul stop
No process in pidfile '/var/run/zuul/zuul.pid' found running; none killed.
root@master2:/var/run/zuul#
---------------------------------------------------------------------------------------------------------------------------------

The reason for defunct process(ie.zombie) is that it’s exiting but its parent has not waited for it.

In our case, the defunct is caused by start_gear_server function:
---------------------------------------------------------------------------------------------------------------------------------
    def start_gear_server(self):
        pipe_read, pipe_write = os.pipe()
        child_pid = os.fork()
        if child_pid == 0:
            os.close(pipe_write)
            self.setup_logging('gearman_server', 'log_config')
            import gear
            statsd_host = os.environ.get('STATSD_HOST')
            statsd_port = int(os.environ.get('STATSD_PORT', 8125))
            gear.Server(4730, <--- try to start gearserver but gearmand is already running, so this will fail
                        statsd_host=statsd_host,
                        statsd_port=statsd_port,
                        statsd_prefix='zuul.geard')

            # Keep running until the parent dies: <-- it's supposed to keep running, but actually it dies before parent , thus we get defunct processes.
            pipe_read = os.fdopen(pipe_read)
            pipe_read.read()
            os._exit(0)
        else:
            os.close(pipe_read)
            self.gear_server_pid = child_pid
            self.gear_pipe_write = pipe_write
                                                                  <--- parent no waitpid ( child is supposed to keep running as long as parent)
---------------------------------------------------------------------------------------------------------------------------------
   And when child dies, the pidfile will be removed because child and parent are in a same DaemonContext?
---------------------------------------------------------------------------------------------------------------------------------
if server.config.has_option('zuul', 'pidfile'):
        pid_fn = os.path.expanduser(server.config.get('zuul', 'pidfile'))
    else:
        pid_fn = '/var/run/zuul/zuul.pid'
    pid = pid_file_module.TimeoutPIDLockFile(pid_fn, 10)

    if server.args.nodaemon:
        server.main()
    else:
        with daemon.DaemonContext(pidfile=pid):
            server.main()
---------------------------------------------------------------------------
   Is there any way to improve the code so that we can log a warning message when port 4730 is already open and also keep pidfile with the parent so that we can use 'service zuul ..' to control zuul. according to https://wiki.jenkins-ci.org/display/JENKINS/Gearman+Plugin, zuul is supposed to work with gearmand.

samwan (wan-sam)
description: updated
Clark Boylan (cboylan)
no longer affects: openstack-ci
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.