Ubuntu
upstart package

[SRU] segfault in log.c code causes phone reboot loops

Bug #1447756 reported by Oliver Grawert on 2015-04-23

This bug affects 3 people

	Status	Importance	Assigned to	Milestone
Canonical System Image	Fix Released	Critical	Ondrej Kubik	Canonical System Image ww22-2015 "ota4"
upstart	Fix Committed	Undecided	James Hunt
upstart (Ubuntu)	Fix Released	Critical	James Hunt
Utopic	Won't Fix	Critical	Unassigned
Vivid	Won't Fix	Critical	James Hunt
Wily	Fix Released	Critical	James Hunt
upstart (Ubuntu RTM)	Fix Released	Critical	Unassigned

Bug Description

= Summary =

The version of Upstart in vivid is affected by a coule of bugs relating
to the flushing data from early-boot jobs to disk which can both result
in a crash:

== Problem 1 ==

An internal list is mishandled meaning a crash could occur randomly.

== Problem 2 ==

Jobs which spawn processes in the background then themselves exit can
cause a crash due.

= Explanation of how Upstart flushes early job output =

If an Upstart job starts *and ends* early in the boot sequence (before
the log partition is mounted and writable) and produces output to its
stdout/stderr, Upstart will cache the output for later flushing by
adding the 'Log' object associated with the 'Job' to a list.

When the log partition is mounted writable, the
/etc/init/flush-early-job-log.conf job is run which calls "initctl
notify-disk-writeable". This is a signal to Upstart to flush its cache
of early-boot job output which takes the form of iterating the
'log_unflushed_files' list and flushing all the 'Log' entries to disk.

= Code Specifics =

There are 2 issues (note that the numbers used below match those used in
the Summary).

== Problem 1 detail ==

Due to a bug in the way the 'log_unflushed_files' list is handled (the
'Log' cannot be added to the list directly, so is added via an
intermediary ('NihListElem') node), a crash can result when iterating
the list since the 'Log' is freed, but NOT the intermediary node. The
implication is that it is possible for the intermediary node to be
attempt to dereference already-freed data, resulting in a crash.

== Problem 2 detail ==

If a job spawns a process in the background, then itself exits, that
jobs 'Log' entry will be added to the 'log_unflushed_files' list. But,
if the background process produces output and then exits before Upstart
attempts to flush the original jobs data to disk, the 'NihIo'
corresponding to the log will be serviced automatically and the data
flushed to disk. The problem comes when Upstart receives the
notification to flush the 'log_unflushed_files' list, since that list
now contains an entry which has already been freed (since all its data
has already been flushed). The result is an assertion failure.

= Fix =

== Problem 1 fix ==

Correct the 'log_unflushed_files' list handling by freeing the
'NihListElem' (which will automatically free the 'Log' object), not by
simply freeing the 'Log' object itself.

* Branch: lp:~jamesodhunt/ubuntu/vivid/upstart/bug-1447756/
* New Upstart test added to avoid regression?: Yes.

== Problem 2 fix ==

Correct the assumption that the only entries in the
'log_unflushed_files' list will always have data to flush by checking if
there is in fact any data to flush; if not, remove the entry from the
'log_unflushed_files' list since it has already been handled
automatically by the 'NihIo'.

* Branch: lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix
* New Upstart test added to avoid regression?: Yes.

= Workarounds =

If a system is affected by this bug, it will be manifested by a crash early in the boot sequence.

To overcome the issue, either:

a) Boot by adding "--no-log" to the kernel command-line.

b) Disable the flush-early-job-log job (assuming the machine is bootable) by running the following:

$ echo manual | sudo tee -a /etc/init/flush-early-job-log.override

= Impact =

The issue has been present in Upstart since logging was introduced but
no known instances of crashes relating to these problems have been
reported prior to this bug being reported (which relates the the issue
being seen on a very small subset of specific Ubuntu Touch phone
hardware where Upstart is used as the system init daemon).

Note that vivid still uses Upstart for managing the graphical session,
but now uses systemd by default for the system init daemon. Since the session (Upstart) init does not even require
a flush-early-job-log, the exposure to both the bug and the updated fix codepath is extremely limited.

= Test Case =

This bug is extremely hard to surface so the approach is simply to check that the internal list can be iterated correctly by:

1) Booting the system with upstart
(select the Upstart option from the grub menu or add "init=/sbin/upstart" to the kernel command-line).

2) Running the following on a system booted with Upstart:

$ for i in $(seq 17); do sudo start flush-early-job-log; done

= Regression Potential =

None expected:

- As noted in Impact, the problems fixed by this version of Upstart have not been observed on server/desktop systems before.
- The fix is already in wily and no problems have been reported.
- See Impact.

= Original Description =

We recently started getting reprots from phone users that their devices go into a reboot loop after changing the language or getting an OTA upgrade (either of both end with a reboot of the phone)

after a bit of research we collected the log at http://pastebin.ubuntu.com/10872934/

this shows a segfault of upstarts init binary in the log.c code:

[ 6.999083]init: log.c:819: Assertion failed in log_clear_unflushed: log->unflushed->len
[ 7.000279]init: Caught abort, core dumped
[ 7.467176]Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000600

See original description

Tags:

Related branches

lp://staging/~jamesodhunt/upstart/bug-1447756 (Merged)

lp://staging/~jamesodhunt/ubuntu/vivid/upstart/bug-1447756

lp://staging/~jamesodhunt/upstart/bug-1447756-the-actual-fix

Merged into lp://staging/upstart at revision 1666

Upstart Reviewers: Pending requested 2015-05-13

lp://staging/ubuntu/wily/upstart

lp://staging/~jamesodhunt/ubuntu/vivid/upstart/sru-bug-1447756

Pat McGowan (pat-mcgowan) on 2015-04-23

Changed in upstart (Ubuntu):
importance:	Undecided → Critical
status:	New → Confirmed
Changed in canonical-devices-system-image:
importance:	Undecided → Critical
milestone:	none → ww17-2015
status:	New → Confirmed
tags:	added: hotfix

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-04-23:

Note for the record that this bug has so far only been reported on the ubuntu-rtm branch, not the ubuntu branch, of the upstart package. However, the differences between these branches are negligible and include no changes to the upstream code.

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-04-23:

I think I see a potential problem if 'initctl notify-disk-writeable' is called multiple times. The log_clear_unflushed() function walks the log_unflushed_files list, attempting to flush each of the logs and freeing them when done with nih_free(). But as far as I know, nih_free() will not cause the element to be removed from the log_unflushed_files list, so if this function is ever called a second time it looks to me that it will re-process the list, traversing freed memory with undefined results.

We can't see anything in the official Ubuntu rootfs that would account for this function being called more than once. On the other hand, we also haven't seen this reproduced on any devices other than those installed in the factory, so it could be that an error has crept in there.

We can try to produce a patch to upstart to fix this bug, to see if it fixes the segfault for those who are seeing it.

James Hunt (jamesodhunt) on 2015-04-24

Changed in upstart (Ubuntu):
assignee:	nobody → James Hunt (jamesodhunt)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-04-24:

I think I understand what's happening now - it's not the log that isn't being freed, it's the list entry the log is attached to that is not freed. The effect is the same though - calling log_clear_unflushed() multiple times could trigger this issue since the still-valid (but incorectly so) list entry is now pointing at potentially garbage memory where a Log used to be.

James Hunt (jamesodhunt) on 2015-04-24

Changed in upstart:
assignee:	nobody → James Hunt (jamesodhunt)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-04-24:

Test binaries are available here: http://people.canonical.com/~jhunt/upstart/bugs/bug-1447756/armhf/

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-04-24:

i386+amd64 are packaged here: https://launchpad.net/~jamesodhunt/+archive/ubuntu/bug-1447756/+packages

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-04-24:

I've tried the fix on my bq device and it appears to be in a reboot loop (like the one the fix was supposed to resolve).

As such, I'd recommend testing the binaries only for the session init initially (/usr/bin/ubuntu-touch-session / /usr/share/lightdm/sessions/ubuntu-touch.desktop).

Also, for both system and session scenarios, I'd recommend adding '--debug' to get as much detail as possible.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-04-24:

Something else to try - disable /etc/init/flush-early-job-log.conf on boot...

$ sudo mount -oremount,rw /
$ echo manual | sudo tee /etc/init/flush-early-job-log.override
$ sudo reboot

... and post-boot do the following:

$ for i in $(seq 17); do sudo initctl notify-disk-writeable; done

As Steve has mentioned, we do need to trace down why 'initctl notify-disk-writeable' seems to be being called multiple times on boot though - it should only really be called once when (in the case of Touch) /var/log/ is writable.

Pat McGowan (pat-mcgowan) on 2015-04-30

Changed in canonical-devices-system-image:
milestone:	ww17-2015 → ww19-ota

Pat McGowan (pat-mcgowan) on 2015-04-30

Changed in canonical-devices-system-image:
assignee:	nobody → Ondrej Kubik (w-ondra)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-01:

Hi ondra/ogra - Can you comment on my suggestions in #6 and #7? My device is still bricked so if you have any suggestions on how to perform a full reset, that'd be great as udf is unable to recover it.

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-05:

hi James
So one way is to disable upstart logging all together with "--no-log" kern command option. We are going with this option for next OTA, till we can crack actual root cause of this issue.
As for restore, best bet is with MTK flash_tool

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-05:

#10

I've rebuild the fix in a clean environment and the init binary below now boots fine for me on a bq aquaris E4.5:

http://people.canonical.com/~jhunt/upstart/bugs/bug-1447756/armhf/

Pat McGowan (pat-mcgowan) on 2015-05-06

Changed in canonical-devices-system-image:
status:	Confirmed → Fix Committed

Revision history for this message

Oliver Grawert (ogra) wrote on 2015-05-06:

#11

i dont think we should close this one, the bu still persists and need to urgently be fixed, the workaround we ship can not stay for log (since it removes all logging for system jobs)

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-06:

#12

Agreed, either we open new bug to track this, or we don't mark this as fixed. This is the time one could use state "workaround"

Revision history for this message

John McAleely (john.mcaleely) wrote on 2015-05-07:

#13

DIsabling as a workaround is bug #1452663

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-11:

#14

ondra and I have been hammering away at this, but progress is painfully slow given that:

a) the problem is not seen on every boot.
b) we can only view the end of kmsg log.
c) rebuild times are relatively slow.

From what ondra says he's seen today, it sounds as though we might be hitting a stack corruption issue - the debug I've given him is not being displayed as expected. Even with the initial fix I created based on code inspection alone, ondra is still seeing exactly the same assertion failure which we thought would have been fixed by the initial fix for this bug.

I've tried various ways to recreate the issue (both on a device and on a normal system, code review / code analysis tools, runtime checkers), but have so far been unsuccessful.

Current work-arounds:

1) Keep using '--no-log' in the kernel command-line.
Pros: reliable.
Cons: means that no system jobs get their output logged.

2) Disable the /etc/init/flush-early-job-log.conf job.
Pros: seems to be reliable (? but needs further testing).
Cons: means early job output is not logged (however, on the device in question the only output seems to be from /etc/init/container-detect.conf (and that output is not even required).

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#15

I think I have nailed it down now, here is brief description what is happening (if I read code right)
There seems to be race, when we get new log data for one of the jobs after job has been terminated, and while processing it we call log_io_reader and eventually log_file_write which will try to flash unflashed buffer to drive.
This succeeds, mind this is before we got disk writable signal.
Since it succeeds, unflashed->len becomes 0, but we don't remove that log instance from list of logs which needs to be flashed (log_unflushed_files).
So next time when we get signal that disk is writable, we try again to flash that log and BOOM it panics on assert checking that log has something to be flashed, but it was already flashed.

Actual change of unfleshed log len changes on line 562, that's where we shrink unflashed buffer by amount we managed to write to disk, which in our case if full len, making buffer after shrinking zero length.
So I can see at least three fixes:
1) we should after calling nih_io_buffer_shrink (log->unflushed, (size_t)wlen); try to check and if log->unflashed->len is 0, and if so then remove it from the log_unflushed_files list.
2) we need to make log_clear_unflushed more tolerant to logs which has been already flushed successfully before reaching to this point.
3) we don't try to flash unfleshed buffers till we get disk writable signal

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#16

Sorry in previous comment, replace all "flash" with "flush"
Actually one more issue I can see there is this:
Job dies -> it's added to log_unflushed_files when it has unflushed data
but if job is restarted before we get disk writable signal, it will mess up the things, job will still remain in log_unflushed_files
and once we call log_file_write it will try to flush unflushed data, without removing job from log_unflushed_files list

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-12:

#17

Hi Ondrej,

Regarding #15, I'm not sure this is correct. As you say, when the job process terminates, job_process_terminated() gets called. This calls log_handle_unflushed() and that function calls log_read_watch(), which ultimately calls write(2). However, even if the write is successful before 'initctl notify-disk-writeable' gets called, if you look at log_handle_unflushed()...

743 log_handle_unflushed (void *parent, Log *log)
744 {
745 NihListEntry *elem;
746
747 nih_assert (log);
748 nih_assert (log->detached == 0);
749
750 log_read_watch (log);
751
752 if (! log->unflushed->len)
753 return 1;

So, if the write is successful and log->unflushed->len becomes zero, the function returns 1 (meaning "log does not need to be added to the unflushed list") and crucially the log is not added to the unflushed list.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-12:

#18

Hi Ondrej,

Regarding #16, there shouldn't be a problem in that scenario since when the log gets added to the unflushed list, it is totally detached from its parent job. Hence, the job can be destroyed but the log lives on as an element of the unflushed list. If that job gets recreated, it will get a new set of log objects associated with it.

Did you manage to get the full log_clear_unflushed() debug output in the end?

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#19

So when job gets terminated we don't succeed to write to to the disk and it is added to the unflushed list.
Problem is another call to write function later on, but before we get writable disk signal:
[ 7.460627]init: log_handle_unflushed:778:len=32673, path='/var/log/upstart/ureadahead-touch.log'
[ 7.460682]init: unflushed->len=32673
[ 7.460736]init: job_process_terminated:1877:adding job ureadahead-touch to unflushed list

but then later one we get:
[ 7.968094]init: log_io_reader:368:
[ 7.968158]init: log_file_write:552:unflashed:61817, new_len:150, path:/var/log/upstart/ureadahead-touch.log
[ 7.975087]init: log_file_write:579:unflashed->len:61817, written_len:61817
calling "nih_io_buffer_shrink (log->unflushed, (size_t)wlen);"
[ 7.975167]init: log_file_write:581:unflashed->len:0

and eventually we get:
[ 8.369166]init: control_notify_disk_writeable<<
[ 8.369239]init: log_clear_unflushed<<
[ 8.369280]init: OK checking assert:unflashed->len = 46
[ 8.369435]init: log_clear_unflushed:888:
[ 8.369492]init: log_file_write:552:unflashed->len=46, new:0, path:/var/log/upstart/container-detect.log
[ 8.370305]init: log_file_write:579:len:46, wlen:46
[ 8.370359]init: log_file_write:581:len:0
[ 8.370450]init: OK checking assert:unflashed->len = 0
[ 8.370509]init: log.c:874: Assertion failed in log_clear_unflushed: log->unflushed->len

As you can see call to write in the middle is one causing the issue, since it flushes log buffer, but does not remove it from unflushed list.

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#20

So one thing which I still cannot track down is how is that unidentified log_io_reader called. Or who is calling it.
I have put traces to log_read_watch under condition "if (io->recv_buf->len) {" where we call "log_io_reader" but call is not coming from there.
So what are other options for log_io_reader to be called?

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#21

Sorry for spamming, but I guess that log_io_reader is called by nih_io_watcher which has been initialised by "nih_io_reopen"
which I suppose is called when job starts?

So that would go back to my original finding, job dies and is restarted before we get signal about disk being writable, but at the time job has been restarted it succeeds in writing unflushed logs to the disk while calling log_io_reader
Makes sense?

Revision history for this message

Oliver Grawert (ogra) wrote on 2015-05-12:

#22

the ureadahead-touch upstart job does definitely not need to log anything .... we should add "console none" to it so it does not attempt to. (logging there only makes sense when actively debugging ureadahead anyway)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-12:

#23

Ondrej - aha! with the debug, this is making more sense now. Yes, since the ureadahead-touch job spawns a process in the background (ureadahead) and then the job itself exits, the log associated with the main job process gets added to the unflushed list. ureadahead then writes output and the NihIo is automatically serviced to flush the log to disk and set log->unflushed->len to zero. Later, 'initctl notify-disk-writable' gets called and the assert fires as all the output from both /etc/init/ureadahead-touch.conf and from the ureadahead daemon itself has already been logged.

Could you try the following which should be a minimal fix: http://paste.ubuntu.com/11095313/

Another point - the fact that ureadahead-touch is triggering the problem does point to an issue with that job - as can be seen, even when the script section ends (and that associated shell dies), 'ureadahead --daemon' continues and if it produces output, that gets logged. A simple work-around is to remove --daemon so that the job process and the log never become detached.

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-12:

#24

I can confirm fix: http://paste.ubuntu.com/11095313/ does the job.
It will safely ignore entries of log_unflushed_files list which have log->unflushed->len set to zero. Since we know how this state is reached it seems like previous nih_assert (log->unflushed->len); was too aggressive.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-12:

#25

\o/. Yes, Upstart uses asserts extremely agressively. It's unfortunate that we've never hit this issue in testing but I'm currently working on new tests for this slightly unusual scenario.

James Hunt (jamesodhunt) on 2015-05-12

Changed in upstart (Ubuntu):
status:	Confirmed → In Progress
Changed in upstart:
status:	New → In Progress

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-12:

#26

lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix contains the fix and a new test (which correctly fails with the current lp:upstart but passes with the fix in that branch).

The code has been tested on a failing device and a server system. I am currently testing on a non-failing krillin device. I understand Ondra is going to test on a couple of other devices that exhibit this bug to further sanity check the code.

Once all these tests pass I'll raise an MP for the branch above to get it merged to lp:upstart.

We'll then need to cherry pick the fix into both vivid and wily.

Pat McGowan (pat-mcgowan) on 2015-05-12

Changed in canonical-devices-system-image:
milestone:	ww19-ota → ww22-2015
status:	Fix Committed → In Progress

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-13:

#27

bug-1447756-both-fixes.diff Edit (9.9 KiB, text/plain)

Applying the top 2 commits (r1665 and r1666) from lp:~jamesodhunt/upstart/bug-1447756-the-actual-fix [1] to https://bugs.launchpad.net/ubuntu-rtm/+source/upstart/1.13.2-0ubuntu1rtm1 is now working for me.

I've tested this by building on the device itself and also by building in a ARCH=armhf utopic chroot [2].

__________

[1] - Equivalent to the attached patch.
[2] - Interestingly, building in an vivid ARCH=armhf chroot generates a binary that fails at runtime.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-13:

#28

MP raised on lp:upstart to start the trickle-down to the rtm package.

Ubuntu Foundations Team Bug Bot (crichton) on 2015-05-13

tags:

added: patch

James Hunt (jamesodhunt) on 2015-05-15

Changed in upstart:
status:	In Progress → Fix Committed

Revision history for this message

Ondrej Kubik (ondrak) wrote on 2015-05-15:

#29

Just to confirm. I just received another device exhibiting boot loop issue and after quick investigation it was same problem, race of chid process continuing logging after parent died but before writable disk signal.
Once I used patched "upstart" binary, device booted normally, so all good.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-18:

#30

Looks like kernel bug 1429756 could block 1.13.2-0ubuntu14 landing for wily (being a pre-req to getting this fix into vivid and utopic):

https://jenkins.qa.ubuntu.com/view/Wily/view/AutoPkgTest/job/wily-adt-upstart/ARCH=amd64,label=adt/9/

Revision history for this message

Pat McGowan (pat-mcgowan) wrote on 2015-05-18:

#31

@james we can land in vivid first if we need to, and the fix does not need to be made to utopic
We do need the fix to land by May 21

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-05-18:

#32

Erroneous test failures can (and should) be overridden in proposed-migration. Also the requirement is not that the package *reach* the development release before being SRUed, only that it be *uploaded* to the development series.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-05-19:

#33

This bug was fixed in the package upstart - 1.13.2-0ubuntu14

---------------
upstart (1.13.2-0ubuntu14) wily; urgency=medium

  * Cherry-pick upstream fix for LP: #1447756, fixing broken handling
    when flushing logs to disk.
  * Cherry-pick follow-on upstream fix for LP: #1447756.

-- James Hunt <email address hidden> Mon, 18 May 2015 11:09:10 +0100

Changed in upstart (Ubuntu Wily):
status:	In Progress → Fix Released

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-19:

#34

Thanks for nudging -proposed Steve.

I've silo 021 now includes upstart version 1.13.2-0ubuntu13.1 (which sil2100 synced from wily).

Basic test plan is here: https://wiki.ubuntu.com/Process/TestPlans/upstart-bug-1447756

I've tested this as follows:

$ wget http://people.canonical.com/~jhm/barajas/recovery.img
$ ubuntu-device-flash --verbose touch --bootstrap --channel=ubuntu-touch/rc-proposed/bq-aquaris.en --device krillin --recovery-image ./recovery.img
$ citrain device-upgrade 021

I then reflashed the boot.img using one from ondra which drops the current '--no-log' option from /proc/cmdline and rebooted.

As far as I'm concerned, the fix is good. All that now remains is for ondra or one of the folk in London who have the "failing" krillin devices to test the above.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-20:

#35

Unfortunately, ondra no longer has the failing phones so we may need to take a decision to just land this if QA are happy the change has not regressed the behaviour for non-failing phones.

I've tried to raise QA but they are sprinting in the US this week so no direct response.

I believe that sil2100 is doing what he can atm, but it's unclear whether this fix will make it into the next OTA at this stage.

Revision history for this message

Pat McGowan (pat-mcgowan) wrote on 2015-05-20:

#36

Per Ondrej on private bug https://bugs.launchpad.net/barajas/+bug/1439778/comments/30
"I can confirm that phone which we received from BQ had same issue and problem has been resolved once applied fix developed in bug #1447756 was applied.
We should now revert workaround committed as part of Bug #1452663"

Revision history for this message

Alex Kaluzhny (alex-kaluzhny) wrote on 2015-05-21:

#37

Is the fix landing in vivid?

Steve Langasek (vorlon) on 2015-05-21

Changed in upstart (Ubuntu Utopic):
status:	New → Won't Fix

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-05-21: Re: [Bug 1447756] Re: segfault in log.c code causes phone reboot loops

#38

On Thu, May 21, 2015 at 05:05:16PM -0000, Alex Kaluzhny wrote:
> Is the fix landing in vivid?

The fix has landed in the stable phone overlay ppa (upstart
1.13.2-0ubuntu13.1).

James, can you please follow through on SRUing this to vivid? I've copied
the package into the vivid-proposed queue from the stable-phone-overlay ppa.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Changed in upstart (Ubuntu RTM):
status:	New → Fix Released

Pat McGowan (pat-mcgowan) on 2015-05-21

Changed in canonical-devices-system-image:
status:	In Progress → Fix Released

James Hunt (jamesodhunt) on 2015-05-22

Changed in upstart (Ubuntu Vivid):
status:	New → In Progress
assignee:	nobody → James Hunt (jamesodhunt)

James Hunt (jamesodhunt) on 2015-05-22

description:	updated
summary:	- segfault in log.c code causes phone reboot loops + [SRU] segfault in log.c code causes phone reboot loops

Revision history for this message

James Hunt (jamesodhunt) wrote on 2015-05-22:

#39

Hi Steve - ah, sorry - hadn't registered that you'd already copied the package into the queue (we now have 2 :-) Could mine be deleted please?

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-05-22: Re: [Bug 1447756] Re: [SRU] segfault in log.c code causes phone reboot loops

#40

On Fri, May 22, 2015 at 01:34:52PM -0000, James Hunt wrote:
> Hi Steve - ah, sorry - hadn't registered that you'd already copied the
> package into the queue (we now have 2 :-) Could mine be deleted please?

Done.

Revision history for this message

Alexanrd (w-a-2) wrote on 2015-06-24:

#41

What about the fix for Trusty? I have similar problems with upstart=1.12.1-0ubuntu4.2.
Workrounds with kernel option "--no-log" and disabling flush-early-job-log job works for me.

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-06-24:

#42

There was no intention of fixing this bug for trusty, because this bug was never reported with trusty. It has only been reported with the Ubuntu phone, which is utopic or later. If you are experiencing crashes with upstart on trusty, can you please attach a tarball of the /etc/init directory from the affected system?

Revision history for this message

Steve Langasek (vorlon) wrote on 2015-06-26: Please test proposed package

#43

Hello Oliver, or anyone else affected,

Accepted upstart into vivid-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/upstart/1.13.2-0ubuntu13.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in upstart (Ubuntu Vivid):
status:	In Progress → Fix Committed
tags:	added: verification-needed

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2015-07-02: [upstart/vivid] possible regression found

#44

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of upstart from vivid-proposed was performed and bug 1470745 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed" tag from this bug report and tag 1470745 "bot-stop-nagging". Thanks!

tags:

added: verification-failed

Adam Conrad (adconrad) on 2015-07-21

tags:

removed: verification-failed

Mathew Hodson (mhodson) on 2015-09-18

Changed in upstart (Ubuntu Vivid):
importance:	Undecided → Critical
Changed in upstart (Ubuntu RTM):
importance:	Undecided → Critical
Changed in upstart (Ubuntu Utopic):
importance:	Undecided → Critical

Mathew Hodson (mhodson) on 2016-04-24

Changed in upstart (Ubuntu Vivid):
status:	Fix Committed → Won't Fix

Revision history for this message

Martin Pitt (pitti) wrote on 2016-05-12: Proposed package removed from archive

#45

The version of upstart in the proposed pocket of Vivid that was purported to fix this bug report has been removed because the bugs that were to be fixed by the upload were not verified in a timely (105 days) fashion.

tags:

removed: verification-needed

Revision history for this message

Daniel Palffy (dpalffy) wrote on 2016-09-01:

#49

Please backport this to trusty, it hit me there, too. Standard amd64 desktop.

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-09-23:

#50

Yep, this is definitely affecting Trusty as well. Is there any chance the fix will be backported?

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-11-17:

#51

Ping?

Revision history for this message

Steve Langasek (vorlon) wrote on 2016-11-17:

#52

Justin, please see comment #42. Can you provide a reproducer for this on trusty?

That this has not been widely reported on 14.04 means it has not been a priority for fixing; less so now that a newer LTS release is available.

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-11-17:

#53

Not really. The bug is hugely intermittent as it stands.

However, I can confirm we're experiencing this on our fleet.

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-11-29:

#54

Guys, this really is affecting us. Help?

Revision history for this message

Steve Langasek (vorlon) wrote on 2016-11-29:

#55

This won't make progress as an SRU without a reproducer for the bug. An intermittent reproducer would be better than no reproducer.

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-11-29:

#56

I understand. However, our reproducer is: boot.
Removing --daemon from the ureadahead initscripts does appear to reduce the frequency with which this occurs.

Revision history for this message

Justin King-Lacroix (justinkl) wrote on 2016-12-12:

#57

Guys, please help us with this. This is a real problem, which is low-probability but high-frustration.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

bug-1447756-both-fixes.diff Edit

Add patch

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuupstart package

[SRU] segfault in log.c code causes phone reboot loops

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntu
upstart package