Mir

[regression] [BufferQueue] BufferQueueTest.compositor_never_owns_client_buffers is very slow (up to 30 seconds) on fast machines

Bug #1317370 reported by Daniel van Vugt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mir
Fix Released
Medium
Alan Griffiths
mir (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

New unit test "compositor_never_owns_client_buffers" is very slow (10-20 seconds).

[ RUN ] BufferQueueTest.compositor_never_owns_client_buffers
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (19478 ms)

It appears to have just been introduced recently in r1598

See also bug 1317801.

Related branches

summary: - New unit test "compositor_never_owns_client_buffers" is very slow (10-20
- seconds)
+ New unit test "BufferQueueTest.compositor_never_owns_client_buffers" is
+ very slow (10-20 seconds)
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: New unit test "BufferQueueTest.compositor_never_owns_client_buffers" is very slow (10-20 seconds)

Even worse:

[ RUN ] BufferQueueTest.compositor_never_owns_client_buffers
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (25404 ms)

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

I've seen very variable times for BufferQueueTest.* (as run by ctest) but I've not been able to reproduce this specific problem.

E.g.

$ bin/mir_unit_tests --gtest_filter=BufferQueueTest.compositor_never_owns_client_buffers --gtest_repeat=1000 | grep OK | sort --reverse --key=5.2 -b -n | tee tmp.txt | head
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (126 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (124 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (115 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (110 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (110 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (108 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (108 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (107 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (107 ms)
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (105 ms)

OTOH I've seen

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<std::logic_error> >'
what(): unexpected release: buffer was not given to compositor

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Given two of us now have been able to reproduce this bug, and it's a new test in new code there seems a reasonable chance it indicates a new race. So I'm going to separate this from bug 1262689 because this one may prove to be of higher importance.

summary: New unit test "BufferQueueTest.compositor_never_owns_client_buffers" is
- very slow (10-20 seconds)
+ very slow (10-25 seconds) on very fast machines
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: New unit test "BufferQueueTest.compositor_never_owns_client_buffers" is very slow (10-25 seconds) on very fast machines

Weird. This appears to be solved by lp:~vanvugt/mir/simplify-BufferQueue and I don't know why yet.

Changed in mir:
status: Triaged → In Progress
assignee: nobody → Daniel van Vugt (vanvugt)
milestone: none → 0.2.0
Changed in mir:
status: In Progress → Triaged
assignee: Daniel van Vugt (vanvugt) → nobody
milestone: 0.2.0 → none
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

OK, I'm going to ignore comment #4 for now in case that's just masking the problem...

When the bug does occur, the CPU usage of mir_unit_tests is around 150% the whole time...

$ time bin/mir_unit_tests --gtest_filter="BufferQueueTest.compositor_never_owns_client_buffers"
Note: Google Test filter = BufferQueueTest.compositor_never_owns_client_buffers
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BufferQueueTest
[ RUN ] BufferQueueTest.compositor_never_owns_client_buffers
[ OK ] BufferQueueTest.compositor_never_owns_client_buffers (26077 ms)
[----------] 1 test from BufferQueueTest (26077 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (26077 ms total)
[ PASSED ] 1 test.

real 0m26.102s
user 0m26.322s
sys 0m12.151s

So it sounds like a simple case of racing.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Alright, found the problem. In this particular test case, with the compositor thread running much faster (1000x) than the client thread, the list current_buffer_users was regularly growing to several thousand elements in size on every frame. This is obviously unnecessary and explains why lp:~vanvugt/mir/simplify-BufferQueue is in fact a /correct/ fix for the bug :)

Changed in mir:
status: Triaged → In Progress
assignee: nobody → Daniel van Vugt (vanvugt)
milestone: none → 0.2.0
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Ignore comment #6 too. Although memory bloat is a very bad side-effect of the current BufferQueue implementation, it turns out not to be the real root cause of this bug. Avoiding the memory bloat in other (more efficient) ways I find this bug remains.

Changed in mir:
importance: Low → Medium
summary: - New unit test "BufferQueueTest.compositor_never_owns_client_buffers" is
+ [BufferQueue] BufferQueueTest.compositor_never_owns_client_buffers is
very slow (10-25 seconds) on very fast machines
summary: - [BufferQueue] BufferQueueTest.compositor_never_owns_client_buffers is
- very slow (10-25 seconds) on very fast machines
+ [regression] [BufferQueue]
+ BufferQueueTest.compositor_never_owns_client_buffers is very slow (10-25
+ seconds) on very fast machines
Changed in mir:
status: In Progress → Triaged
assignee: Daniel van Vugt (vanvugt) → nobody
tags: added: regression
tags: added: bufferqueue
description: updated
Revision history for this message
Daniel van Vugt (vanvugt) wrote : Re: [regression] [BufferQueue] BufferQueueTest.compositor_never_owns_client_buffers is very slow (10-25 seconds) on very fast machines

Approaching 30 seconds today to run BufferQueueTest.compositor_never_owns_client_buffers.

Remember this bug only occurs on fast systems, reported on:
  i7-4750
  i7-4770

Adding any kind of debugging into the mix (slows it down and) makes it go away and the test finishes almost instantly.

summary: [regression] [BufferQueue]
- BufferQueueTest.compositor_never_owns_client_buffers is very slow (10-25
- seconds) on very fast machines
+ BufferQueueTest.compositor_never_owns_client_buffers is very slow (up to
+ 30 seconds) on fast machines
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Ah, it appears there is a code path where BufferQueue might lose track of a buffer:

void mc::BufferQueue::compositor_release(std::shared_ptr<graphics::Buffer> const& buffer)
{
    ......

    if (current_compositor_buffer != buffer.get() && nbuffers > 1)
        release(buffer.get(), std::move(lock));
    // else ... what happens to buffer? It's not in any queue any more, is it?
}

I'm experiencing this in another branch and it results in my client freezing in some circumstances (because the client never receives a callback).

Changed in mir:
importance: Medium → High
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

I chanced to catch this happening:

$ ctest | grep "#" | sort --key=7 -n -b --reverse | head
 74/192 Test #74: mir_unit_tests.BufferQueueTest.* ........................................... Passed 25.55 sec
138/192 Test #138: mir_unit_tests.MultiThreadedCompositor.* ................................... Passed 3.33 sec
 70/192 Test #70: mir_unit_tests.AsioMainLoopAlarmTest.* ..................................... Passed 1.09 sec
 16/192 Test #16: mir_acceptance_tests.ServerShutdown/OnSignal.* ............................. Passed 0.50 sec
182/192 Test #182: mir_unit_tests.UdevWrapperDeathTest.* ...................................... Passed 0.46 sec
  9/192 Test #9: mir_acceptance_tests.DisplayConfigurationTest.* ............................ Passed 0.29 sec
 20/192 Test #20: mir_acceptance_tests.TestClientInput.* ..................................... Passed 0.22 sec
127/192 Test #127: mir_unit_tests.MesaGraphicsPlatform.* ...................................... Passed 0.19 sec
 13/192 Test #13: mir_acceptance_tests.MirSurfaceSwapBuffersTest.* ........................... Passed 0.18 sec
  3/192 Test #3: mir_acceptance_tests.BespokeDisplayServerTestFixture.* ..................... Passed 0.17 sec

Revision history for this message
kevin gunn (kgunn72) wrote :

just documenting that i'm lower priority on this one.
As it only effects "fast" machines, this isn't something I don't want this bug to hold up promotion of 0.2.0
we've a handful of items that are important to touch coming through that we will likely want to promote soon and i am suspecting this won't be addressed before then.

Changed in mir:
importance: High → Medium
Changed in mir:
milestone: 0.2.0 → 0.3.0
Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Following an observation of Alexandros[1], changing the end of BufferQueue::compositor_release(...) to the following has the same effect.

    if (current_compositor_buffer != buffer.get())
        release(buffer.get(), std::move(lock));
    else
    {
        lock.unlock();
        std::this_thread::yield(); // EXPERIMENTAL
    }

This has the feel of a synchronization issue in BufferQueue. I'll look deeper tomorrow.

[1] https://code.launchpad.net/~kdub/mir/composition-signalling/+merge/220974/comments/531615

Changed in mir:
assignee: nobody → Alan Griffiths (alan-griffiths)
Changed in mir:
status: Triaged → In Progress
Revision history for this message
PS Jenkins bot (ps-jenkins) wrote :

Fix committed into lp:mir/devel at revision None, scheduled for release in mir, milestone Unknown

Changed in mir:
status: In Progress → Fix Committed
Changed in mir:
status: Fix Committed → Fix Released
Changed in mir (Ubuntu):
status: New → Fix Released
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.