Documenting this here as bug# was dropped from the mail thread:
On 02/10/19 13:05, Jan Glauber wrote:
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
>
> From my understanding this is not sufficient for SMP sync.
>
>>> If I read this comment correct:
>>>
>>> void aio_notify(AioContext *ctx)
>>> {
>>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
>>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>> */
>>> smp_mb();
>>> if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome! That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_fetch_add(ptr, n, __ATOMIC_SEQ_CST))
>> #define atomic_sub(ptr, n) ((void) __atomic_fetch_sub(ptr, n, __ATOMIC_SEQ_CST))
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html. A
small testcase is as follows:
int ctx_notify_me;
int bh_scheduled;
int x()
{
int one = 1;
int ret;
__atomic_store(&bh_scheduled, &one, __ATOMIC_RELEASE); // x1
__atomic_thread_fence(__ATOMIC_SEQ_CST); // x2
__atomic_load(&ctx_notify_me, &ret, __ATOMIC_RELAXED); // x3
return ret;
}
int y()
{
int ret;
__atomic_fetch_add(&ctx_notify_me, 2, __ATOMIC_SEQ_CST); // y1
__atomic_load(&bh_scheduled, &ret, __ATOMIC_RELAXED); // y2
return ret;
}
Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled. However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me. So you can have something like:
which resembles the code in the test case a bit more.
I then found a discussion about using the C11 memory model in Linux
(https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) which contains the
following statement, which is a bit disheartening even though it is
about a different test:
My first gut feeling was that the assertion should never fire, but
that was wrong because (as I seem to usually forget) the seq-cst
total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.
and later in the thread:
Use of C11 atomics to implement Linux kernel atomic operations
requires knowledge of the underlying architecture and the compiler's
implementation, as was noted earlier in this thread.
Indeed if I add an atomic_thread_fence I get only one valid execution,
where r2 must be 1. This is similar to GCC's bug https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697, and we can fix it in
QEMU by using __sync_fetch_and_add; in fact cppmem also shows one valid
execution if the store is replaced with something like GCC's assembly
for __sync_fetch_and_add (or Linux's assembly for atomic_add_return):
1) understand why ATOMIC_SEQ_CST is not enough in this case. QEMU code
seems to be making the same assumptions as Linux about the memory model,
and this is wrong because QEMU uses C11 atomics if available.
Fortunately, this kind of synchronization in QEMU is relatively rare and
only this particular bit seems affected. If there is a fix which stays
within the C11 memory model, and does not pessimize code on x86, we can
use it[1] and document the pitfall.
2) if there's no way to fix the bug, qemu/atomic.h needs to switch to
__sync_fetch_and_add and friends. And again, in this case the
difference between the C11 and Linux/QEMU memory models must be documented.
Torvald, Will, help me please... :((
Paolo
[1] as would be the case if fetch_add was implemented as
fetch_add(RELEASE)+thread_fence(SEQ_CST).
Documenting this here as bug# was dropped from the mail thread:
On 02/10/19 13:05, Jan Glauber wrote: AioContext *ctx) fetch_add( ptr, n, __ATOMIC_SEQ_CST)) fetch_sub( ptr, n, __ATOMIC_SEQ_CST)) /www.cl. cam.ac. uk/~pes20/ cpp/cpp0xmappin gs.html. A
> The arm64 code generated for the
> atomic_[add|sub] accesses of ctx->notify_me doesn't contain any
> memory barriers. It is just plain ldaxr/stlxr.
>
> From my understanding this is not sufficient for SMP sync.
>
>>> If I read this comment correct:
>>>
>>> void aio_notify(
>>> {
>>> /* Write e.g. bh->scheduled before reading ctx->notify_me. Pairs
>>> * with atomic_or in aio_ctx_prepare or atomic_add in aio_poll.
>>> */
>>> smp_mb();
>>> if (ctx->notify_me) {
>>>
>>> it points out that the smp_mb() should be paired. But as
>>> I said the used atomics don't generate any barriers at all.
>>
>> Awesome! That would be a compiler bug though, as atomic_add and atomic_sub
>> are defined as sequentially consistent:
>>
>> #define atomic_add(ptr, n) ((void) __atomic_
>> #define atomic_sub(ptr, n) ((void) __atomic_
>
> Compiler bug sounds kind of unlikely...
Indeed the assembly produced by the compiler matches for example the
mappings at https:/
small testcase is as follows:
int ctx_notify_me;
int bh_scheduled;
int x() _atomic_ store(& bh_scheduled, &one, __ATOMIC_RELEASE); // x1 _atomic_ thread_ fence(_ _ATOMIC_ SEQ_CST) ; // x2 _atomic_ load(&ctx_ notify_ me, &ret, __ATOMIC_RELAXED); // x3
{
int one = 1;
int ret;
_
_
_
return ret;
}
int y() _atomic_ fetch_add( &ctx_notify_ me, 2, __ATOMIC_SEQ_CST); // y1 _atomic_ load(&bh_ scheduled, &ret, __ATOMIC_RELAXED); // y2
{
int ret;
_
_
return ret;
}
Here y (which is aio_poll) wants to order the write to ctx->notify_me
before reads of bh->scheduled. However, the processor can speculate the
load of bh->scheduled between the load-acquire and store-release of
ctx->notify_me. So you can have something like:
thread 0 (y) thread 1 (x) ------- ------- ------- ------- - ------- ------- ------- ------- -
x1: store-rel bh->scheduled <-- 1
x2: memory barrier
x3: load-rlx ctx->notify_me
------
y1: load-acq ctx->notify_me
y2: load-rlx bh->scheduled
y1: store-rel ctx->notify_me <-- 2
Being very puzzled, I tried to put this into cppmem:
int main() {
bh_scheduled .store( 1, mo_release);
atomic_ thread_ fence(mo_ seq_cst) ;
ctx_ notify_ me.load( mo_relaxed) .readsvalue( 0);
ctx_ notify_ me.store( 2, mo_seq_cst);
r2=bh_ scheduled. load(mo_ relaxed) ;
atomic_int ctx_notify_me = 0;
atomic_int bh_scheduled = 0;
{{{ {
// must be zero since the bug report shows no notification
}
||| {
}
}}};
return 0;
}
and much to my surprise, the tool said r2 *can* be 0. Same if I put a
CAS like
which resembles the code in the test case a bit more.
I then found a discussion about using the C11 memory model in Linux /gcc.gnu. org/ml/ gcc/2014- 02/msg00058. html) which contains the
(https:/
following statement, which is a bit disheartening even though it is
about a different test:
My first gut feeling was that the assertion should never fire, but
that was wrong because (as I seem to usually forget) the seq-cst
total order is just a constraint but doesn't itself contribute
to synchronizes-with -- but this is different for seq-cst fences.
and later in the thread:
Use of C11 atomics to implement Linux kernel atomic operations
requires knowledge of the underlying architecture and the compiler's
implementation, as was noted earlier in this thread.
Indeed if I add an atomic_thread_fence I get only one valid execution, /gcc.gnu. org/bugzilla/ show_bug. cgi?id= 65697, and we can fix it in fetch_and_ add; in fact cppmem also shows one valid fetch_and_ add (or Linux's assembly for atomic_add_return):
where r2 must be 1. This is similar to GCC's bug
https:/
QEMU by using __sync_
execution if the store is replaced with something like GCC's assembly
for __sync_
So we should:
1) understand why ATOMIC_SEQ_CST is not enough in this case. QEMU code
seems to be making the same assumptions as Linux about the memory model,
and this is wrong because QEMU uses C11 atomics if available.
Fortunately, this kind of synchronization in QEMU is relatively rare and
only this particular bit seems affected. If there is a fix which stays
within the C11 memory model, and does not pessimize code on x86, we can
use it[1] and document the pitfall.
2) if there's no way to fix the bug, qemu/atomic.h needs to switch to fetch_and_ add and friends. And again, in this case the
__sync_
difference between the C11 and Linux/QEMU memory models must be documented.
Torvald, Will, help me please... :((
Paolo
[1] as would be the case if fetch_add was implemented as RELEASE) +thread_ fence(SEQ_ CST).
fetch_add(