Bad Neon intrinsics code gen when using ld4/st4 on AArch64
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linaro GCC |
Confirmed
|
Undecided
|
Michael Collison |
Bug Description
The attached test case produces the following code for arm-none-eabi:
gcc -S -o- -mcpu=cortex-a9 -mfpu=neon -mfloat-abi=hard /tmp/t.c:
test4:
add r1, r0, r1
vmov.i32 d24, #0 @ v8qi
cmp r0, r1
bxeq lr
.L7:
vld4.8 {d20-d23}, [r0]
vadd.i8 d25, d24, d20
vmov d16, d25 @ v8qi
vadd.i8 d25, d25, d21
vmov d17, d25 @ v8qi
vadd.i8 d25, d25, d22
vadd.i8 d24, d25, d23
vmov d18, d25 @ v8qi
vmov d19, d24 @ v8qi
vst4.32 {d16[0], d17[0], d18[0], d19[0]}, [r0]!
cmp r1, r0
bne .L7
bx lr
(Not perfect but the extraneous vmov's are understood and being investigated elsewhere).
For aarch64-none-elf this produces:
aarch64-
test4:
add x1, x0, x1, uxtw
cmp x0, x1
sub sp, sp, #96
beq .L1
movi v0.2s, 0
add x4, sp, 8
add x3, sp, 16
add x2, sp, 24
.L3:
ld4 {v1.8b - v4.8b}, [x0]
add x5, sp, 32
st1 {v1.16b - v4.16b}, [x5]
ld1 {v3.8b}, [x5]
add x5, sp, 48
add v3.8b, v0.8b, v3.8b
ld1 {v2.8b}, [x5]
add x5, sp, 64
ld1 {v1.8b}, [x5]
add v2.8b, v2.8b, v3.8b
add x5, sp, 80
add v1.8b, v1.8b, v2.8b
ld1 {v0.8b}, [x5]
add v0.8b, v0.8b, v1.8b
st1 {v3.8b}, [sp]
st1 {v2.8b}, [x4]
st1 {v1.8b}, [x3]
st1 {v0.8b}, [x2]
// Start of user assembly
// 15030 "/work/
ld1 {v16.2s - v19.2s}, [sp]
st4 {v16.s - v19.s}[0], [x0]
// 0 "" 2
// End of user assembly
add x0, x0, 16
cmp x1, x0
bne .L3
.L1:
add sp, sp, 96
ret
.size test4, .-test4
.ident "GCC: (GNU) 4.9.0 20130930 (experimental)"
This code is in Linaro GCC 4.8 and FSF trunk. The AArch64 code has significantly more stores and loads.
Changed in gcc-linaro: | |
status: | New → Confirmed |
Changed in gcc-linaro: | |
assignee: | nobody → Michael Collison (michael-collison) |
Now tracked at https:/ /bugs.linaro. org/show_ bug.cgi? id=536