Poorly optimised code generation for cortex M0/M0+/M1 vs M3/M4
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
GNU Arm Embedded Toolchain |
In Progress
|
Undecided
|
Unassigned |
Bug Description
I had believed I was being hit with the bug report https:/
However further testing leads me to believe this is a distinct problem.
I am using :
Release Version - gcc-arm-
Version is the binary release and was downloaded directly from launchpad.
Host is 64bit Ubuntu 15.04
On Cortex M0 type cores (M0/M0+ and M1) code generation is far from optimal at -Os, (all optimization levels also exhibit this problem). The same code when compiled for M3 does not have these sub-optimal patterns, and testing has revealed that for the test case presented, the code generated by GCC for the M3 will assemble without error for the M0.
The major problem occurs that accessing memory mapped registers at known addresses cause each register to have a unique entry in the Literal Table, when a single entry and offset addressing would suffice. This cause the code to become significantly larger and slower than is necessary. The problem occurs with multiple different declarations of the same memory, direct pointers, an array, a structure. A smaller but related problem is that constants are not consistently calculated from known register contents when they can be, but instead create unnecessary literal table entries. I believe the root cause is related which is why i have created one bug report for these two issues.
There are 6 tests in the attached test case, all are suboptimal compared to the M3 compile of the same test cases, even though the compiler does not emit any instructions which are not also legal M0 code.
Of gravest concern to me as a general pattern is test 6. I quote it here:
/* Write 8 bit values to known register locations - using an array */
void test6(void)
{
volatile uint8_t* const r = (uint8_
r[0] = 0xFF;
r[1] = 0xFE;
r[2] = 0xFD;
r[3] = 0xFC;
r[4] = 0xEE;
r[8] = 0xDD;
r[12] = 0xCC;
}
Which, at -Os for -mcpu-cortex-m0 results in:
000000ec <test6>:
ec: 22ff movs r2, #255 ; 0xff
ee: 4b0a ldr r3, [pc, #40] ; (118 <test6+0x2c>)
f0: 701a strb r2, [r3, #0]
f2: 4b0a ldr r3, [pc, #40] ; (11c <test6+0x30>)
f4: 3a01 subs r2, #1
f6: 701a strb r2, [r3, #0]
f8: 4b09 ldr r3, [pc, #36] ; (120 <test6+0x34>)
fa: 3a01 subs r2, #1
fc: 701a strb r2, [r3, #0]
fe: 4b09 ldr r3, [pc, #36] ; (124 <test6+0x38>)
100: 3a01 subs r2, #1
102: 701a strb r2, [r3, #0]
104: 4b08 ldr r3, [pc, #32] ; (128 <test6+0x3c>)
106: 3a0e subs r2, #14
108: 701a strb r2, [r3, #0]
10a: 4b08 ldr r3, [pc, #32] ; (12c <test6+0x40>)
10c: 3a11 subs r2, #17
10e: 701a strb r2, [r3, #0]
110: 4b07 ldr r3, [pc, #28] ; (130 <test6+0x44>)
112: 3a11 subs r2, #17
114: 701a strb r2, [r3, #0]
116: 4770 bx lr
118: 40002800 .word 0x40002800
11c: 40002801 .word 0x40002801
120: 40002802 .word 0x40002802
124: 40002803 .word 0x40002803
128: 40002804 .word 0x40002804
12c: 40002808 .word 0x40002808
130: 4000280c .word 0x4000280c
Each element accessed in the array of bytes has resulted in the address of that element appearing in the literal table. !!!!
By comparison the M3 build generates :
00000094 <test6>:
94: 4b07 ldr r3, [pc, #28] ; (b4 <test6+0x20>)
96: 22ff movs r2, #255 ; 0xff
98: 701a strb r2, [r3, #0]
9a: 22fe movs r2, #254 ; 0xfe
9c: 705a strb r2, [r3, #1]
9e: 22fd movs r2, #253 ; 0xfd
a0: 709a strb r2, [r3, #2]
a2: 22fc movs r2, #252 ; 0xfc
a4: 70da strb r2, [r3, #3]
a6: 22ee movs r2, #238 ; 0xee
a8: 711a strb r2, [r3, #4]
aa: 22dd movs r2, #221 ; 0xdd
ac: 721a strb r2, [r3, #8]
ae: 22cc movs r2, #204 ; 0xcc
b0: 731a strb r2, [r3, #12]
b2: 4770 bx lr
b4: 40002800 .word 0x40002800
ALL of which is legal M0 Code.
There are 5 other tests in the test case, each exhibits this behavior for varying patterns of accessing the same memory mapped registers. The ONLY testcase which uses STR with an offset is testcase 5, but it exhibits another less severe optimization problem that causes excessive and unnecessary literal table entries to be created.
I test accessing 4 x contiguous 32bit memory registers, at a known address as :
Test 1 - Fixed pointers to known locations:
Test 2 - Accessing the registers as an array
Test 3 - Accessing the registers as a structure
Test 4 - Accessing as an array where the array element (register) is a union type.
Test 5 - Accessing the registers as a structure where each register is a union type.
Test 6 - Just writing contiguous bytes in memory as an array.
In each case the addresses are known at compile time and are fixed constants, and so is the data being written.
It seems strange that for the M3, GCC can generate so much better code than the M0, even when it does not emit any instructions specific to the M3.
The attached file contains my test source (test.c), a bash script to build the code as i tested it, and the output from my tests for both the M0 and M3.
Hmm,
I was playing with test6, which just assigns values consecutively to a byte array. If i make the base address low, 0x10 in my test case, M0 suddenly compiles code which I would think it should :
00000134 <test7>:
134: 2310 movs r3, #16
136: 22ff movs r2, #255 ; 0xff
138: 701a strb r2, [r3, #0]
13a: 3a01 subs r2, #1
13c: 705a strb r2, [r3, #1]
13e: 3a01 subs r2, #1
140: 709a strb r2, [r3, #2]
142: 3a01 subs r2, #1
144: 70da strb r2, [r3, #3]
146: 3a0e subs r2, #14
148: 711a strb r2, [r3, #4]
14a: 3a11 subs r2, #17
14c: 721a strb r2, [r3, #8]
14e: 3a11 subs r2, #17
150: 731a strb r2, [r3, #12]
152: 4770 bx lr
But strangely M3 code generation goes bad :
000000b8 <test7>:
b8: 22ff movs r2, #255 ; 0xff
ba: 2310 movs r3, #16
bc: 701a strb r2, [r3, #0]
be: 22fe movs r2, #254 ; 0xfe
c0: 2311 movs r3, #17
c2: 701a strb r2, [r3, #0]
c4: 22fd movs r2, #253 ; 0xfd
c6: 2312 movs r3, #18
c8: 701a strb r2, [r3, #0]
ca: 22fc movs r2, #252 ; 0xfc
cc: 2313 movs r3, #19
ce: 701a strb r2, [r3, #0]
d0: 22ee movs r2, #238 ; 0xee
d2: 2314 movs r3, #20
d4: 701a strb r2, [r3, #0]
d6: 22dd movs r2, #221 ; 0xdd
d8: 2318 movs r3, #24
da: 701a strb r2, [r3, #0]
dc: 22cc movs r2, #204 ; 0xcc
de: 231c movs r3, #28
e0: 701a strb r2, [r3, #0]
e2: 4770 bx lr
BUT, if i move the start address further down in memory to say 0x200:
The M0 code generation changes to this :
00000134 <test7>:
134: 2380 movs r3, #128 ; 0x80
136: 22ff movs r2, #255 ; 0xff
138: 009b lsls r3, r3, #2
13a: 701a strb r2, [r3, #0]
13c: 4b07 ldr r3, [pc, #28] ; (15c <test7+0x28>)
13e: 3a01 subs r2, #1
140: 701a strb r2, [r3, #0]
142: 4b07 ldr r3, [pc, #28] ; (160 <test7+0x2c>)
144: 3a01 subs r2, #1
146: 701a strb r2, [r3, #0]
148: 4b06 ldr r3, [pc, #24] ; (164 <test7+0x30>)
14a: 3a01 subs r2, #1
14c: 701a strb r2, [r3, #0]
14e: 3a0e subs r2, #14
150: 705a strb r2, [r3, #1]
152: 3a11 subs r2, #17
154: 715a strb r2, [r3, #5]
156: 3a11 subs r2, #17
158: 725a strb r2, [r3, #9]
15a: 4770 bx lr
15c: 00000201 .word 0x00000201
160: 00000202 .word 0x00000202
164: 00000203 .word 0x00000203
And suddenly the code generator can calculate address 0x200 without the literal table, but cant calculate 0x201, 0x202 or 0x203 but once its got 0x203, it can calculate 0x204, 0x208 and 0x20C.
Something very strange is going on with the M0 code generation for these sequences.