(In reply to comment #6)
> Similar testcase is gcc's own libcpp/lex.c optimization, which also can access
> a few bytes after malloced area, as long as at least one byte in the value read
> is from within the malloced area.
Those loops are (effectively) vectorised while loops, in which you use
standard carry-chain propagation tricks to ensure that the stopping
condition for the loop does not rely on the data from beyond the malloced
area. It is not possible to vectorise them without such over-reading.
By contrast, Joost's loop (and anything gcc can vectorise) are countable
loops: the trip count is known (at run time) before the loop begins. It
is always possible to vectorise such a loop without generating memory
over reads, by having a vector loop to do (trip_count / vector_width)
iterations, and a scalar fixup loop to do the final (trip_count % vector_width)
iterations.
> I guess valgrind could mark somehow the extra bytes as undefined content and
> propagate it through following arithmetic instructions, complain only if some
> conditional jump was made solely on the undefined bits or if the undefined bits
> were stored somewhere (or similar heuristics).
Well, maybe .. but Memcheck is too slow already. I don't want to junk it up
with expensive and complicated heuristics that are irrelevant for 99.9% of
the loads it will encounter.
If you can show me some way to identify just the loads that need special
treatment, then maybe. I don't see how to identify them, though.
(In reply to comment #6)
> Similar testcase is gcc's own libcpp/lex.c optimization, which also can access
> a few bytes after malloced area, as long as at least one byte in the value read
> is from within the malloced area.
Those loops are (effectively) vectorised while loops, in which you use
standard carry-chain propagation tricks to ensure that the stopping
condition for the loop does not rely on the data from beyond the malloced
area. It is not possible to vectorise them without such over-reading.
By contrast, Joost's loop (and anything gcc can vectorise) are countable
loops: the trip count is known (at run time) before the loop begins. It
is always possible to vectorise such a loop without generating memory
over reads, by having a vector loop to do (trip_count / vector_width)
iterations, and a scalar fixup loop to do the final (trip_count % vector_width)
iterations.
> I guess valgrind could mark somehow the extra bytes as undefined content and
> propagate it through following arithmetic instructions, complain only if some
> conditional jump was made solely on the undefined bits or if the undefined bits
> were stored somewhere (or similar heuristics).
Well, maybe .. but Memcheck is too slow already. I don't want to junk it up
with expensive and complicated heuristics that are irrelevant for 99.9% of
the loads it will encounter.
If you can show me some way to identify just the loads that need special
treatment, then maybe. I don't see how to identify them, though.