2a fetch does not use 'skinny' groups
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Confirmed
|
Medium
|
Unassigned |
Bug Description
we may already have a bug report for this, but I couldn't find it, and wanted to record some notes about the issue.
1) From a top level, the issue is that all fetching using Groupcompress is
currently done via a group that contains at least one fulltext (and often would
contain one fulltext for each file that is being fetched.)
Which means that if you have revisions 1-10 and someone commits a 20byte change
to a 1MB file, and then you fetch that change, you will download 1MB.
2) One possibility that we conceptually discussed was to allow for 'skinny'
groups. These would be generated by inserting a shared fulltext into a group,
then compressing the content that you want to be sent afterwards. When sending
this group, we would omit the shared fulltext bytes and only transmit the
groupcompress delta bytes.
This can be a fairly big win for small changes to large files.
However, there are some caveats
1) Generally when fetching we are asking for a set of keys, and we don't really
make assumptions about keys that we aren't requesting.
For example, we say "give me keys A, B, C" but we don't also say "and not D".
We might be able to infer some of that if we have a graph for the given keys
(we have a graph for text keys, but no graph for chk pages.)
2) This also might have bad results interacting with stacking. Consider an
example:
base_repo text keys fa1, fa2, fa3, revisions a1, a2, a3
stacked_repo text keys fa3, fa4 revisions a4
target text keys fa1, fa2 revisions a1, a2
At this point we want to fetch a4 and a3 into the target. With our current
fetch code, this will fetch a4 and fa4 from the stacked location, and then make
a separate request for a3 and fa3 from the fallback.
I'm not asserting how stacked_repo has text fa3, and not a3, but consider
something like a possible merge revision, or a ghost revision or a few
different ways.
Just doing the logic in the get_record_stream() layer, we would probably try to
say "you are asking for only fa4, thus I expect that you have all parents, and
would create a skinny group with fa4 in the send set, and fa3 in the omit set."
However, the target doesn't have fa3 yet. It probably will *get* fa3 when it
makes the fallback fetch from base_repo. However, it cannot expand the stream
as it comes across the wire, and would need to buffer.
3) There are probably other edge cases that could be triggered with
combinations of stacking/ghosts/and smart fetches (since a stacked repository
doesn't have access to the backing repository during smart fetch.)
4) We might be able to make a bit more use out of things like SearchResult,
which has a bit more knowledge about things that are accessible and things that
are not.
We could change the get_record_stream() api to be a bit more concrete
about things that could be used as omitted compression parents, allowing the
logic of what should be grouped up to be computed at a higher level, and then
passed down.
5) This would cause some amount of server overhead, as it has to generate new
compressed forms. This may be more overhead than similar systems, since none of
the groupcompress delta 'recipies' are valid outside of the specific group they
are in. (Versus chained deltas, where if you insert a new delta at the front or
back of the chain, the intermediate deltas are all still valid.)
Changed in bzr: | |
status: | Triaged → Confirmed |
tags: | added: check-for-breezy |
On Tue, 2009-07-21 at 18:53 +0000, John A Meinel wrote:
>
>
> 1) Generally when fetching we are asking for a set of keys, and we
> don't really
> make assumptions about keys that we aren't requesting.
>
> For example, we say "give me keys A, B, C" but we don't also say "and
> not D".
>
> We might be able to infer some of that if we have a graph for the
> given keys
> (we have a graph for text keys, but no graph for chk pages.)
We already have a solution for this - the streaming protocol permits a
'and I need X text' response to a stream; so rather than worrying about
what the server will have, just let the server tell us.
> 2) This also might have bad results interacting with stacking.
> ...
> However, it cannot expand the stream
> as it comes across the wire, and would need to buffer.
Same answer as for 1 - only send the bytes we know the new revs
introduced, and if a parent is missing the recipient will tell us.
> 3) There are probably other edge cases that could be triggered with
> combinations of stacking/ghosts/and smart fetches (since a stacked
> repository
> doesn't have access to the backing repository during smart fetch.)
I'm reasonably sure the current answer is sufficient.
> 4) We might be able to make a bit more use out of things like
> SearchResult,
> which has a bit more knowledge about things that are accessible and
> things that
> are not.
>
> We could change the get_record_stream() api to be a bit more concrete
> about things that could be used as omitted compression parents,
> allowing the
> logic of what should be grouped up to be computed at a higher level,
> and then
> passed down.
I don't think thats really needed, though I do want the stream interface
to be as powerful as needed.
> 5) This would cause some amount of server overhead, as it has to
> generate new
> compressed forms. This may be more overhead than similar systems,
> since none of
> the groupcompress delta 'recipies' are valid outside of the specific
> group they
> are in. (Versus chained deltas, where if you insert a new delta at the
> front or
> back of the chain, the intermediate deltas are all still valid.)
I think this is tolerable. We're already decompressing everything we
receive, and other systems do recombine texts from streams.
-Rob