Differences between revisions 1 and 13 (spanning 12 versions)
Revision 1 as of 2008-10-24 14:18:43
Size: 3423
Editor: tonfa
Comment: protocol ng, first draft
Revision 13 as of 2010-04-04 10:27:33
Size: 3390
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
== Proposed new wire protocol == ## page was renamed from WireProtocolNG

Status: Draft/Experiment

== Proposed new discovery/changegroup protocol ==
Line 11: Line 15:
The basic algorithm is still the same, we first discover a set of nodes.
Then we get all the changes from those nodes.
Line 16: Line 19:
The discovery protocol is based on the dominance property: http://en.wikipedia.org/wiki/Dominator_(node) ==== Discovery ====
Line 18: Line 21:
{{{heads()}}} Instead of exploring the missing remote nodes like the current protocol, this proposal explores the local nodes, exploiting more local knowledge and having a stateful approach (which can't be done on the server).

It only needs one new server call:

{{{discover(nodes)}}}
 * list of local nodes
Line 20: Line 28:
 return a list of heads
 (everything new must be an ancestor of one of these heads, so start here)
  return a boolean array, value is True if the server has the node, False otherwise. (TODO: investigate what is cheaper, between list and boolean array / to avoid races, should we pass the head as argument?)
Line 24: Line 31:
Basically dominators() is like between() except it is more coarse-grained
{{{dominators(list of couple)}}}
 * list = a list of couple (tip, dom) where dom dominates tip
 * tip = the last revision in a branch that we're interested in (often a head).
 * base = the first revision going backwards from a node that has two parents (the product of a merge).
 * p1 = the first parent of the base.
 * p2 = the second parent of the base.
The client keeps all it's nodes between three states: unknown, missing and common.
At each iteration, it builds a sample of the nodes in unknown state, send it to discover and update the sets accordingly.
The sample size should obvisously be bounded.

Proposed sample construction: first breadth first search, starting from the nodes at {{{headsof(unknown)}}}, keep nodes at a distance 0, 1, 2, 4, 8, 16, ... from the heads. Second breadth first search starting at {{{rootsof(unknown)}}}, keep the nodes at a distance 0, 1, 2, 4, 8, 16, ... from the roots.
If the sample is bigger than MAX_SAMPLE_SIZE, take a random sample from it, but make sure it includes the heads. If the number of unknown nodes is smaller than MAX_SAMPLE_SIZE, send all the unknown nodes to discover.

During testing in largish repository (netbeans, kernel), the number of iteration was almost always lower than 5.

Proposed tweak: the server can compute it's own unknown set and return a sample on it's own. Testing shown that it didn't make a big difference in the number of iterations (it usually removes at most one roundtrip)

==== Changegroup ====

The current changegroup() uses base nodes, it should instead use common nodes.

{{{changegroup(roots, heads)}}}
 * roots = list of nodes, that the server can assume the client knows (as well as all their ancestors).
Line 32: Line 49:
 let s be the sequence of dominators from tip (we only aggregate linear history, so it's a list of (t, b, p1, p2))
 return elements 1, 2, 4, ... (maybe bounded ?)
 find all changesets ancestors from heads and not descended from roots and return them as a single changegroup
Line 36: Line 52:
{{{between(list)}}}
 * branch = Basically a linear set of changes. More formally, a list of changes such that no change in the set (except possibly the last) has two parents, and all changes in the list are related to eachother by parent->child relationships.
 * list = a list of (tip, base) tuples for branches we'd like to know the history of. Presumably the client knows about the base of the branch, but not the tip, and wants to find out where in the branch its knowledge ends.
 * tip = the latest revision in a branch the client is interested in (may not be the actual tip of the branch on the server side)
 * base = the earliest revision in a branch the client is interested in. Always a revision the client knows about.
{{{
 for tip, base in list:
   walk back a linear branch, return elements 1, 2, 4, 8..
   (and this lets us do bisection search if we diverge in the middle of one of these long branches)
}}}

Now it would be better if the processus was symetrical, at the same time as we descend from remote tips,
we could do the same locally.
For that we would need the following call:
{{{has(list)}}}
 * list = a list of nodes
{{{
 return a list of boolean indicating if the node was found on the server.
}}}

To minimize roundtrips, it would be nice to combine the has() call with dominators().

The current changegroup() uses base nodes, it should instead use common nodes.
{{{changegroup(roots)}}}
 * roots = a list of the latest nodes on every service side changeset branch that both the client and server know about.
{{{
 find all changesets descended from roots and return them as a single changegroup
}}}
Line 75: Line 63:
 * uncompressed delta to p1  * uncompressed delta to p1 (or optionally to the previous node)
Line 77: Line 65:
 * where should lightweight copies be put?

=== Prototypes ===

Prototypes can be found here:

 * http://bitbucket.org/parren/dag-discovery/
 * http://bitbucket.org/bboissin/dag-discovery/ (fork)

=== Wishlist ===

 * Estimate early how much data or items have to be transfered and communicate
 this to the other side, so a progress indicator could be more useful.
 -- ThomasArendsenHein <<DateTime(2008-10-24T14:23:01Z)>>

Status: Draft/Experiment

Proposed new discovery/changegroup protocol

The wire protocol has several flaws:

  • it uses the roots of the new branches, this is susceptible to a race (see issue1320) .
  • if the client is missing a lot of nodes but doesn't have any local changes, it will still have to do a lot of roundtrips to discover the base nodes.

See WireProtocol for the current protocol.

Overview

The Protocol

Discovery

Instead of exploring the missing remote nodes like the current protocol, this proposal explores the local nodes, exploiting more local knowledge and having a stateful approach (which can't be done on the server).

It only needs one new server call:

discover(nodes)

  • list of local nodes

  return a boolean array, value is True if the server has the node, False otherwise. (TODO: investigate what is cheaper, between list and boolean array / to avoid races, should we pass the head as argument?)

The client keeps all it's nodes between three states: unknown, missing and common. At each iteration, it builds a sample of the nodes in unknown state, send it to discover and update the sets accordingly. The sample size should obvisously be bounded.

Proposed sample construction: first breadth first search, starting from the nodes at headsof(unknown), keep nodes at a distance 0, 1, 2, 4, 8, 16, ... from the heads. Second breadth first search starting at rootsof(unknown), keep the nodes at a distance 0, 1, 2, 4, 8, 16, ... from the roots. If the sample is bigger than MAX_SAMPLE_SIZE, take a random sample from it, but make sure it includes the heads. If the number of unknown nodes is smaller than MAX_SAMPLE_SIZE, send all the unknown nodes to discover.

During testing in largish repository (netbeans, kernel), the number of iteration was almost always lower than 5.

Proposed tweak: the server can compute it's own unknown set and return a sample on it's own. Testing shown that it didn't make a big difference in the number of iterations (it usually removes at most one roundtrip)

Changegroup

The current changegroup() uses base nodes, it should instead use common nodes.

changegroup(roots, heads)

  • roots = list of nodes, that the server can assume the client knows (as well as all their ancestors).

 find all changesets ancestors from heads and not descended from roots and return them as a single changegroup

A changegroup is a single stream containing:

  • a changelog group
  • a manifest group
  • a list of
    • filename length
    • filename
    • file group (terminated by a zero length filename)

A group is a list of chunks:

  • chunk length
  • self hash, p1 hash, p2 hash, link hash
  • uncompressed delta to p1 (or optionally to the previous node)
  • (terminated by a zero length chunk)
  • where should lightweight copies be put?

Prototypes

Prototypes can be found here:

Wishlist

  • Estimate early how much data or items have to be transfered and communicate this to the other side, so a progress indicator could be more useful.

    -- ThomasArendsenHein 2008-10-24 14:23:01


CategoryInternals

DiscoveryPlan (last edited 2013-04-18 18:37:27 by KevinBullock)