Differences between revisions 20 and 33 (spanning 13 versions)

Note:

This page is primarily intended for developers of Mercurial.

Note:

This page is no longer relevant but is kept for historical purposes.

This plan have been carried out, check in-code documentation.

BundleFormat2

This page describes the current plan to get a more modern and complete bundle format. (for old content of this page check BundleFormatHG19)

Contents

Why a New bundle format?
Changes in current command
Top level Bundle
1. content
2. forseen sections
Format of the Bundle2 Container
1. Examples of top level parameter
New type of Part
1. Changesets exchange
2. New header
Testing bundle2

(current content is copy pasted from 2.9 sprint note)

1. Why a New bundle format?

lightweight
new manifest
general delta
bookmarks
phase boundaries
obsolete markers
>sha1 support
pushkey
extensible for new features (required and optional)
progress information
resumable?
transaction commit markers?
recursive (to be able to bundle subrepos)

It's possible to envision a format that sends a change, its manifest, and filenodes in each chunk rather than sending all changesets, then all manifests, etc. capabilities

2. Changes in current command

2.1. Push Orchestraction

2.1.1. Current situation

push:
- changesets:
  - discovery
  - validation
  - actual push
- phase:
  - discovery
  - pull
  - push
- obsolescence
  - discovery
  - push
- bookmark
  - discovery
  - push

2.1.2. Aimed orchestration

* push:

discovery:
- changesets
- phase
- obs
- bookmark
post-discovery action:
- current usecase move phase for common changeset seen as public.
local-validation:
- (much easier will everything in hands)
- complains about:
  - multiple heads
  - new branch
  - troubles changeset
  - divergent bookmark
  - missing subrepo revisions
  - Rent in Manhattan
  - etc…
push:
- (using multipart-bundle when possible)
  - The one and single remote side transaction happen here
(post-push) pull:
- The server send back its own multipart-bundle to the client
  - (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

2.1.3. post-push pull

If we let the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contain any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignore its content.

Possible use cases are:

sending standard output back
sending standard error back
notification that a changeset was made public on push
notification of partially accepted changeset
notification of automatic bookmark move on the server
test case result (or test run key)
Automatic shipment of Pony to contributor address
… (Possibility are endless)

2.2. Changes in Pull

Same kind of stuff will happen but pull is much simpler. (I'm not worried at all about it). May efficiently pull subrepo revisions.

2.3. Change in Bundle/Unbundle

Unbundle would learn to unbundle both

Maybe we can have the new bundle format start with an invalid entry to prevent old unbundle to try to import them

bundle should be able to produce new bundle. It can probably not do it by default for a long time however :-/

We could also do a "recursive bundle" in the presence of subrepos. A bundle could contain parts that are bundles of the subrepo revisions referenced by the revisions contained in the main bundle.

3. Top level Bundle

3.1. content

On the remote side, the server will need to redo the validation that was done on the remote side to ensure that nothing interesting happened between discovery and push. We need to send appropriate data to the remote for validation. This implies either argument in the command data, or a dedicated section in the bundle. The dedicated section seems the way to go as it feels more flexible. We do not know what kind of data will be monitored and send. So we cannot build a sensible set of argument doing the job. With a dedicated section in the multi-part bundle, we can make this section evolve over time to match the evolution of data we send to the server.

3.2. forseen sections

Here are the idea we already have about section

HG10 (old changeset bundle format)
HG19 (new changeset bundle with support for modern stuff)
pushkey data (phase, bookmarks)
obsolescence markers (format 1 and upcoming format 2 ?)
client capacity (to be used for the reply multi part bundle)
presence of subrepo bundles

4. Format of the Bundle2 Container

The latest description of the binary format can be found as comment in the Mercurial source code. This is the source of truth.

4.1. Examples of top level parameter

Those are example **not actual proposal of final parameters**. Some of them are actually very clowny.

4.1.1. Mandatory options

Set a new format of part headers:
- PARTVERSION=1
Have the payload use a special compression algorithm
- COMPRESSION=DOGEZIP
Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)
- PARTENCODING=GOST13052
Set integer format in part-header to middle-endian
- ENDIANESS=PDP11

4.1.2. Example advisory options

ask for debug level output in the reply
- debug=1
inform of total number of parts:
- nbparts=42
inform of total size of the bundle:
- totalsize=1337

4.1.3. Example of -invalid- options

List of known heads (use a part for that)
username and/or credential (use a part for that)

5. New type of Part

5.1. Changesets exchange

5.2. New header

type Header struct {
    length       uint32
    lNode        byte
    node         [lNode]byte

    // if empty (lP1 ==0) then default to previous node in the stream
    lP1          byte
    p1           [lP1]byte

    // if empty, nullrev
    lP2          byte
    p2           [lP2]byte

    // if empty, self (for changelogs)
    lLinknode    byte
    linknode     [lLinknode]byte

    // if empty, p1
    lDeltaParent byte
    deltaParent  [lDeltaParent]byte
}

We'll modify the existing changegroup type so it can pretend to be a new changegroup that just has a variety of empty fields. Progress information fields might be optional.

6. Testing bundle2

bundle2 can be enabled by setting the following hgrc option:

[experimental]
bundle2-exp = True

CategoryOldFeatures CategoryInternals

-  ⇤ ← Revision 20 as of 2014-03-16 09:25:14 → 
  Size: 12893
  Editor: Pierre-YvesDavid
  Comment: fix title
+   ← Revision 33 as of 2015-06-12 08:44:16 → ⇥
  Size: 6476
  Editor: Pierre-YvesDavid
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+#pragma section-numbers 2
-Line 2:
+Line 3:
+<<Include(A:historic)>>
-Line 3:
+Line 5:
+<!> This plan have been carried out, check in-code documentation.

= BundleFormat2 =
-Line 9:
+Line 14:
-= Why a New bundle format ? =
+== Why a New bundle format? ==
-Line 23:
+Line 27:
+ * recursive (to be able to bundle subrepos)
-Line 24:
+Line 29:
-It's possible to envision a format that sends a change, its manifest, and
filenodes in each chunk rather than sending all changesets, then all manifests,
etc.  capabilities
+It's possible to envision a format that sends a change, its manifest, and filenodes in each chunk rather than sending all changesets, then all manifests, etc.  capabilities
-Line 28:
+Line 31:
-= Changes in current command =
+== Changes in current command ==
=== Push Orchestraction ===
==== Current situation ====
 * push:
  * changesets:
   * discovery
   * validation
   * actual push
  * phase:
   * discovery
   * pull
   * push
  * obsolescence
   * discovery
   * push
  * bookmark
   * discovery
   * push
-Line 30:
+Line 50:
-== Push Orchestraction ==
+==== Aimed orchestration ====
* push:
-Line 32:
+Line 53:
-=== Current situation ===
+ * discovery:
  * changesets
  * phase
  * obs
  * bookmark
 * post-discovery action:
  * current usecase move phase for common changeset seen as public.
 * local-validation:
  * (much easier will everything in hands)
  * complains about:
   * multiple heads
   * new branch
   * troubles changeset
   * divergent bookmark
   * missing subrepo revisions
   * Rent in Manhattan
   * etc…
 * push:
  * (using multipart-bundle when possible)
   . The one and single remote side transaction happen here
 * (post-push) pull:
  * The server send back its own multipart-bundle to the client
   . (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)
-Line 34:
+Line 77:
- * push:
   * changesets:
     * discovery
     * validation
     * actual push
   * phase:
     * discovery
     * pull
     * push
   * obsolescence
     * discovery
     * push
   * bookmark
     * discovery
     * push
+==== post-push pull ====
If we let the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.
-Line 50:
+Line 80:
-=== Aimed orchestration ===
+The idea is to use the very same top level format. It could contain any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignore its content.
-Line 52:
+Line 82:
-* push:
  * discovery:
    * changesets
    * phase
    * obs
    * bookmark
  * post-discovery action:
    * current usecase move phase for common changeset seen as public.
  * local-validation:
    * (much easier will everything in hands)
    * complains about:
      * multiple heads
      * new branch
      * troubles changeset
      * divergent bookmark
      * Rent in Manhattan 
      * etc…
  * push:
      * (using multipart-bundle when possible)
        The one and single remote side transaction happen here
  * (post-push) pull:
      * The server send back its own multipart-bundle to the client
        (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

=== post-push pull ===

If we lets the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contains any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignores its content.

Possible use case are:
+Possible use cases are:
 Line 93:
-== Changes in Pull ==
+=== Changes in Pull ===
Same kind of stuff will happen but pull is much simpler. (I'm not worried at all about it). May efficiently pull subrepo revisions.
-Line 95:
+Line 96:
-Same kind of stuff wil happen but pull is much simpler. (I'm not worried at all about it)

== Change in Bundle/Unbundle ==
+=== Change in Bundle/Unbundle ===
-Line 101:
+Line 99:
-Maybe we can have the new bundle format start with an invalide entry to prevent old unbundle to try to import them
+Maybe we can have the new bundle format start with an invalid entry to prevent old unbundle to try to import them
-Line 105:
+Line 103:
-= Top level Bundle =
+We could also do a "recursive bundle" in the presence of subrepos. A bundle could contain parts that are bundles of the subrepo revisions referenced by the revisions contained in the main bundle.
-Line 107:
+Line 105:
-== content ==
+== Top level Bundle ==
=== content ===
On the remote side, the server will need to redo the validation that was done on the remote side to ensure that nothing interesting happened between discovery and push. We need to send appropriate data to the remote for validation. This implies either argument in the command data, or a dedicated section in the bundle. The dedicated section seems the way to go as it feels more flexible. We do not know what kind of data will be monitored and send. So we cannot build a sensible set of argument doing the job. With a dedicated section in the multi-part bundle, we can make this section evolve over time to match the evolution of data we send to the server.
 Line 109:
-On the remote side, the server will need to redo the validation that was done on
the remote side to ensure that nothing interesting happened between discovery
and push. We need to send appriopricate data to the remote for validation. This
implies either argument in the command data. Or a dedicated section in the
bundle. The dedicated section seems the way to go as it feels more flexible. We
do not know what kind of data will be monitored and send. So we cannot build a
sensible set of argument doing the job. With a dedicated section in the
multi-part bundle, we can make this section evolve over time to match the
evolution of data we send to the server.

== forseen sections ==
+=== forseen sections ===
-Line 128:
+Line 117:
+ * presence of subrepo bundles
-Line 129:
+Line 119:
-= Format of the Bundle2 Container =
+== Format of the Bundle2 Container ==
The latest description of the binary format can be found as comment in the Mercurial source code. This is the  source of truth.
-Line 131:
+Line 122:
-== Goal ==
+=== Examples of top level parameter ===
Those are example **not actual proposal of final parameters**. Some of them are actually very clowny.
-Line 133:
+Line 125:
-The goal of bundle2 is to act as an atomically packet to transmit a set of
payloads in an application agnostic way.  It consist in a sequence of "parts"
that will be handed to and processed by the application layer.
+==== Mandatory options ====
 * Set a new format of part headers:
  . `PARTVERSION=1`
-Line 137:
+Line 129:
-A bundle2 can be read in a single pass from a stream.
+ * Have the payload use a special compression algorithm
  . `COMPRESSION=DOGEZIP`
-Line 139:
+Line 132:
-bundle2 start with a small header and follow with a sequence of parts. Parts
have an header of they own.
+ * Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)
  . `PARTENCODING=GOST13052`
-Line 142:
+Line 135:
-== Main Header ==
+ * Set integer format in part-header to middle-endian
  . `ENDIANESS=PDP11`
-Line 144:
+Line 138:
-This header contains information about the application agnostic bundle.
+==== Example advisory options ====
-Line 146:
+Line 140:
-It is encoded as such:
+ * ask for debug level output in the reply
  . `debug=1`
-Line 148:
+Line 143:
- - bundle version (unsigned Byte)
 - main options:
   - size of main options (unsigned 32 bits integer)
   - main options (text)
+ * inform of total number of parts:
  . `nbparts=42`
-Line 153:
+Line 146:
+ * inform of total size of the bundle:
  . `totalsize=1337`
-Line 154:
+Line 149:
-Bundle version is a single unsigned Bytes used to know the version of bundle2
format used in this stream.  It will be incremented when major upgrade of the
protocol happen
+==== Example of -invalid- options ====
-Line 158:
+Line 151:
-Current value is `0`.
+ * List of known heads (use a part for that)
-Line 160:
+Line 153:
-unbundling MUST abort when an unknown protocol version is meet.
+ * username and/or credential (use a part for that)
-Line 162:
+Line 155:
-Note that abort from unknown protocol version are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

=== Main Options ===

First come a 32 bits integer. Its the size in Bytes of the options themselves

The size of data in the main header. If people need more than 4GB of
options I expect them to be run in other troubles before.

The main header data are the list of options at alter the behavior of the top
level bundle. This is intended only to control extraction of the payload from
it.  This is -not- intended for any changes in the application level
understanding of the payload. The list of option follow this rules

  - Content is pure alphanumerical list of option,
  - The list is space separated,
  - Option MUST start with a letter,
  - upper case option are mandatory,
  - lower case option MAY be disregarded,
  - mixed case options SHOULD never happen but will be interpreted as upper case,
  - option may be valuated using the forms optionname=value,
  - value are alphanumerical + `.`,

  - to summarise option are a space separated list of thing matching:
    `[A-Za-z][A-Za-z0-9]+(=[A-Za-z0-9.]+)`

Note that the first piece send is the size of the options section. So options
themselves cannot be stream. This is one more reason why you should not intend
to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much
data remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

==== Examples of valid main option ====

 - Set a new format of part headers:

   `PARTVERSION=1`

 - Have the payload use a special compression algorithm

   `COMPRESSION=DOGEZIP`

 - Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)

   `PARTENCODING=GOST13052`

 - Set integer format in part-header to middle-endian

   `ENDIANESS=PDP11`

==== Example of -possibly- valid main option ====

 - ask for debug level output in the reply

   `debug`

 - inform of total number of parts:

   `nbparts=42`

 - inform of total size of the bundle:

   `totalsize=1337`

==== Example of -invalid- main option ====

 - List of known heads (use a part for that)

 - username and/or credential (use a part for that)


== Parts ==

Parts convey the application level payload of the bundle. They are handled by
the application layer during the unbundle process.

A parts consist in three elements: type, options and data.

Type is a simple alphanumerical identifier that lets the application level know
what kind of data the part contains and root it to the applicable processors.

 - upper case type are mandatory and MUST be processed by the server. If the
   server does not know how to handle an upper case type it MUST abort the
   unbundle process.

 - lower case type are advisory and CAN be disregarded during the unbundle process.

 - mixed case SHOULD not appear and WILL be interpreted as upper case one.


Options are a set of key and value that may change the way the data from this
part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

=== Parts Header ===


 - size of header (32bits integer)
 - header:
   - size of type (Byte)
   - part type (string (up to 255 char))
   - part mode (Byte)
   - data size (when applicable)
   - options: (see other section)


Note that first entry is the full size of the header. So the header can't be
streamed and one should not plan to put massive data in the header itself.
(That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used
to find the application level part that process the data payload. It follow the
upper/lower case rules explained in the previous section. Note that routing
should be case insensitive. The lower case and upper case version of the same
type MUST be handled by the same code. It only matters in the case no handler
is found for a given type.

The mode is an enum that define how the data can be retrieved from the part.
The unbundle process MUST be aborted is an unknown mode is meet.  The two
foreseen mode are now:

 - stream (0x0): total size of the data are yet unknown. See dedicated section
   below for details.

 - plan (0x1): we know the total size of data and they will be available
   directly after the header. The total size of data is encoded in a 32bits
   integer right after the mode file.

Note that abort from unknown mode are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

==== Parts Options ====

Parts options are able to carry arbitrary bytes. Their encoding is therefor more
complicated than the main-options.

 - number of mandatory options (32 bits integer)
 - number of advisory options (32 bits integer)
 - pair of options size (sequence of 32bits integer couple)
 - options themselves

First is the number of mandatory and advisory option. Once the number of
options is known, we can read Nx2 number of integer to get the len of the key,
value couple of each option. Then we can proceed to reading all the options.

Note that this force all mandatory option to be read before the advisory one.

=== Part Data: ===

==== Parts Data: Stream Mode ====

In stream mode, data will be read as `<sizeofchunk><chunk>` until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT
using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH
TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in
the code base)


==== Parts data: Plain Mode ====

In plain mode we know the total size of data. So we can just read them until with reach that amount of data


=== End Of Bundle Marker ===

End of bundle is marked by an "empty Parts" with a 0 size header.

== Summary of the general structure ==

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

 - main header
   - bundle version (unsigned Byte)
   - main options:
     - size of main options (unsigned 32 bits integer)
     - main options (text)

 - part: (any number of them)
   - size of header (32bits integer)
   - header:
     - size of type (Byte)
     - part type (string (up to 255 char))
     - part mode (Byte)
     - data size (when if applicable)
     - options: (see other section)
       - number of mandatory options (Byte)
       - number of advisory options (Byte)
       - pair of options size (sequence of 32bits integer couple)
       - options themselves
   - data (Bytes (plenty of them))

 - end of bundle marker (empty part)


= New type of Part =

== Changesets exchange ==

== New header ==
+== New type of Part ==
=== Changesets exchange ===
=== New header ===
-Line 391:
+Line 178:
     deltaParent  [lDeltaParent]byte
-Line 396:
+Line 183:
+== Testing bundle2 ==
bundle2 can be enabled by setting the following hgrc option:
-Line 397:
+Line 186:
+{{{
[experimental]
bundle2-exp = True
}}}
-Line 399:
+Line 191:
-CategoryNewFeatures
+CategoryOldFeatures CategoryInternals

Diff for "BundleFormat2"