Differences between revisions 22 and 23

Note:

This page is primarily intended for developers of Mercurial.

This page describes the current plan to get a more modern and complete bundle format. (for old content of this page check BundleFormatHG19)

Contents

Why a New bundle format ?
Changes in current command
Top level Bundle
1. content
2. forseen sections
Format of the Bundle2 Container
New type of Part
1. Changesets exchange
2. New header

(current content is copy pasted from 2.9 sprint note)

Why a New bundle format ?

lightweight
new manifest
general delta
bookmarks
phase boundaries
obsolete markers
>sha1 support
pushkey
extensible for new features (required and optional)
progress information
resumable?
transaction commit markers?
recursive (to be able to bundle subrepos)

It's possible to envision a format that sends a change, its manifest, and filenodes in each chunk rather than sending all changesets, then all manifests, etc. capabilities

Changes in current command

Push Orchestraction

Current situation

push:
- changesets:
  - discovery
  - validation
  - actual push
- phase:
  - discovery
  - pull
  - push
- obsolescence
  - discovery
  - push
- bookmark
  - discovery
  - push

Aimed orchestration

* push:

discovery:
- changesets
- phase
- obs
- bookmark
post-discovery action:
- current usecase move phase for common changeset seen as public.
local-validation:
- (much easier will everything in hands)
- complains about:
  - multiple heads
  - new branch
  - troubles changeset
  - divergent bookmark
  - missing subrepo revisions
  - Rent in Manhattan
  - etc…
push:
- (using multipart-bundle when possible)
  - The one and single remote side transaction happen here
(post-push) pull:
- The server send back its own multipart-bundle to the client
  - (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

post-push pull

If we lets the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contains any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignores its content.

Possible use case are:

sending standard output back
sending standard error back
notification that a changeset was made public on push
notification of partially accepted changeset
notification of automatic bookmark move on the server
test case result (or test run key)
Automatic shipment of Pony to contributor address
… (Possibility are endless)

Changes in Pull

Same kind of stuff will happen but pull is much simpler. (I'm not worried at all about it). May efficiently pull subrepo revisions.

Change in Bundle/Unbundle

Unbundle would learn to unbundle both

Maybe we can have the new bundle format start with an invalid entry to prevent old unbundle to try to import them

bundle should be able to produce new bundle. It can probably not do it by default for a long time however :-/

We could also do a "recursive bundle" in the presence of subrepos. A bundle could contain parts that are bundles of the subrepo revisions referenced by the revisions contained in the main bundle.

Top level Bundle

content

On the remote side, the server will need to redo the validation that was done on the remote side to ensure that nothing interesting happened between discovery and push. We need to send appriate data to the remote for validation. This implies either argument in the command data. Or a dedicated section in the bundle. The dedicated section seems the way to go as it feels more flexible. We do not know what kind of data will be monitored and send. So we cannot build a sensible set of argument doing the job. With a dedicated section in the multi-part bundle, we can make this section evolve over time to match the evolution of data we send to the server.

forseen sections

Here are the idea we already have about section

HG10 (old changeset bundle format)
HG19 (new changeset bundle with support for modern stuff)
pushkey data (phase, bookmarks)
obsolescence markers (format 1 and upcoming format 2 ?)
client capacity (to be used for the reply multi part bundle)
presence of subrepo bundles

Format of the Bundle2 Container

Goal

The goal of bundle2 is to act as an atomically packet to transmit a set of payloads in an application agnostic way. It consist in a sequence of "parts" that will be handed to and processed by the application layer.

A bundle2 can be read in a single pass from a stream.

bundle2 start with a small header and follow with a sequence of parts. Parts have an header of they own.

Main Header

This header contains information about the application agnostic bundle.

It is encoded as such:

Magic string 'HG20
stream parameter:
- size of main stream parameters (unsigned 16 bits integer)
- main stream parameter (text)

unbundling MUST abort when an unknown Magic string is met

Note that abort from unknown magic string are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Stream Options

First come a 16 bits integer. Its the size in Bytes of the parameters themselves

The size of data in the main header. If people need more than 64k of parameters I expect them to be run in other troubles before.

The main header data are the list of parameters that alter the behavior of the top level bundle. This is intended only to control extraction of the payload part. This is -not- intended for any changes in the application level understanding of the payload. The parameters are formated as space separated list of entry. Each entry is in the form <name>[=<value>]. both name and value are urlquoted. The entry name MUST start with a letter. Those with an capital first letter will are mandatory, the unbundling process MUST abort is an unknown mandatory parameter is encountered. Those with a lower case first letter may be safely ignored when unknown.

Note that the first piece send is the size of the parameters section. So parameters themselves cannot be stream. This is one more reason why you should not intend to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much data remains to be read. This MUST result in a full scale panic abort that invalidate the whole communication channel.

Examples of valid stream option

Those are example **not actual proposal of final parameters**. Some of them are actually very clowny.

Set a new format of part headers:
- PARTVERSION=1
Have the payload use a special compression algorithm
- COMPRESSION=DOGEZIP
Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)
- PARTENCODING=GOST13052
Set integer format in part-header to middle-endian
- ENDIANESS=PDP11

Example of -possibly- valid main option

ask for debug level output in the reply
- debug
inform of total number of parts:
- nbparts=42
inform of total size of the bundle:
- totalsize=1337

Example of -invalid- main option

List of known heads (use a part for that)
username and/or credential (use a part for that)

Parts

Parts convey the application level payload of the bundle. They are handled by the application layer during the unbundle process.

A parts consist in three elements: type, parameters and data.

Type is a simple alphanumerical identifier that lets the application level know what kind of data the part contains and root it to the applicable processors.

Capital first letter type are mandatory and MUST be processed by the server. If the
- server does not know how to handle an upper case type it MUST abort the unbundle process.
lower case first letter type are advisory and CAN be disregarded during the unbundle process.

Options are a set of key and value that may change the way the data from this part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

Parts Header

size of header (16bits integer)
header:
- size of type (Byte)
- part type (string (up to 255 char))
- parameters: (see other section)

Note that first entry is the full size of the header. So the header can't be streamed and one should not plan to put massive data in the header itself. (That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used to find the application level part that process the data payload. It follow the upper/lower case rules explained in the previous section. Note that routing should be case insensitive. The lower case and upper case version of the same type MUST be handled by the same code. It only matters in the case no handler is found for a given type.

Parts Options

Parts parameters are able to carry arbitrary bytes. Their encoding is therefor more complicated than the stream parameters.

number of mandatory parameters (Byte)
number of advisory parameters (Byte)
pair of parameters size (sequence of Byte couple)
parameters themselves

First is the number of mandatory and advisory parameters. Once the number of parameters is known, we can read Nx2 number of Bytes to get the len of the key, value couple of each parameters. Then we can proceed to reading all the parameters.

Note that this force all mandatory parameter to be read before the advisory one.

Part Data:

Parts Data: Stream Mode

In stream mode, data will be read as <sizeofchunk><chunk> until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in the code base)

Parts data: Plain Mode

In plain mode we know the total size of data. So we can just read them until with reach that amount of data

End Of Bundle Marker

End of bundle is marked by an "empty Parts" with a 0 size header.

Summary of the general structure

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

main header
- bundle version (unsigned Byte)
- main parameters:
  - size of main parameters (unsigned 16 bits integer)
  - main parameters (text)
part: (any number of them)
- size of header (16bits integer)
- header:
  - size of type (Byte)
  - part type (string (up to 255 char))
  - parameters: (see other section)
    - number of mandatory parameters (Byte)
    - number of advisory parameters (Byte)
    - pair of parameters size (sequence of Byte couple)
    - parameters themselves
- data (Bytes (plenty of them))
empty part (act a end of bundle marker)

New type of Part

Changesets exchange

New header

type Header struct {
    length       uint32
    lNode        byte
    node         [lNode]byte

    // if empty (lP1 ==0) then default to previous node in the stream
    lP1          byte
    p1           [lP1]byte

    // if empty, nullrev
    lP2          byte
    p2           [lP2]byte

    // if empty, self (for changelogs)
    lLinknode    byte
    linknode     [lLinknode]byte

    // if empty, p1
    lDeltaParent byte
    deltaParent  [lDeltaParent]byte 
}

We'll modify the existing changegroup type so it can pretend to be a new changegroup that just has a variety of empty fields. Progress information fields might be optional.

CategoryNewFeatures

-  ⇤ ← Revision 22 as of 2014-03-19 23:54:57 → 
  Size: 13243
  Editor: AngelEzquerra
  Comment: Added comments about using "recursive bundles" to bundle subrepos with main bundle
+   ← Revision 23 as of 2014-03-20 20:08:18 → ⇥
  Size: 12211
  Editor: Pierre-YvesDavid
  Comment: update with new idea
-Deletions are marked like this.
+Additions are marked like this.
 Line 155:
- * bundle version (unsigned Byte)
 * main options:
   * size of main options (unsigned 32 bits integer)
   * main options (text)


Bundle version is a single unsigned Bytes used to know the version of bundle2
format used in this stream.  It will be incremented when major upgrade of the
protocol happen

Current value is `0`.

unbundling MUST abort when an unknown protocol version is meet.

Note that abort from unknown protocol version are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
+ * Magic string 'HG20
 * stream parameter:
   * size of main stream parameters (unsigned 16 bits integer)
   * main stream parameter (text)


unbundling MUST abort when an unknown Magic string is met

Note that abort from unknown magic string are nasty as we do not know how much
data remains to be read. This MUST result in a full scale panic abort that
-Line 173:
+Line 167:
-=== Main Options ===

First come a 32 bits integer. Its the size in Bytes of the options themselves

The size of data in the main header. If people need more than 4GB of
options I expect them to be run in other troubles before.

The main header data are the list of options at alter the behavior of the top
level bundle. This is intended only to control extraction of the payload from
it.  This is -not- intended for any changes in the application level
understanding of the payload. The list of option follow this rules

  * Content is pure alphanumerical list of option,
  * The list is space separated,
  * Option MUST start with a letter,
  * upper case option are mandatory,
  * lower case option MAY be disregarded,
  * mixed case options SHOULD never happen but will be interpreted as upper case,
  * option may be valuated using the forms optionname=value,
  * value are alphanumerical + `.`,

  * to summarise option are a space separated list of thing matching:
    `[A-Za-z][A-Za-z0-9]+(=[A-Za-z0-9.]+)`

Note that the first piece send is the size of the options section. So options
themselves cannot be stream. This is one more reason why you should not intend
to store huge data in main-option.
+=== Stream Options ===

First come a 16 bits integer. Its the size in Bytes of the parameters themselves

The size of data in the main header. If people need more than 64k of
parameters I expect them to be run in other troubles before.

The main header data are the list of parameters that alter the behavior of the
top level bundle. This is intended only to control extraction of the payload
part.  This is -not- intended for any changes in the application level
understanding of the payload. The parameters are formated as space separated
list of entry. Each entry is in the form `<name>[=<value>]`. both name and value
are urlquoted. The entry name MUST start with a letter. Those with an capital
first letter will are mandatory, the unbundling process MUST abort is an unknown
mandatory parameter is encountered. Those with a lower case first letter may be
safely ignored when unknown.

Note that the first piece send is the size of the parameters section. So
parameters themselves cannot be stream. This is one more reason why you should
not intend to store huge data in main-option.
-Line 205:
+Line 192:
-==== Examples of valid main option ====
+==== Examples of valid stream option ====


Those are example **not actual proposal of final parameters**. Some of them are
actually very clowny.
-Line 249:
+Line 240:
-A parts consist in three elements: type, options and data.
+A parts consist in three elements: type, parameters and data.
-Line 254:
+Line 245:
- * upper case type are mandatory and MUST be processed by the server. If the
+ * Capital first letter type are mandatory and MUST be processed by the server. If the
-Line 258:
+Line 249:
- * lower case type are advisory and CAN be disregarded during the unbundle process.

 * mixed case SHOULD not appear and WILL be interpreted as upper case one.
+ * lower case first letter type are advisory and CAN be disregarded during the unbundle process.
-Line 271:
+Line 260:
- * size of header (32bits integer)
+ * size of header (16bits integer)
-Line 275:
+Line 264:
-   * part mode (Byte)
   * data size (when applicable)
   * options: (see other section)
+   * parameters: (see other section)
-Line 291:
+Line 278:
-The mode is an enum that define how the data can be retrieved from the part.
The unbundle process MUST be aborted is an unknown mode is meet.  The two
foreseen mode are now:

 * stream (0x0): total size of the data are yet unknown. See dedicated section
   below for details.

 * plan (0x1): we know the total size of data and they will be available
   directly after the header. The total size of data is encoded in a 32bits
   integer right after the mode file.

Note that abort from unknown mode are nasty as we do not know how much data
remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.
-Line 308:
+Line 281:
-Parts options are able to carry arbitrary bytes. Their encoding is therefor more
complicated than the main-options.

 * number of mandatory options (32 bits integer)
 * number of advisory options (32 bits integer)
 * pair of options size (sequence of 32bits integer couple)
 * options themselves

First is the number of mandatory and advisory option. Once the number of
options is known, we can read Nx2 number of integer to get the len of the key,
value couple of each option. Then we can proceed to reading all the options.

Note that this force all mandatory option to be read before the advisory one.
+Parts parameters are able to carry arbitrary bytes. Their encoding is therefor more
complicated than the stream parameters.

 * number of mandatory parameters (Byte)
 * number of advisory parameters (Byte)
 * pair of parameters size (sequence of Byte couple)
 * parameters themselves

First is the number of mandatory and advisory parameters. Once the number of
parameters is known, we can read Nx2 number of Bytes to get the len of the key,
value couple of each parameters. Then we can proceed to reading all the
parameters.

Note that this force all mandatory parameter to be read before the advisory one.
-Line 351:
+Line 325:
-   * main options:
     * size of main options (unsigned 32 bits integer)
     * main options (text)
+   * main parameters:
     * size of main parameters (unsigned 16 bits integer)
     * main parameters (text)
-Line 356:
+Line 330:
-   * size of header (32bits integer)
+   * size of header (16bits integer)
-Line 360:
+Line 334:
-     * part mode (Byte)
     * data size (when if applicable)
     * options: (see other section)
       * number of mandatory options (Byte)
       * number of advisory options (Byte)
       * pair of options size (sequence of 32bits integer couple)
       * options themselves
+     * parameters: (see other section)
       * number of mandatory parameters (Byte)
       * number of advisory parameters (Byte)
       * pair of parameters size (sequence of Byte couple)
       * parameters themselves
-Line 369:
+Line 341:
- * end of bundle marker (empty part)
+ * empty part (act a end of bundle marker)

Diff for "BundleFormat2"