Differences between revisions 25 and 34 (spanning 9 versions)
Revision 25 as of 2014-03-27 17:49:48
Size: 12080
Comment:
Revision 34 as of 2018-02-10 00:05:58
Size: 2056
Editor: AviKelman
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
<<Include(A:dev)>> {{{#!wiki caution
Line 3: Line 3:
= BundleFormat2 = This information was derived by reverse engineering. Some details may be incomplete. Hopefully someone with intimate familiarity with the code can improve it.}}}
Line 5: Line 5:
This page describes the current plan to get a more modern and complete bundle format. (for old content of this page check [[BundleFormatHG19]]) The v2 bundle file format is in practice quite similar to v1 (see BundleFormat), in that it comprises a file header followed by a changegroup, but it differs in a few significant ways.
Line 7: Line 7:
<<TableOfContents>> == Practical differences from v1 bundles ==
 * The file has a more verbose multi-stage ASCII header containing key:value pairs. (more below)
 * Zstandard compression (new default) also supported.
 * Uses version 2 deltagroup headers instead of version 1. (see the spec at [[Topic:internals.changegroups|help internals.changegroups]])
 * Everything after the header is shredded into N-byte chunks after it is assembled (N is a parameter defined in the source code).
Line 9: Line 13:
(current content is copy pasted from 2.9 sprint note) == Reading the header ==
Line 11: Line 15:
== Why a New bundle format? == === stage 1 ===
|| 'HG20' || Compression Chunk || rest of file ||
Line 13: Line 18:
 * lightweight
 * new manifest
 * general delta
 * bookmarks
 * phase boundaries
 * obsolete markers
 * >sha1 support
 * pushkey
 * extensible for new features (required and optional)
 * progress information
 * resumable?
 * transaction commit markers?
 * recursive (to be able to bundle subrepos)
Compression Chunk will be either null or contain the ASCII 'Compression=XX' where XX is a code indicating which decompression to use on the rest of the file.
Line 27: Line 20:
It's possible to envision a format that sends a change, its manifest, and
filenodes in each chunk rather than sending all changesets, then all manifests,
etc. capabilities
=== stage 2 ===
|||| rest of file from stage 1 ||
|| Parameters Chunk || shredded changegroup (and possibly other sections?) ||
Line 31: Line 24:
== Changes in current command == Parameters Chunk contains (among possibly other things?) the fact that the file contains a changegroup ('\x0bCHANGEGROUP'), a null chunk, and then a complex nested sequence of two parameter categories. The nested sequence contains, first, indicators for how many key:value pairs are in the first category, followed by how many pairs are in the second category, followed by the length of an ASCII key, followed by the length of its ASCII value (repeated for all keys and values).
Line 33: Line 26:
=== Push Orchestraction ===

==== Current situation ====

 * push:
   * changesets:
     * discovery
     * validation
     * actual push
   * phase:
     * discovery
     * pull
     * push
   * obsolescence
     * discovery
     * push
   * bookmark
     * discovery
     * push

==== Aimed orchestration ====

* push:
  * discovery:
    * changesets
    * phase
    * obs
    * bookmark
  * post-discovery action:
    * current usecase move phase for common changeset seen as public.
  * local-validation:
    * (much easier will everything in hands)
    * complains about:
      * multiple heads
      * new branch
      * troubles changeset
      * divergent bookmark
      * missing subrepo revisions
      * Rent in Manhattan
      * etc…
  * push:
      * (using multipart-bundle when possible)
        The one and single remote side transaction happen here
  * (post-push) pull:
      * The server send back its own multipart-bundle to the client
        (The server would be able to reply a multi-bundle. To inform the client of potential phase//bookmark//changeset rewrites etc…)

==== post-push pull ====

If we lets the protocol send arbitrary data to the server, we need the server to be able to send back arbitrary data too.

The idea is to use the very same top level format. It could contains any kind of thing the client have advertise to understand. This last phase is advisory this the client can totally decide to ignores its content.

Possible use case are:

 * sending standard output back
 * sending standard error back
 * notification that a changeset was made public on push
 * notification of partially accepted changeset
 * notification of automatic bookmark move on the server
 * test case result (or test run key)
 * Automatic shipment of Pony to contributor address
 * … (Possibility are endless)

=== Changes in Pull ===

Same kind of stuff will happen but pull is much simpler. (I'm not worried at all about it).
May efficiently pull subrepo revisions.

=== Change in Bundle/Unbundle ===

Unbundle would learn to unbundle both

Maybe we can have the new bundle format start with an invalid entry to prevent old unbundle to try to import them

bundle should be able to produce new bundle. It can probably not do it by default for a long time however :-/

We could also do a "recursive bundle" in the presence of subrepos. A bundle could contain parts that are bundles of the subrepo revisions referenced by the revisions contained in the main bundle.


== Top level Bundle ==

=== content ===

On the remote side, the server will need to redo the validation that was done on
the remote side to ensure that nothing interesting happened between discovery
and push. We need to send appriate data to the remote for validation. This
implies either argument in the command data. Or a dedicated section in the
bundle. The dedicated section seems the way to go as it feels more flexible. We
do not know what kind of data will be monitored and send. So we cannot build a
sensible set of argument doing the job. With a dedicated section in the
multi-part bundle, we can make this section evolve over time to match the
evolution of data we send to the server.

=== forseen sections ===

Here are the idea we already have about section

 * HG10 (old changeset bundle format)
 * HG19 (new changeset bundle with support for modern stuff)
 * pushkey data (phase, bookmarks)
 * obsolescence markers (format 1 and upcoming format 2 ?)
 * client capacity (to be used for the reply multi part bundle)
 * presence of subrepo bundles

== Format of the Bundle2 Container ==

=== Goal ===

The goal of bundle2 is to act as an atomically packet to transmit a set of
payloads in an application agnostic way. It consist in a sequence of "parts"
that will be handed to and processed by the application layer.

A bundle2 can be read in a single pass from a stream.

bundle2 start with a small header and follow with a sequence of parts. Parts
have an header of they own.

=== Main Header ===

This header contains information about the application agnostic bundle.

It is encoded as such:

 * Magic string 'HG20
 * stream parameter:
   * size of main stream parameters (unsigned 16 bits integer)
   * main stream parameter (text)


unbundling MUST abort when an unknown Magic string is met

Note that abort from unknown magic string are nasty as we do not know how much
data remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

==== Stream Options ====

First come a 16 bits integer. Its the size in Bytes of the parameters themselves

The size of data in the main header. If people need more than 64k of
parameters I expect them to be run in other troubles before.

The main header data are the list of parameters that alter the behavior of the
top level bundle. This is intended only to control extraction of the payload
part. This is -not- intended for any changes in the application level
understanding of the payload. The parameters are formated as space separated
list of entry. Each entry is in the form `<name>[=<value>]`. both name and value
are urlquoted. The entry name MUST start with a letter. Those with an capital
first letter will are mandatory, the unbundling process MUST abort is an unknown
mandatory parameter is encountered. Those with a lower case first letter may be
safely ignored when unknown.

Note that the first piece send is the size of the parameters section. So
parameters themselves cannot be stream. This is one more reason why you should
not intend to store huge data in main-option.

Note also that abort from unknown option are nasty as we do not know how much
data remains to be read. This MUST result in a full scale panic abort that
invalidate the whole communication channel.

===== Examples of valid stream option =====


Those are example **not actual proposal of final parameters**. Some of them are
actually very clowny.

 * Set a new format of part headers:

   `PARTVERSION=1`

 * Have the payload use a special compression algorithm

   `COMPRESSION=DOGEZIP`

 * Set encoding of string in part-header to GOST13052 (or EBCDIC if you insist)

   `PARTENCODING=GOST13052`

 * Set integer format in part-header to middle-endian

   `ENDIANESS=PDP11`

===== Example of -possibly- valid main option =====

 * ask for debug level output in the reply

   `debug`

 * inform of total number of parts:

   `nbparts=42`

 * inform of total size of the bundle:

   `totalsize=1337`

===== Example of -invalid- main option =====

 * List of known heads (use a part for that)

 * username and/or credential (use a part for that)


=== Parts ===

Parts convey the application level payload of the bundle. They are handled by
the application layer during the unbundle process.

A parts consist in three elements: type, parameters and data.

Type is a simple alphanumerical identifier that lets the application level know
what kind of data the part contains and root it to the applicable processors.

 * Capital first letter type are mandatory and MUST be processed by the server. If the
   server does not know how to handle an upper case type it MUST abort the
   unbundle process.

 * lower case first letter type are advisory and CAN be disregarded during the unbundle process.


Options are a set of key and value that may change the way the data from this
part will be processed. Some of them may be mandatory some other may be advisory

Data are the actual payload of the part.

==== Parts Header ====


 * size of header (16bits integer)
 * header:
   * size of type (Byte)
   * part type (string (up to 255 char))
   * parameters: (see other section)


Note that first entry is the full size of the header. So the header can't be
streamed and one should not plan to put massive data in the header itself.
(That what parts data are meant for).

The type is an alphanumerical string of arbitrary size (<256) that will be used
to find the application level part that process the data payload. It follow the
upper/lower case rules explained in the previous section. Note that routing
should be case insensitive. The lower case and upper case version of the same
type MUST be handled by the same code. It only matters in the case no handler
is found for a given type.


===== Parts Options =====

Parts parameters are able to carry arbitrary bytes. Their encoding is therefor more
complicated than the stream parameters.

 * number of mandatory parameters (Byte)
 * number of advisory parameters (Byte)
 * pair of parameters size (sequence of Byte couple)
 * parameters themselves

First is the number of mandatory and advisory parameters. Once the number of
parameters is known, we can read Nx2 number of Bytes to get the len of the key,
value couple of each parameters. Then we can proceed to reading all the
parameters.

Note that this force all mandatory parameter to be read before the advisory one.

==== Part Data: ====


Data is read as `<sizeofchunk><chunk>` until an empty chunk is found.

The size of each chunk is encoded in a 32 bits integer.

There is no constraint on the chunk size. But the bundler REALLY SHOULD NOT
using 1 Byte long chunk as that would be very inefficient. The bundler MAY WISH
TO stick to stable and sensible chunk size as the 4096 Byte use elsewhere in
the code base)


==== End Of Bundle Marker ====

End of bundle is marked by an "empty Parts" with a 0 size header.

=== Summary of the general structure ===

(the bundle2 format WOULD PROBABLY start with a fixed invalid//empty HG10 bundle)

 * main header
   * bundle version (unsigned Byte)
   * main parameters:
     * size of main parameters (unsigned 16 bits integer)
     * main parameters (text)

 * part: (any number of them)
   * size of header (16bits integer)
   * header:
     * size of type (Byte)
     * part type (string (up to 255 char))
     * parameters: (see other section)
       * number of mandatory parameters (Byte)
       * number of advisory parameters (Byte)
       * pair of parameters size (sequence of Byte couple)
       * parameters themselves
   * data (Bytes (plenty of them))

 * empty part (act a end of bundle marker)


== New type of Part ==

=== Changesets exchange ===

=== New header ===

{{{#!C
type Header struct {
    length uint32
    lNode byte
    node [lNode]byte

    // if empty (lP1 ==0) then default to previous node in the stream
    lP1 byte
    p1 [lP1]byte

    // if empty, nullrev
    lP2 byte
    p2 [lP2]byte

    // if empty, self (for changelogs)
    lLinknode byte
    linknode [lLinknode]byte

    // if empty, p1
    lDeltaParent byte
    deltaParent [lDeltaParent]byte
}
}}}
We'll modify the existing changegroup type so it can pretend to be a new changegroup that just has a variety of empty fields. Progress information fields might be optional.



----
CategoryNewFeatures
Example Parameters Chunk:
|| chunk length |||| description of contents || #section1 parameters || #section2 parameters || len(key1),len(value1) || len(key2),len(value2) || key1 || value1 || key2 || value2||
|| 4 bytes || \x0bCHANGEGROUP || 4 bytes null || \x01 || \x01 || \x07\x02 || \t\x01 || version || 02 || nbchanges || 7 ||

This information was derived by reverse engineering. Some details may be incomplete. Hopefully someone with intimate familiarity with the code can improve it.

The v2 bundle file format is in practice quite similar to v1 (see BundleFormat), in that it comprises a file header followed by a changegroup, but it differs in a few significant ways.

Practical differences from v1 bundles

  • The file has a more verbose multi-stage ASCII header containing key:value pairs. (more below)
  • Zstandard compression (new default) also supported.
  • Uses version 2 deltagroup headers instead of version 1. (see the spec at help internals.changegroups)

  • Everything after the header is shredded into N-byte chunks after it is assembled (N is a parameter defined in the source code).

Reading the header

stage 1

'HG20'

Compression Chunk

rest of file

Compression Chunk will be either null or contain the ASCII 'Compression=XX' where XX is a code indicating which decompression to use on the rest of the file.

stage 2

rest of file from stage 1

Parameters Chunk

shredded changegroup (and possibly other sections?)

Parameters Chunk contains (among possibly other things?) the fact that the file contains a changegroup ('\x0bCHANGEGROUP'), a null chunk, and then a complex nested sequence of two parameter categories. The nested sequence contains, first, indicators for how many key:value pairs are in the first category, followed by how many pairs are in the second category, followed by the length of an ASCII key, followed by the length of its ASCII value (repeated for all keys and values).

Example Parameters Chunk:

chunk length

description of contents

#section1 parameters

#section2 parameters

len(key1),len(value1)

len(key2),len(value2)

key1

value1

key2

value2

4 bytes

\x0bCHANGEGROUP

4 bytes null

\x01

\x01

\x07\x02

\t\x01

version

02

nbchanges

7

BundleFormat2 (last edited 2018-02-10 00:05:58 by AviKelman)