Differences between revisions 1 and 11 (spanning 10 versions)
Revision 1 as of 2009-10-24 21:21:19
Size: 7335
Editor: JuanDelgado
Comment: How to parse a Mercurial's bundle format.
Revision 11 as of 2013-11-06 19:03:51
Size: 9786
Comment: Adding more detailed description.
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
On the command line, bundle files are generated with the <code>hg bundle</code> command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:
On the command line, bundle files are generated with the `hg bundle` command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:
Line 26: Line 25:
 infile = open(filename, "rb")
 outfile = open(str.join('',[filename,'.uncompressed']), "wb")
 outfile.write('HG10UN')
 for chunk in unbundle(infile):
  outfile.write(chunk)
        infile = open(filename, "rb")
        outfile = open(filename+'.uncompressed', "wb")
        outfile.write('HG10UN')
        for chunk in unbundle(infile):
                outfile.write(chunk)
Line 55: Line 54:
        # old client with uncompressed bundle         # old-style uncompressed bundle with no header - we've read into actual data
        fh.seek(0)
Line 79: Line 79:
== Overview of Bundles ==
 . To understand how the bundle format works, it is helpful to understand how changesets are committed to the repository. You might want to take a look at [[http://mercurial.selenic.com/wiki/ChangeSet#Committing_a_new_changeset|ChangeSet#Committing_a_new_changeset]].
Line 80: Line 82:
== Breakdown of uncompressed bundle files == A bundles contains all the information necessary to add one or more changesets to a repository. A changest includes:
Line 82: Line 84:
Uncompressed bundle files consist of the ''HG10UN'' header followed by a ''changegroup'', which,
according to the WireProtocol page, is:
 * a particular version of the ChangeLog
 * a particular version of the [[Manifest]]
 * a particular version of content for each tracked file

As such, a bundle contains the following for each changeset it includes:

 * changes made to the ChangeLog
 * changes made to the Manifest
 * changes made to a set of files

The bundle is divided into three corresponding sections, with a common structure called a Group used in each section (with a slight variation for the files section). Each Group is composed of one or more structures called Chunks, each of which is simply a 4-byte len field followed by data. The len field is interpretted as a big-endian integer and specifies the number of bytes in the ''entire'' Chunk, i.e., it includes its own 4 bytes.
||||<tablewidth="365px" tableheight="59px"style="text-align:center">'''Chunk ''' ||
||'''4 bytes - big-endian''' ||'''(''len'' - 4) bytes''' ||
||''len'' ||data ||
Line 86: Line 100:
  * A changelog ''group''
  * A manifest ''group''
  * A list of:
    * filename length
    * filename
    * file ''group''
Line 93: Line 101:
A ''group'' is a series of ''chunks'', which have the following form:
Line 95: Line 102:
|| '''4 bytes''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''(''len'' - 84) bytes''' ||
|| ''len'' || ''node'' || ''p1'' || ''p2'' || ''cs'' || Data ||
The group is terminated by a ''NullChunk'', which is simply a Chunk whose ''len'' is no more than 4 and therefore has no data (but a NullChunk always has all 4 bytes for the len field).
||||||||<tablewidth="365px" tableheight="59px"style="text-align:center">'''Group''' ||
||Chunk 0 ||... ||Null Chunk ||
Line 98: Line 106:
If ''len'' is less than or equal to 4, then the rest of the fields of the ''chunk'' will
not be written. Such a ''null chunk'' terminates every ''group''.



The changelog section and the manifest section of the bundle are both simply one group each. The final section is called the Filelist, and it is sequence of two-tuples, one two-tuple for each file that was modified by any one of the changesets in the bundle. Each two-tuple in the Filelist contains the file's path, in the form of a chunk, and a Group. the filelist is termianted by a NullChunk (in place of the next filepath chunk):
||||||||||||||<tablewidth="889px" tableheight="108px"style="text-align:center">'''Filelist''' ||
||||<style="text-align:center">'''FileEntry 0''' ||||<style="text-align:center">... ||||<style="text-align:center">'''FileEntry F-1''' ||'''Terminator''' ||
||filepath (Chunk) ||filedata (Group) ||||<style="text-align:center"> ||filepath (Chunk) ||filedata (Group) ||NullChunk ||




Putting it all together a bundle looks like this:
||||||||||||||||||||||||<tablewidth="1014px" tableheight="88px"style="text-align:center">'''Bundle''' ||
||||||||<style="text-align:center">'''Changelog (Group)<<BR>>''' ||||||||<style="text-align:center">'''Manifest (Group)<<BR>>''' ||||||||<style="text-align:center">'''Filelist<<BR>>''' ||
||Chunk 0 ||... ||Chunk C-1 ||NullChunk ||Chunk 0 ||... ||Chunk M-1 ||NullChunk ||FileEntry 0 ||... ||FileEntry F-1 ||NullChunk ||




=== Inside the Groups ===
Inside each Group (1 Group for the changelog, 1 Group for the manifest, 1 Group for each of the modified files), there is 1 Chunk for each changeset which is included in the Bundle. These Chunks are a special species of Chunk called a RevChunk, which have the Chunk data further divided into the following fields:
||'''4 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''(''len'' - 84) bytes''' ||
||''len'' ||''node'' ||''p1 (parent 1)'' ||''p2 (parent 2)'' ||''cs (changeset link)'' ||''revdata'' ||




The ''len'' field is the same as for all other Chunks, it specifies to the total number of bytes in the chunk. The next four fields are each 20 byte nodeids, stored in big-endian binary form (as opposed to the ASCII hexidecimal form commonly seen by the user). Each RevChunk contains the data needed to create a new entry in the corresponding revlog (revlog for the changelog, manifest, or tracked file), and the required nodeids are stored in the four fields. The ''node'' field is the identifier for the new entry that the RevChunk creates, while p1 and p2 are the nodeids for the new entry's parents. The ''cs'' fields is the [[ChangeSetID|ChangeSetId]] for the changeset that this RevChunk belongs to.
Line 102: Line 136:
Each non-null ''chunk'' in the Changelog group contains in its ''Data'' section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or ''parent'' patch of a given patch ''p'' is the patch that has a ''node'' equal to ''p'''s ''''''''p1'' field), so that the ''final text'' contains the following newline-terminated fields: '''''
Line 103: Line 138:
Each non-null ''chunk'' in the Changelog group contains in its ''Data'' section a patch against a text that is only stored
in the form of other patches, beginning with a patch against the empty string (""). Each Changelog
entry patches the result of all previous patches (the previous, or ''parent'' patch of a given patch ''p'' is the patch that has a ''node'' equal to ''p''``'s ''p1'' field), so that the ''final text'' contains the following
newline-terminated fields:
 * '''''A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest ''group'' '''''
 * '''''A string describing the user who committed this entry. '''''
 * '''''A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds. '''''
 * '''''The name of a file. '''''
 * '''''A description of the change that was made. '''''
Line 108: Line 144:
  *A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest ''group''
  *A string describing the user who committed this entry.
  *A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds.
  *The name of a file.
  *A description of the change that was made.

Or, in BNF:
'''''Or, in BNF: '''''
Line 117: Line 147:
<final-text> ::= <node-id> "\n" <user-id> "\n" <unix-timestamp> " " <utc-offset> "\n"
                 <filename> "\n" <description>
<node-id> ::= {<hexadecimal-character>}{40}
<user-id> ::= {<printable-character>}*
<unix-timestamp> ::= {<numeral>}*
<utc-offset> ::= {<numeral>}*
<filename> ::= {<os-accepted-filename-character>}*
<description> ::= {<printable-character>}*
Line 126: Line 148:
'''''The ''Data'' of the ''chunk'' contains a series of binary ''blocks'' that represent patches. These ''blocks'' have the following form:
||4 bytes - big-endian ''' ''' ||4 bytes - big-endian ''' ''' ||4 bytes - big-endian ''' ''' ||blocklen bytes ''''' ''''' ||
||start '' '' ||end '' '' ||blocklen '' '' ||textual data ||
Line 127: Line 152:
The ''Data'' of the ''chunk'' contains a series of binary ''blocks'' that represent patches. These ''blocks'' have the following form:
Line 129: Line 153:
|| '''4 bytes - big-endian''' || '''4 bytes - big-endian''' || '''4 bytes - big-endian'''|| '''''blocklen'' bytes''' ||
|| ''start'' || ''end'' || ''blocklen'' || textual data ||
'''''

Line 133: Line 158:
In all Changelog entries, ''start'' and ''end'' represent the byte positions within
the previous entry text version that are to be replaced with the ''textual data''. Each ''start'' within the same ''chunk'' must be greater than or equal to the previous block's ''end''. No ''end'' can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the ''final text'' of the previous
entry, the next ''final text'' is given by the following pseudocode ('''++''' is the ''append'' operator):
'''''In all Changelog entries, ''start'' and ''end'' represent the byte positions within the previous entry text version that are to be replaced with the ''textual data''. Each ''start'' within the same ''chunk'' must be greater than or equal to the previous block's ''end''. No ''end'' can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the ''final text'' of the previous entry, the next ''final text'' is given by the following pseudocode ('''''''''''++''' is the ''append'' operator): ''''' '''
Line 138: Line 161:
 patch(blocks, previous_text) =
    /* blocks[-1] is the block before the one currently being
     * processed, or a block containing all zeros
     * if the first block in the entry is being processed.
     */
    if blocks == [] then
       return previous_text[blocks[-1].end .. $]
    else
       return previous_text[blocks[-1].end .. blocks[0].start]
              ++ blocks[0].textual_data ++ patch(blocks[1..$], previous_text)
    end if
 end
Line 151: Line 162:
''For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the ''n''th Changelog entry to its ancestors is shown by the following pseudocode: ''''' '''
Line 152: Line 164:
For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the ''n''th Changelog entry to its ancestors is shown by the following pseudocode:
Line 154: Line 165:
 entries[0] := diff(first_final_text, "")
 
 if p2 == "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0" then
    entries[n] := diff(final_text,
                       patch(entries[n-1], patch(entries[n-2],
                                                 patch(..., patch(entries[0], "")))))
 else
    error "I don't know what the relationship is for merging patches"
 end
Line 164: Line 166:
=== Calculating a Changelog chunk's node field ===
''The ''chunk'' containing a set of Changelog blocks must have a ''node'' that is calculated according to the following pseudocode: ''''' '''
Line 165: Line 169:
=== Calculating a Changelog chunk's node field === {{{
}}}
p1'' and ''p2'' must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the ''final text'' for Manifest ''node''s. Numerous libraries exist that implement the ''sha1_digest'' function. ''''' '''
Line 167: Line 173:
The ''chunk'' containing a set of Changelog blocks must have a ''node'' that is calculated
according to the following pseudocode:
{{{
 sha1_digest(min(p1,p2) ++ max(p1,p2) ++ final_text)
}}}

''p1'' and ''p2'' must
be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the ''final text'' for Manifest ''node''s. Numerous libraries exist that implement the
''sha1_digest'' function.
== Bundle format v2 ==
''The new bundle format design is described on the BundleFormat2 page. ''''' '''

Bundle format is the format in which changegroups are exchanged. It is used in the WireProtocol, as well as from the command line.

On the command line, bundle files are generated with the hg bundle command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:

  • HG10BZ - Compressed with the Python 'bz2' module

  • HG10GZ - Compressed with the Python 'zlib' module

  • HG10UN - Not compressed.

Decompressor in Python

The following is a Python program to convert an HG10BZ or HG10GZ file into an HG10UN file:

   1 #!/usr/bin/python
   2 
   3 # Program to decompress Mercurial bundles
   4 # This program contains extracts from the Mercurial
   5 # source, and is therefore subject to the GNU General
   6 # Public License.
   7 
   8 import bz2, zlib, sys
   9 
  10 def decompress(filename):
  11         infile = open(filename, "rb")
  12         outfile = open(filename+'.uncompressed', "wb")
  13         outfile.write('HG10UN')
  14         for chunk in unbundle(infile):
  15                 outfile.write(chunk)
  16 
  17 def filechunkiter(f, size=65536, limit=None):
  18     """Create a generator that produces the data in the file size
  19     (default 65536) bytes at a time, up to optional limit (default is
  20     to read all data).  Chunks may be less than size bytes if the
  21     chunk is the last chunk in the file, or the file is a socket or
  22     some other type of file that sometimes reads less data than is
  23     requested."""
  24     assert size >= 0
  25     assert limit is None or limit >= 0
  26     while True:
  27         if limit is None: nbytes = size
  28         else: nbytes = min(limit, size)
  29         s = nbytes and f.read(nbytes)
  30         if not s: break
  31         if limit: limit -= len(s)
  32         yield s
  33 
  34 
  35 def unbundle(fh):
  36     header = fh.read(6)
  37     if header == 'HG10UN':
  38         return fh
  39     elif not header.startswith('HG'):
  40         # old-style uncompressed bundle with no header - we've read into actual data
  41         fh.seek(0)
  42         def generator(f):
  43             yield header
  44             for chunk in f:
  45                 yield chunk
  46     elif header == 'HG10GZ':
  47         def generator(f):
  48             zd = zlib.decompressobj()
  49             for chunk in f:
  50                 yield zd.decompress(chunk)
  51     elif header == 'HG10BZ':
  52         def generator(f):
  53             zd = bz2.BZ2Decompressor()
  54             zd.decompress("BZ")
  55             for chunk in filechunkiter(f, 4096):
  56                 yield zd.decompress(chunk)
  57     return generator(fh)
  58 
  59 if len(sys.argv) != 2:
  60    print "Usage: expandbundle <file>"
  61    exit()
  62 
  63 decompress(sys.argv[1])

Overview of Bundles

  • To understand how the bundle format works, it is helpful to understand how changesets are committed to the repository. You might want to take a look at ChangeSet#Committing_a_new_changeset.

A bundles contains all the information necessary to add one or more changesets to a repository. A changest includes:

  • a particular version of the ChangeLog

  • a particular version of the Manifest

  • a particular version of content for each tracked file

As such, a bundle contains the following for each changeset it includes:

  • changes made to the ChangeLog

  • changes made to the Manifest
  • changes made to a set of files

The bundle is divided into three corresponding sections, with a common structure called a Group used in each section (with a slight variation for the files section). Each Group is composed of one or more structures called Chunks, each of which is simply a 4-byte len field followed by data. The len field is interpretted as a big-endian integer and specifies the number of bytes in the entire Chunk, i.e., it includes its own 4 bytes.

Chunk

4 bytes - big-endian

(len - 4) bytes

len

data

The group is terminated by a NullChunk, which is simply a Chunk whose len is no more than 4 and therefore has no data (but a NullChunk always has all 4 bytes for the len field).

Group

Chunk 0

...

Null Chunk

The changelog section and the manifest section of the bundle are both simply one group each. The final section is called the Filelist, and it is sequence of two-tuples, one two-tuple for each file that was modified by any one of the changesets in the bundle. Each two-tuple in the Filelist contains the file's path, in the form of a chunk, and a Group. the filelist is termianted by a NullChunk (in place of the next filepath chunk):

Filelist

FileEntry 0

...

FileEntry F-1

Terminator

filepath (Chunk)

filedata (Group)

filepath (Chunk)

filedata (Group)

NullChunk

Putting it all together a bundle looks like this:

Bundle

Changelog (Group)

Manifest (Group)

Filelist

Chunk 0

...

Chunk C-1

NullChunk

Chunk 0

...

Chunk M-1

NullChunk

FileEntry 0

...

FileEntry F-1

NullChunk

Inside the Groups

Inside each Group (1 Group for the changelog, 1 Group for the manifest, 1 Group for each of the modified files), there is 1 Chunk for each changeset which is included in the Bundle. These Chunks are a special species of Chunk called a RevChunk, which have the Chunk data further divided into the following fields:

4 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

20 bytes - big-endian

(len - 84) bytes

len

node

p1 (parent 1)

p2 (parent 2)

cs (changeset link)

revdata

The len field is the same as for all other Chunks, it specifies to the total number of bytes in the chunk. The next four fields are each 20 byte nodeids, stored in big-endian binary form (as opposed to the ASCII hexidecimal form commonly seen by the user). Each RevChunk contains the data needed to create a new entry in the corresponding revlog (revlog for the changelog, manifest, or tracked file), and the required nodeids are stored in the four fields. The node field is the identifier for the new entry that the RevChunk creates, while p1 and p2 are the nodeids for the new entry's parents. The cs fields is the ChangeSetId for the changeset that this RevChunk belongs to.

The Changelog ''group''

Each non-null chunk in the Changelog group contains in its Data section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or parent patch of a given patch p is the patch that has a node equal to ps p1 field), so that the final text contains the following newline-terminated fields:

  • A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest group

  • A string describing the user who committed this entry.

  • A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds.

  • The name of a file.

  • A description of the change that was made.

Or, in BNF:

The Data of the chunk contains a series of binary blocks that represent patches. These blocks have the following form:

4 bytes - big-endian

4 bytes - big-endian

4 bytes - big-endian

blocklen bytes

start

end

blocklen

textual data

Producing the Changelog entry's text from the patches describing it

In all Changelog entries, start and end represent the byte positions within the previous entry text version that are to be replaced with the textual data. Each start within the same chunk must be greater than or equal to the previous block's end. No end can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the final text of the previous entry, the next final text is given by the following pseudocode (++ is the append operator):

For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the nth Changelog entry to its ancestors is shown by the following pseudocode:

Calculating a Changelog chunk's node field

The chunk containing a set of Changelog blocks must have a node that is calculated according to the following pseudocode:

p1 and p2 must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the final text for Manifest nodes. Numerous libraries exist that implement the sha1_digest function.

Bundle format v2

The new bundle format design is described on the BundleFormat2 page.

BundleFormat (last edited 2018-02-10 00:09:33 by AviKelman)