Size: 7324
Comment:
|
Size: 7169
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
Line 26: | Line 25: |
infile = open(filename, "rb") outfile = open(str.join('',[filename,'.uncompressed']), "wb") outfile.write('HG10UN') for chunk in unbundle(infile): outfile.write(chunk) |
infile = open(filename, "rb") outfile = open(str.join('',[filename,'.uncompressed']), "wb") outfile.write('HG10UN') for chunk in unbundle(infile): outfile.write(chunk) |
Line 79: | Line 78: |
== Breakdown of uncompressed bundle files == Uncompressed bundle files consist of the ''HG10UN'' header followed by a ''changegroup'', which, according to the WireProtocol page, is: |
|
Line 80: | Line 81: |
== Breakdown of uncompressed bundle files == | * A changelog ''group'' * A manifest ''group'' * A list of: * filename length * filename * file ''group'' |
Line 82: | Line 88: |
Uncompressed bundle files consist of the ''HG10UN'' header followed by a ''changegroup'', which, according to the WireProtocol page, is: |
A ''group'' is a series of ''chunks'', terminated by a ''null chunk''. Non-null chunks have the following form: ||'''4 bytes - big--endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''20 bytes - big-endian''' ||'''(''len'' - 84) bytes''' || ||''len'' ||''node'' ||''p1'' ||''p2'' ||''cs'' ||Data || |
Line 86: | Line 93: |
* A changelog ''group'' * A manifest ''group'' * A list of: * filename length * filename * file ''group'' A ''group'' is a series of ''chunks'', which have the following form: || '''4 bytes''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''20 bytes - big-endian''' || '''(''len'' - 84) bytes''' || || ''len'' || ''node'' || ''p1'' || ''p2'' || ''cs'' || Data || If ''len'' is less than or equal to 4, then the rest of the fields of the ''chunk'' will not be written. Such a ''null chunk'' terminates every ''group''. |
A null chunk is simply a 4-byte big-endian integer that is less than or equal to 4. |
Line 102: | Line 96: |
Each non-null ''chunk'' in the Changelog group contains in its ''Data'' section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or ''parent'' patch of a given patch ''p'' is the patch that has a ''node'' equal to ''p'''s ''p1'' field), so that the ''final text'' contains the following newline-terminated fields: | |
Line 103: | Line 98: |
Each non-null ''chunk'' in the Changelog group contains in its ''Data'' section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or ''parent'' patch of a given patch ''p'' is the patch that has a ''node'' equal to ''p''``'s ''p1'' field), so that the ''final text'' contains the following newline-terminated fields: *A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest ''group'' *A string describing the user who committed this entry. *A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds. *The name of a file. *A description of the change that was made. |
* A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest ''group'' * A string describing the user who committed this entry. * A Unix time stamp, followed by a space, followed by an offset from UTC, in seconds. * The name of a file. * A description of the change that was made. |
Line 126: | Line 116: |
The ''Data'' of the ''chunk'' contains a series of binary ''blocks'' that represent patches. These ''blocks'' have the following form: ||'''4 bytes - big-endian''' ||'''4 bytes - big-endian''' ||'''4 bytes - big-endian''' ||'''''blocklen'' bytes''' || ||''start'' ||''end'' ||''blocklen'' ||textual data || |
|
Line 127: | Line 120: |
The ''Data'' of the ''chunk'' contains a series of binary ''blocks'' that represent patches. These ''blocks'' have the following form: | |
Line 129: | Line 121: |
|| '''4 bytes - big-endian''' || '''4 bytes - big-endian''' || '''4 bytes - big-endian'''|| '''''blocklen'' bytes''' || || ''start'' || ''end'' || ''blocklen'' || textual data || |
|
Line 133: | Line 124: |
In all Changelog entries, ''start'' and ''end'' represent the byte positions within the previous entry text version that are to be replaced with the ''textual data''. Each ''start'' within the same ''chunk'' must be greater than or equal to the previous block's ''end''. No ''end'' can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the ''final text'' of the previous entry, the next ''final text'' is given by the following pseudocode ('''++''' is the ''append'' operator): |
In all Changelog entries, ''start'' and ''end'' represent the byte positions within the previous entry text version that are to be replaced with the ''textual data''. Each ''start'' within the same ''chunk'' must be greater than or equal to the previous block's ''end''. No ''end'' can be greater than the total length of the previous entry version. Given any non-merging Changelog entry and the ''final text'' of the previous entry, the next ''final text'' is given by the following pseudocode ('''++''' is the ''append'' operator): |
Line 151: | Line 140: |
For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the ''n''th Changelog entry to its ancestors is shown by the following pseudocode: | |
Line 152: | Line 142: |
For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the ''n''th Changelog entry to its ancestors is shown by the following pseudocode: | |
Line 155: | Line 144: |
Line 164: | Line 153: |
=== Calculating a Changelog chunk's node field === The ''chunk'' containing a set of Changelog blocks must have a ''node'' that is calculated according to the following pseudocode: |
|
Line 165: | Line 156: |
=== Calculating a Changelog chunk's node field === The ''chunk'' containing a set of Changelog blocks must have a ''node'' that is calculated according to the following pseudocode: |
|
Line 172: | Line 159: |
''p1'' and ''p2'' must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the ''final text'' for Manifest ''node''s. Numerous libraries exist that implement the ''sha1_digest'' function. |
''p1'' and ''p2'' must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the ''final text'' for Manifest ''node''s. Numerous libraries exist that implement the ''sha1_digest'' function. |
Bundle format is the format in which changegroups are exchanged. It is used in the WireProtocol, as well as from the command line.
On the command line, bundle files are generated with the hg bundle command. They consist of a header, followed by a block of binary data, which may be compressed. The header is 6 bytes long and indicates the compression type:
HG10BZ - Compressed with the Python 'bz2' module
HG10GZ - Compressed with the Python 'zlib' module
HG10UN - Not compressed.
Contents
Decompressor in Python
The following is a Python program to convert an HG10BZ or HG10GZ file into an HG10UN file:
1 #!/usr/bin/python
2
3 # Program to decompress Mercurial bundles
4 # This program contains extracts from the Mercurial
5 # source, and is therefore subject to the GNU General
6 # Public License.
7
8 import bz2, zlib, sys
9
10 def decompress(filename):
11 infile = open(filename, "rb")
12 outfile = open(str.join('',[filename,'.uncompressed']), "wb")
13 outfile.write('HG10UN')
14 for chunk in unbundle(infile):
15 outfile.write(chunk)
16
17 def filechunkiter(f, size=65536, limit=None):
18 """Create a generator that produces the data in the file size
19 (default 65536) bytes at a time, up to optional limit (default is
20 to read all data). Chunks may be less than size bytes if the
21 chunk is the last chunk in the file, or the file is a socket or
22 some other type of file that sometimes reads less data than is
23 requested."""
24 assert size >= 0
25 assert limit is None or limit >= 0
26 while True:
27 if limit is None: nbytes = size
28 else: nbytes = min(limit, size)
29 s = nbytes and f.read(nbytes)
30 if not s: break
31 if limit: limit -= len(s)
32 yield s
33
34
35 def unbundle(fh):
36 header = fh.read(6)
37 if header == 'HG10UN':
38 return fh
39 elif not header.startswith('HG'):
40 # old client with uncompressed bundle
41 def generator(f):
42 yield header
43 for chunk in f:
44 yield chunk
45 elif header == 'HG10GZ':
46 def generator(f):
47 zd = zlib.decompressobj()
48 for chunk in f:
49 yield zd.decompress(chunk)
50 elif header == 'HG10BZ':
51 def generator(f):
52 zd = bz2.BZ2Decompressor()
53 zd.decompress("BZ")
54 for chunk in filechunkiter(f, 4096):
55 yield zd.decompress(chunk)
56 return generator(fh)
57
58 if len(sys.argv) != 2:
59 print "Usage: expandbundle <file>"
60 exit()
61
62 decompress(sys.argv[1])
Breakdown of uncompressed bundle files
Uncompressed bundle files consist of the HG10UN header followed by a changegroup, which, according to the WireProtocol page, is:
A changelog group
A manifest group
- A list of:
- filename length
- filename
file group
A group is a series of chunks, terminated by a null chunk. Non-null chunks have the following form:
4 bytes - big--endian |
20 bytes - big-endian |
20 bytes - big-endian |
20 bytes - big-endian |
20 bytes - big-endian |
(len - 84) bytes |
len |
node |
p1 |
p2 |
cs |
Data |
A null chunk is simply a 4-byte big-endian integer that is less than or equal to 4.
The Changelog ''group''
Each non-null chunk in the Changelog group contains in its Data section a patch against a text that is only stored in the form of other patches, beginning with a patch against the empty string (""). Each Changelog entry patches the result of all previous patches (the previous, or parent patch of a given patch p is the patch that has a node equal to ps p1 field), so that the final text contains the following newline-terminated fields: A node-id, in hexadecimal notation, that corresponds to an entry in the Manifest Or, in BNF: The 4 bytes - big-endian 4 bytes - big-endian 4 bytes - big-endian blocklen bytes start end blocklen textual data
In all Changelog entries, For non-merging Changelog entries, every patch except the first is a patch against the result of applying previous patches. The relationship of the
The
group <final-text> ::= <node-id> "\n" <user-id> "\n" <unix-timestamp> " " <utc-offset> "\n"
<filename> "\n" <description>
<node-id> ::= {<hexadecimal-character>}{40}
<user-id> ::= {<printable-character>}*
<unix-timestamp> ::= {<numeral>}*
<utc-offset> ::= {<numeral>}*
<filename> ::= {<os-accepted-filename-character>}*
<description> ::= {<printable-character>}*
Producing the Changelog entry's text from the patches describing it
patch(blocks, previous_text) =
/* blocks[-1] is the block before the one currently being
* processed, or a block containing all zeros
* if the first block in the entry is being processed.
*/
if blocks == [] then
return previous_text[blocks[-1].end .. $]
else
return previous_text[blocks[-1].end .. blocks[0].start]
++ blocks[0].textual_data ++ patch(blocks[1..$], previous_text)
end if
end
entries[0] := diff(first_final_text, "")
if p2 == "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0" then
entries[n] := diff(final_text,
patch(entries[n-1], patch(entries[n-2],
patch(..., patch(entries[0], "")))))
else
error "I don't know what the relationship is for merging patches"
end
Calculating a Changelog chunk's node field
sha1_digest(min(p1,p2) ++ max(p1,p2) ++ final_text)
p1 and p2 must be in the 20-byte binary representation used within the Bundle format, not the hexadecimal notation used within the final text for Manifest nodes. Numerous libraries exist that implement the sha1_digest function.