Differences between revisions 3 and 4
Revision 3 as of 2009-05-19 19:31:01
Size: 3302
Editor: localhost
Comment: converted to 1.6 markup
Revision 4 as of 2009-06-15 08:39:05
Size: 3352
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
Current limits on Mercurial file size:

 * all files are currently handled as single Python strings in memory
 * the diff algorithm currently assumes two revisions + line indexes
   + resulting diff can fit in memory (so 3-5x overhead)
== Current limits on Mercurial file size ==
 * all files are currently handled as single Python strings '''in memory '''
 * the diff/delta algorithm currently assumes two revisions + line indexes
  . + resulting diff can fit in memory (so '''3-5x''' overhead)
Line 8: Line 7:
   in memory   . in memory
Line 13: Line 12:
 * the wire protocol and bundle format currently have 2G limits.  * the wire protocol and bundle format currently have 2G limits on individual deltas
Line 15: Line 14:
Thus, 32-bit systems can handle files approaching 1GB with sufficient
RAM+swap before running out of virtual memory.
Thus, 32-bit systems can handle files approaching '''1GB''' with sufficient RAM+swap before running out of virtual memory.
Line 20: Line 18:
With some small changes, the 2GB barriers can probably be pushed back
to 4GB. By changing some index and protocol structures, we can push
this back to terabytes, but you'll need a corresponding amount of
RAM+swap to handle those large files.
== Future Directions ==
With some small changes, the 2GB barriers can probably be pushed back to 4GB. By changing some index and protocol structures, we can push this back to terabytes, but you'll need a corresponding amount of RAM+swap to handle those large files.
Line 25: Line 21:
To go beyond this limit and handle files much larger than available
memory, we would need to do some fairly substantial replumbing of
Mercurial's internals. This is desirable for handling extremely large
files (video, astronomical data, ASIC design) and reducing
requirements for web servers. Possible approaches to handling larger
files:
To go beyond this limit and handle files much larger than available memory, we would need to do some fairly substantial replumbing of Mercurial's internals. This is desirable for handling extremely large files (video, astronomical data, ASIC design) and reducing requirements for web servers. Possible approaches to handling larger files:
Line 34: Line 25:
  into memory on demand   . into memory on demand
Line 37: Line 28:
The mmap approach doesn't really help as we quickly run into a 3GB
barrier on 32-bit machines.
The mmap approach doesn't really help as we quickly run into a 3GB barrier on 32-bit machines.
Line 40: Line 30:
The magic string technique would require auditing every single use of
the string to avoid things like write() that would instantiate the
whole string in memory.
The magic string technique would require auditing every single use of the string to avoid things like write() that would instantiate the whole string in memory.
Line 44: Line 32:
If we instead declare that we pass all file contents around as an
iterable (list, tuple, or iterator) of large multi-megabyte string
fragments, every user will break loudly and need replacing with an
appropriate loop, thus simplifying the audit process. This concept can
be wrapped in a simple class, but it can't have any automatic
conversion to 'str' type. As a first pass, making everything work with
one-element lists should be easy.
If we instead declare that we pass all file contents around as an iterable (list, tuple, or iterator) of large multi-megabyte string fragments, every user will break loudly and need replacing with an appropriate loop, thus simplifying the audit process. This concept can be wrapped in a simple class, but it can't have any automatic conversion to 'str' type. As a first pass, making everything work with one-element lists should be easy.
Line 54: Line 36:
The mpatch code can be made to work on a window without too much
effort, but it may be hard to avoid degrading to O(n^2) performance
overall as we iterate through the window.
The mpatch code can be made to work on a window without too much effort, but it may be hard to avoid degrading to O(n²) performance overall as we iterate through the window.
Line 58: Line 38:
The core delta algorithm could similarly be made to delta
corresponding chunks of revisions, or could be extended to support a
streaming binary diff.
The core delta algorithm could similarly be made to delta corresponding chunks of revisions, or could be extended to support a streaming binary diff.
Line 62: Line 40:
Changing compression and decompression to work on iterables is
trivial. Adjusting most I/O is also trivial. Various operations like
annotate will be harder.
Changing compression and decompression to work on iterables is trivial. Adjusting most I/O is also trivial. Various operations like annotate will be harder.
Line 66: Line 42:
Extending dirstate and revlog chunks to 4G means going to unsigned pack/unpack
specifiers, which is easy enough. Beyond that, more invasive format
changes will be needed.
Extending dirstate and revlog chunks to 4G means going to unsigned pack/unpack specifiers, which is easy enough. Beyond that, more invasive format changes will be needed.
Line 70: Line 44:
If revlog is changed to store the end offset of each hunk, the
compressed hunk length needn't be stored. This will let us go to
48-bit uncompressed lengths and 64-bit total revlogs without enlarging
the index.
If revlog is changed to store the end offset of each hunk, the compressed hunk length needn't be stored. This will let us go to 48-bit uncompressed lengths and 64-bit total revlogs without enlarging the index.

Current limits on Mercurial file size

  • all files are currently handled as single Python strings in memory

  • the diff/delta algorithm currently assumes two revisions + line indexes
    • + resulting diff can fit in memory (so 3-5x overhead)

  • compression and decompression may use 2x memory
  • the patch algorithm assumes original + all patches + output can fit
    • in memory
  • individual deltas are limited to 2G in the revlog index
  • uncompressed revision length is also limited to 2G in the index
  • revlogs have a maximum length of 48 bits (256 terabytes)
  • dirstate only tracks file sizes up to 2G
  • the wire protocol and bundle format currently have 2G limits on individual deltas

Thus, 32-bit systems can handle files approaching 1GB with sufficient RAM+swap before running out of virtual memory.

64-bit systems will instead hit internal 2GB barriers.

Future Directions

With some small changes, the 2GB barriers can probably be pushed back to 4GB. By changing some index and protocol structures, we can push this back to terabytes, but you'll need a corresponding amount of RAM+swap to handle those large files.

To go beyond this limit and handle files much larger than available memory, we would need to do some fairly substantial replumbing of Mercurial's internals. This is desirable for handling extremely large files (video, astronomical data, ASIC design) and reducing requirements for web servers. Possible approaches to handling larger files:

  • use mmap to back virtual memory with disk
  • use a 'magic string' class to transparently bring portions of a file
    • into memory on demand
  • use iterable string vectors for all file contents

The mmap approach doesn't really help as we quickly run into a 3GB barrier on 32-bit machines.

The magic string technique would require auditing every single use of the string to avoid things like write() that would instantiate the whole string in memory.

If we instead declare that we pass all file contents around as an iterable (list, tuple, or iterator) of large multi-megabyte string fragments, every user will break loudly and need replacing with an appropriate loop, thus simplifying the audit process. This concept can be wrapped in a simple class, but it can't have any automatic conversion to 'str' type. As a first pass, making everything work with one-element lists should be easy.

Fixing up the code:

The mpatch code can be made to work on a window without too much effort, but it may be hard to avoid degrading to O(n²) performance overall as we iterate through the window.

The core delta algorithm could similarly be made to delta corresponding chunks of revisions, or could be extended to support a streaming binary diff.

Changing compression and decompression to work on iterables is trivial. Adjusting most I/O is also trivial. Various operations like annotate will be harder.

Extending dirstate and revlog chunks to 4G means going to unsigned pack/unpack specifiers, which is easy enough. Beyond that, more invasive format changes will be needed.

If revlog is changed to store the end offset of each hunk, the compressed hunk length needn't be stored. This will let us go to 48-bit uncompressed lengths and 64-bit total revlogs without enlarging the index.


CategoryInternals

HandlingLargeFiles (last edited 2013-02-03 20:55:49 by mpm)