Differences between revisions 3 and 17 (spanning 14 versions)
Revision 3 as of 2011-10-24 07:44:07
Size: 5666
Comment: Add hgext/largefiles/design.txt
Revision 17 as of 2013-09-01 01:22:32
Size: 7242
Editor: KevinBot
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
<!> This is considered a [[FeaturesOfLastResort|feature of last resort]].
Line 9: Line 10:
Line 15: Line 15:

The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when
you clone a repository or pull in changesets, the large files in older revisions of the repository are not needed, and only the ones needed to update to the current version are downloaded.  This saves both disk space and
bandwidth.
The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when you clone a repository or pull in changesets, only the largefiles needed to update to the current version are downloaded. This saves both disk space and bandwidth.
Line 21: Line 18:
Line 26: Line 24:
When you push a changeset that affects largefiles to a remote repository, its largefile revisions will be uploaded along with the changeset. This ensures that the central store gets a copy of every revision of every largefile. Note that the remote Mercurial must also have the largefiles extension enabled for this to work.
Line 27: Line 26:
When you push a changeset that affects largefiles to a remote repository, its largefile revisions will be uploaded along with it. Note that the remote Mercurial must also have the largefiles extension enabled for this to work.

When you pull a changeset that affects largefiles from a remote repository, nothing different from Mercurial's normal behavior happens. However, when you update to such a revision, any largefiles needed by that revision are
downloaded and cached if they have never been downloaded before. This means that network access is required to update to revision you have not yet updated to.
When you pull a changeset that affects largefiles from a remote repository, nothing different from Mercurial's normal behavior happens. However, when you update to such a revision, any largefiles needed by that revision are downloaded if they have never been downloaded before. This means that network access is required to update to a revision you have not yet updated to.
Line 33: Line 29:
Line 36: Line 33:
By default, in repositories that already have largefiles in them, any new file over 10 MB will automatically be added as largefiles. To change this threshhold, set `largefiles.minsize` in your Mercurial config file to the minimum size in megabytes to track as a largefile:
Line 37: Line 35:
By default, in repositories that already have largefiles in them, any new file over 10MB will automatically be added as largefiles. To change this threshhold, set `largefiles.size` in your Mercurial config file to the minimum size in megabytes to track as a largefile:
Line 40: Line 37:
size = 2 minsize = 2
Line 43: Line 40:
Line 46: Line 44:
The `largefiles.patterns` config option allows you to specify specific space-separated filename patterns (in shell glob syntax) that should always be tracked as largefiles:
Line 47: Line 46:
The `largefiles.patterns` config option allows you to specify specific space-separated filename patterns (in shell glob syntax) that should always be tracked as largefiles:
Line 52: Line 50:
Note: the patterns syntax shown here is probably incorrect, please try `hg help patterns` to see if it fits better, in particular `*.{png,bmp}` seems not to work, whereas `re:.*\.(png|bmp)` get things done as expected.
Line 54: Line 53:
Configure your config file to enable the extension by adding following lines: Enable the largefiles extension by adding following lines in your config file:
Line 60: Line 59:
== Design ==
This section explains how largefiles works behind the scenes. If you're just adding/modifying/committing/pushing/pulling in a largefiles repo, you shouldn't have to read this section (although it can't hurt). But if you are setting up or administering Mercurial with largefiles, this is essential reading.
Line 61: Line 62:
== Design == === The local store ===
Each local repository has a local largefiles store in '`.hg/largefiles`'. When you add a new largefile to a repository, it is first stored here. When largefiles are downloaded from the central store (see below), a copy is saved there. Files in the local store are also hard-linked to the user cache.
Line 63: Line 65:
The extension is based off of Greg Ward's BfilesExtension. === The user cache ===
The user cache helps to avoid downloading and storing multiple copies of largefiles. When a largefile is needed but does not exist in the local store, Mercurial checks the user cache. If the needed largefile exists, a hard-link is created in the local store.
Line 65: Line 68:
=== The largefile store ===

largefile stores are, in the typical use case, centralized servers that have every past revision of a given binary file. Each largefile is identified by its SHA-1 hash, and all interactions with the store take one of the following forms.

 * Download a bfile with this hash
 * Upload a bfile with this hash
 * Check if the store has a bfile with this hash

largefiles stores can take one of two forms:

 * Directories on a network file share
 * Mercurial wireproto servers, either via ssh or http (hgweb)

=== The Local Repository ===

The local repository has a largefile store in '`.hg/largefiles`' which holds a subset of the largefiles needed. On a clone only the largefiles at tip are downloaded. When largefiles are downloaded from the central store, a copy is saved in this store.

=== The User Cache ===

largefiles in a local repository store are hardlinked to files in the user cache. Before a file is downloaded we check if it is in the global cache, hard-linking to the local store if we find it.

=== Implementation Details ===

Each largefile has a standin which is in '`.hglf.`' The standin is tracked by Mercurial. The standin contains the SHA-1 hash of the largefile. When a largefile is added/removed/copied/renamed/etc the same operation is applied to the standin. Thus the history of the standin is the history of the largefile.

For performance reasons, the contents of a standin are only updated before a commit. Standins are added/removed/copied/renamed from add/remove/copy/rename Mercurial commands but their contents will not be updated. The contents of a standin will always be the hash of the largefile as of the last commit. To support some commands (revert) some standins are temporarily updated but will be changed back after the command is finished.

A Mercurial dirstate object tracks the state of the largefiles. The dirstate uses the last modified time and current size to detect if a file has changed (without reading the entire contents of the file).
The cache location is OS dependent:
||OS X ||`/Users/username/Library/Caches/largefiles` ||
||Windows (Vista and up) ||`C:\Users\username\AppData\Local\largefiles` ||
||Windows (pre-Vista) ||`C:\Documents and Settings\username\Application Data\largefiles` ||
||Linux ||`/home/username/.cache/largefiles` ||
Line 95: Line 75:


You can set your user cache to a non-default location by setting `largefiles.usercache` in your Mercurial config:

{{{
[largefiles]
usercache = /shared/myusercachedir
}}}
The user cache can be deleted at any time to reclaim disk space, but doing so may also result in downloading and storing additional copies of largefiles.

==== The central store ====
In a typical setup with a central Mercurial server, the user who serves the central repositories will get a user cache that acts as a central store for all the repositories. This central largefiles store has every past revision of every largefile.

<!> Unlike other user caches, the central store should not be deleted! It may be the only cache that holds a largefile used by an old revision.

<!> When a client repository needs to download a largefile, it'll try to get it from the repository specified as default in the hgrc file. If not specified or incorrect repository is specified, the download will fail. As an alternative, a default path can be set for the specific hg update command:

{{{
hg --config paths.default=path-to-repo-with-the-file update
}}}
=== Implementation details ===
Each largefile has a standin file in '`.hglf/`', which is tracked by Mercurial like any other file. The standin contains the SHA-1 hash of the largefile contents. When a largefile is added/removed/copied/renamed/etc the same operation is applied to the standin. Thus the history of the standin is the history of the largefile.

For performance reasons, the contents of a standin are only updated before a commit. Standins are added/removed/copied/renamed from add/remove/copy/rename Mercurial commands but their contents will not be updated. The contents of a standin will always be the hash of the largefile as of the last commit. To support some commands (revert) some standins are temporarily updated, but changed back after the command is finished.

A Mercurial dirstate object tracks the state of the largefiles. The dirstate uses the last modified time and current size to detect if a file has changed without reading the entire contents of the file.
Line 96: Line 103:

Largefiles extension

<!> This is considered a feature of last resort.

Large binary files tend to be not very compressible, not very "diffable", and not at all mergeable. Such files are not handled well by Mercurial's storage format (Revlog), which is based on compressed binary deltas. largefiles solves this problem by adding a centralized client-server layer on top of Mercurial: largefiles live in a central store out on the network somewhere, and you only fetch the ones that you need when you need them.

1. Status

This extension is distributed with Mercurial 2.0 and later.

Author: Various

2. Overview

The largefiles extension allows for tracking large, incompressible binary files in Mercurial without requiring excessive bandwidth for clones and pulls. Files added as largefiles are not tracked directly by Mercurial; rather, their revisions are identified by a checksum, and Mercurial tracks these checksums. This way, when you clone a repository or pull in changesets, only the largefiles needed to update to the current version are downloaded. This saves both disk space and bandwidth.

If you are starting a new repository or adding new large binary files, using largefiles for them is as easy as adding '--large' to your hg add command. For example:

$ dd if=/dev/urandom of=thisfileislarge count=2000
$ hg add --large thisfileislarge
$ hg commit -m 'add thisfileislarge, which is large, as a largefile'

When you push a changeset that affects largefiles to a remote repository, its largefile revisions will be uploaded along with the changeset. This ensures that the central store gets a copy of every revision of every largefile. Note that the remote Mercurial must also have the largefiles extension enabled for this to work.

When you pull a changeset that affects largefiles from a remote repository, nothing different from Mercurial's normal behavior happens. However, when you update to such a revision, any largefiles needed by that revision are downloaded if they have never been downloaded before. This means that network access is required to update to a revision you have not yet updated to.

If you already have large files tracked by Mercurial without the largefiles extension, you will need to convert your repository in order to benefit from largefiles. This is done with the 'hg lfconvert' command:

$ hg lfconvert --size 10 oldrepo newrepo

By default, in repositories that already have largefiles in them, any new file over 10 MB will automatically be added as largefiles. To change this threshhold, set largefiles.minsize in your Mercurial config file to the minimum size in megabytes to track as a largefile:

[largefiles]
minsize = 2

or use the --lfsize option to the add command (also in megabytes):

$ hg add --lfsize 2

The largefiles.patterns config option allows you to specify specific space-separated filename patterns (in shell glob syntax) that should always be tracked as largefiles:

[largefiles]
patterns = *.jpg *.{png,bmp} library.zip content/audio/*

Note: the patterns syntax shown here is probably incorrect, please try hg help patterns to see if it fits better, in particular *.{png,bmp} seems not to work, whereas re:.*\.(png|bmp) get things done as expected.

3. Configuration

Enable the largefiles extension by adding following lines in your config file:

[extensions]
largefiles =

4. Design

This section explains how largefiles works behind the scenes. If you're just adding/modifying/committing/pushing/pulling in a largefiles repo, you shouldn't have to read this section (although it can't hurt). But if you are setting up or administering Mercurial with largefiles, this is essential reading.

4.1. The local store

Each local repository has a local largefiles store in '.hg/largefiles'. When you add a new largefile to a repository, it is first stored here. When largefiles are downloaded from the central store (see below), a copy is saved there. Files in the local store are also hard-linked to the user cache.

4.2. The user cache

The user cache helps to avoid downloading and storing multiple copies of largefiles. When a largefile is needed but does not exist in the local store, Mercurial checks the user cache. If the needed largefile exists, a hard-link is created in the local store.

The cache location is OS dependent:

OS X

/Users/username/Library/Caches/largefiles

Windows (Vista and up)

C:\Users\username\AppData\Local\largefiles

Windows (pre-Vista)

C:\Documents and Settings\username\Application Data\largefiles

Linux

/home/username/.cache/largefiles

You can set your user cache to a non-default location by setting largefiles.usercache in your Mercurial config:

[largefiles]
usercache = /shared/myusercachedir

The user cache can be deleted at any time to reclaim disk space, but doing so may also result in downloading and storing additional copies of largefiles.

4.2.1. The central store

In a typical setup with a central Mercurial server, the user who serves the central repositories will get a user cache that acts as a central store for all the repositories. This central largefiles store has every past revision of every largefile.

<!> Unlike other user caches, the central store should not be deleted! It may be the only cache that holds a largefile used by an old revision.

<!> When a client repository needs to download a largefile, it'll try to get it from the repository specified as default in the hgrc file. If not specified or incorrect repository is specified, the download will fail. As an alternative, a default path can be set for the specific hg update command:

hg --config paths.default=path-to-repo-with-the-file update

4.3. Implementation details

Each largefile has a standin file in '.hglf/', which is tracked by Mercurial like any other file. The standin contains the SHA-1 hash of the largefile contents. When a largefile is added/removed/copied/renamed/etc the same operation is applied to the standin. Thus the history of the standin is the history of the largefile.

For performance reasons, the contents of a standin are only updated before a commit. Standins are added/removed/copied/renamed from add/remove/copy/rename Mercurial commands but their contents will not be updated. The contents of a standin will always be the hash of the largefile as of the last commit. To support some commands (revert) some standins are temporarily updated, but changed back after the command is finished.

A Mercurial dirstate object tracks the state of the largefiles. The dirstate uses the last modified time and current size to detect if a file has changed without reading the entire contents of the file.

5. See also

There are a number of older extensions for managing large files. This extension is a descendant of the BfilesExtension and is now the recommended way to handle such files. Alternatives are BigfilesExtension and SnapExtension.


CategoryBundledExtension

LargefilesExtension (last edited 2017-01-10 15:03:55 by CharlesB)