Note:
This page is primarily intended for developers of Mercurial.
Status: Project
Main proponents: Rémi Chaintron
LFS Plan
LFS aims to provide support for large binaries in Mercurial. Users should be able to opt-in and opt-out to LFS without issues. It is designed to not be intrusive and provide a more seamlessly and more stable large file interface than hg largefiles.
1. Key Concept
The key concept is to store metadata needed to obtain files from a lookaside server in regular filelogs. We are using deliberate hash mismatches to distinguish stored metadata from content. This is achieved by storing SHA-1 of the content as the node id of the entry in the filelog, while storing some metadata as the content. This will lead to a hash mismatch during 'hg update' which we will catch (a similar approach is used for censored nodes). If we determine that the content is valid metadata, we process the metadata appropriately (either return it, or return the actual file content from the blob store).
2. Architecture
2.1. Server
LFS stores large files on a lookaside server and not within a Mercurial repository. This allows for more efficient blob storage and support for alternative backends that are better suited, such as Amazon S3. The current consideration is to reuse the Git LFS HTTP protocol. This will facilitate the usage of existing server implementations and the conversions of repositories between Git and Mercurial.
Exchangeable backend interfaces will be allowed and are configurable in the hgrc of the client.
2.2. Mercurial Core
The changes to Mercurial core involve merging the revlog._checkhash and revlog.checkhash methods to perform a single operation: check the hash of the (text, p1, p2) and return text if the hash matches. This behavior can be overridden in the extension: if the hashes do not match, then the text is passed through the metadata deserializer, and if the parsing succeeds, the extension will return the actual file data or the parsed metadata.
Another detection mechanism can use flags in the filelog to signal out large files. For example, the checkhash wrapper function can probe for this flag and decide whether to attempt to deserialize the metadata in the filelog (and then return the blobstore content or the deserialized metadata), or to use the original checkhash function.
Since checkhash is an operation used anytime the revlog is accessed, we will ensure that its performance is not affected by the deserialization mechanism. An easy improvement is to check the size of the file to be deserialized: for example, it would be reasonable (but also configurable) to assume that the maximum size for metadata is 10KB.
2.3. LFS Extension
The LFS Extension will perform all the operations by wrapping around core Mercurial filelog methods, commands and checkhash method.
2.3.1. Commit Path
When a large file is committed to Mercurial, the extension will write its contents to the local blob store (cache) and commit to the filelog a metadata file (specifying file size, hash, object identifier etc). The node id stored in the filelog is still the one returned by hashing the original text, together with the two parent revisions.
The local blob store can be considered both as a cache and a staging area for blobs to be pushed to the remote blob store.
2.3.2. Update, Cat and other Read Paths
When a read operation is performed on the filelog, the checkhash method will raise an exception. By wrapping the checkhash method, we can catch the exception and, depending on the hg command which was run, decide whether we would return the file contents from the blobstore (e.g. on hg cat), or the actual metadata file (e.g. on hg debugdata). The easiest way to alter which behavior (read from blobstore vs. pass metadata through) is needed by the current command, we use a default callback which reads from the blobstore, and in the wrapper for hg debugdata we install a special callback, which returns the raw metadata.
On hg update [rev], it is also important to check whether all the required blobs are available in the local cache, and retrieve the missing ones from the remote blob storage. Performing this operation on update, as opposed to pull, ensures that disk space will only be used only by the files which are immediately needed by the user (i.e. the requested revision, as opposed to the whole history).
2.3.3. Pull/Push Paths
On pull/push, the hash mismatch is simply ignored, and the metadata is returned, since the filelog should be transferred between checkouts as-is.
However, before pushing the filelog, all the new blobs must be pushed from the local cache to the remote blobstore; only after this operation has succeeded, we can safely push the new revlog contents to the remote repository.
Transferring blobs between the local cache and the remote storage can be parallelized using workers.
The default behavior of the LFS extension is to use the Git LFS HTTP protocol when talking to the storage server. The benefit of this approach is that HG-LFS will keep compatibility with Github's reference implementation, leading to loosely-coupled components.
2.3.4. Merge/Diff Paths
Since blobs do not produce meaningful results when diff'ed and merging is usually impossible, the extension will simply report for large files that differences exist and that selecting version A or B is required, bypassing the usual Mercurial behavior.
2.3.5. Configuration
The configuration is stored in .hgrc and is different from client to client (e.g. people have local proxies, prefer different transports / protocols, etc).
[lfs] # The threshold between regular file and largefile # TODO: we should be able to specify different # thresholding mechanisms, such as judging by MIME type threshold=1M # Where is the local blob cache stored blobstore=cache/localblobstore # Where is the HTTP endpoint located # For testing, using 'dummy' will place the blobs in /tmp/hgstore remotestore=http://s3.amazonaws.com # [Optional] User authentication for the HTTP endpoint # Currently, Git-LFS recommends the usage of HTTP Basic-authentication scheme remoteuser=user reotepassword=passwd
3. Metadata format
To allow easier migration between repositories which already use Git-LFS and HG-LFS, the metadata format proposed by Git-LFS will be used. It specifically allows extensions to alter it and add other fields as necessary, which will be required in further iteration on the HG-LFS extension.
version https://git-lfs.github.com/spec/v1 oid sha256:27c0a92fc51290e3227bea4dd9e780c5035f017de8d5ddfa35b269ed82226d97 size 17 scmid hg:faceb00cfaceb00cfaceb00cfaceb00cfaceb00c
Explanation of the fields:
version allows Git-LFS and HG-LFS to determine which fields are expected to be there. By default, version, oid and size are required
oid is the object identifier; the Git-LFS standard allows for any “[hash algo]:[hash]” string to be used as id, but currently the reference Git-LFS server implementation only supports sha256
size is the length of the blob in bytes
scmid is the mercurial node id (sha1 40 byte signature); can be replaced by “git:shasum”, although the oid already represents the git identifier of the node
Other possible fields:
url specifying where the blob can be retrieved from by other clients
4. Diagrams
4.1. Commit Workflow
- User wants blob to be stored into a new revision
- Core addrevision is intercepted by the extension
- The file is stored into the cache (local blobstore)
- Metadata uniquely identifying the original file contents is returned to core
- Core stores the metadata into the filelog, under the original node hash
4.2. Push Workflow
- Core detects hash mismatch in the filelog
- The extension intercepts the hash mismatch
- The actual file content is searched for in the local cache
- The BLOB is retrieved from the local cache
- The actual file data is pushed to the HTTP storage
- Storage succeeded/failed for all blobs
- The actual push op is blocked until uploading to HTTP completes.
4.3. Update Workflow
http://imgur.com/hB4ga61 (same diagram with "push" on purpose)
- Core detects hash mismatch in the filelog
- The extension intercepts the hash mismatch
- The actual file content is searched for in the local cache
- The BLOB cannot be found in the local cache
- A request to download the file is made to the HTTP endpoint
- The file is retrieved and stored in the local cache
- The actual file data is returned to hg core, to be placed in the filesystem.
4.4. Pull Workflow
- Core detects a hash mismatch after retrieving the filelog from remote repo
- The extension intercepts the hash mismatch
- A no-op is performed: the hash mismatch is normal, and the actual file contents are retrieved on update, to save space