Problem Statement

Cloning and pulling (large) repositories can consume a large amount of CPU on servers. In the face of high client volume, this can lead to resource exhaustion and service unavailability.

These operations can consume large amounts of CPU because every clone or pull that transfers changeset data results in the server creating a changegroup bundle of the data to be transferred. This operation is expensive because the producer has to read revlogs and construct new delta chains from the content. It is essentially re-encoding the revlog on the fly. For revlogs with large entries (such as manifests with 100,000 files) or large diffs, this can take a lot of CPU (and even I/O).

Solution: Pre-Generated Bundles

The inherent problem is servers are "rebundling" repository data for every clone or pull operation. What if instead of the server generating bundles at request time, it could pre-generate the bundles and save them somewhere. When the client connects, it could obtain the contents of that bundle, apply it, then pull the changes since the bundle was created.

This solution works because repository data is generally append-only and immutable. This means that clones and subsequent pulls can effectively be modeled as replays of a linear log of data. Data is strictly additive, so bundling a snapshot of the repository and then transferring the delta since that bundle is effectively equivalent to hg unbundle + hg pull.

This solution saves a significant amount of CPU on the server because reading a static file off disk (or redirecting elsewhere) is almost certainly much cheaper than rebundling.

Methods of Serving Pre-Generated Bundles

Inline bundle2 Part

In this solution, when a clone or pull is requested, the server takes inventory of what bundles are available. If an appropriate one is present, it reads its data and inserts it directly into the bundle2 reply. This is simply streaming bits off disk or elsewhere. Then, the server calculates what changesets aren't in the bundle and constructs a new bundle2 part containing those. From the client's perspective, it (likely) receives multiple changegroup bundles.

Pros:

Cons / Oddities:

External Bundle Download

In this solution, instead of the server sending the pre-generated bundle data inline with the bundle2 reply, it instead advertises a URL (and likely metadata) of a bundle to fetch. The client sees the URL, fetches and applies it, then gets the incremental data from the server.

There are a few variants of this, which will be explained shortly. However, there are some common concerns with this approach:

Inline Followup Variant

The server sends the bundle URL in bundle2 part and then sends a changegroup part with data since the bundle was generated.

Support for this exists in Mercurial today with the remote-changegroup bundle2 part. Server-side code for generating these parts and subsequent changegroup parts is not implemented in core.

Pros:

Cons:

Disconnect and Return Variant

Server sends URL. Client detaches, fetches and applies bundle. Then, the client reconnects to the server and does the equivalent of an hg pull (if necessary).

This could be implemented in a few different ways:

  1. Client issues getbundle with capabilities saying it can apply remote hosted bundles. Bundle URL part received. Client disconnects. Applies bundle. Starts over.
  2. Server advertises that it hosts bundles. Client requests a bundle, disconnects, applies bundle, and then reconnects for the pull.

These are very similar. But in #1 the bundles are integrated into the "getbundle" wire protocol command. In #2, there is likely a separate server command or "listkeys" namespace advertising bundles which the client can connect directly to for bundle info.

Pros:

Cons:

Mozilla's Implementation

Mozilla has implemented support for static bundle serving and typically serves >1TB/day using this model, saving hundreds of hours of CPU time on servers.

It is implemented as a Mercurial extension - called bundleclone - that is installed on both the client and server.

The server advertises a "bundles" capability and makes a "bundles" wire protocol command available. When hg clone is performed, the client calls the "bundles" wire protocol command (if available), fetches a manifest of available bundles, fetches and applies an appropriate bundle, then does the equivalent of hg pull.

The bundles manifest is simply a static file served from the .hg directory on the server. The file contains a list of URLs and key-value metadata. Here is an example manifest:

https://hg.cdn.mozilla.net/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.gzip.hg cdn=true requiresni=true compression=gzip
https://hg.cdn.mozilla.net/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.stream.hg cdn=true requiresni=true stream=revlogv1
https://hg.cdn.mozilla.net/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.bzip2.hg cdn=true requiresni=true compression=bzip2
https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.gzip.hg ec2region=us-west-2 compression=gzip
https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.gzip.hg ec2region=us-east-1 compression=gzip
https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.stream.hg ec2region=us-west-2 stream=revlogv1
https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.stream.hg ec2region=us-east-1 stream=revlogv1
https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.bzip2.hg ec2region=us-west-2 compression=bzip2
https://s3-external-1.amazonaws.com/moz-hg-bundles-us-east-1/mozilla-central/d6ea652c579992daa9041cc9718bb7c6abefbc91.bzip2.hg ec2region=us-east-1 compression=bzip2

In this case, the server has made available 9 bundle URLs. There is a gz, bzip2, and stream variant of each bundle served from 1 of 3 locations (a CDN, S3 us-west-2, or S3 us-east-1). 3x3=9.

Client performs basic content negotiation to select the most appropriate bundle. By default, the first entry is used. However, clients can defined "preferences" for certain attributes and their values in their hgrc. For example, clients in AWS us-west-2 will prefer ec2region=us-west-2 so they fetch from a local server and have ultra fast transfer speeds.

There is also a provision for avoiding hosts that require SNI. Python <2.7.9 do not support the SNI TLS extension. hg.cdn.mozilla.net currently requires SNI, so the server advertises this fact so the client can ignore this entry unless it supports SNI.

Mozilla's implementation is only suitable for initial clones. It makes no attempt to record what changesets are in the bundles. The server implementation and content negotiation is limited to things like the type and location of the bundle, not the contents. The existing discovery logic around hg pull is used for that.

A limitation of this approach is that clients are doing content negotiation. For things like compression type, clients probably should be in control of whether they want a small bz2 bundle or a fast and large "stream" bundle. However, for location, that is something that a server may want a say in. For example, the server could detect if the client IP is from a known data center and only hand out URLs from those data centers. Mozilla may implement this server-side detection eventually. For now, we rely on clients not being completely stupid. And, the CDN is a reasonable default for most.


CategoryDeveloper CategoryNewFeatures