5.6Sprint/Notes - Mercurial

Mercurial sprint 5.6 (notes from https://mypads.framapad.org/mypads/?/mypads/group/octobus-public-5d3rw470w/pad/view/mercurial-sprint-gb2akv7pm)
Agenda: https://docs.google.com/spreadsheets/d/1-PhQvYRmUk2LRrpKZg32fktAsfBRN93WC1GKCnDK7YY/edit#gid=0

Day 1 (Thursday 05)

Attendees:

    Anton Shestakov 

    Augie Fackler

    Charles Chamberlain

    Danny Hooper

    Georges Racinet

    Gregory Szorc

    Jeff Sipek

    Jörg Sonnenberger

    Kyle Lippincott

    Manuel Jacob

    Martin von Zweigbergk

    Mathias de Maré

    Matt Harbison

    Pierre-Yves David

    Pulkit Goyal

    Raphaël Gomès

    Sushil Khanchi

    Thomas Klausner (from NetBSD)

    Valentin Gatien Baron

    Yuya Nishihara


Dropping Python 2.7

Who gets hurt? / Who is using 2.7?

    OpenVMS

    TortoiseHg (Corporate Windows users)

    Random bytes/str issues

    Windows installers complicated

    macOS DMG issues

    Repository structure is a mess

    CentOS 7

    Action: change our RPMs to use Python 3

    Windows

    Non-ascii filenames are problematic

    pip install pager doesn't work in msys

    make install doesn't work

    WSL is wonky

    NTFS per-directory case sensitivity flag in WSL

    Filesystem I/O performance issues

    Kalithea dropped Python 2 support

    hg-subversion

    It is probably out of date; not ported to modern Mercurial

    FreeBSD builds hg-subversion for Python 3. Unsure if it works.

    hg-fastimport: Jörg going to look at the porting

    Conclusions:

    2.7 blockers are working TortoiseHg MSI and DMG installers

    Come back to this later today to decide!

    Other extensions that do not (fully) support python 3.x:

    hglist (https://alastairs-place.net/projects/hglist/) (Does not work with current mercurial anyway)

    Mozilla version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/, https://mercurial.selenic.com/wiki/MozillaPushlogExtension)

    hgnested (https://pypi.org/project/hgnested/) (Does not work with current mercurial anyway)


Testing Approach

    https://www.mercurial-scm.org/wiki/IntegrationTestsPlan

    Yuya: lots of tests don't care about UI; internals can be tested without invoking `hg`

    Should we adopt pytest?

    Python ecosystem is using pytest

    Alphare: As long as we have a "code style" for automagic fixtures, I'm +100 for using pytest

    Complexity of current tests both for performance and maintainability, revset tests used as example, consideration for the bad HTTP tests ->  Jörg and Georges will look at that

    Conclusion:

    Georges to send patch series vendoring pytest and demonstrating use/integration with run-tests.py. Fort at least 1 test to demonstrate how it works.


Command Namespaces

    https://www.mercurial-scm.org/wiki/5.6Sprint#More_command_.22namespace.22

    debug*, perf* commands

    We don't want users running debug* commands

    Question: create a new namespace (name TBD - possibly "admin") where we consolidate some debug* and regular namespace commands to aggregate "repo administration" commands. Upgrade, rollback, verify, debugstrip, etc.

    Proposal: Add - between namespace name and command

    Additional namespaces possible (for example one for experimental commands)

    Conclusion:

    Change canonical commands to use - between words; automatically register non-hyphen aliases to existing commands

    Also convert arguments to use -???

    Some users might also prefer to use _

    New namespaces are good idea

    Use "maintenance-" for repo admin / maintenance commands?


IRC / Chat Futures

    IRC is ancient. Should we ditch IRC?

    We are losing potential contributors by using IRC.

    Matrix and Zulip are interesting.

    Replacement needs:

    persistence (no need to always be online)

    anonymous access or very very very easy registration process (e.g. using already existing 3rd-party account or credentials from something in our project infrastructure, like phab)

    bridge to/from IRC

    Conclusion: Let's play with Matrix with IRC gateway -- see https://app.element.io/#/room/#freenode_#mercurial:matrix.org


Website

    Raphael has volunteered to modernize website; big problem is modernizing content

    Tortoise domain expiring in 2021; unsure of future

    The wiki is a graveyard; no review process; no commenting

    Replace wiki with Sphinx-based documentation?


Day 2 (Friday 06)

Tags

    Usability issue: hg clone -r tag doesn't get the tag, just the tagged changeset

    Scaling issue: consider every head

    Weird conflict resolutions using non-stable ordering, merge conflicts in .hgtags are annoying, and other bugs

    Alternative implementations are used by some companies:

    Jane Street uses a dedicated repo for tags

    Nokia also has a separate approach

    Using the opportunity that changing the hash function will break all old clients anyway and changing the tags system together with the hash transition

    We could store the tag information inside the changeset

    Implementing some sort of caching mechanism for side-data (extra?) will fix performance issues for new tags, but also topics, etc.

    We need to take into account people not using evolve

    Tag namespaces for monorepos (also maybe narrow/sparse information)?


Experimental VCS (Martin) called "JJ" (placeholder)

    Written in Rust

    Compatible with Git repos

    Separated into two lib and cli crates

    All filepaths are UTF-8 (Rust String)

    Meant as a client for Git repos with revsets, evolution, templates, etc.

    Split between ChangeID and Revision ID (follows the change across amends)

    Commands:

    describe: to change metadata

    close: equivalent to commit?

    diff: equivalent to diff

    co/checkout: equivalent to checkout

    prune: equivalent to prune?

    rebase: equivalent to rebase

    restore: equivalent to revert

    merge: equivalent to merge

    op log: remembers all operations you've done to the repo

    --at-op flag to see the state of the repo at any given operation

    op undo/op restore

    working copy changes stay where they are, need to "rebase the working copy" to bring them along on some other revision, like hg update does by default

    unresolved conflicts can be committed, and the resulting revision can be used further. Unresolved conflicts are materialized as conflict markers on checkout. Reordering commits can result in intermediate conflicts but jj knows the overall diff is the same as before and so the final rev has no conflicts

    allows conflicts to resolved collaboratively, rebase doesn't have to store state

    merges are represented as the diff from the merged parents, so you can rebase them and add/remove parents

    op log is motivated to support replication-friendly repository format

    Discussion about what could be included into mercurial

    backwards compatibility would hold us back? but revlogs are not really stable, the command line interface and network protocol are the parts that are most stable


Treating commits like working copy

    https://www.mercurial-scm.org/wiki/RevisionAsWDirPlan

    Turned into a discussion about ambiguous --rev meaning

    Idea: --into <rev> for describing revision to mutate on; --from <rev> for revision to retrieve from

    List all existing commands that take --rev (and also others) to check if the new flags would apply to existing commands

    hg status -r doesn't map to --rev

    Use different commands for very different user-facing results instead of adding a (non-advanced) flag to an existing command. (like histedit/rebase -i)


Replication Friendly Storage

    We aren't replication friendly today

    We don't use fsync properly today so not even POSIX filesystem compatible today (e.g. NFS)

    We need to formalize storage interfaces and have continuously tested alternative storage backend to keep it working to unblock viable alternative storage backends.



Day 3 (Saturday 07)

--auto-shelve flag

    What happens on conflict? And conflicts with untracked files

    Depends on the next discussion (phase-based shelve)


Phase-based shelve

    Using this, we re-use commit manipulation features already in hg instead of having a "side-channel" for shelves

    internal phases are 95% ready: missing the upgrade/downgrade repo (removing all internal except shelves, etc.)

    Performance needs to be worked on and understood before making the switch

    Removes the need for stripping the intermediate stages


State of Rust in hg

    effort to have a rust executable, rhg, that should become eventually the start point, and which run python for the complex commands (which is most of them)

    would include chg, so it's much friendlier to enable, and could even be enabled by default

    we'd like to have status in pure rust, but where do we stop?

    concern about rust as a hard dependency in BSD. Thoughts from previous sprints  would be no hard dependency on rust, however some (non-mandatory) features/performance work could be rust only

    Sometimes it's a bit more complicated, as the persistent-nodemap feature is faster than equivalent functionality in C+python, and while it's possible to use a python-only hg with persistent-nodemap, it's much slower than the equivalent with C+python.

    Greg claims that people with big monorepos should be running on archs well supported by rust (objections about powerpc cpus). May not even need a pure python implementation for performance-only changes. People can clone a repo to remove a rust requirement on a dirstate, for instance. Wouldn't work with nfs repos.

    The network protocol must not require rust, but the local disk format can, and probably should, as it's better use of resources.

    Jörg concerned about significant performance degradation for people not using rust

    We don't really want much new C code at this code, especially one that's not fuzzed

    augie makes the following claim "if we assume we ship pyoxidized binaries on mercurial-scm.org, a format that's used by default by that executable should have a python implementation", not clear if people agree.

    To address the conflicting requirements of: "1. hg repos can be copied around and work, 2. work with reasonable performance if you don't have rust, 3. be fast when you have rust", one option that's suggested would be that hg init may add requirement specific to the hg version you're running (so persistent-nodemap for instance if you're running rust), but there would be `hg init --universal` or something to prevent this installation-specific requirements

    Greg wants the default experience to be the fastest that the client can provide, because default matters etc

    Conclusions:

    Mercurial will not require Rust on any platform

    Rust is an optional requirement; Mercurial features can be implemented in Rust; But end-user functionality (e.g. workflow hg commands) will not require Rust.

    The Mercurial experience on large repos without Rust may not be great: this is due to the limits of performance with Python

    The official Mercurial distributions will ship with Rust code / features on platforms where Rust is supported

    hg init produces the best possible experience (read: set of enabled requirements) for whatever the running hg client supports

    If there is a Rust feature supported that provides benefit, it is enabled by default

    There is implicit BC here where we no longer guarantee that an hg X.Y can read a repo on machine B that was created on machine A

    If a Mercurial install supports N>1 implementations of given functionality (e.g. Rust versus Python dirstate) and 1 implementation is slower than the other, it should use the faster one by default and possible emit a warning and/or require a config option to opt in to the known slower implementation.

    We don't have a blanket policy of whether a non-Rust implementation of a feature is required

    We don't require a 2nd implementation at initial implementation time.

    At feature enable-by-default time, we have a formal review of whether a 2nd implementation should be required.

    Having the ability to downgrade the repository format is greatly preferred.


Releases

    mostly automated (build, sign the release, new version on bugzilla, ...), although the release notes are not. Augie doesn't have time for the latter

    maybe we should write to the release-notes files in the hg repo more. May be able to make phabricator suggest editing the file if it's not already modified

    unclear if api changes notes are useful. One could imagine a link to a revset on the hg repo that would show the commit with "(api)" in the description instead.


Mercurial layered architecture

    Gradually moving towards a "core" and "consumer" split: the core should never output anything user-facing, and be responsible for most of the logic

    discussion about ripgrep support (meaning it knowing about hgignore), since code for that is implemented in the rust code, which burntsushi should accept

    In general splitting building bricks into crates (for the Rust case) like graph traversal, graph splitting, ignore support, etc.


Rust-cpython

    we use rust-cpython, but py03 seems to be the most popular option now, so we may want to switch over at some point

    the split between hg-core and hg-cpython means the mercurial rust is ready for this kind of switch

    we should check at some point with mark thomas what the concerns are about memory-safety in py03

    pyoxidizer also uses rust-cpython, but greg wants to switch at some point. He had to patch rust-cpython to remove some assumptions that didn't work in the case of embedding the runtime and things into a single executable. May have to do the same thing with py03.


Should some extensions be on by default

    rebase: yes, it was discussed in a previous sprint but hasn't happened. Didn't follow the bit about various levels of obsolescence markers or something

    share: yes, with pulkit's safe-share

    clean: if we can make it not wipe by default, so perhaps interactive by default (but not interactive if the purge extension is loaded, for compatibility)

    strip: don't remember what we said here, but I think the previous sprint said we should have hg debugstrip


Dirstate format mentioned in the rust discussion

    it's a tree format ( so we have sorted order). It contains full paths for files, at the leaves. Writes can edit the tree without rewriting the whole file (by appending edits, similar to persistent-nodemap), but eventually the file gets rewritten from scratch when too much edits have accumulated.

    went into a lot discussion about how the new status works, how some parts will require mtime for directories that get updated at the right time with high precision (which are going to be gated on some runtime check, under the assumption that a single checkout is not spread across multiple filesystems). What kind of paths are stored in the file (the same bytes as in the manifest, I believe).


Day 4 (Sunday 08)

Combined revlog

    The general idea is, with a not too big amount of work, to add a new field with a hash of the filename to reduce the number of filelogs

    Optimizes copying a file if both are in the same revlog

    Numbers: NetBSD repo (4.1GB for store) (I missed the part for filelogs...)

    We need an efficient index (so, the current persistent nodemap) if we want to combine revlogs

    Greg: we need strong interfaces for storage internals, a second format that's tested in storage, at which point it should be much easier to change repo formats

    We need to take into account strip performance

    Take alignment of revlog entries into account (current are exactly 64 bytes long, which happens to be a line of cache for modern processors)

    Maybe separate strip into two commands, one that just drops the index, the other does the actual "lose my data" (essentially something like "prune" and "gc")


Using MMap more

    Reduces memory pressure by sharing kernel pages

    Has some speed drawbacks in rare cases, most of the time it's a big win for large repos

    It helps low-RAM users because less data is copied twice

    The main issue is that memory access can return a SIGBUS if the file has been truncated, so just don't truncate unless you're doing a vacuum or you really really need to. Indirectly you can also have a random prefix for the data, keep that in the docket file and write to a separate file, then rename if you need to truncate

    Alternatively: signal handler for SIGBUS ==> They're very hard to get right in Python. Any reading to a truncated file would return zeros only. But it's probably very hacky

    We need an abstraction for using/not using mmap


Performance

    Looking back at recent work:

    Clone needs 40% less memory (RSS) now

    Sparse-revlog improves performance in highly merged history

    Persistent nodemap is a lot faster (if you use Rust). On one particulary large repo latency dropped from ~400ms per side for each pull/push in discovery code to basically < 10ms. 

    Copytracing is getting faster, hasn't yet landed. The Python version works fine and is reasonably efficient, the Rust implementation is much faster and might be necessary for very large repos. Numbers are fuzzy, but multiple cases that were in the multiple seconds range (20s+) are down to <1s now.

    Status has gotten about twice as fast with Rust in the general case. In pure Rust (rhg), the current POC code is much faster (25x times with a 8c/16t computer in Mozilla-Central)

    chg: performance difference (C is less resource-intensive for now), and aside for fixing small existing bugs, it should work

    Stable-range work can help with many different performance issues

    Some French researchers are interested in general fixing graph issues

    Issues:

    Evolution problem of building the stable-range caches

    Clone performance: a lot of time is spent on metadata (changelog/manifest) and seems too slow

    In general commands are slower than they should be because of a threshold effect

    Having many heads makes multiple things slow: discovery, branchmap, ??

    Better datastructures for evolve/topic are needed, but this is not worrying, just something that needs to be done

    Valentin: the main issues are things that are in the works (files, cat, status...), but some things are in consideration like:

    using transaction numbers for pull instead of discovery

    performance with narrow for the server-side

    hg update is parallel but slow, slower than single threaded hg cat of the whole repository in rust. So maybe some optimizations could be done with writing files and the parallel work is bottlenecked by Python

    hg bundle is extremely slow


Packaging TortoiseHg

    Packaging for Windows is broken, restart from scratch

    Host dependency components on m-s.o, instead of random developer repos?  e.g. merge modules, Qt/PyQt/sip/Qscintilla source tarballs, etc

    Mac packaging is a bash script

    TortoiseHg is a blocker for Python 3 migration (as previously said). If we merge the two build repos into the main thg repo, Greg will do the work.

    Heptapod CI could gain Windows and Mac runners in the next few months

    Signing packages is important, the Mac story is more complicated, because you need a Mac computer to sign it

    tortoisehg.org is still not in the hands of the Mercurial project and will expire in 2021

    add sitecustomize.py or equivalent to core hg binaries like thg packages have, so extensions are a simple `foo =` for all binaries on all platforms when using `pip install --user foo`

    Need a mechanism to install packages in this manner with PyOxidizer.  Ship a standalone python.exe?


Dirstate cache vs source data

    General discussion about whether the souce information in the dirstate (parent revisinos, hg add, hg remove, hg rename, hg forget) can be separated from the cache part of the dirstate (things like mtime)

    Such a file would be tiny the vast majority of the time, just 40 bytes

    It would apparently avoid repository corruptions due to google repos being edited on multiple machines


Virtual sprints in general

    the next couple of sprints will probably be virtual as well

    general discussion about people talking over people. Maybe we should have a convention for people to signal they want to speak

    some people like virtual sprints because they are easier to attend

    not every discussion should be done at the sprint, people can also talk between themselves in between sprints

    post-sprint note by jeffpc: it looks like Zoom has breakout rooms that the host can enable - I just did a quick test and it seems to work quite well