Mercurial sprint 5.6 (notes from https://mypads.framapad.org/mypads/?/mypads/group/octobus-public-5d3rw470w/pad/view/mercurial-sprint-gb2akv7pm)
Agenda: https://docs.google.com/spreadsheets/d/1-PhQvYRmUk2LRrpKZg32fktAsfBRN93WC1GKCnDK7YY/edit#gid=0 Day 1 (Thursday 05) Attendees: Anton Shestakov Augie Fackler Charles Chamberlain Danny Hooper Georges Racinet Gregory Szorc Jeff Sipek Jörg Sonnenberger Kyle Lippincott Manuel Jacob Martin von Zweigbergk Mathias de Maré Matt Harbison Pierre-Yves David Pulkit Goyal Raphaël Gomès Sushil Khanchi Thomas Klausner (from NetBSD) Valentin Gatien Baron Yuya Nishihara Dropping Python 2.7 Who gets hurt? / Who is using 2.7? OpenVMS TortoiseHg (Corporate Windows users) Random bytes/str issues Windows installers complicated macOS DMG issues Repository structure is a mess CentOS 7 Action: change our RPMs to use Python 3 Windows Non-ascii filenames are problematic pip install pager doesn't work in msys make install doesn't work WSL is wonky NTFS per-directory case sensitivity flag in WSL Filesystem I/O performance issues Kalithea dropped Python 2 support hg-subversion It is probably out of date; not ported to modern Mercurial FreeBSD builds hg-subversion for Python 3. Unsure if it works. hg-fastimport: Jörg going to look at the porting Conclusions: 2.7 blockers are working TortoiseHg MSI and DMG installers Come back to this later today to decide! Other extensions that do not (fully) support python 3.x: hglist (https://alastairs-place.net/projects/hglist/) (Does not work with current mercurial anyway) Mozilla version-control-tools repo (https://hg.mozilla.org/hgcustom/version-control-tools/, https://mercurial.selenic.com/wiki/MozillaPushlogExtension) hgnested (https://pypi.org/project/hgnested/) (Does not work with current mercurial anyway) Testing Approach https://www.mercurial-scm.org/wiki/IntegrationTestsPlan Yuya: lots of tests don't care about UI; internals can be tested without invoking `hg` Should we adopt pytest? Python ecosystem is using pytest Alphare: As long as we have a "code style" for automagic fixtures, I'm +100 for using pytest Complexity of current tests both for performance and maintainability, revset tests used as example, consideration for the bad HTTP tests -> Jörg and Georges will look at that Conclusion: Georges to send patch series vendoring pytest and demonstrating use/integration with run-tests.py. Fort at least 1 test to demonstrate how it works. Command Namespaces https://www.mercurial-scm.org/wiki/5.6Sprint#More_command_.22namespace.22 debug*, perf* commands We don't want users running debug* commands Question: create a new namespace (name TBD - possibly "admin") where we consolidate some debug* and regular namespace commands to aggregate "repo administration" commands. Upgrade, rollback, verify, debugstrip, etc. Proposal: Add - between namespace name and command Additional namespaces possible (for example one for experimental commands) Conclusion: Change canonical commands to use - between words; automatically register non-hyphen aliases to existing commands Also convert arguments to use -??? Some users might also prefer to use _ New namespaces are good idea Use "maintenance-" for repo admin / maintenance commands? IRC / Chat Futures IRC is ancient. Should we ditch IRC? We are losing potential contributors by using IRC. Matrix and Zulip are interesting. Replacement needs: persistence (no need to always be online) anonymous access or very very very easy registration process (e.g. using already existing 3rd-party account or credentials from something in our project infrastructure, like phab) bridge to/from IRC Conclusion: Let's play with Matrix with IRC gateway -- see https://app.element.io/#/room/#freenode_#mercurial:matrix.org Website Raphael has volunteered to modernize website; big problem is modernizing content Tortoise domain expiring in 2021; unsure of future The wiki is a graveyard; no review process; no commenting Replace wiki with Sphinx-based documentation? Day 2 (Friday 06) Tags Usability issue: hg clone -r tag doesn't get the tag, just the tagged changeset Scaling issue: consider every head Weird conflict resolutions using non-stable ordering, merge conflicts in .hgtags are annoying, and other bugs Alternative implementations are used by some companies: Jane Street uses a dedicated repo for tags Nokia also has a separate approach Using the opportunity that changing the hash function will break all old clients anyway and changing the tags system together with the hash transition We could store the tag information inside the changeset Implementing some sort of caching mechanism for side-data (extra?) will fix performance issues for new tags, but also topics, etc. We need to take into account people not using evolve Tag namespaces for monorepos (also maybe narrow/sparse information)? Experimental VCS (Martin) called "JJ" (placeholder) Written in Rust Compatible with Git repos Separated into two lib and cli crates All filepaths are UTF-8 (Rust String) Meant as a client for Git repos with revsets, evolution, templates, etc. Split between ChangeID and Revision ID (follows the change across amends) Commands: describe: to change metadata close: equivalent to commit? diff: equivalent to diff co/checkout: equivalent to checkout prune: equivalent to prune? rebase: equivalent to rebase restore: equivalent to revert merge: equivalent to merge op log: remembers all operations you've done to the repo --at-op flag to see the state of the repo at any given operation op undo/op restore working copy changes stay where they are, need to "rebase the working copy" to bring them along on some other revision, like hg update does by default unresolved conflicts can be committed, and the resulting revision can be used further. Unresolved conflicts are materialized as conflict markers on checkout. Reordering commits can result in intermediate conflicts but jj knows the overall diff is the same as before and so the final rev has no conflicts allows conflicts to resolved collaboratively, rebase doesn't have to store state merges are represented as the diff from the merged parents, so you can rebase them and add/remove parents op log is motivated to support replication-friendly repository format Discussion about what could be included into mercurial backwards compatibility would hold us back? but revlogs are not really stable, the command line interface and network protocol are the parts that are most stable Treating commits like working copy https://www.mercurial-scm.org/wiki/RevisionAsWDirPlan Turned into a discussion about ambiguous --rev meaning Idea: --into <rev> for describing revision to mutate on; --from <rev> for revision to retrieve from List all existing commands that take --rev (and also others) to check if the new flags would apply to existing commands hg status -r doesn't map to --rev Use different commands for very different user-facing results instead of adding a (non-advanced) flag to an existing command. (like histedit/rebase -i) Replication Friendly Storage We aren't replication friendly today We don't use fsync properly today so not even POSIX filesystem compatible today (e.g. NFS) We need to formalize storage interfaces and have continuously tested alternative storage backend to keep it working to unblock viable alternative storage backends. Day 3 (Saturday 07) --auto-shelve flag What happens on conflict? And conflicts with untracked files Depends on the next discussion (phase-based shelve) Phase-based shelve Using this, we re-use commit manipulation features already in hg instead of having a "side-channel" for shelves internal phases are 95% ready: missing the upgrade/downgrade repo (removing all internal except shelves, etc.) Performance needs to be worked on and understood before making the switch Removes the need for stripping the intermediate stages State of Rust in hg effort to have a rust executable, rhg, that should become eventually the start point, and which run python for the complex commands (which is most of them) would include chg, so it's much friendlier to enable, and could even be enabled by default we'd like to have status in pure rust, but where do we stop? concern about rust as a hard dependency in BSD. Thoughts from previous sprints would be no hard dependency on rust, however some (non-mandatory) features/performance work could be rust only Sometimes it's a bit more complicated, as the persistent-nodemap feature is faster than equivalent functionality in C+python, and while it's possible to use a python-only hg with persistent-nodemap, it's much slower than the equivalent with C+python. Greg claims that people with big monorepos should be running on archs well supported by rust (objections about powerpc cpus). May not even need a pure python implementation for performance-only changes. People can clone a repo to remove a rust requirement on a dirstate, for instance. Wouldn't work with nfs repos. The network protocol must not require rust, but the local disk format can, and probably should, as it's better use of resources. Jörg concerned about significant performance degradation for people not using rust We don't really want much new C code at this code, especially one that's not fuzzed augie makes the following claim "if we assume we ship pyoxidized binaries on mercurial-scm.org, a format that's used by default by that executable should have a python implementation", not clear if people agree. To address the conflicting requirements of: "1. hg repos can be copied around and work, 2. work with reasonable performance if you don't have rust, 3. be fast when you have rust", one option that's suggested would be that hg init may add requirement specific to the hg version you're running (so persistent-nodemap for instance if you're running rust), but there would be `hg init --universal` or something to prevent this installation-specific requirements Greg wants the default experience to be the fastest that the client can provide, because default matters etc Conclusions: Mercurial will not require Rust on any platform Rust is an optional requirement; Mercurial features can be implemented in Rust; But end-user functionality (e.g. workflow hg commands) will not require Rust. The Mercurial experience on large repos without Rust may not be great: this is due to the limits of performance with Python The official Mercurial distributions will ship with Rust code / features on platforms where Rust is supported hg init produces the best possible experience (read: set of enabled requirements) for whatever the running hg client supports If there is a Rust feature supported that provides benefit, it is enabled by default There is implicit BC here where we no longer guarantee that an hg X.Y can read a repo on machine B that was created on machine A If a Mercurial install supports N>1 implementations of given functionality (e.g. Rust versus Python dirstate) and 1 implementation is slower than the other, it should use the faster one by default and possible emit a warning and/or require a config option to opt in to the known slower implementation. We don't have a blanket policy of whether a non-Rust implementation of a feature is required We don't require a 2nd implementation at initial implementation time. At feature enable-by-default time, we have a formal review of whether a 2nd implementation should be required. Having the ability to downgrade the repository format is greatly preferred. Releases mostly automated (build, sign the release, new version on bugzilla, ...), although the release notes are not. Augie doesn't have time for the latter maybe we should write to the release-notes files in the hg repo more. May be able to make phabricator suggest editing the file if it's not already modified unclear if api changes notes are useful. One could imagine a link to a revset on the hg repo that would show the commit with "(api)" in the description instead. Mercurial layered architecture Gradually moving towards a "core" and "consumer" split: the core should never output anything user-facing, and be responsible for most of the logic discussion about ripgrep support (meaning it knowing about hgignore), since code for that is implemented in the rust code, which burntsushi should accept In general splitting building bricks into crates (for the Rust case) like graph traversal, graph splitting, ignore support, etc. Rust-cpython we use rust-cpython, but py03 seems to be the most popular option now, so we may want to switch over at some point the split between hg-core and hg-cpython means the mercurial rust is ready for this kind of switch we should check at some point with mark thomas what the concerns are about memory-safety in py03 pyoxidizer also uses rust-cpython, but greg wants to switch at some point. He had to patch rust-cpython to remove some assumptions that didn't work in the case of embedding the runtime and things into a single executable. May have to do the same thing with py03. Should some extensions be on by default rebase: yes, it was discussed in a previous sprint but hasn't happened. Didn't follow the bit about various levels of obsolescence markers or something share: yes, with pulkit's safe-share clean: if we can make it not wipe by default, so perhaps interactive by default (but not interactive if the purge extension is loaded, for compatibility) strip: don't remember what we said here, but I think the previous sprint said we should have hg debugstrip Dirstate format mentioned in the rust discussion it's a tree format ( so we have sorted order). It contains full paths for files, at the leaves. Writes can edit the tree without rewriting the whole file (by appending edits, similar to persistent-nodemap), but eventually the file gets rewritten from scratch when too much edits have accumulated. went into a lot discussion about how the new status works, how some parts will require mtime for directories that get updated at the right time with high precision (which are going to be gated on some runtime check, under the assumption that a single checkout is not spread across multiple filesystems). What kind of paths are stored in the file (the same bytes as in the manifest, I believe). Day 4 (Sunday 08) Combined revlog The general idea is, with a not too big amount of work, to add a new field with a hash of the filename to reduce the number of filelogs Optimizes copying a file if both are in the same revlog Numbers: NetBSD repo (4.1GB for store) (I missed the part for filelogs...) We need an efficient index (so, the current persistent nodemap) if we want to combine revlogs Greg: we need strong interfaces for storage internals, a second format that's tested in storage, at which point it should be much easier to change repo formats We need to take into account strip performance Take alignment of revlog entries into account (current are exactly 64 bytes long, which happens to be a line of cache for modern processors) Maybe separate strip into two commands, one that just drops the index, the other does the actual "lose my data" (essentially something like "prune" and "gc") Using MMap more Reduces memory pressure by sharing kernel pages Has some speed drawbacks in rare cases, most of the time it's a big win for large repos It helps low-RAM users because less data is copied twice The main issue is that memory access can return a SIGBUS if the file has been truncated, so just don't truncate unless you're doing a vacuum or you really really need to. Indirectly you can also have a random prefix for the data, keep that in the docket file and write to a separate file, then rename if you need to truncate Alternatively: signal handler for SIGBUS ==> They're very hard to get right in Python. Any reading to a truncated file would return zeros only. But it's probably very hacky We need an abstraction for using/not using mmap Performance Looking back at recent work: Clone needs 40% less memory (RSS) now Sparse-revlog improves performance in highly merged history Persistent nodemap is a lot faster (if you use Rust). On one particulary large repo latency dropped from ~400ms per side for each pull/push in discovery code to basically < 10ms. Copytracing is getting faster, hasn't yet landed. The Python version works fine and is reasonably efficient, the Rust implementation is much faster and might be necessary for very large repos. Numbers are fuzzy, but multiple cases that were in the multiple seconds range (20s+) are down to <1s now. Status has gotten about twice as fast with Rust in the general case. In pure Rust (rhg), the current POC code is much faster (25x times with a 8c/16t computer in Mozilla-Central) chg: performance difference (C is less resource-intensive for now), and aside for fixing small existing bugs, it should work Stable-range work can help with many different performance issues Some French researchers are interested in general fixing graph issues Issues: Evolution problem of building the stable-range caches Clone performance: a lot of time is spent on metadata (changelog/manifest) and seems too slow In general commands are slower than they should be because of a threshold effect Having many heads makes multiple things slow: discovery, branchmap, ?? Better datastructures for evolve/topic are needed, but this is not worrying, just something that needs to be done Valentin: the main issues are things that are in the works (files, cat, status...), but some things are in consideration like: using transaction numbers for pull instead of discovery performance with narrow for the server-side hg update is parallel but slow, slower than single threaded hg cat of the whole repository in rust. So maybe some optimizations could be done with writing files and the parallel work is bottlenecked by Python hg bundle is extremely slow Packaging TortoiseHg Packaging for Windows is broken, restart from scratch Host dependency components on m-s.o, instead of random developer repos? e.g. merge modules, Qt/PyQt/sip/Qscintilla source tarballs, etc Mac packaging is a bash script TortoiseHg is a blocker for Python 3 migration (as previously said). If we merge the two build repos into the main thg repo, Greg will do the work. Heptapod CI could gain Windows and Mac runners in the next few months Signing packages is important, the Mac story is more complicated, because you need a Mac computer to sign it tortoisehg.org is still not in the hands of the Mercurial project and will expire in 2021 add sitecustomize.py or equivalent to core hg binaries like thg packages have, so extensions are a simple `foo =` for all binaries on all platforms when using `pip install --user foo` Need a mechanism to install packages in this manner with PyOxidizer. Ship a standalone python.exe? Dirstate cache vs source data General discussion about whether the souce information in the dirstate (parent revisinos, hg add, hg remove, hg rename, hg forget) can be separated from the cache part of the dirstate (things like mtime) Such a file would be tiny the vast majority of the time, just 40 bytes It would apparently avoid repository corruptions due to google repos being edited on multiple machines Virtual sprints in general the next couple of sprints will probably be virtual as well general discussion about people talking over people. Maybe we should have a convention for people to signal they want to speak some people like virtual sprints because they are easier to attend not every discussion should be done at the sprint, people can also talk between themselves in between sprints post-sprint note by jeffpc: it looks like Zoom has breakout rooms that the host can enable - I just did a quick test and it seems to work quite well