Contact Details
Email: md8580 AT SPAM _BARFOO_ FREE bris DOT ac DOT uk)
Alt Email: qwerty360 AT SPAM _FOOBAR_ FREE gmail DOT com
IRC nick: qwerty360
VOIP: sip://markd@217.155.162.65 (Warning due to known NAT issues and the difficulty of testing this without a second voip server I cannot guarantee this will work! Please tell me if it doesnt work so that I can try and fix any issues!).
About Me
I am a second year Maths and Computer Science Student at the University of Bristol in the UK.
I have been using open source software (and mercurial) for a while, although so far I have never contributed (something I hope to fix this year). I am interested in doing the Google Summer of Code with mercurial as it has benefited me in the past.
I normally program in C or Java although I am now learning python. (Mostly by playing with Mercurial. If anyone has any suggestions of good projects or resources for learning python they would be much appreciated.)
Google Summer of Code 2010
At the moment I am looking at the 'Better Changeset Discovery' idea. Pending discussion with people who know the code better I have a few ideas/questions/notes already about it. The following covers what I believe is a fairly large scope!:
Some(most) of the following does not consider that a rough protocol has already been designed and appears to be in the process of implementation.
- Minimize round trips over everything else. On a 1mbit connection sending 100 160 bit ID's, assuming an overhead taking them up to 300bits is faster than doing an extra round trip. given that connection throughput is increasing all the time, but latency is much harder to improve upon this is going to become more significant over time.
- Minimize server/client operations secondly. Searching the graph for the ID's is not a fast operation. While this can be done in log n time with a sorted list this means pre computing one and adding to it on each commit. This may still be the optimum solution.
- Optimize pull over push
- only need to know differences one way most of the time
- push is normally done after pulling code and merging
- push cares if server is different to give warning and allow for merging, but not how different
- Precomputation: possibly precomputation of the graphs could be done on each commit to find changesets where being common indicates all prececiding changesets must also be common (cuts across 1 point) (check these before checking parallel branches to rapidly classify sections)
- implement as extension? use with large projects for improved performance of major repos? how to ensure precomputation is done by submitters instead of server if extension?
- Use called C for relatively fast graph processing?
- Divide and conquor techniques (treat branches that dont merge later on as sub problems to simplify process).
- varying length of reverse hop based on how often repository is used? more used repositories have more commits to deal with!
- How are others solving the problem
- other DVCS's, e.g git
- Dependancy management (e.g apt) -- similar problem, treat each predecessing changeset as a dependancy for current changeset?
- Testing
- Aim is to improve performance, so you need rigerous testing to ensure that changes have actual benefits
- Automated random repository generation
- allows testing of some extremes
- likely to show performance changes
- e.g. complex branch operations
- dissimilar to real world
- Lots of tests
- proper design would be useful for testing other mercurial development
- allows testing of some extremes
- Large repositories
- Kernel, mercurial, etc
- Real world tests
- rarely worst case
- Combination of automatically generated and manually generated code is best.
- Use hg incoming and hg outgoing to test performance?
- Tests with simulated latency?
- extension using hook with wait command?
Why the project needs doing:
The example below shows that cloning the mercurial stable repository takes 5 times longer than 'hg incoming' takes to decide what changesets are needed between revision 1000 and the current revision (changeset 10646).
markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest2 real 0m34.081s user 0m10.485s sys 0m0.668s
markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest3 -r 1000 real 0m4.642s user 0m1.212s sys 0m0.036s markd@m2n-mx-ath-x2:~/code/hg-exper$ cd clonetest3
markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ time hg incoming > screenoutput real 2m24.890s user 0m4.212s sys 0m0.424s markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ ping selenic.com -c 4
--- selenic.com ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3001ms rtt min/avg/max/mdev = 160.950/161.373/162.314/0.616 ms
n.b. Some of the issues related to this may result from the cgi web interface being used.