Diff for "MarkDetermann"

Differences between revisions 1 and 2

Contact Details

Email: md8580 AT SPAM _BARFOO_ FREE bris DOT ac DOT uk)

Alt Email: qwerty360 AT SPAM _FOOBAR_ FREE gmail DOT com

IRC nick: qwerty360

VOIP: sip://markd@217.155.162.65 (Warning due to known NAT issues and the difficulty of testing this without a second voip server I cannot guarantee this will work! Please tell me if it doesnt work so that I can try and fix any issues!).

About Me

I am a second year Maths and Computer Science Student at the University of Bristol in the UK.

I have been using open source software (and mercurial) for a while, although so far I have never contributed (something I hope to fix this year). I am interested in doing the Google Summer of Code with mercurial as it has benefited me in the past.

I normally program in C or Java although I am now learning python. (Mostly by playing with Mercurial. If anyone has any suggestions of good projects or resources for learning python they would be much appreciated.)

Google Summer of Code 2010

At the moment I am looking at the 'Better Changeset Discovery' idea. Pending discussion with people who know the code better I have a few ideas/questions/notes already about it. The following covers what I believe is a fairly large scope!:

Some(most) of the following does not consider that a rough protocol has already been designed and appears to be in the process of implementation.

Minimize round trips over everything else. On a 1mbit connection sending 100 160 bit ID's, assuming an overhead taking them up to 300bits is faster than doing an extra round trip. given that connection throughput is increasing all the time, but latency is much harder to improve upon this is going to become more significant over time.
Minimize server/client operations secondly. Searching the graph for the ID's is not a fast operation. While this can be done in log n time with a sorted list this means pre computing one and adding to it on each commit. This may still be the optimum solution.
Optimize pull over push
- only need to know differences one way most of the time
- push is normally done after pulling code and merging
- push cares if server is different to give warning and allow for merging, but not how different
Precomputation: possibly precomputation of the graphs could be done on each commit to find changesets where being common indicates all prececiding changesets must also be common (cuts across 1 point) (check these before checking parallel branches to rapidly classify sections)
- implement as extension? use with large projects for improved performance of major repos? how to ensure precomputation is done by submitters instead of server if extension?
- Use called C for relatively fast graph processing?
Divide and conquor techniques (treat branches that dont merge later on as sub problems to simplify process).
varying length of reverse hop based on how often repository is used? more used repositories have more commits to deal with!
How are others solving the problem
- other DVCS's, e.g git
- Dependancy management (e.g apt) -- similar problem, treat each predecessing changeset as a dependancy for current changeset?
Testing
- Aim is to improve performance, so you need rigerous testing to ensure that changes have actual benefits
- Automated random repository generation
  - allows testing of some extremes
    - likely to show performance changes
    - e.g. complex branch operations
  - dissimilar to real world
  - Lots of tests
  - proper design would be useful for testing other mercurial development
- Large repositories
  - Kernel, mercurial, etc
  - Real world tests
  - rarely worst case
- Combination of automatically generated and manually generated code is best.
- Use hg incoming and hg outgoing to test performance?
- Tests with simulated latency?
  - extension using hook with wait command?

Why the project needs doing:

markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest2 requesting all changes adding changesets adding manifests adding file changes added 10647 changesets with 21321 changes to 1479 files updating to branch default 1143 files updated, 0 files merged, 0 files removed, 0 files unresolved

real 0m34.081s user 0m10.485s sys 0m0.668s markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest3 -r 1000 requesting all changes adding changesets adding manifests adding file changes added 995 changesets with 2376 changes to 166 files updating to branch default 148 files updated, 0 files merged, 0 files removed, 0 files unresolved

real 0m4.642s user 0m1.212s sys 0m0.036s markd@m2n-mx-ath-x2:~/code/hg-exper$ cd clonetest3 markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ time hg incoming > screenoutput

real 2m24.890s user 0m4.212s sys 0m0.424s markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ ping selenic.com -c 4 PING selenic.com (173.11.57.241) 56(84) bytes of data. 64 bytes from waste.org (173.11.57.241): icmp_seq=1 ttl=42 time=162 ms 64 bytes from waste.org (173.11.57.241): icmp_seq=2 ttl=42 time=161 ms 64 bytes from waste.org (173.11.57.241): icmp_seq=3 ttl=42 time=160 ms 64 bytes from waste.org (173.11.57.241): icmp_seq=4 ttl=42 time=161 ms

--- selenic.com ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3001ms rtt min/avg/max/mdev = 160.950/161.373/162.314/0.616 ms

CategoryHomepage

-  ⇤ ← Revision 1 as of 2010-03-27 16:10:21 → 
  Size: 4095
  Editor: MarkDetermann
  Comment:
+   ← Revision 2 as of 2010-03-29 20:08:29 → ⇥
  Size: 5651
  Editor: MarkDetermann
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 28:
- * Minimize round trips over everything else. On a 1mbit connection sending 100 160 bit ID's, assuming an overhead taking them up to 300bits is faster than doing an extra round trip. given that connection throughput is increasing all the time, but latency is much harder to improve upon this is going to become more significant the longer it takes.
+ * Minimize round trips over everything else. On a 1mbit connection sending 100 160 bit ID's, assuming an overhead taking them up to 300bits is faster than doing an extra round trip. given that connection throughput is increasing all the time, but latency is much harder to improve upon this is going to become more significant over time.
-Line 62:
+Line 63:
+Why the project needs doing:

markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest2
requesting all changes
adding changesets
adding manifests
adding file changes
added 10647 changesets with 21321 changes to 1479 files
updating to branch default
1143 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	0m34.081s
user	0m10.485s
sys	0m0.668s
markd@m2n-mx-ath-x2:~/code/hg-exper$ time hg clone http://selenic.com/repo/hg-stable clonetest3 -r 1000
requesting all changes
adding changesets
adding manifests
adding file changes
added 995 changesets with 2376 changes to 166 files
updating to branch default
148 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	0m4.642s
user	0m1.212s
sys	0m0.036s
markd@m2n-mx-ath-x2:~/code/hg-exper$ cd clonetest3
markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ time hg incoming > screenoutput

real	2m24.890s
user	0m4.212s
sys	0m0.424s
markd@m2n-mx-ath-x2:~/code/hg-exper/clonetest3$ ping selenic.com -c 4
PING selenic.com (173.11.57.241) 56(84) bytes of data.
64 bytes from waste.org (173.11.57.241): icmp_seq=1 ttl=42 time=162 ms
64 bytes from waste.org (173.11.57.241): icmp_seq=2 ttl=42 time=161 ms
64 bytes from waste.org (173.11.57.241): icmp_seq=3 ttl=42 time=160 ms
64 bytes from waste.org (173.11.57.241): icmp_seq=4 ttl=42 time=161 ms

--- selenic.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3001ms
rtt min/avg/max/mdev = 160.950/161.373/162.314/0.616 ms