Differences between revisions 27 and 93 (spanning 66 versions)
Revision 27 as of 2010-04-18 15:31:33
Size: 5501
Editor: Pradeepkumar
Comment:
Revision 93 as of 2010-10-22 18:17:21
Size: 8315
Editor: mpm
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
##master-page:HomepageTemplate
#format wiki
#language en
Email: <<MailTo(in3xes AT SPAMFREE gmail DOT com)>>
## page was renamed from in3xes
## page was renamed from Pradeepkumar
== Google Summer of Code-2010 ==
<<TableOfContents>>
Line 6: Line 6:
IRC Nick: inXs_ ----
== Contact Details: ==
{{{
        Name: Pradeepkumar Gayam
        Email: in3xes@gmail.com
        IRC Nick: in3xes (freenode.net)
        Jabber ID: in3xes [AT] gmail.com (talk.google.com)
}}}
My GSoC [[http://bitbucket.org/in3xes/gsoc/src/tip/application|application]] on bitbucket.
Line 8: Line 16:
'''About Me:''' ----
''' '''
Line 10: Line 19:
- Student at [[http://www.iitr.ac.in|IIT]] in India. Mathematics is my major. == Journal: ==
2010-06-23: Study shallow clone how they are related to parent delta
Line 12: Line 22:
- Fan of C, Python, Haskell. You can find some of my small projects on [[http://www.github.com/in3Xes|Github]]. - 2010-06-24: Revlog experiments to improve performance
Line 14: Line 24:
- Love solving mathematics problems by programming. [[http://www.github.com/in3Xes/projecteuler|Solutions]] for the problems in [[http://www.projecteuler.net|Project Euler]]. Solved in python. 2010-06-25: More experiments, fixed the issue, so that it will not differ for old revlogs, still expriments needs to be done.
Line 16: Line 26:
- Experienced user of git. Comfortable with Bazaar. 2010-06-26: Study wire-proto, understand how bundle, unbundle is being done.
Line 18: Line 28:
- Find me on [[http://www.twitter.com/in3Xes|Twitter]] 2010-06-27: Week end
Line 20: Line 30:
== Google Summer of Code-2010 ==
I am interested in "Parent delta". I can also work on conversion tools. But I am mainly focusing on Parent Delta Plan.
2010-06-28: More expreiments, found the reason for the performance issue, [[http://paste.lisp.org/+2EEE|this]] paste explains. Conclusion: 'The number of patches to be applied increases at starting rev of new branch'.
Line 23: Line 32:
A word about project: ----
2010-06-29: Update blog, journal, Attened GSoC meeting,
Line 25: Line 35:
Mercurial calculates diffs against previous revision rather than parent. In some cases this implementation is space inefficient, and it is more sensible to store deltas against parent. This project is about implementing [[http://mercurial.selenic.com/wiki/ParentDeltaPlan|Parent Delta Plan]]. This implementation changes the structure of revlogs, so wire protocol has to be extended to allow backward compatibility. 2010-06-30: Worked through bundle/unbundle code and understand wire-proto completely
Line 27: Line 37:
============================================================================ 2010-07-01: Worked through bundle/unbundle code and understand wire-proto completely
Line 29: Line 39:
=== Proposal ===
TO DO(Tentative): Major:
2010-07-02: - (Tried to work, but, unproductive day)
Line 32: Line 41:
 * Changes in revlog
 * Changes in wire protocol
2010-07-03: - (another unproductive day :-( )
Line 35: Line 43:
Changes in revlog: 2010-07-04: - Week end
Line 37: Line 45:
 . Change revlog structure, now it appends the next delta to existing data.This helps in reconstrfucting a revision at O(1) seeks. If 2010-07-05: Worked on bundle/unbundle, it works for now, have to rewrite the patches,(Ugly code)
Line 39: Line 47:
parent delta is implemented better compression can be achieved but, may have to compromise with number of seeks. ----
2010-07-06: Attended GSoC meeting on IRC, finished bundle/unbundle
Line 41: Line 50:
 * Current:Linear delta model
  . --> O(1) seeks to create any revision.
 * Parent delta model
  * ** I think this can be completed quickly, tonfa shared patches for this***
  . --> An intelligent algorithm to reduce number of seeks to reconstruct any revision.I think constructing revision is
2010-07-07: Pull from remote and localrepo partially working
Line 47: Line 52:
 . not straight forward as it is now. It's not going to be linear.
  . --> Investigation about merge revision. --> Optimization, and investigation for better implementation(If possible). --> Continuous parallel testing.
2010-07-08: Pull, push, clone working over http repo [[http://paste.lisp.org/+2ENB|this]] way.
Line 50: Line 54:
Changes in wire protocol: 2010-07-09: Support for sshrepo added.
Line 52: Line 56:
 * Changes in wire protocol to allow backward computability.
 * Implementation of changes according to old, new servers,client, repos.
  . --> Backward computability matrix (by tonfa): http://mercurial.selenic.com/wiki/ParentDeltaPlan
   . --> There are things to be discussed, and learned before implementation.
 * Rigorous testing
 * Integrating with other's work.(Changeset discovery, Lightweight copies/renames)
2010-07-10: Looked into subrepo issues, sent a patch to recursively pull subrepo, and support pull form clones.
Line 59: Line 58:
About Me: 2010-07-11 Rewrote patches, day was less productive though
Line 61: Line 60:
 . I am student at Indian Institute of Technology, Roorkee. I am currently pursing third year of 5 years degree, Mathematics as major. I 2010-07-12: Rewrote patches, sent them on mailing list
Line 63: Line 62:
am a Linux enthusiast. Interested in open source. I have been learning C,Python. Some of my projects can be found on github ( http://www.github.com/in3Xes ). I have never got opportunity to contribute to open source. GSoC would be stepping stone for that. ----
2010-07-21: Setup dev-env on new computer, import everything.
Line 65: Line 65:
 . I am project leader in IMG ( Information Management Group, IIT Roorkee, http://www.iitr.ac.in/IMG/ ), a group which manages intranet, 2010-07-22: -
Line 67: Line 67:
 . student databases, institute website. I have worked on PHP, Java in web development. 2010-07-23: Look into Discovery Plan and some DAG algorithms
Line 69: Line 69:
Contact Information: 2010-07-24: Look into bts, Nothing much though
Line 71: Line 71:
 . Name: Pradeepkumar Gayam
 Email: in3xes@gmail.com
 . IRC Nick: inXs_ (freenode.net)
 . Jabber ID: in3xes [AT] gmail.com (talk.google.com)
2010-07-25: Week end
Line 76: Line 73:
Schedule: 2010-07-26: Modify work according the changes in wireproto unification.
Line 78: Line 75:
 . I would start coding well before the schedule given in the time line. My final exams will end by May-15th. I can start working ----
2010-07-27: Refine patches according to suggestions in IRC.
Line 80: Line 78:
 . rigorously shortly after exams.
  . Till May-15th(or 24th according to time line) I will be reading code, documentation, getting used to coding style, getting ready to
 start coding, little experimentation and stuff.
2010-07-28: Ponder about compatibility matrix, No final conclusion
Line 84: Line 80:
'''Timeline''' 2010-07-29: --
Line 86: Line 82:
''Changes in revlog structure'': [Total 3 weeks] 2010-07-30: Work on new bundle format.
Line 88: Line 84:
 * Parent delta model:
  . --> An intelligent algorithm to reduce number of seeks to reconstruct any revision.I think constructing revision is not
2010-07-31: Work on wire-protocol stuff.
Line 91: Line 86:
 . straight forward as it is now. It's not going to be linear.
  . [(It can take 1 1/2 to 2 weeks to bring it to working stage, Roughly around 5th-Jun)]
  . --> Investigation about merge revision. --> Optimization, and investigation for better implementation(If possible).
   . --> Continuous parallel testing.
    . [( I would finish testing and other work nearly in 10 days, by 15th-Jun)]
2010-08-01: Week end.
Line 97: Line 88:
''Changes in wire protocol'':[Total 3 to 3 1/2 weeks] 2010-08-02:Work on wire-protocol stuff. Patches half ready.
Line 99: Line 90:
 * Changes in wire protocol to allow backward computability.
  . [(For midterm evaluation, this may not be complete, but significant part of it)]
  . --> Implementation of changes according to old, new servers,client, repos.
   . >> Backward compatibility matrix (by tonfa): http://mercurial.selenic.com/wiki/ParentDeltaPlan . >> There are things to be discussed, and learned before implementation.
----
== Work Progress: ==
''' '''Few things related revlog:
Line 104: Line 94:
 * Rigorous testing.
  . [(I will start testing shortly after mid-term evaluation, i.e., around 18th-July, would take a week to complete)]
  * Integrating with other's work.
   . [(Changeset discovery, Lightweight copies/renames). ( Remaining time, depending upon other students)]
1) A [[http://markmail.org/download.xqy?id=ooqvjdaor4t3sxan&number=1|script]] by tonfa, proof for parent delta.
Line 109: Line 96:
Link to my GSoC [[http://www.bitbucket.org/in3xes/gsoc|application]] on bitbucket.org 2) mpm [[http://markmail.org/thread/ooqvjdaor4t3sxan|explaining]] how space inefficiency is caused.
Line 111: Line 98:
## You can even more obfuscate your email address by adding more uppercase letters followed by a leading and trailing blank.
CategoryHomepage
3) A [[http://mercurial.selenic.com/hg/hg/file/45aabc523c86/contrib/shrink-revlog.py|python script]] in mercurial repo to shrink revlog by sorting in topological order.

=== 4th Week: ===
Earlier in this week I started working on revlog, most of the work is already done by tonfa. He shared patches. All I did was understand what he did, and rewrite those patches with minor changes.

'''strategy:'''

As mpm suggested, I have changed the way new revision is being added to existing revlog, and make corresponding changes in creating a given revision of file. The crucial step (performance affecting step) is creating a given version of file. previously to create a given version of a file first we take base revision , to that we keep adding the deltas till given revision. Earlier deltas are added in linear manner. So creating a given revision of file is simply seeking O(1) times.

''Defination of base in revlog is not given anywhere. From revlog, a base is the version where we add full text of file to the revlog instead of delta.''

example scenario:

We want to create a R1 revision of a file. The parents of R1 are P1 and P2. Base of R1 is B1.

Consider a situation where P2 is null, and P1 < B1. In this case I simply can't take B1 and create R1. I have first find base and parents of P1 and construct P1. The same case might repeat. where parent of P1 is < base of P1. So, first I have to find a base which is ancestor of all the parents, then create a chain of deltas from that base to construct a given revision.

Output of --profile says that, what ever the decrease in the performance is caused only by this function which used for constructing the delta chain. So, I need to find a better algorithm to construct the delta chain. The current algorithm for constructing the delta chain is.

{{{
    def deltachain(self, rev):
        chain = []
        while self.base(rev) != rev:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        return rev, chain

    def deltaparent(self, rev):
        if self.base(rev) == rev:
            return nullrev
        elif self.flags(rev) & REVLOG_PARENTDELTA_FLAGS:
            return self.parentrevs(rev)[0]
        else:
            return rev-1
}}}
=== 5th Week: ===
In the starting of the week, I stopped working on revlog for sometime to look at other part of API. I studied abit about shallow clones, and parent delta's role in it. Though I didn't study in much detail, got general idea about how things work in shallow clones. Later worked on wire protocol abit.

After some gap I again started working on revlog performance, did some exprimentation, fixed some issues (which brought back the performace when the revlog is non-parent deltad). From the results I concluded that the decrease in the performance is bacause of number of patches being applied in to construct a revision of a file. In the previous post I wrote that revlog slowed down becase of the deltachain(), and deltaparent() functions are repeated being called. Thought that is one of the reason, but not the major reason. Some important results are pasted below.

(All the experiments are done on mercurial repo, Right now, I don't have good enough computer for larger repos )

 * Old revlog, Normal repo

{{{
hg perfrevlog .hg/store/00manifest.i
! wall 0.570377 comb 0.570000 user 0.560000 sys 0.010000 (best of 18)
hg verify --time
Time: real 11.290 secs (user 10.790+0.000 sys 0.250+0.000)
}}}
 * New revlog, Normal repo

{{{
hg perfrevlog .hg/store/00manifest.i
! wall 0.646474 comb 0.650000 user 0.640000 sys 0.010000 (best of 16)
hg verify --time
Time: real 15.930 secs (user 15.350+0.000 sys 0.290+0.000)
}}}
----
=== Final Week ===
 . After little modification in {{{deltachain()}}} finally I was able to make those patches work on current repos without much affecting performance. In the end {{{deltachain()}}} looks like:

{{{
    def deltachain(self, rev, cache):
        """return chain of revisions to construct a given revision"""
        chain = []
        check = False
        while self.base(rev) != rev and rev != cache:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        if rev == cache:
            check = True
        return check, rev, chain
}}}
Results:

With hg1.6.2:

{{{
hg verify --time
Time: real 5.660 secs (user 5.510+0.000 sys 0.160+0.000)
}}}
With patches applied:

{{{
hg verify --time
Time: real 5.800 secs (user 5.690+0.000 sys 0.100+0.000)
}}}
Which means decrease in performance is less than 3%

== Work Remaining: ==
Wire protocol part is not implemented as it requires new bundle format. In push/pull/clone if we request a changegroup of changegroupsubset it has to redelta everytime on both server and client side. To be efficient new bundle format has to used. With new bundle format we can carry flags corresponding to each delta and its deltaparent. Next thing to do is implement a new changegroup to interact with new servers. As I already worked protocol it should not take much time to implement required changes.

Discussion about new bundle format can seen [[http://mercurial.markmail.org/message/anrw4s4svdrrk5lv|here]]

----
CategoryHomepage CategoryGsoc

Google Summer of Code-2010


Contact Details:

        Name:           Pradeepkumar Gayam
        Email:          in3xes@gmail.com
        IRC Nick:       in3xes (freenode.net)
        Jabber ID:      in3xes [AT] gmail.com (talk.google.com)

My GSoC application on bitbucket.


Journal:

2010-06-23: Study shallow clone how they are related to parent delta

2010-06-24: Revlog experiments to improve performance

2010-06-25: More experiments, fixed the issue, so that it will not differ for old revlogs, still expriments needs to be done.

2010-06-26: Study wire-proto, understand how bundle, unbundle is being done.

2010-06-27: Week end

2010-06-28: More expreiments, found the reason for the performance issue, this paste explains. Conclusion: 'The number of patches to be applied increases at starting rev of new branch'.


2010-06-29: Update blog, journal, Attened GSoC meeting,

2010-06-30: Worked through bundle/unbundle code and understand wire-proto completely

2010-07-01: Worked through bundle/unbundle code and understand wire-proto completely

2010-07-02: - (Tried to work, but, unproductive day)

2010-07-03: - (another unproductive day :-( )

2010-07-04: - Week end

2010-07-05: Worked on bundle/unbundle, it works for now, have to rewrite the patches,(Ugly code)


2010-07-06: Attended GSoC meeting on IRC, finished bundle/unbundle

2010-07-07: Pull from remote and localrepo partially working

2010-07-08: Pull, push, clone working over http repo this way.

2010-07-09: Support for sshrepo added.

2010-07-10: Looked into subrepo issues, sent a patch to recursively pull subrepo, and support pull form clones.

2010-07-11 Rewrote patches, day was less productive though

2010-07-12: Rewrote patches, sent them on mailing list


2010-07-21: Setup dev-env on new computer, import everything.

2010-07-22: -

2010-07-23: Look into Discovery Plan and some DAG algorithms

2010-07-24: Look into bts, Nothing much though

2010-07-25: Week end

2010-07-26: Modify work according the changes in wireproto unification.


2010-07-27: Refine patches according to suggestions in IRC.

2010-07-28: Ponder about compatibility matrix, No final conclusion

2010-07-29: --

2010-07-30: Work on new bundle format.

2010-07-31: Work on wire-protocol stuff.

2010-08-01: Week end.

2010-08-02:Work on wire-protocol stuff. Patches half ready.


Work Progress:

Few things related revlog:

1) A script by tonfa, proof for parent delta.

2) mpm explaining how space inefficiency is caused.

3) A python script in mercurial repo to shrink revlog by sorting in topological order.

4th Week:

Earlier in this week I started working on revlog, most of the work is already done by tonfa. He shared patches. All I did was understand what he did, and rewrite those patches with minor changes.

strategy:

As mpm suggested, I have changed the way new revision is being added to existing revlog, and make corresponding changes in creating a given revision of file. The crucial step (performance affecting step) is creating a given version of file. previously to create a given version of a file first we take base revision , to that we keep adding the deltas till given revision. Earlier deltas are added in linear manner. So creating a given revision of file is simply seeking O(1) times.

Defination of base in revlog is not given anywhere. From revlog, a base is the version where we add full text of file to the revlog instead of delta.

example scenario:

We want to create a R1 revision of a file. The parents of R1 are P1 and P2. Base of R1 is B1.

Consider a situation where P2 is null, and P1 < B1. In this case I simply can't take B1 and create R1. I have first find base and parents of P1 and construct P1. The same case might repeat. where parent of P1 is < base of P1. So, first I have to find a base which is ancestor of all the parents, then create a chain of deltas from that base to construct a given revision.

Output of --profile says that, what ever the decrease in the performance is caused only by this function which used for constructing the delta chain. So, I need to find a better algorithm to construct the delta chain. The current algorithm for constructing the delta chain is.

    def deltachain(self, rev):
        chain = []
        while self.base(rev) != rev:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        return rev, chain

    def deltaparent(self, rev):
        if self.base(rev) == rev:
            return nullrev
        elif self.flags(rev) & REVLOG_PARENTDELTA_FLAGS:
            return self.parentrevs(rev)[0]
        else:
            return rev-1

5th Week:

In the starting of the week, I stopped working on revlog for sometime to look at other part of API. I studied abit about shallow clones, and parent delta's role in it. Though I didn't study in much detail, got general idea about how things work in shallow clones. Later worked on wire protocol abit.

After some gap I again started working on revlog performance, did some exprimentation, fixed some issues (which brought back the performace when the revlog is non-parent deltad). From the results I concluded that the decrease in the performance is bacause of number of patches being applied in to construct a revision of a file. In the previous post I wrote that revlog slowed down becase of the deltachain(), and deltaparent() functions are repeated being called. Thought that is one of the reason, but not the major reason. Some important results are pasted below.

(All the experiments are done on mercurial repo, Right now, I don't have good enough computer for larger repos )

  • Old revlog, Normal repo

hg perfrevlog .hg/store/00manifest.i
! wall 0.570377 comb 0.570000 user 0.560000 sys 0.010000 (best of 18)
hg verify --time
Time: real 11.290 secs (user 10.790+0.000 sys 0.250+0.000)
  • New revlog, Normal repo

hg perfrevlog .hg/store/00manifest.i
! wall 0.646474 comb 0.650000 user 0.640000 sys 0.010000 (best of 16)
hg verify --time
Time: real 15.930 secs (user 15.350+0.000 sys 0.290+0.000)


Final Week

  • After little modification in deltachain() finally I was able to make those patches work on current repos without much affecting performance. In the end deltachain() looks like:

    def deltachain(self, rev, cache):
        """return chain of revisions to construct a given revision"""
        chain = []
        check = False
        while self.base(rev) != rev and rev != cache:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        if rev == cache:
            check = True
        return check, rev, chain

Results:

With hg1.6.2:

hg verify --time
Time: real 5.660 secs (user 5.510+0.000 sys 0.160+0.000)

With patches applied:

hg verify --time
Time: real 5.800 secs (user 5.690+0.000 sys 0.100+0.000)

Which means decrease in performance is less than 3%

Work Remaining:

Wire protocol part is not implemented as it requires new bundle format. In push/pull/clone if we request a changegroup of changegroupsubset it has to redelta everytime on both server and client side. To be efficient new bundle format has to used. With new bundle format we can carry flags corresponding to each delta and its deltaparent. Next thing to do is implement a new changegroup to interact with new servers. As I already worked protocol it should not take much time to implement required changes.

Discussion about new bundle format can seen here


CategoryHomepage CategoryGsoc

Pradeepkumar (last edited 2010-10-22 18:17:21 by mpm)