Differences between revisions 28 and 63 (spanning 35 versions)
Revision 28 as of 2010-04-18 15:32:23
Size: 4845
Editor: Pradeepkumar
Comment:
Revision 63 as of 2010-06-29 07:25:05
Size: 5629
Editor: Pradeepkumar
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from in3xes
## page was renamed from Pradeepkumar
== Google Summer of Code-2010 ==
----
=== Contact Details: ===
{{{
        Name: Pradeepkumar Gayam
        Email: in3xes@gmail.com
        IRC Nick: in3xes (freenode.net)
        Jabber ID: in3xes [AT] gmail.com (talk.google.com)
}}}
My GSoC [[http://bitbucket.org/in3xes/gsoc/src/tip/application|application]] on bitbucket.
Line 2: Line 14:
----
''' '''
Line 3: Line 17:
== Google Summer of Code-2010 ==
I am interested in "Parent delta". I can also work on conversion tools. But I am mainly focusing on Parent Delta Plan.
=== Journal: ===
2010-06-23: Study shallow clone how they are related to parent delta
Line 6: Line 20:
A word about project: 2010-06-24: Revlog experiments to improve performance
Line 8: Line 22:
Mercurial calculates diffs against previous revision rather than parent. In some cases this implementation is space inefficient, and it is more sensible to store deltas against parent. This project is about implementing [[http://mercurial.selenic.com/wiki/ParentDeltaPlan|Parent Delta Plan]]. This implementation changes the structure of revlogs, so wire protocol has to be extended to allow backward compatibility. 2010-06-25: More experiments, fixed the issue, so that it will not differ for old revlogs, still expriments needs to be done.
Line 10: Line 24:
============================================================================ 2010-07-26: Study wire-proto, understand how bundle, unbundle is being done.
Line 12: Line 26:
=== Proposal ===
TO DO(Tentative): Major:
2010-06-27: Week end
Line 15: Line 28:
 * Changes in revlog
 * Changes in wire protocol
----
=== Work Progress: ===
''' '''Few things related revlog:
Line 18: Line 32:
Changes in revlog: 1) A [[http://markmail.org/download.xqy?id=ooqvjdaor4t3sxan&number=1|script]] by tonfa, proof for parent delta.
Line 20: Line 34:
 . Change revlog structure, now it appends the next delta to existing data.This helps in reconstrfucting a revision at O(1) seeks. If 2) mpm [[http://markmail.org/thread/ooqvjdaor4t3sxan|explaining]] how space inefficiency is caused.
Line 22: Line 36:
parent delta is implemented better compression can be achieved but, may have to compromise with number of seeks. 3) A [[http://mercurial.selenic.com/hg/hg/file/45aabc523c86/contrib/shrink-revlog.py|python script]] in mercurial repo to shrink revlog by sorting in topological order.
Line 24: Line 38:
 * Current:Linear delta model
  . --> O(1) seeks to create any revision.
 * Parent delta model
  * ** I think this can be completed quickly, tonfa shared patches for this***
  . --> An intelligent algorithm to reduce number of seeks to reconstruct any revision.I think constructing revision is
'''__4th Week:__'''
Line 30: Line 40:
 . not straight forward as it is now. It's not going to be linear.
  . --> Investigation about merge revision. --> Optimization, and investigation for better implementation(If possible). --> Continuous parallel testing.
Earlier in this week I started working on revlog, most of the work is already done by tonfa. He shared patches. All I did was understand what he did, and rewrite those patches with minor changes.
Line 33: Line 42:
Changes in wire protocol: '''strategy:'''
Line 35: Line 44:
 * Changes in wire protocol to allow backward computability.
 * Implementation of changes according to old, new servers,client, repos.
  . --> Backward computability matrix (by tonfa): http://mercurial.selenic.com/wiki/ParentDeltaPlan
   . --> There are things to be discussed, and learned before implementation.
 * Rigorous testing
 * Integrating with other's work.(Changeset discovery, Lightweight copies/renames)
As mpm suggested, I have changed the way new revision is being added to existing revlog, and make corresponding changes in creating a given revision of file. The crucial step (performance affecting step) is creating a given version of file. previously to create a given version of a file first we take base revision , to that we keep adding the deltas till given revision. Earlier deltas are added in linear manner. So creating a given revision of file is simply seeking O(1) times.
Line 42: Line 46:
About Me: ''Defination of base in revlog is not given anywhere. From revlog, a base is the version where we add full text of file to the revlog instead of delta.''
Line 44: Line 48:
 . I am student at Indian Institute of Technology, Roorkee. I am currently pursing third year of 5 years degree, Mathematics as major. I example scenario:
Line 46: Line 50:
am a Linux enthusiast. Interested in open source. I have been learning C,Python. Some of my projects can be found on github ( http://www.github.com/in3Xes ). I have never got opportunity to contribute to open source. GSoC would be stepping stone for that. We want to create a R1 revision of a file. The parents of R1 are P1 and P2. Base of R1 is B1.
Line 48: Line 52:
 . I am project leader in IMG ( Information Management Group, IIT Roorkee, http://www.iitr.ac.in/IMG/ ), a group which manages intranet, Consider a situation where P2 is null, and P1 < B1. In this case I simply can't take B1 and create R1. I have first find base and parents of P1 and construct P1. The same case might repeat. where parent of P1 is < base of P1. So, first I have to find a base which is ancestor of all the parents, then create a chain of deltas from that base to construct a given revision.
Line 50: Line 54:
 . student databases, institute website. I have worked on PHP, Java in web development. Output of --profile says that, what ever the decrease in the performance is caused only by this function which used for constructing the delta chain. So, I need to find a better algorithm to construct the delta chain. The current algorithm for constructing the delta chain is.
Line 52: Line 56:
Contact Information: {{{
    def deltachain(self, rev):
        chain = []
        while self.base(rev) != rev:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        return rev, chain
Line 54: Line 65:
 . Name: Pradeepkumar Gayam
 Email: in3xes@gmail.com
 . IRC Nick: inXs_ (freenode.net)
 . Jabber ID: in3xes [AT] gmail.com (talk.google.com)
    def deltaparent(self, rev):
        if self.base(rev) == rev:
            return nullrev
        elif self.flags(rev) & REVLOG_PARENTDELTA_FLAGS:
            return self.parentrevs(rev)[0]
        else:
            return rev-1
}}}
'''__5th Week:__'''
Line 59: Line 75:
Schedule: In the starting of the week, I stopped working on revlog for sometime to look at other parts of the api. In the starting I studied abit abotu shallow clones, and parent delta's role in it. Though I didn't go much detailed in it, I got general idea about how things work in shallow clones. Later studied wire protocol abit.
Line 61: Line 77:
 . I would start coding well before the schedule given in the time line. My final exams will end by May-15th. I can start working After some gap I again started working on revlog performance, did a bit of exprimentation, fixed some issues(which brought back the performace when the revlog is non-parent deltad). From the results I concluded that the decrease in the performance is bacause of cache (As tonfa said). In the previous post I wrote that revlog slowed down becase of the deltachain(), and deltaparent() functions are repeated being called. Thought that is one of the reason, but not the major reason. Here are some results.
Line 63: Line 79:
 . rigorously shortly after exams.
  . Till May-15th(or 24th according to time line) I will be reading code, documentation, getting used to coding style, getting ready to
 start coding, little experimentation and stuff.
(All the experiments are done on mercurial repo, Right now, I don't have good enough computer for larger repos )
Line 67: Line 81:
'''Timeline''' -- Old revlog, Normal repo --
Line 69: Line 83:
''Changes in revlog structure'': [Total 3 weeks] {{{
hg perfrevlog .hg/store/00manifest.i
! wall 0.570377 comb 0.570000 user 0.560000 sys 0.010000 (best of 18)
hg verify --time
Time: real 11.290 secs (user 10.790+0.000 sys 0.250+0.000)
}}}
-- New revlog (before fix), Normal repo --
Line 71: Line 91:
 * Parent delta model:
  . --> An intelligent algorithm to reduce number of seeks to reconstruct any revision.I think constructing revision is not
{{{
hg perfrevlog .hg/store/00manifest.i
! wall 0.646474 comb 0.650000 user 0.640000 sys 0.010000 (best of 16)
hg verify --time
Time: real 15.930 secs (user 15.350+0.000 sys 0.290+0.000)
}}}
-- New revlog (after fix), Normal repo --
Line 74: Line 99:
 . straight forward as it is now. It's not going to be linear.
  . [(It can take 1 1/2 to 2 weeks to bring it to working stage, Roughly around 5th-Jun)]
  . --> Investigation about merge revision. --> Optimization, and investigation for better implementation(If possible).
   . --> Continuous parallel testing.
    . [( I would finish testing and other work nearly in 10 days, by 15th-Jun)]
{{{
hg perfrevlog .hg/store/00manifest.i
! wall 0.589325 comb 0.550000 user 0.540000 sys 0.010000 (best of 17)
hg verify --time
Time: real 12.420 secs (user 11.700+0.000 sys 0.300+0.000)
}}}
-- New revlog (before fix), Repo with parent delta'ed manifest --
Line 80: Line 107:
''Changes in wire protocol'':[Total 3 to 3 1/2 weeks]

 * Changes in wire protocol to allow backward computability.
  . [(For midterm evaluation, this may not be complete, but significant part of it)]
  . --> Implementation of changes according to old, new servers,client, repos.
   . >> Backward compatibility matrix (by tonfa): http://mercurial.selenic.com/wiki/ParentDeltaPlan . >> There are things to be discussed, and learned before implementation.

 * Rigorous testing.
  . [(I will start testing shortly after mid-term evaluation, i.e., around 18th-July, would take a week to complete)]
  * Integrating with other's work.
   . [(Changeset discovery, Lightweight copies/renames). ( Remaining time, depending upon other students)]

Link to my GSoC [[http://www.bitbucket.org/in3xes/gsoc|application]] on bitbucket.org

## You can even more obfuscate your email address by adding more uppercase letters followed by a leading and trailing blank.
CategoryHomepage
{{{
hg perfrevlog .hg/store/00manifest.i
! wall 2.044643 comb 2.050000 user 2.020000 sys 0.030000 (best of 5)
hg verify --time
Time: real 42.450 secs (user 41.330+0.000 sys 0.300+0.000)
}}}
-- New revlog (after fix), Repo with parent delta'ed manifest --{{{
hg perfrevlog .hg/store/00manifest.i
! wall 1.967572 comb 1.970000 user 1.950000 sys 0.020000 (best of 6)
hg verify --time
}}}
Time: real 39.150 secs (user 37.470+0.000 sys 0.450+0.000)

Google Summer of Code-2010


Contact Details:

        Name:           Pradeepkumar Gayam
        Email:          in3xes@gmail.com
        IRC Nick:       in3xes (freenode.net)
        Jabber ID:      in3xes [AT] gmail.com (talk.google.com)

My GSoC application on bitbucket.


Journal:

2010-06-23: Study shallow clone how they are related to parent delta

2010-06-24: Revlog experiments to improve performance

2010-06-25: More experiments, fixed the issue, so that it will not differ for old revlogs, still expriments needs to be done.

2010-07-26: Study wire-proto, understand how bundle, unbundle is being done.

2010-06-27: Week end


Work Progress:

Few things related revlog:

1) A script by tonfa, proof for parent delta.

2) mpm explaining how space inefficiency is caused.

3) A python script in mercurial repo to shrink revlog by sorting in topological order.

4th Week:

Earlier in this week I started working on revlog, most of the work is already done by tonfa. He shared patches. All I did was understand what he did, and rewrite those patches with minor changes.

strategy:

As mpm suggested, I have changed the way new revision is being added to existing revlog, and make corresponding changes in creating a given revision of file. The crucial step (performance affecting step) is creating a given version of file. previously to create a given version of a file first we take base revision , to that we keep adding the deltas till given revision. Earlier deltas are added in linear manner. So creating a given revision of file is simply seeking O(1) times.

Defination of base in revlog is not given anywhere. From revlog, a base is the version where we add full text of file to the revlog instead of delta.

example scenario:

We want to create a R1 revision of a file. The parents of R1 are P1 and P2. Base of R1 is B1.

Consider a situation where P2 is null, and P1 < B1. In this case I simply can't take B1 and create R1. I have first find base and parents of P1 and construct P1. The same case might repeat. where parent of P1 is < base of P1. So, first I have to find a base which is ancestor of all the parents, then create a chain of deltas from that base to construct a given revision.

Output of --profile says that, what ever the decrease in the performance is caused only by this function which used for constructing the delta chain. So, I need to find a better algorithm to construct the delta chain. The current algorithm for constructing the delta chain is.

    def deltachain(self, rev):
        chain = []
        while self.base(rev) != rev:
            chain.append(rev)
            rev = self.deltaparent(rev)
        chain.reverse()
        return rev, chain

    def deltaparent(self, rev):
        if self.base(rev) == rev:
            return nullrev
        elif self.flags(rev) & REVLOG_PARENTDELTA_FLAGS:
            return self.parentrevs(rev)[0]
        else:
            return rev-1

5th Week:

In the starting of the week, I stopped working on revlog for sometime to look at other parts of the api. In the starting I studied abit abotu shallow clones, and parent delta's role in it. Though I didn't go much detailed in it, I got general idea about how things work in shallow clones. Later studied wire protocol abit.

After some gap I again started working on revlog performance, did a bit of exprimentation, fixed some issues(which brought back the performace when the revlog is non-parent deltad). From the results I concluded that the decrease in the performance is bacause of cache (As tonfa said). In the previous post I wrote that revlog slowed down becase of the deltachain(), and deltaparent() functions are repeated being called. Thought that is one of the reason, but not the major reason. Here are some results.

(All the experiments are done on mercurial repo, Right now, I don't have good enough computer for larger repos )

-- Old revlog, Normal repo --

hg perfrevlog .hg/store/00manifest.i
! wall 0.570377 comb 0.570000 user 0.560000 sys 0.010000 (best of 18)
hg verify --time
Time: real 11.290 secs (user 10.790+0.000 sys 0.250+0.000)

-- New revlog (before fix), Normal repo --

hg perfrevlog .hg/store/00manifest.i
! wall 0.646474 comb 0.650000 user 0.640000 sys 0.010000 (best of 16)
hg verify --time
Time: real 15.930 secs (user 15.350+0.000 sys 0.290+0.000)

-- New revlog (after fix), Normal repo --

hg perfrevlog .hg/store/00manifest.i
! wall 0.589325 comb 0.550000 user 0.540000 sys 0.010000 (best of 17)
hg verify --time
Time: real 12.420 secs (user 11.700+0.000 sys 0.300+0.000)

-- New revlog (before fix), Repo with parent delta'ed manifest --

hg perfrevlog .hg/store/00manifest.i
! wall 2.044643 comb 2.050000 user 2.020000 sys 0.030000 (best of 5)
hg verify --time
Time: real 42.450 secs (user 41.330+0.000 sys 0.300+0.000)

-- New revlog (after fix), Repo with parent delta'ed manifest --

hg perfrevlog .hg/store/00manifest.i
! wall 1.967572 comb 1.970000 user 1.950000 sys 0.020000 (best of 6)
hg verify --time

Time: real 39.150 secs (user 37.470+0.000 sys 0.450+0.000)

Pradeepkumar (last edited 2010-10-22 18:17:21 by mpm)