Al­gorithm to de­tect repo-​copies in CVS

FreeBSD is on its way to move from CVS to SVN  for the ver­sion con­trol sys­tem for the Ports Col­lec­tion. The de­cision was made to keep the com­plete his­tory, so the com­plete CVS re­pos­it­ory has to be con­ver­ted to SVN.

As CVS has no way to re­cord a copy or move of files in­side the re­pos­it­ory, we copied the CVS files in­side the re­pos­it­ory in case we wanted to copy or move a file (the so called “re­po­copy”). While this al­lows to see the full his­tory of a file, the draw­back is that you do not really know when a file was copied/​moved if you are not strict at re­cord­ing this info after do­ing a copy. Guess what, we where not.

Now with the move to SVN which has a build-​in way for copies/​moves, it would be nice if we could re­cord this info. In an in­tern­al dis­cus­sion someone told its not pos­sible to de­tect a re­po­copy re­li­ably.

Well, I thought oth­er­wise and an hour later my mail went out how to de­tect one. The longest time was needed to write how to do it, not to come up with a solu­tion. I do not know if someone picked up this al­gorithm and im­ple­men­ted some­thing for the cvs2svn con­vert­er, but I de­cided to pub­lish the al­gorithm here if someone needs a sim­il­ar func­tion­al­ity some­where else. Note, the fol­low­ing is tailored to the struc­ture of the Ports Col­lec­tion. This al­lows to speed up some things (no need to do all steps on all files). If you want to use this in a gen­er­ic re­pos­it­ory where the struc­ture is not as reg­u­lar as in our Ports Col­lec­tion, you have to run this al­gorithm on all files.

It also de­tects com­mits where mul­tiple files where com­mit­ted at once in one com­mit (sweep­ing com­mits).


  • check only category/​name/​Make­file
  • gen­er­ate a hash of each commitlog+committer
  • if you are memory-​limited use ha/​sh/​ed/​dirs/​cvs-​rev and store path­name in the list cvs-​rev (path­name = “category-​name”) as stor­age
  • store the hash also in pathname/​cvs-​rev

If you have only one item in ha/​sh/​ed/​dirs/​cvs-​rev in the end, there was no re­po­copy and no sweep­ing com­mit, you can de­lete this ha/​sh/​ed/​dirs/​cvs-​rev.

If you have more than … let’s say … 10 (sub­ject to tun­ing) path­names in ha/​sh/​ed/​dirs/​cvs-​rev you found a sweep­ing com­mit and you can de­lete the ha/​sh/​ed/​dirs/​cvs-​rev.

The meat

The re­main­ing ha/​sh/​ed/​dirs/​cvs-​rev are prob­ably re­po­cop­ies. Take one ha/​sh/​ed/​dirs/​cvs-​rev and for each path­name (there may be more than 2 path­names) in there have a look at pathname/​. Take the first cvs-​rev of each and check if they have the same hash. Con­tin­ue with the next rev-​number for each un­til you found a cvs-​rev which does not con­tain the same hash. If the num­ber of cvs-​revs since the be­gin­ning is >= … let’s say … 3 (sub­ject to tun­ing), you have a can­did­ate for a re­po­copy. If it is >=  … 10 (sub­ject to tun­ing), you have a very good in­dic­at­or for a re­po­copy. You have to pro­ceed un­til you have only one path­name left.

You may de­tect mul­tiple re­po­cop­ies like A->B->C->D or A->B + A->D + A->C here.

Write out the re­po­copy can­did­ate to a list and de­lete the ha/​sh/​ed/​dirs/​cvs-​rev for each cvs-​rev in a de­tec­ted se­quence.

This finds re­po­copy can­did­ates for category/​name/​Makefile. To de­tect the cor­rect repocopy-​date (there are maybe cases where an­oth­er file was changed after the Make­file but be­fore the re­po­copy), you now have to look at all the files for a giv­en repocopy-​pair and check if there is a match­ing com­mit after the Makefile-​commit-​date. If you want to be 100% sure, you com­pare the com­plete commit-​history of all files for a giv­en repocopy-​pair.