Algo­rithm to detect repo-copies in CVS

FreeB­SD is on its way to move from CVS to SVN  for the ver­sion con­trol sys­tem for the Ports Col­lec­tion. The deci­sion was made to keep the com­plete his­to­ry, so the com­plete CVS repos­i­to­ry has to be con­vert­ed to SVN.

As CVS has no way to record a copy or move of files inside the repos­i­to­ry, we copied the CVS files inside the repos­i­to­ry in case we want­ed to copy or move a file (the so called “repocopy”). While this allows to see the full his­to­ry of a file, the draw­back is that you do not real­ly know when a file was copied/moved if you are not strict at record­ing this info after doing a copy. Guess what, we where not.

Now with the move to SVN which has a build-in way for copies/moves, it would be nice if we could record this info. In an inter­nal dis­cus­sion some­one told its not pos­si­ble to detect a repocopy reli­ably.

Well, I thought oth­er­wise and an hour lat­er my mail went out how to detect one. The longest time was need­ed to write how to do it, not to come up with a solu­tion. I do not know if some­one picked up this algo­rithm and imple­ment­ed some­thing for the cvs2svn con­vert­er, but I decid­ed to pub­lish the algo­rithm here if some­one needs a sim­i­lar func­tion­al­i­ty some­where else. Note, the fol­low­ing is tai­lored to the struc­ture of the Ports Col­lec­tion. This allows to speed up some things (no need to do all steps on all files). If you want to use this in a gener­ic repos­i­to­ry where the struc­ture is not as reg­u­lar as in our Ports Col­lec­tion, you have to run this algo­rithm on all files.

It also detects com­mits where mul­ti­ple files where com­mit­ted at once in one com­mit (sweep­ing com­mits).

Prepa­ra­tion

  • check only category/name/Make­file
  • gen­er­ate a hash of each commitlog+committer
  • if you are memory-limited use ha/sh/ed/dirs/cvs-rev and store path­name in the list cvs-rev (path­name = “category-name”) as stor­age
  • store the hash also in pathname/cvs-rev

If you have only one item in ha/sh/ed/dirs/cvs-rev in the end, there was no repocopy and no sweep­ing com­mit, you can delete this ha/sh/ed/dirs/cvs-rev.

If you have more than … let’s say … 10 (sub­ject to tun­ing) path­names in ha/sh/ed/dirs/cvs-rev you found a sweep­ing com­mit and you can delete the ha/sh/ed/dirs/cvs-rev.

The meat

The remain­ing ha/sh/ed/dirs/cvs-rev are prob­a­bly repocopies. Take one ha/sh/ed/dirs/cvs-rev and for each path­name (there may be more than 2 path­names) in there have a look at pathname/. Take the first cvs-rev of each and check if they have the same hash. Con­tin­ue with the next rev-number for each until you found a cvs-rev which does not con­tain the same hash. If the num­ber of cvs-revs since the begin­ning is >= … let’s say … 3 (sub­ject to tun­ing), you have a can­di­date for a repocopy. If it is >=  … 10 (sub­ject to tun­ing), you have a very good indi­ca­tor for a repocopy. You have to pro­ceed until you have only one path­name left.

You may detect mul­ti­ple repocopies like A->B->C->D or A->B + A->D + A->C here.

Write out the repocopy can­di­date to a list and delete the ha/sh/ed/dirs/cvs-rev for each cvs-rev in a detect­ed sequence.

This finds repocopy can­di­dates for category/name/Makefile. To detect the cor­rect repocopy-date (there are maybe cas­es where anoth­er file was changed after the Make­file but before the repocopy), you now have to look at all the files for a giv­en repocopy-pair and check if there is a match­ing com­mit after the Makefile-commit-date. If you want to be 100% sure, you com­pare the com­plete commit-history of all files for a giv­en repocopy-pair.

Send to Kin­dle