Alexander Leidinger

Just another weblog

Jun
15

Algo­rithm to detect repo-copies in CVS

FreeBSD is on its way to move from CVS to SVN  for the ver­sion con­trol sys­tem for the Ports Col­lec­tion. The deci­sion was made to keep the com­plete his­tory, so the com­plete CVS repos­i­tory has to be con­verted to SVN.

As CVS has no way to record a copy or move of files inside the repos­i­tory, we copied the CVS files inside the repos­i­tory in case we wanted to copy or move a file (the so called “repocopy”). While this allows to see the full his­tory of a file, the draw­back is that you do not really know when a file was copied/moved if you are not strict at record­ing this info after doing a copy. Guess what, we where not.

Now with the move to SVN which has a build-in way for copies/moves, it would be nice if we could record this info. In an inter­nal dis­cus­sion some­one told its not pos­si­ble to detect a repocopy reliably.

Well, I thought oth­er­wise and an hour later my mail went out how to detect one. The longest time was needed to write how to do it, not to come up with a solu­tion. I do not know if some­one picked up this algo­rithm and imple­mented some­thing for the cvs2svn con­verter, but I decided to pub­lish the algo­rithm here if some­one needs a sim­i­lar func­tion­al­ity some­where else. Note, the fol­low­ing is tai­lored to the struc­ture of the Ports Col­lec­tion. This allows to speed up some things (no need to do all steps on all files). If you want to use this in a generic repos­i­tory where the struc­ture is not as reg­u­lar as in our Ports Col­lec­tion, you have to run this algo­rithm on all files.

It also detects com­mits where mul­ti­ple files where com­mit­ted at once in one com­mit (sweep­ing commits).

Prepa­ra­tion

  • check only category/name/Make­file
  • gen­er­ate a hash of each commitlog+committer
  • if you are memory-limited use ha/sh/ed/dirs/cvs-rev and store path­name in the list cvs-rev (path­name = “category-name”) as storage
  • store the hash also in pathname/cvs-rev

If you have only one item in ha/sh/ed/dirs/cvs-rev in the end, there was no repocopy and no sweep­ing com­mit, you can delete this ha/sh/ed/dirs/cvs-rev.

If you have more than … let’s say … 10 (sub­ject to tun­ing) path­names in ha/sh/ed/dirs/cvs-rev you found a sweep­ing com­mit and you can delete the ha/sh/ed/dirs/cvs-rev.

The meat

The remain­ing ha/sh/ed/dirs/cvs-rev are prob­a­bly repocopies. Take one ha/sh/ed/dirs/cvs-rev and for each path­name (there may be more than 2 path­names) in there have a look at pathname/. Take the first cvs-rev of each and check if they have the same hash. Con­tinue with the next rev-number for each until you found a cvs-rev which does not con­tain the same hash. If the num­ber of cvs-revs since the begin­ning is >= … let’s say … 3 (sub­ject to tun­ing), you have a can­di­date for a repocopy. If it is >=  … 10 (sub­ject to tun­ing), you have a very good indi­ca­tor for a repocopy. You have to pro­ceed until you have only one path­name left.

You may detect mul­ti­ple repocopies like A->B->C->D or A->B + A->D + A->C here.

Write out the repocopy can­di­date to a list and delete the ha/sh/ed/dirs/cvs-rev for each cvs-rev in a detected sequence.

This finds repocopy can­di­dates for category/name/Makefile. To detect the cor­rect repocopy-date (there are maybe cases where another file was changed after the Make­file but before the repocopy), you now have to look at all the files for a given repocopy-pair and check if there is a match­ing com­mit after the Makefile-commit-date. If you want to be 100% sure, you com­pare the com­plete commit-history of all files for a given repocopy-pair.

GD Star Rat­ing
load­ing…
GD Star Rat­ing
load­ing…

Tags: , , , , , , , , ,

No Responses to “Algo­rithm to detect repo-copies in CVS

Leave a Reply