FreeBSD is on its way to move from CVS to SVN for the version control system for the Ports Collection. The decision was made to keep the complete history, so the complete CVS repository has to be converted to SVN.
As CVS has no way to record a copy or move of files inside the repository, we copied the CVS files inside the repository in case we wanted to copy or move a file (the so called “repocopy”). While this allows to see the full history of a file, the drawback is that you do not really know when a file was copied/moved if you are not strict at recording this info after doing a copy. Guess what, we where not.
Now with the move to SVN which has a build-in way for copies/moves, it would be nice if we could record this info. In an internal discussion someone told its not possible to detect a repocopy reliably.
Well, I thought otherwise and an hour later my mail went out how to detect one. The longest time was needed to write how to do it, not to come up with a solution. I do not know if someone picked up this algorithm and implemented something for the cvs2svn converter, but I decided to publish the algorithm here if someone needs a similar functionality somewhere else. Note, the following is tailored to the structure of the Ports Collection. This allows to speed up some things (no need to do all steps on all files). If you want to use this in a generic repository where the structure is not as regular as in our Ports Collection, you have to run this algorithm on all files.
It also detects commits where multiple files where committed at once in one commit (sweeping commits).
- check only category/name/Makefile
- generate a hash of each commitlog+committer
- if you are memory-limited use ha/sh/ed/dirs/cvs-rev and store pathname in the list cvs-rev (pathname = “category-name”) as storage
- store the hash also in pathname/cvs-rev
If you have only one item in ha/sh/ed/dirs/cvs-rev in the end, there was no repocopy and no sweeping commit, you can delete this ha/sh/ed/dirs/cvs-rev.
If you have more than … let’s say … 10 (subject to tuning) pathnames in ha/sh/ed/dirs/cvs-rev you found a sweeping commit and you can delete the ha/sh/ed/dirs/cvs-rev.
The remaining ha/sh/ed/dirs/cvs-rev are probably repocopies. Take one ha/sh/ed/dirs/cvs-rev and for each pathname (there may be more than 2 pathnames) in there have a look at pathname/. Take the first cvs-rev of each and check if they have the same hash. Continue with the next rev-number for each until you found a cvs-rev which does not contain the same hash. If the number of cvs-revs since the beginning is >= … let’s say … 3 (subject to tuning), you have a candidate for a repocopy. If it is >= … 10 (subject to tuning), you have a very good indicator for a repocopy. You have to proceed until you have only one pathname left.
You may detect multiple repocopies like A->B->C->D or A->B + A->D + A->C here.
Write out the repocopy candidate to a list and delete the ha/sh/ed/dirs/cvs-rev for each cvs-rev in a detected sequence.
This finds repocopy candidates for category/name/Makefile. To detect the correct repocopy-date (there are maybe cases where another file was changed after the Makefile but before the repocopy), you now have to look at all the files for a given repocopy-pair and check if there is a matching commit after the Makefile-commit-date. If you want to be 100% sure, you compare the complete commit-history of all files for a given repocopy-pair.
Tags: algorithm, category name, cvs repository, dirs, drawback, generic repository, hash, longest time, mail, multiple files —