Alexander Leidinger

Just another weblog


Algo­rithm to detect repo-copies in CVS

FreeBSD is on its way to move from CVS to SVN  for the ver­sion con­trol sys­tem for the Ports Col­lec­tion. The deci­sion was made to keep the com­plete his­tory, so the com­plete CVS repos­i­tory has to be con­verted to SVN.

As CVS has no way to record a copy or move of files inside the repos­i­tory, we copied the CVS files inside the repos­i­tory in case we wanted to copy or move a file (the so called “repocopy”). While this allows to see the full his­tory of a file, the draw­back is that you do not really know when a file was copied/moved if you are not strict at record­ing this info after doing a copy. Guess what, we where not.

Now with the move to SVN which has a build-in way for copies/moves, it would be nice if we could record this info. In an inter­nal dis­cus­sion some­one told its not pos­si­ble to detect a repocopy reliably.

Well, I thought oth­er­wise and an hour later my mail went out how to detect one. The longest time was needed to write how to do it, not to come up with a solu­tion. I do not know if some­one picked up this algo­rithm and imple­mented some­thing for the cvs2svn con­verter, but I decided to pub­lish the algo­rithm here if some­one needs a sim­i­lar func­tion­al­ity some­where else. Note, the fol­low­ing is tai­lored to the struc­ture of the Ports Col­lec­tion. This allows to speed up some things (no need to do all steps on all files). If you want to use this in a generic repos­i­tory where the struc­ture is not as reg­u­lar as in our Ports Col­lec­tion, you have to run this algo­rithm on all files.

It also detects com­mits where mul­ti­ple files where com­mit­ted at once in one com­mit (sweep­ing commits).


  • check only category/name/Make­file
  • gen­er­ate a hash of each commitlog+committer
  • if you are memory-limited use ha/sh/ed/dirs/cvs-rev and store path­name in the list cvs-rev (path­name = “category-name”) as storage
  • store the hash also in pathname/cvs-rev

If you have only one item in ha/sh/ed/dirs/cvs-rev in the end, there was no repocopy and no sweep­ing com­mit, you can delete this ha/sh/ed/dirs/cvs-rev.

If you have more than … let’s say … 10 (sub­ject to tun­ing) path­names in ha/sh/ed/dirs/cvs-rev you found a sweep­ing com­mit and you can delete the ha/sh/ed/dirs/cvs-rev.

The meat

The remain­ing ha/sh/ed/dirs/cvs-rev are prob­a­bly repocopies. Take one ha/sh/ed/dirs/cvs-rev and for each path­name (there may be more than 2 path­names) in there have a look at pathname/. Take the first cvs-rev of each and check if they have the same hash. Con­tinue with the next rev-number for each until you found a cvs-rev which does not con­tain the same hash. If the num­ber of cvs-revs since the begin­ning is >= … let’s say … 3 (sub­ject to tun­ing), you have a can­di­date for a repocopy. If it is >=  … 10 (sub­ject to tun­ing), you have a very good indi­ca­tor for a repocopy. You have to pro­ceed until you have only one path­name left.

You may detect mul­ti­ple repocopies like A->B->C->D or A->B + A->D + A->C here.

Write out the repocopy can­di­date to a list and delete the ha/sh/ed/dirs/cvs-rev for each cvs-rev in a detected sequence.

This finds repocopy can­di­dates for category/name/Makefile. To detect the cor­rect repocopy-date (there are maybe cases where another file was changed after the Make­file but before the repocopy), you now have to look at all the files for a given repocopy-pair and check if there is a match­ing com­mit after the Makefile-commit-date. If you want to be 100% sure, you com­pare the com­plete commit-history of all files for a given repocopy-pair.

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , , , , ,

Speed traps with chmod

I have the habit to chmod with the rel­a­tive nota­tion (e.g. g+w or a+r or go-w or sim­i­lar) instead of the absolute one (e.g. 0640 or u=rw,g=r,o=). Recently I had to chmod a lot of files. As usual I was using the rel­a­tive nota­tion. With a lot of files, this took a lot of time. Time was not really an issue, so I did not stop it to restart with a bet­ter per­form­ing com­mand (e.g. find /path –type f –print0 | xargs –0 chmod 0644; find /path –type d –print0 | xargs –0 chmod 0755), but I thought a lit­tle tips&tricks post­ing may be in order, as not every­one knows the difference.

The rel­a­tive notation

When you spec­ify g+w, it means to remove the write access for the group, but keep every­thing else like it is. Nat­u­rally this means that chmod first has to lookup the cur­rent access rights. So for each async write request, there has to be a read-request first.

The absolute notation

The absolute nota­tion is what most peo­ple are used to (at least the numeric one). It does not need to read the access rights before chang­ing them, so there is less I/O to be done to get what you want. The draw­back is that it is not so nice for recur­sive changes. You do not want to have the x-bit for data files, but you need it for direc­to­ries. If you only have a tree with data files where you want to have an uni­form access, the exam­ple above via find is prob­a­bly faster (for sure if the direc­tory meta-data is still in RAM).

If you have a mix of bina­ries and data, it is a lit­tle bit more tricky to come up with a way which is faster. If the data has a name-pattern, you could use it in the find.

And if you have a non-uniform access for the group bits and want to make sure the owner has write access to every­thing, it may be faster to use the rel­a­tive nota­tion than to find a replace­ment command-sequence with the absolute notation.

GD Star Rat­ing
GD Star Rat­ing

Tags: , , , , , , , , ,