Cleaning up your maven repository

by Jettro CoenradieAugust 1, 2011

A few days a go I was looking at a warning that my disk was getting to full. I just upgraded to apple osx lion. There were a few things that were related to the upgrade, but another large directory was the maven repository directory. The easy way out is to just remove everything, but I do not want to do that every week. Than I started to think about a solution to delete only part of the repository. As I like playing around with groovy, it must be a groovy script.

So what libraries or artifacts to remove? I want to remove all old snapshots and I want to remove old artifacts of which newer ones have been installed. In this blog post I explain the script, what it can and how it works.

Requirements

What libraries to clean? The maven repository has a lot of libraries that we no longer need. Of course there is a risk we delete to much, still what is this risk? If we remove to much, we just have to download it again. Not to big of a risk if you ask me. Why not just remove everything at once? Well that is just a bit to much and not something you’d do once a week.

I separate the requirements in two parts. Libraries that are versioned and libraries that are snapshots.

snapshots

I want to throw away all snapshots that are older than 60 days.

Other libraries

I want to throw away all versions that are created more than 90 days a go and have a more recent version.

The script

I have chosen to do everything in one file. You can find it in Github:

https://github.com/jettro/small-scripts/blob/master/groovy/CleanDir.groovy

Configure the script

I started creating a mechanism to provide commend line parameters, but that was not ideal when I needed more parameters. Therefore I just made it easy to change the configuration in one spot in the source code. You can change the Configuration object and influence what the script does.

now = new Date()
configuration = new Configuration()
cleanedSize = 0
details = []
directoryFilter = new DirectoryFilter()
nonSnapshotDirectoryFilter = new NonSnapshotDirectoryFilter()

def class Configuration {
    def homeFolder = System.getProperty("user.home")
    def path = homeFolder + "/.m2/repository"
    def dryRun = true
    def printDetails = true
    def maxAgeSnapshotsInDays = 60
    def maxAgeInDays = 90
    def versionsToKeep = ["3.1.0.M1"]
    def snapshotsOnly = true
}

The first parameters without the def are global parameters. I bit strange, if you do not use the keyword def, they are available to all your methods. The configuration object contains the parameters to configure the script. These are the ones you want to alter in order to change the execution result. You can change the location of the repository to clean. You can make the script to s dry run only, change the output details, change the age of artifacts to keep and influence removing snapshots or all kind of artifacts. The final parameter is the versionsToKeep, here you can add versions strings that you do not want to remove. These are meant for strange versions like the example with the millstone in it.

Which directories to delete

Cleaning the repository starts by going through all directories and check if the directory contains old artifacts. The definition of old should be flexible, but if the artifacts are old, we remove the directory. The following code block show the recursive function that checks a directory for sub-directories. If no sub-directories are found, we check the age. The behavior is different for snapshot folders and normal version folders.

private def cleanMavenRepository(File file) {
    def lastModified = new Date(file.lastModified());
    def ageInDays = now - lastModified;
    def directories = file.listFiles(directoryFilter);

    if (directories.length > 0) {
        directories.each {
            cleanMavenRepository(it);
        }
    } else {
        if (ageInDays > configuration.maxAgeSnapshotsInDays && file.canonicalPath.endsWith("-SNAPSHOT")) {
            int size = removeDirAndReturnFreedKBytes(file)
            details.add("About to remove directory $file.canonicalPath with total size $size and $ageInDays days old");
        } else if (ageInDays > configuration.maxAgeInDays && !file.canonicalPath.endsWith("-SNAPSHOT") && !configuration.snapshotsOnly) {
            String highest = obtainHighestVersionOfArtifact(file)
            if (file.name != highest && !configuration.versionsToKeep.contains(file.name)) {
                int size = removeDirAndReturnFreedKBytes(file)
                details.add("About to remove directory $file.canonicalPath with total size $size and $ageInDays days old and not highest version $highest");
            }
        }
    }
}

Lines 7-9 show the recursive behavior. For each sub-directory we call this method again. In line 11 we check if we deal with an old snapshot folder. If that is the case we call the removeDir function and store a message in the details array that we use later on for reporting.

In line 14 we check for normal version artifacts and if we configured to clean them as well. If so, we have to do a little bit more work. We want to remove all versions that are older than specified days and that have a higher version in the repository. So we need to find the highest version first. Than we check for the current folder if it is the highest version. If not, and it is not in the special versions array, we delete the folder like we did before with the snapshots.

So how do we determine the highest version?

Determine the highest version

First step in finding the highest version is finding all the versions.

private String obtainHighestVersionOfArtifact(File file) {
    def folderWithVersions = file.parentFile
    // Keep only the highest version
    def versionsFolders = folderWithVersions.listFiles(nonSnapshotDirectoryFilter)
    def highest = '0'
    versionsFolders.each {
        if (higherThan(highest, it.name)) {
            highest = it.name
        }
    }
    return highest
}
def class NonSnapshotDirectoryFilter implements FileFilter {
    boolean accept(File file) {
        return file.directory && !file.name.endsWith("-SNAPSHOT")
    }
}

In order to find all the versions, we move up one folder and ask for all sub-directories that not end with SNAPSHOT. The way to do this is by providing a filter to the listFiles method. The filter is an instance of the provided filter NonSnapshotDirectoryFilter. Than we go through the folders and compare each folder with the highest version found so far. Comparing the versions is done in the following code block.

private boolean higherThan(highestVersion, newVersion) {
    def highestArr = highestVersion.tokenize('.')
    def newArr = newVersion.tokenize('.')
    if (highestVersion.endsWith("RELEASE") && !newVersion.endsWith("RELEASE")) {
        return false
    }
    return compareTwoIntegersInArray(highestArr, newArr, 0)
}

private boolean compareTwoIntegersInArray(highestArr, newArr, counter) {
    def counterPlus1 = counter + 1
    if (highestArr[counter] == newArr[counter]) {
        if (highestArr.size() > counterPlus1 && newArr.size() > counterPlus1) {
            return compareTwoIntegersInArray(highestArr, newArr, counterPlus1)
        } else if (newArr.size() > counterPlus1) {
            return true
        }
    } else {
        def highest = highestArr[counter]
        def newest = newArr[counter]
        if (highest.isInteger() && newest.isInteger()) {
            if (highest.toInteger() < newest.toInteger()) {
                return true
            }
        } else {
            if (highest < newest) {
                return true
            }
        }
    }
    return false
}

The most important part to grasp here is what a version looks like and how we determine the highest version. In the higherThan method we take the string and tokenize it on the dots. Each version looks like 3.1.0. This is an easy one, we can also have 3.1.0.RELEASE or 3.1.0-RC1. Using the arrays from the higherThan method, we compare the two versions. If we can compare them as numbers, we do that, if not we compare them as strings. If one of the versions is shorter (less dots) than the other, i.e. 3.0 and 3.0.1, the longest version wins if the first part is the same. So 3.0.1 is higher than 3.0, but 3.1 is higher than 3.0.1

Final remarks

Within the script we also do some reporting, this is kind of trivial, refer to the code if you want to see how it works.

I ended up cleaning about 7 Gb of my repository, so mission succeeded. As a catch, I do not know yet what I removed that I still needed :-).

As always feel free to ask questions or leave comments for improvements.