Iterating over all CPAN releases

CPANPAUSEiterators Thu 6 February 2014

Last year I created PAUSE::Packages, which lets you iterate over all dists that PAUSE believes are still on CPAN. For a number of projects, including the CPAN Report 2013, I need to iterate over all releases of all dists. Yesterday I made the first release of CPAN::ReleaseHistory, which makes it easy to do that, in a similar way to PAUSE::Packages.

Up to now I've been grabbing a dump of PAUSE's database, and parsing that to get the information I need. There were a number of problems:

Talking to ANDK and DAGOLDEN about this, David pointed out that there were BackPAN indexes available, which might serve my need. This led me to MSCHWERN's BackPAN::Index. That provided more features than I needed, has a lot of dependencies, and doesn't provide exactly the interface I want for a number of my projects / ideas. But it did lead me to the BackPAN index file it uses.

My new module has an interface pretty much the same as PAUSE::Packages, but now you're iterating over all releases ever. Here's how to find all releases of dist enum:

my $iterator = CPAN::ReleaseHistory->new()->release_iterator();
while ($release = $iterator->next) {
    next unless defined $release->distinfo->dist;
    next unless $release->distinfo->dist eq 'enum';
    printf "%s  time=%d  size=%d\n",
           $release->path, $release->timestamp, $release->size;
}

The distinfo method returns an instance of CPAN::DistnameInfo, from which you can get the dist name, PAUSE id of the uploaded, and lots more.

The above code generates the following output:

Z/ZE/ZENIN/enum-1.008.tar.gz  time=897606348  size=4232
Z/ZE/ZENIN/enum-1.009.tar.gz  time=897610837  size=4524
Z/ZE/ZENIN/enum-1.010.tar.gz  time=897682129  size=4509
Z/ZE/ZENIN/enum-1.011.tar.gz  time=900784109  size=5906
N/NJ/NJLEON/enum-0.02.tar.gz  time=901821239  size=3396
Z/ZE/ZENIN/enum-1.013.tar.gz  time=926634892  size=5627
Z/ZE/ZENIN/enum-1.014.tar.gz  time=926636344  size=5666
Z/ZE/ZENIN/enum-1.015.tar.gz  time=927414594  size=5714
Z/ZE/ZENIN/enum-1.016.tar.gz  time=927845988  size=5847
R/RO/ROODE/enum-0.01.tar.gz  time=1205434783  size=9280
N/NE/NEILB/enum-1.016_01.tar.gz  time=1377640563  size=6667
N/NE/NEILB/enum-1.02.tar.gz  time=1378023284  size=6827
N/NE/NEILB/enum-1.03.tar.gz  time=1378145819  size=6902
N/NE/NEILB/enum-1.04.tar.gz  time=1378412340  size=7003
N/NE/NEILB/enum-1.05.tar.gz  time=1378423112  size=7084
N/NE/NEILB/enum-1.06.tar.gz  time=1390608724  size=7230

The iterator gives you releases sorted first by dist name, and then by release time.

Not everything on CPAN is a tarball, particularly old things. That's why I included the line:

next unless defined $release->distinfo->dist;

Here are some examples of things released to CPAN that this guard line filters out:

A/AN/ANKITAS/AWS-SQS-Simple
S/SR/SREZIC/patches/Net-ZooKeeper-0.35-RT91216.patch
M/MA/MAHATMA/phttpd-0.01.45.pl

I should add an option to the iterator that controls whether you even get to see those things, since most of the time I skip them anyway.

Caveat

It's currently very simple in how it works: it grabs the index, loads all the relevant entries into memory, sorts them according to the above rules, then writes them to a local file. This obviously takes up quite a bit of memory, so don't use this module on your smartwatch.

comments powered by Disqus