Iterating over all dists on CPAN

CPANPAUSEiteratorsJSON Sat 14 September 2013

I recently released a new version of my PAUSE::Packages module, which caches information about releases on CPAN and makes it easy to iterate over it. In this post I'll cover the motivation for this module, why I decided to transform the existing PAUSE export file, and how David Golden nudged me to use JSON.

PAUSE::Packages provides an iterator that can be used to loop over the latest version of all dists on CPAN. I've various hacks that want to do this, and now I can write:

my $iterator = PAUSE::Packages->new()->release_iterator();
while ($release = $iterator->next) {
    process_release($release);
}

The source data is the 02packages.details.txt file, which is generated by PAUSE hourly. This contains one line for every package on CPAN; for example for the latest release of Furl:

Furl                   2.19   T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::ConnectionCache  undef  T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Headers          undef  T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::HTTP             2.19   T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Request          undef  T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Response         undef  T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::ZlibStream       undef  T/TO/TOKUHIROM/Furl-2.19.tar.gz

The first version of PAUSE::Packages used this file directly, but it has two characteristics that didn't really suit my needs.

Package droppings

The first problem was alluded to above: the file contains all packages on CPAN (regular modules and cuckoo packages as well). Let's say you upload one version of your dist, then at some point later refactor it and change some class names, drop some, and add some new ones. If you don't delete the older releases from your CPAN account (a topic for another day), then all of the modules will be listed in 02packages, and they'll be listed against the last release they appeared in. So for the dist Furl, in addition to the modules above, you'll also find the following:

Furl::CallbackStream   undef  T/TO/TOKUHIROM/Furl-0.05.tar.gz
Furl::ConnPool         undef  T/TO/TOKUHIROM/Furl-0.09.tar.gz
Furl::FileStream       undef  T/TO/TOKUHIROM/Furl-0.05.tar.gz
Furl::PartialWriter    undef  T/TO/TOKUHIROM/Furl-0.01.tar.gz

But because each package only appears once (against the most recent release containing it), you can't construct the full package list for earlier releases.

Sorted by package name

The second problem is that 02packages is sorted by package name. This means that if you want to process the file release by release, you actually need to read the whole lot into memory and then iterate over that. Which is what I've been doing until now.

Enter PAUSE::Packages 0.02

I decided I would transform 02packages into a format that supported my needs: only include the latest release, and sort by release, so the file could be processed release by release.

My first thought was a simple chunked format:

M/MI/MIYAGAWA/Attribute-Protected-0.03.tar.gz
    Attribute::Protected 0.03

T/TO/TOBYINK/Attribute-QueueStack-0.001.tar.gz
    Attribute::QueueStack 0.001
    Tie::Array::Queue 0.001
    Tie::Array::Stack 0.001

C/CH/CHORNY/Attribute-Signature-1.10.zip
    Attribute::Signature 1.10

This was easy to inspect and obviously easy to parse in perl. But David Golden pointed out that this was sub-optimal in at least two ways:

I don't want to use a binary format, so I tried two more approaches. The first just put the release path and the module/version pairs all on the same line. That solved the grep problem. And to humour David I tried a version where the module information is encoded as JSON. I benchmarked the reading of these, including both XS and Pure Perl readers for JSON:

  json_pp: 74 wallclock secs (73.56 usr +  0.10 sys = 73.66 CPU)
  json_xs:  2 wallclock secs ( 1.76 usr +  0.03 sys =  1.79 CPU)
paragraph:  4 wallclock secs ( 3.83 usr +  0.03 sys =  3.86 CPU)
  uniline:  3 wallclock secs ( 3.00 usr +  0.03 sys =  3.03 CPU)

Dang, David was right (assuming everyone installs JSON::XS). So I've gone with JSON for 0.02, though I write a header with a format identifier, so I can change my mind down the road.

PAUSE::Packages grabs 02packages the first time it runs. Thereafter it will make a request with an If-Modified-Since header, so will only pull back the file if it's changed. Given 02packages is updated hourly, I may end up with a different caching strategy, as most of the time you'll probably be happy to run with what you've got.

Here's how you could list the modules in every release:

my $iterator = PAUSE::Packages->new()->release_iterator();

while (my $release = $iterator->next) {
    $di = $release->distinfo;

    next unless defined($di->dist);

    print $di->dist, "  v", $di->version, "\n";
    foreach my $module (@{ $release->modules }) {
        print "\t", $module->name, ' v', $module->version, "\n";
    }
}

The check for defined($di->dist) is needed to skip releases with non-standard release formats that can't be handled by CPAN::DistnameInfo. For example:

T/TO/TOMC/scripts/whenon.dir/LastLog/Entry.pm.gz

The $release object has methods:

That modules method should really be packages, given the file contains cuckoo packages as well as regular modules.

Lessons learned:

Thanks to David Golden: he's given me a couple of good nudges during the development of this module. And thanks to Toby Inkster, who outlined how I could do this in Moops; I'm going to have to work up to that!

comments powered by Disqus