I recently released a new version of my PAUSE::Packages module, which caches information about releases on CPAN and makes it easy to iterate over it. In this post I'll cover the motivation for this module, why I decided to transform the existing PAUSE export file, and how David Golden nudged me to use JSON.
PAUSE::Packages provides an iterator that can be used to loop over the latest version of all dists on CPAN. I've various hacks that want to do this, and now I can write:
my $iterator = PAUSE::Packages->new()->release_iterator();
while ($release = $iterator->next) {
process_release($release);
}
The source data is the 02packages.details.txt
file, which is generated
by PAUSE hourly. This contains one line for every package on CPAN;
for example for the latest release of Furl:
Furl 2.19 T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::ConnectionCache undef T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Headers undef T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::HTTP 2.19 T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Request undef T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::Response undef T/TO/TOKUHIROM/Furl-2.19.tar.gz
Furl::ZlibStream undef T/TO/TOKUHIROM/Furl-2.19.tar.gz
The first version of PAUSE::Packages used this file directly, but it has two characteristics that didn't really suit my needs.
The first problem was alluded to above: the file contains
all packages on CPAN (regular modules and
cuckoo packages as well).
Let's say you upload one version of your dist,
then at some point later refactor it and change some class names,
drop some, and add some new ones. If you don't delete the older releases
from your CPAN account (a topic for another day), then all
of the modules will be listed in 02packages
,
and they'll be listed against the last release they appeared in.
So for the dist Furl
, in addition to the modules above,
you'll also find the following:
Furl::CallbackStream undef T/TO/TOKUHIROM/Furl-0.05.tar.gz
Furl::ConnPool undef T/TO/TOKUHIROM/Furl-0.09.tar.gz
Furl::FileStream undef T/TO/TOKUHIROM/Furl-0.05.tar.gz
Furl::PartialWriter undef T/TO/TOKUHIROM/Furl-0.01.tar.gz
But because each package only appears once (against the most recent release containing it), you can't construct the full package list for earlier releases.
The second problem is that 02packages
is sorted by package name.
This means that if you want to process the file release by release,
you actually need to read the whole lot into memory and then iterate
over that. Which is what I've been doing until now.
I decided I would transform 02packages
into a format that supported
my needs: only include the latest release, and sort by release,
so the file could be processed release by release.
My first thought was a simple chunked format:
M/MI/MIYAGAWA/Attribute-Protected-0.03.tar.gz
Attribute::Protected 0.03
T/TO/TOBYINK/Attribute-QueueStack-0.001.tar.gz
Attribute::QueueStack 0.001
Tie::Array::Queue 0.001
Tie::Array::Stack 0.001
C/CH/CHORNY/Attribute-Signature-1.10.zip
Attribute::Signature 1.10
This was easy to inspect and obviously easy to parse in perl. But David Golden pointed out that this was sub-optimal in at least two ways:
grep
'd in the same way 02packages
can.I don't want to use a binary format, so I tried two more approaches. The first just put the release path and the module/version pairs all on the same line. That solved the grep problem. And to humour David I tried a version where the module information is encoded as JSON. I benchmarked the reading of these, including both XS and Pure Perl readers for JSON:
json_pp: 74 wallclock secs (73.56 usr + 0.10 sys = 73.66 CPU)
json_xs: 2 wallclock secs ( 1.76 usr + 0.03 sys = 1.79 CPU)
paragraph: 4 wallclock secs ( 3.83 usr + 0.03 sys = 3.86 CPU)
uniline: 3 wallclock secs ( 3.00 usr + 0.03 sys = 3.03 CPU)
Dang, David was right (assuming everyone installs JSON::XS). So I've gone with JSON for 0.02, though I write a header with a format identifier, so I can change my mind down the road.
PAUSE::Packages grabs 02packages
the first time it runs.
Thereafter it will make a request with an If-Modified-Since
header, so will only pull back the file if it's changed.
Given 02packages
is updated hourly, I may end up with a different caching strategy,
as most of the time you'll probably be happy to run with what you've got.
Here's how you could list the modules in every release:
my $iterator = PAUSE::Packages->new()->release_iterator();
while (my $release = $iterator->next) {
$di = $release->distinfo;
next unless defined($di->dist);
print $di->dist, " v", $di->version, "\n";
foreach my $module (@{ $release->modules }) {
print "\t", $module->name, ' v', $module->version, "\n";
}
}
The check for defined($di->dist)
is needed to skip releases with non-standard
release formats that can't be handled by CPAN::DistnameInfo
. For example:
T/TO/TOMC/scripts/whenon.dir/LastLog/Entry.pm.gz
The $release
object has methods:
path
returns the partial release path (eg T/TO/TOBYINK/Attribute-QueueStack-0.001.tar.gz
above),distinfo
method which returns an instance of CPAN::DistnameInfomodules
returns an arrayref of objects which have name
and version
methods.That modules
method should really be packages
, given the file contains
cuckoo packages as well as regular modules.
Lessons learned:
Thanks to David Golden: he's given me a couple of good nudges during the development of this module. And thanks to Toby Inkster, who outlined how I could do this in Moops; I'm going to have to work up to that!
comments powered by Disqus