River of CPAN discussion at The QA Hackathon

CPAN River • QAH Sun 1 May 2016

At the QAH this year we had another discussion about the River of CPAN: what's been done since last year, and what we should do to keep things moving forward. These are the notes from that discussion, and some of the things that happened after the discussion.

Last year at the QAH we:

Came up with the concept of "the river of CPAN", as a way to talk about dependencies between CPAN distributions.
Talked about how a CPAN author's development process should hopefully mature as a distribution moves upriver. You need to be more rigorous about testing changes to your module before releasing it to CPAN, as doing a bad release can break all the modules that rely on yours.

The driver behind this was trying to improve the overall reliability of CPAN, and in particular the distributions on CPAN that lots of people rely on.

Since last year's QAH the river been talked about in a number of blog posts, and people have started using the terminology. We've moved things forward a bit, but not as much as I had hoped.

What are the problems?

Many CPAN authors aren't aware that there are other distributions relying on their distribution(s).
Most people learn about the different ways you can break things on CPAN by painful experience: people complaining that you broke their dist.
Many authors, particularly new ones, aren't aware of the different tools and services that can help them improve the reliability of their modules.
When picking a module to rely on, it's not easy to tell the total number of modules relying on it (MetaCPAN gives you the dists that are immediately reliant on it, but not the total).

We talked about what things we should do first, to help improve the situation.

Have MetaCPAN display the river position of all dists

In the meeting we agreed that a great first step would be for MetaCPAN to display the river position of all dists. Rather than display the absolute position, I've been using "logarithmic buckets", which was initially suggested by David Golden last year. There are 6 buckets:

0 is for dists that are not relied on by any other dists
1 is for dists relied on by 1 to 9 dists (referred to as downstream dependents)
2 is for 10 to 99 downstream dependents
3 is for 100 to 999 downstream dependents
4 is for 1000 to 9999 downstream dependents
5 is for 10000+ downstream dependents. There are only 45 such distributions at the moment.

I'm already generating this data on a weekly basis (I use it for the Pull Request Challenge, the adoption list, and other hacks). I said I'll make it available as JSON, and will work on pulling this code out of my other mess of scripts, so we can have a clean stand-alone service.

I had a chat with Olaf, the leader of the MetaCPAN project, and we agreed the structure of the JSON, which is described on this ticket. Joel Berger already worked on the changes to import the data!

We talked about how this might be shown in MetaCPAN, and a ticket was raised for that. Barbara Veloso was fortunately at the QAH (she's GARU's wife), and she came up with some better suggestions for how it could look.

DarkPAN

The current river data just tells you when other CPAN distributions are using your distribution. But many CPAN modules are used "off CPAN", aka the DarkPAN. While serving a slightly different need, this sort of data would also be helpful. BOOK and DOLMEN decided to start looking into linux distros, and which modules have packages for those.

Let people know when their dist moves upriver

Rather than let authors know every time they gain or lose a dependent, we agreed that we should tell them when their dist moves between buckets (as defined above). This means you'd be notified the first time a CPAN distribution starts using a dist, then on the 10th, the 100th, and so on. How should we notify authors? Suggestions included:

Send them an email.
Open an RT ticket.
Have an RSS feed for all distributions.

There was general agreement that we shouldn't open a ticket, as it would generate too much noise: lots of unclosed tickets, which don't really need action. Tickets are to prompt action, whereas we want to inform. Email can bounce, and end up in spam folders. The problem with RSS is that someone has to know to subscribe to it first.

We'll start off with an email, sent to the person who last released the distribution. We could email everyone who's ever done a release, or everyone who has perms, but both of those approaches would end up notifying a lot of people who probably don't care (any more). We assume that the person who most recently released it is the best bet.

I'll implement a first version of this as part of the service that generates the river data, as it's the obvious place to put it. I'll raise an issue on PAUSE to see whether ANDK & Co are open to the idea of having a flag on PAUSE users for "email me useful things about my dists".

This email should be brief and to the point, with pointers documentation.

Tools to help authors

Last year we talked about the practices we'd like to encourage CPAN authors to adopt as their dists move up river, and that we should have tools to support them.

Merijn (Tux) has been working on Release-Checklist, which aims to provide a collection of tools to help authors. He's keen for any and all to join in.
Chad (Exodist) developed some similar things to help him while working on Test2.
Dave Rolsky had already created Test::DependentModules, which will test all modules that depend on your module.

We all hoped that Tux and Chad would share experiences and code. Talk to Tux if you want to help with his tools.

Improving the water quality

One of the best ways to improve the average quality of CPAN is to target issues with distributions at the head of the river (ie depended on by many other CPAN distributions). A CPAN Testers fail there can mean that many distributions might not install on certain operating systems or versions of Perl.

The trouble is that these can be gnarly issues, and scary dists to work on, precisely because lots of people rely on them. They might need a lot of time, and generally aren't "fun". So how can we motivate people to work on them?

Someone suggested TPF grants, but Rik pointed out that TPF have explicitly said they don't want to encourage bounties in this way. Maybe sponsors?

They could be subjects for mini hackathons: a number of hackathons in 2015 had groups focussing on specific modules.

NEILB's Blog

A blog on the Perl programming language

Tags

Feed