Improving the water quality of the CPAN River

CPAN Riverqualitymetrics Tue 22 December 2015

As a CPAN distribution moves up the river it needs to become more reliable, as by definition more distributions are relying on it. In this post I propose a simple metric for "suitability for depending on", which is essentially a water quality metric for the CPAN River.

When picking a module to use, there are a number of factors you should consider. The obvious ones are: does it provide the functionality you really need, does it behave as documented for all inputs, and is its performance acceptable to you. But you should also consider whether it's a good distribution to depend on: is it going to impact your module's reliability?

Water quality metric

If you're going to rely on a module, I think the following should be true:

To keep it simple, each CPAN distribution is given a pass or fail for this metric.

Having zero CPAN Testers fails isn't always plausible, due to issues with smoke testers, bad perl configurations etc. So the measure I generally use is:

Has more than 50 CPAN Testers reports, and 2% or more of them are fails.

I used a fairly recent CPAN snapshot to calculate this for all distributions, and present it below for different stages of the river. I calculated the individual measures for each distribution, and then the overall water quality metric.

Number of downstream dependents
10k+ 1k - 9999 100 - 999 10 - 99 1 - 9 0
# dists 45 195 570 1589 8210 21250
CPAN Testers Fails 4 10 30 172 1473 4762
8.9% 5.1% 5.3% 10.8% 17.9% 22.4%
No META 1 2 10 51 483 4440
2.2% 1.0% 1.8% 3.2% 5.9% 20.9%
No perl version 28 68 251 779 4879 15401
62.2% 34.9% 44.0% 49.0% 59.4% 72.5%
Any Fail 28 75 262 849 5293 16369
62.2% 38.5% 46.0% 53.4% 64.5% 77.0%

I first thought about the "water quality" of the CPAN River back in May. The figures for CPAN Testers have improved since then, which is good (though the CPAN Testers was slightly out of date, as there had been a CPAN Testers issue).

One thing that's interesting is that all of the metrics improve as you move up river, until you get to the head of the river (distributions with 10,000 or more dependents), where they all get a bit worse. I wonder if that's because a lot of those 45 distributions are dual-life ones that have been bundled with Perl 5 since the first release, and so perhaps haven't always been updated to follow new practices?

What other factors should be included in a CPAN water quality metric?

Improving the water quality in 2016

One of my main goals for 2016 is going to be improving the water quality of the CPAN River. Ie distributions with 1 or more downstream dependents.

I am going to have this as one of the focusses for the 2016 Pull Request Challenge, and also work on this myself. I'll generate these stats again on the 1st January, and then track them through the year. If anyone wants to join me on this quest, I'll be happy to have the company.

comments powered by Disqus