CPAN adoption candidates v3

CPANadoption Sun 11 August 2013

Two weeks ago I published a list of CPAN modules that might be candidates for adoption, and described the metric used to score them. I had a lot of comments on that version, which has prompted version 3 of the metric. The key change is the use of gating criteria to decide whether a module should even be considered for the list. The new list contains dists that score at least 5 (out of 14), which is about 4% of the dists on CPAN.

There are two stages to the metric now. First we apply gating criteria, and if the dist gets past those, it's then scored for 'adoption potential'. I'll present the gating criteria and scoring rubric first, then explain various parts.

Another new addition: instead of looking at the number of immediate (or direct) dependencies, we now look at the total number of dependent distributions.

Exclusion criteria: the dist isn't included in the list if either:

Inclusion criteria: the dist is scored if either

If a dist is marked for inclusion, it's then scored according to the following rubric. Unless stated, each rule adds +1 to the score.

  1. The recent bug score can add 0 to 3 points. If there's only 1 bug, then this is capped at 1 point.
  2. There have been 10 or more bugs reported since the last release.
  3. One or more modules has ADOPTME or HANDOFF.
  4. If the dist has at least 1 dependent distribution, it gets 1 + log10(total # dists), which includes indirect dependencies.
  5. Author has more than one dist on CPAN and hasn't released anything in the last 3 years.
  6. Only one person has PAUSE permissions for the distribution.
  7. The distribution only contains one module.
  8. If the module is in core, but has upstream set to 'cpan' it gets +2.

Here's a plot which shows the distribution of scores across all CPAN distributions. Note that the y axis is logarithmic.

CPAN adoption scores

This is quite different from the graph for v2 of the metric. For v3 you can see a lot of dists have a score of 0 (86% of CPAN), due to the gating criteria.

I first reported 87% of dists had a score of 0. This was due to a bug, where I hadn't reinitialised some of my intermediate data after change the rules for which bugs are counted. Thanks to David Golden for catching this. So ignoring wishlist and unimportant tickets meant that 6% of CPAN dropped out of consideration. That made 92% of dists excluded.

But then I found a bug in my SQL used to ignore wishlist and unimportant tickets:

severity NOT IN ('wishlist', 'unimportant')

There are a lot of tickets where the severity is NULL, and those weren't getting included. That clause is now:

(severity ISNULL OR severity NOT IN ('wishlist', 'unimportant'))

Furthermore, I was previously considering the date when tickets were created, but now I've switched to using the date when each ticket was last updated.

Recent bug score

This is calculated with the following, and capped to a max value of 3:

            # months since last release - 6
bug_score = -----------------------------------
            # months since most recent open bug

This means that dists released in the last 6 months can't appear on the list (unless they have ADOPTME or HANDOFF). For example XML-Twig appeared on the previous list, but doesn't appear this time because it was released in May 2013.

This gives a higher score to modules that were last released a long time ago, but where bugs have been reported recently.

Discussion of other points

Overall I think this is a better measure: there will still be false positive in there, but it feels like there are a lot fewer of them.

Future work

Sources of data:

Let me know if you've got other ideas for extending or refining this.

comments powered by Disqus