CPAN modules for parsing User-Agent strings

Neil Bowers

2012-08-05

This article is a comparison of nine CPAN modules for extracting information out of the User-Agent string passed to web sites by browsers, spiders, and other software agents. These are useful when processing log files, to produce analytics, and other purposes. You may even use it to change your application's behaviour, but that sort of thing is frowned upon.

A good default choice is HTML::ParseBrowser — it has the best overall coverage, and is probably fast enough for most users. If speed is important, use HTTP::UserAgentString::Parser — it's an order of magnitude faster, and has the second-best coverage. If you need to identify whether an agent is a robot, use HTTP::BrowserDetect.

Here's a summary of the modules which gives some indication of maturity and whether the module is actively maintained:

Module	Doc	Version	Author	# bugs	# users	Last update
`HTML::ParseBrowser`	CPAN	`1.05`	Neil Bowers	`1`	`1`	`2012-02-25`
`HTTP::BrowserDetect`	CPAN	`1.44`	Olaf Alders	`0`	`17`	`2012-05-04`
`HTTP::DetectUserAgent`	CPAN	`0.02`	Takaaki Mizuno	`6`	`1`	`2011-10-18`
`HTTP::Headers::UserAgent`	CPAN	`3.02`	Neil Bowers	`0`	`1`	`2011-11-14`
`HTTP::MobileAgent`	CPAN	`0.36`	Yoshiki KURIHARA	`0`	`16`	`2012-07-24`
`HTTP::UserAgentString::Parser`	CPAN	`0.6.1`	Nicolas Moldavsky	`0`	`0`	`2012-06-15`
`Mobile::UserAgent`	CPAN	`1.05`	Craig Manley	`0`	`0`	`2005-10-14`
`Parse::HTTP::UserAgent`	CPAN	`0.35`	Burak Gürsoy	`1`	`2`	`2012-05-14`
`Woothee`	CPAN	`0.2.4`	田籠聡	`1`	`0`	`2012-07-20`

I'll look at each module in turn, then present results of comparing the modules, and finally which module you should use when.

There are two basic types of module:

Provides methods like name(), version(), os(). This is useful for logfile analytics, and similar applications.
Provides methods like is_chrome(), is_firefox(), is_windows(). I guess this is used if you're generating browser-specific code? I have to admit I don't see the attraction of this style of interface, and would be interested to hear from anyone whose code better maps to this style of interface.

In each section I'll show SYNOPSIS style code examples to illustrate basic use of each module. The User-Agent string I'll use for these is:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_1) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.54 Safari/535.2

Which is taken from the version of Chrome I'm using on my Mac. It's actually all on one line, but I've broken it so it will show up ok in your browser.

What's the problem?

Why is this hard, and why are there so many problems? HTTP user agents are supposed to ("SHOULD" in the terms of the HTTP/1.1 spec) identify themselves, using a simple format for the User-Agent field:

product/version

Major sub-products can also be listed in the User-Agent (UA) string, and additional information can be included in parentheses, and can just be freeform text, so basically any format you like. Here's the UA string from the version of curl on my laptop:

curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 OpenSSL/0.9.8r zlib/1.2.5

The first entry gives the product name and version, and subsequent products identify the main libraries used by curl. What's so hard with this, huh? Look at the UA string above from Chrome. If I hadn't already told you, would you know for sure whether it was Chrome or Safari? And look at:

Opera/9.80 (X11; Linux i686; U; hu) Presto/2.9.168 Version/11.50

At least it's not listing Mozilla/5.0 as the first product (which many browsers do), but then we've got two different version numbers. Which one would appear in the About popup?

Some agents put structured information in the parentheses, some put a URI, some don't bother putting the free-form information in parens. Oh, and people lie. Browsers mis-represent themselves in the UA string, for various reasons.

So basically, it's a mess. You might try and do some standardised parsing, but if you want to get good coverage (see the Comparison section at the end), you'll need a bunch of special cases as well.

HTML::ParseBrowser

HTML::ParseBrowser provides a keyword extractor style of interface.

use HTML::ParseBrowser;

$ua = HTML::ParseBrowser->new( $ua_string );
print "browser  = ", $ua->name, "\n";
print "version  = ", $ua->v, "\n";
print "os       = ", $ua->os, "\n";
print "ostype   = ", $ua->ostype, "\n";
print "osvers   = ", $ua->osvers, "\n";

which prints the following:

browser  = Chrome
version  = 15.0.874.54
os       = Macintosh
ostype   = Macintosh
osvers   =

The ostype method returns the interpreted type of the operating system; for example 'Windows' rather than some specific version or codename. Similarly, osvers is supposed to return the interpreted version of the operating system; for Windows NT 5.1, this method will return 'XP', as it's more commonly known.

The module provides four methods for returning the language or languages suppported by the browser, either as a language name, or two-letter language code. The module will also report the same language more than once; for example with the following UserAgent string:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Mozilla/4.8 [en] (Windows NT 5.1; U)

the langs() method returns ['en', 'en'], where it should really just return ['en']. I may just drop the langs() and languages methods in a future release.

The name() method returns the "interpreted name of the browser". For example, 'Mozilla' becomes 'Netscape' (early versions of Netscape Navigator were known as Mozilla), but 'Gecko' becomes Mozilla, 'Netscape6' becomes Netscape. Less confusingly, 'MSIE' is returned as 'Internet Explorer'.

I recently took over maintenance of this module, and guided by this review, have released a number of updates to improve its coverage and performance, particularly related to the popular modern browsers.

HTTP::BrowserDetect

HTTP::BrowserDetect implements both styles of interface:

$ua = HTTP::BrowserDetect->new( $ua_string );
print "name        = ", $ua->browser_string, "\n";
print "version     = ", $ua->public_version, "\n";
print "OS          = ", $ua->os_string, "\n";
print "engine      = ", $ua->engine_string, "\n";
print "is-mac?     = ", $ua->mac, "\n";
print "is-windows? = ", $ua->windows, "\n";
print "is-chrome?  = ", $ua->chrome, "\n";
print "is-safari?  = ", $ua->safari, "\n";

which prints the following:

name        = Chrome
version     = 15
OS          = Mac OS X
engine      = KHTML
is-mac?     = 1
is-windows? = 
is-chrome?  = 1
is-safari?  =

Notice that the is-something methods return an empty string for false, rather than 0. This can still be used in a boolean expression, as the empty string evaluates to false, but I'd prefer to see 0 returned.

Because of the approach taken, the module provides about 116 methods, many of which are for identifying specific hardware and/or software (lotusnotes, blackberry, wap, ps3, win95, altavista, and lycos, for example).

The module does also provide some useful generic methods, such as robot() and mobile().

The module basically has a long list of rules based on looking for substrings in the User-Agent string. These are all run when you instantiate the class, and used to set up a data structure. The methods then check values in the data structure.

As with HTML::ParseBrowser, HTTP::BrowserDetect also does some transforming of browser names. For example 'AOL' becomes 'AOL Browser'; 'mobile safari' becomes 'Mobile Safari', but 'Opera Mobile' becomes 'Opera'. Sometimes these sorts of transformations are useful, other times confusing. And when comparing modules, it was frustrating that there isn't consistency in these name and version transformations.

HTTP::BrowserDetect expects version strings to look like floating point numbers, so gets Chrome's 15.0.861.0 wrong, as it does with 0.13.GIT, 0.11.3-5ubuntu2 and similar. It has methods for the major and minor version numbers, so it would make sense to have a method which returns the "raw version string".

HTTP::DetectUserAgent

HTTP::DetectUserAgent provides a simple interface of the first type:

use HTTP::DetectUserAgent;

$ua = HTTP::DetectUserAgent->new( $ua_string );
print "type    = ", $ua->type,    "\n";
print "name    = ", $ua->name,    "\n";
print "version = ", $ua->version, "\n";
print "vendor  = ", $ua->vendor,  "\n";
print "os      = ", $ua->os,      "\n";

That's the whole interface: simple and gets the job done. The above prints the following:

type    = Browser
name    = Chrome
version = 15.0.874.54
vendor  = Google
os      = Macintosh

The type method will either return 'Browser', 'Mobile', 'Crawler', 'Robot', or 'Unknown'. Crawler is used for specific search-engine crawlers, such as Google and Yahoo. Robot is used for LWP, wget, curl, etc.

The module doesn't seem to be maintained, and there have only been two versions released, but it's provides better coverage than most of the modules, and is the fastest.

HTTP::Headers::UserAgent

HTTP::Headers::UserAgent provides basic functions to get the browser name, version, and the operating system:

use HTTP::Headers::UserAgent 3.02;

$ua = HTTP::Headers::UserAgent->new( $ua_string );
print "browser  = ", $ua->browser,  "\n";
print "version  = ", $ua->version,  "\n";
print "os       = ", $ua->os,       "\n";
print "platform = ", $ua->platform, "\n";

which prints the following:

browser  = Chrome
version  = 15
os       = macos
platform = ppc

I have now taken over maintenance of this module, and released version 3.02. The module is now deprecated; it's recommended you use one of the other modules described here. HTTP::Headers::UserAgent is now a front end to HTTP::BrowserDetect.

The bottom line is: don't use this module.

HTTP::MobileAgent

HTTP::MobileAgent aims to identify mobile UserAgents, and the documentation admits that it's mainly Japanese mobiles. Here's the standard example:

use HTTP::MobileAgent;

$ua = HTTP::MobileAgent->new( $ua_string );
print "name          = ", $ua->name, "\n";
print "is_non_mobile = ", $ua->is_non_mobile, "\n";
print "carrier       = ", $ua->carrier, "\n";

which results in the following:

name          = Mozilla
is_non_mobile = 1
carrier       = N

So not very successful. The documentation doesn't say whether 'N' is a special value returned by the carrier() method for non-mobiles. Or is could be the result of failed parsing.

I tried the same code with the user agent string from my iPhone 4:

name          = Mozilla
is_non_mobile = 1
carrier       = N

The module supports methods for checking whether it's a specific carrier (e.g. is_docomo()), if the phone supports GPS (gps_compliant()), and the display() method returns an instance of HTTP::MobileAgent::Display, which gives characteristics of the phone's screen, such as size and whether colour is supported.

This module provides some useful methods, but the coverage is poor, so unless you're in Japan, it's not very useful.

A separate module, HTTP::MobileAgent::Flash, adds methods to HTTP::MobileAgent for checking whether the user agent supports Flash.

HTTP::UserAgentString::Parser

HTTP::UserAgentString::Parser is a Perl API for the User-Agent database maintained at user-agent-string.info. You create an instance of the Parser class, then call the parse() method on it to parse a UA string. The parse() method returns either an instance of HTTP::UserAgentString::Browser or HTTP::UserAgentString::Robot, depending on whether the UA is identified as a robot. If neither, then undef is returned.

Here's the standard example for parsing the Chrome User-Agent string:

use HTTP::UserAgentString::Parser;

$parser = HTTP::UserAgentString::Parser->new();
$ua     = $parser->parse( $ua_string );
print "browser  = ", $ua->name, "\n";
print "version  = ", $ua->version, "\n";
print "os       = ", $ua->os->name, "\n";
print "ostype   = ", $ua->os->family, "\n";

which prints the following:

browser  = Chrome
version  = 15.0.874.54
os       = Mac OS X 10.7 Lion
ostype   = Mac OS X

For Internet Explorer the name() method returns 'IE'.

The browser class also provides methods for identifying whether the user agent is a browser, a mobile browser, an email client, a robot, or an HTTP library.

The os() method returns an instance of HTTP::UserAgentString::OS if the OS could be identified, or undef otherwise. The OS class which has methods for getting the operating system name (eg "Windows 98"), family (eg "Windows"), and a few other things.

HTTP::UserAgentString::Parser downloads a data file from user-agent-string.info, which is then cached locally for 7 days. It provides methods for retrieving information about the database being used. By default the local database is stored in /tmp; I think it should really be somewhere like /usr/share.

The HTTP::UserAgentString::Robot class doesn't provide a version() method, so if you're looking at the version, you need to do something like the following:

$parser = HTTP::UserAgentString::Parser->new();
$ua = $parser->parse( $ua_string );
if ($ua->can('version')) {
    ...
}

Or use the isRobot method.

Mobile::UserAgent

HTTP::MobileAgent is another module focussed on mobile user agents. Here's the standard example:

use Mobile::UserAgent;

$ua = Mobile::UserAgent->new( $ua_string );
if ($ua->success) {
    print "vendor  = ", $ua->vendor,  "\n";
    print "model   = ", $ua->model,   "\n";
    print "version = ", $ua->version, "\n";
} else {
    print "not a mobile user agent\n";
}

which results in the following:

not a mobile user agent

I tried this example with the iPhone user agent string, but that wasn't recognised as a mobile. Given the module hasn't been updated since 2005, and the iPhone wasn't released until 2007, that's not surprising!

The module provides methods for getting the screen dimensions (if available), checking if the user agent is an i-mode handset, and checking if the agent is Mozilla-like. It also provides a method isStandard() which returns true if the user agent string has the standard model/version format, and isRubbish(), which returns true if the user agent string doesn't follow the standard format. The isRubbish check isn't just not isStandard(), it returns true if the user agent doesn't follow the standard format, doesn't specify Mozilla-likeness, and isn't an i-mode phone.

Given how out-of-date this module is, relative to the speed of change in the mobile phone space, it's probably not worth using.

Parse::HTTP::UserAgent

Parse::HTTP::UserAgent provides a simple looking API, but is actually the front-end for a small handful of classes which try and make sense of the UserAgent string.

use Parse::HTTP::UserAgent;

$ua = Parse::HTTP::UserAgent->new( $ua_string );
die "Cannot parse UserAgent string\n" if $ua->unknown;

print "browser     = ", $ua->name, "\n";
print "version     = ", $ua->version, "\n";
print "raw version = ", $ua->version('raw'), "\n";
print "os          = ", $ua->os, "\n";
print "lang        = ", $ua->lang, "\n";
print "toolkit     = ", $ua->toolkit, "\n";

which prints the following:

browser     = Chrome
version     = 15.000874054
raw version = 15.0.874.54
os          = Macintosh
lang        = Intel Mac OS X 10_7_1
toolkit     = AppleWebKit535.2535.200

It identifies Chrome correctly, but the version is wrong, lang() returns the wrong thing, and the toolkit is reported as something which looks like an amalgam of two parts of the UserAgent string.

Note that if you pass 'raw' to the version method, you'll get the raw version string, which in this case is correct. This isn't currently documented. I'd make the version() method work this way by default, and require parameters for trying to interpet the version string.

This code also results in a warning about use of an uninitialized value. I've submitted a patch for that. Update: a new version has been released with these fixes, so the module is very much in active development.

In addition to the fairly standard interface above, the module also overloads stringification and numification on the UA object returned, so you can do freaky things like:

print "ok\n" if $ua eq 'Opera' && $ua >= 9

which is equivalent to:

print "ok\n" if $ua->name eq 'Opera' && $ua->version >= 9

A neat use of overloading, but I think the more explicit version is going to be less confusing for the average reader.

The module generates a warning if it gets a version string which isn't made up of dotted digits (such as 1.0rc1). If module can't handle the format of a UA string it shouldn't generate a warning, but should just flag this in what's returned.

Woothee

The Woothee module is a part of a suite of libraries for parsing User-Agent strings; the suite has versions of the Woothee parser in Java, Perl, Python, and Ruby.

Woothee provides a class method parse(), which takes a User-Agent string and returns a hashref with keys name, category, os, version, and vendor

use Woothee;

$ua = Woothee->parse( $ua_string );
print "browser  = ", $ua->{name}, "\n";
print "category = ", $ua->{category}, "\n";
print "version  = ", $ua->{version}, "\n";
print "os       = ", $ua->{os}, "\n";

which prints the following:

browser  = Chrome
category = pc
version  = 15.0.874.54
os       = Mac OSX

The category key will have one of the following values: 'pc', 'smartphone', 'mobilephone', 'appliance', 'crawler' or 'misc', 'UNKNOWN'.

Woothee provides one other class method, is_crawler(), which takes a User-Agent string, and returns a true value if the agent is a crawler. It does this in a quick and dirty way; the documentation says that if you want accuracy on this check, you should call parse() and check whether the category key has the value 'crawler'.

Comparison

For this comparison I've excluded the two modules which focus only on mobiles, and excluded HTTP::Headers::UserAgent, for the reasons given above.

Coverage

I built a corpus of user agent strings, based on the list at useragentstring.com, taking those where the agent name and version could be clearly determined. I ran this corpus (of 12,875 user agent strings) past the five modules. The following table shows how many were correctly recognised by each module:

	Name (%)	Version (%)	Coverage (%)
HTML::ParseBrowser	65.3	60.4	59.2
HTTP::UserAgentString::Parser	58.1	63.1	51.6
Parse::HTTP::UserAgent	53.9	51.7	47.8
HTTP::DetectUserAgent	44.5	42.1	39.4
Woothee	30.9	27.7	26.4
HTTP::BrowserDetect	40.1	10.7	7.9
HTTP::Headers::UserAgent	39.0	10.7	7.9

I had something of an unfair advantage with HTML::ParseBrowser, as I was testing the module with the same corpus that I use for this evaluation. I'll be offering it to the other module authors.

The corpus contains a lot of outdated browsers and robots, which you're unlikely to see in logfiles today, though looking at examples that weren't matched by the various modules, many of them were precisely these outdated ones. But still, take these figures with a pinch of salt.

For comparison, I wrote the simplest UserAgent parser: it just takes the first product/version item in the string and uses that. It scored 12.6%, 12.2%, and 7.7%.

Modern browsers

I created a mini corpus which just had example user agent strings from Internet Explorer, Firefox, Chrome, Safari, Opera and Camino. The following table shows how well each of the four modules did at recognising these.

	Woothee	HTTP :: DetectUserAgent	HTTP :: BrowserDetect	HTTP :: UserAgentString :: Parser	HTML :: ParseBrowser	HTTP :: Headers :: UserAgent	Parse :: HTTP :: UserAgent
Internet Explorer 8 / Windows XP	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Firefox 7 / Mac	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Safari 5 / Mac	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Firefox 3 / Mac	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Camino 2 / Mac		✓✓		✓✓	✓✓		✓✓
Safari 5 / Windows XP	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Firefox 7 / Windows XP	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Chrome 14 / Windows 7	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Firefox 3 / Windows 7	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Internet Explorer 9 / Windows 7	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓
Opera 11 / Mac	✓	✓	✓✓	✓✓	✓✓	✓✓	✓✓
Chrome 15 / Mac	✓✓	✓✓	✓	✓✓	✓✓	✓	✓✓

Two ticks means it got both browser name and version correct, one tick means it got one of those right.

HTTP::BrowserDetect scores particularly badly on this test, because it can't handle double- or triple-dotted version numbers.

Robot detection

Only five of the modules provide a method for finding out if the agent is believed to be a robot or crawler. I used the XML corpus from www.user-agents.org, and trusted that the Robot classification was largely correct. The following table shows how the five modules handled the 1439 robots.

Module	Robot detection (%)
HTTP::BrowserDetect	69.8
Woothee	31.1
HTTP::DetectUserAgent	30.2
HTTP::UserAgentString::Parser	8.5
Parse::HTTP::UserAgent	1.4

Performance

I used the Benchmark module to time each module processing the corpus used in the coverage tests. To get meaningful results, I did 10 runs through the corpus with each module.

The table shows the number of seconds taken, so smaller is better:

Module	Time (s)
HTTP::UserAgentString::Parser	0.12
Woothee	5.00
HTTP::DetectUserAgent	5.14
HTML::ParseBrowser	6.87
Parse::HTTP::UserAgent	15.58
HTTP::BrowserDetect	25.83
HTTP::Headers::UserAgent	26.07

Which just shows the benefit of the approach taken by HTTP::UserAgentString::Parser, using a lookup file.

Conclusion

For general purpose parsing of a UserAgent string, to identify the agent name and version, the overall best choice is HTML::ParseBrowser. It gives the best coverage, both overall and on modern browsers, and while it's not the fastest, it's probably fast enough for most users.

If you're processing a lot of data, so speed is important, and you're only really worried about identifying the most popular browsers, then HTTP::UserAgentString::Parser would be a good choice. It's much faster than the rest and comes second on coverage.

If identification of robots is important to you, then HTTP::BrowserDetect is clearly the best module to use. If you want good browser identification as well, then use this in conjunction with HTML::ParseBrowser.

If you're looking for information about mobile browsers, HTTP::MobileAgent might be worth a look. It currently has a strong bias towards Japanese mobiles, but is under active development, so maybe you could help get coverage outside Japan.

The perfect module

There is no one standout module, so here's my wishlist, which is not much more than a mashup of features from the modules described above:

Good coverage at recognising the browser.
A simple interface if all you want is the browser and version, with more under the hood if you want it.
By default performs sensible mapping from the name given in the UserAgent string (for example MSIE would be returned as Internet Explorer), but provides an interface for getting the raw agent name.
The version method would return the raw version string, but would also provide a way to get at major, minor version numbers, etc.
Provides methods for the operating system and version. Again, this would map the raw data to more user-friendly values ("NT 6.1" would be OS "Windows", version "7"), but would also provide a way to get at the raw data.
An agent_type method would return 'mobile', 'browser', 'robot', 'tablet', or 'unknown'.
If information is available, a device method would return an object which has methods for getting the screen resolution, etc.
Kept up-to-date. The documentation might refer to a website, where you can direct your browser. You'd be asked to give information about your browser, which would add it to the corpus.

It's the last point that's the big challenge. If the currently active authors pooled their efforts, the above doesn't seem implausible. That's my next goal after a bit more work on HTML::ParseBrowser.

HTTP::UserAgentString::Parser is a new entry on this list, but it's already a good contender for the crown.

Here's one possible interface:

  $ua = Perfect::UserAgent::Parser->new( $ua_string );
  
  $name = $ua->name;
  print "name       = ", $name, "\n";           # Internet Explorer
  print "raw name   = ", $name->raw, "\n";      # MSIE
  
  $version = $ua->version;
  print "version    = ", $version, "\n";        # 15.0.874.54
  print "major      = ", $version->major, "\n"; # 15
  
  print "agent type = ", $ua->agent_type, "\n"; # browser

  print "os       = ", $ua->os, "\n";           # Windows 7
  print " name    = ", $ua->os->name, "\n";     # Windows
  print " version = ", $ua->os->version, "\n";  # 7

This is using stringification overloading so if you just want the browser name and version, the interface is simple, but you can dig deeper if you want to.