One of the Perl weekly challenges this week is to use the language detection API at detectlanguage.com. This takes a string of UTF-8 text, and tell you what language or languages it might be. This blog post describes my first cut at a Perl 5 API.
To use the API you have to sign up; they have a free level, and you don't have to hand over {d,cr}edit card details, which is good.
This is how you access the API:
use WebService::DetectLanguage;
my $key = "xxxxxxxxxxxx";
my $api = WebService::DetectLanguage->new(key => $key);
I've created some basic documentation for the module, but if you're going to use it you'd have read the DetectLanguage API documentation.
There's an endpoint you can use to get a list of supported languages;
this was easy to think about, so I did this first.
For each language you get a code and name.
So I created a ::Language
class.
The languages()
method returns a list of objects:
foreach my $lang ($api->languages) {
printf "code=%s name=%s\n",
$lang->code,
$lang->name;
}
There's another simple endpoint for getting details of your account.
So I created an ::AccountStatus
class for this:
my $status = $api->account_status();
printf "plan=%s status=%s\n",
$status->plan,
$status->status;
The endpoint for detecting languages has two modes: one takes a single string, and the other takes multiple strings.
For my first attempt, I just had a single method. This felt clumsy for the simple case though. When it tries to identify the language it returns one or more guesses, where each guess has a language, a reliability flag, and a confidence level. The return for the single method was a list of array refs, with each arrayref containing one or more language guesses.
So I ended up with detect
and multi_detect
methods,
with the latter being as described above.
The other issue is that the language detection results only include a language code,
not the code and name.
An application using the API would either have to call the languages()
method
if it wanted the name, or cache the names.
I submitted a suggestion that they include the language name in a future version,
but that was nixed.
So my Language
class has a cached hash of language codes and names,
and uses this to get the name of any language in the result.
Here's how you use the single string case:
my $text = "It was a bright cold day in April, ...";
my @results = $api->detect($text);
foreach my $result (@results) {
printf "language = %s (%s) confidence = %f reliable = %s\n",
$result->language->name,
$result->language->code,
$result->confidence,
$result->is_reliable ? 'Yes' : 'No';
}
Which results in:
language = ENGLISH (en) confidence = 15.360000 reliable = Yes
This is a usable first pass at a Perl interface to the API. I don't know how often the API's interface might change in the future (it's only version 0.2 right now), or how frequently they add new languages. If the language list changes frequently, then I might end up regretting the decision to cache the names locally. They have a pretty long list of supported languages though, so I guessed it won't change all that often.
The main thing I think it needs right now is more complete documentation — right now it relies on you reading the API documentation.
I've released version 0.01 of WebService::DetectLanguage to CPAN.
I emailed the team at detectlanguage.com, and let them know about the Perl module available on CPAN. They've now listed Perl as one of the supported languages on their home page.
comments powered by Disqus