Automatically translating a gettext-ed programme with Apertium

The problem

As part of my i18n/l10n (internationalization/localization), I wanted to do a demo of our application, but translated into another language. I've previously written about how to translate your app to pirate speak, but that was just a toy.

Apertium to the rescue

Apertium is an open source machine translation toolkit. It has many languages and is easy to use from the command line. Since polib gives us easy access to the gettext/.po files, we can combine the two!

How to automatically translate your .po file

Requirements

First install the packages apertium (ubuntu install link) and the translation language pair that you want to use. list of current apertium language pairs. For example if you want to translate your English .po file into Catalan, you'd install apertium-en-ca (ubuntu install link).

Secondly, It uses the polib python library (polib documentation), which you can install with easy_install/pip.

Invoke the programme

The code is on github, download the apertium-po.py file directly.

Then invoke the programme like so: python apertium-po.py /path/to/my/file.po LANG1-LANG2. For example to translate a file in ~/django.po from English to Catalan, this is the code: python apertium-po.py ~/django.po en-ca.

No direct translation?

I wanted to translate an English gettext file into French. Apertium does not currently have this language pair, so you can't translate directly. However you can translate from English to Catalan to French. This programme is able to do this sort of double translation, just specify the language code like so: en-ca/ca-fr (read that as "english-to-catalan, then catalan-to-french"), so: python apertium-po.py ~/django.po en-ca/ca-fr

Caveats

Obviously machine translation is usually of low quality, and should not be relied on for real productive work. Doing 2 lots of machine translation (english to catalan to french) can make the translation of ever lower quality. A proper human translator is certainly better than this approach.
This programme can take a long time to run (can do about 50 translations per minute). This could be because I'm calling apertium ineffeciently (spawning a new process many times), or because machine translation of text is a very computationally hard thing to do.
In apertium, if it can't translate a word or part of the text, it just copies the words from the original to the source. Ergo you might have english words in your spanish website, etc. If you are doing an English → Catalan → French, you might have Catalan text in your French text.
This programme assumes that the target language has the same pluralization rule as English. Trying to sensibly translate plural forms automatically is complicated and I don't know how to do it. For more info see: this description of why we need different plural rules, and a big list of various languages' pluralization rules.

How it works?

The source code

download


#! /usr/bin/env python

__author__ = 'Amanda McCann <amanda@technomancy.org>'
__version__ = '1.0'
__licence__ = 'GPLv3'

import polib, subprocess, re, sys

def translate_subpart(string, lang_direction):
    """Simple translate for just a certin string"""

    for codes in lang_direction.split("/"):
        translater = subprocess.Popen(['apertium', '-u', '-f', 'html', codes], stdin=subprocess.PIPE, stdout=subprocess.PIPE)
        translater.stdin.write(string.encode("utf8")+"\n")
        string, _ = translater.communicate()
        string = string[:-1].decode("utf8")

    return string

def translate(string, lang_direction):
    """Takes a string that is to be translated and returns the translated string, doesn't translate the %(format)s parts, they must remain the same text as the msgid"""
    # simple format chars like %s can be 'translated' ok, they just pass through unaffected
    named_format_regex = re.compile(r"%\([^\)]+?\)[sd]", re.VERBOSE)
    matches = named_format_regex.findall(string)
    new = None

    if len(matches) == 0:
        # There are no format specifiers in this string, so just do a straight translation

        # this fails if we've missed a format specifier
        assert "%(" not in string, string

        new = translate_subpart(string, lang_direction)

    else:

        # we need to do complicate translation of the bits inside
        full_trans = translate_subpart(string, lang_direction)

        for match in matches:
            # then, for each format specifier, replace back in the string

            translated_match = translate_subpart(match, lang_direction)

            # during the translation some extra punctuation/spaces might have been added
            # remove them
            translated_match_match = named_format_regex.search(translated_match)
            assert translated_match_match
            translated_match = translated_match_match.group(0)

            # put back the format specifier, the case of the format specifier might have changed
            replace = re.compile(re.escape(translated_match), re.IGNORECASE)
            full_trans = replace.sub(match, full_trans)

        
        new = full_trans

    return new

def translate_po(filename, lang_direction):
    """Given a .po file, Translate it"""
    pofile = polib.pofile(filename)

    # pretend the same plural forms as English
    pofile.metadata['Plural-Forms'] = 'nplurals=2; plural=(n != 1)'

    try:
        total = len(pofile)
        num_done = 0

        for entry in pofile:

            if entry.msgid_plural == '':
                # not a pluralized string
                entry.msgstr = translate(entry.msgid, lang_direction)

            else:
                # pluralised string
                # we just pretend to use the same rules as english
                entry.msgstr_plural['0'] = translate(entry.msgid, lang_direction)
                entry.msgstr_plural['1'] = translate(entry.msgid_plural, lang_direction)

            num_done += 1
            if num_done % 10 == 0:
                print "Translated %d of %d" % (num_done, total)

    finally:
        pofile.save(filename)

if __name__ == '__main__':
    translate_po(sys.argv[1], sys.argv[2])

download

Suggestions? Feedback?

As always, if you have suggestions/feedback, feel free to email me, or fork the project on github.