The problemAs part of my i18n/l10n (internationalization/localization), I wanted to do a demo of our application, but translated into another language. I've previously written about how to translate your app to pirate speak, but that was just a toy.
Apertium to the rescueApertium is an open source machine translation toolkit. It has many languages and is easy to use from the command line. Since polib gives us easy access to the gettext/.po files, we can combine the two!
How to automatically translate your .po file
First install the packages
apertium (ubuntu install link) and the translation language pair that you want to use. list of current apertium language pairs. For example if you want to translate your English .po file into Catalan, you'd install
apertium-en-ca (ubuntu install link).
Secondly, It uses the polib python library (polib documentation), which you can install with
Invoke the programme
The code is on github, download the apertium-po.py file directly.
Then invoke the programme like so:
python apertium-po.py /path/to/my/file.po LANG1-LANG2. For example to translate a file in
~/django.po from English to Catalan, this is the code:
python apertium-po.py ~/django.po en-ca.
No direct translation?
I wanted to translate an English gettext file into French. Apertium does not currently have this language pair, so you can't translate directly. However you can translate from English to Catalan to French. This programme is able to do this sort of double translation, just specify the language code like so:
en-ca/ca-fr (read that as "english-to-catalan, then catalan-to-french"), so:
python apertium-po.py ~/django.po en-ca/ca-fr
- Obviously machine translation is usually of low quality, and should not be relied on for real productive work. Doing 2 lots of machine translation (english to catalan to french) can make the translation of ever lower quality. A proper human translator is certainly better than this approach.
- This programme can take a long time to run (can do about 50 translations per minute). This could be because I'm calling apertium ineffeciently (spawning a new process many times), or because machine translation of text is a very computationally hard thing to do.
- In apertium, if it can't translate a word or part of the text, it just copies the words from the original to the source. Ergo you might have english words in your spanish website, etc. If you are doing an English → Catalan → French, you might have Catalan text in your French text.
- This programme assumes that the target language has the same pluralization rule as English. Trying to sensibly translate plural forms automatically is complicated and I don't know how to do it. For more info see: this description of why we need different plural rules, and a big list of various languages' pluralization rules.
How it works?
The source code
#! /usr/bin/env python __author__ = 'Rory McCann <firstname.lastname@example.org>' __version__ = '1.0' __licence__ = 'GPLv3' import polib, subprocess, re, sys def translate_subpart(string, lang_direction): """Simple translate for just a certin string""" for codes in lang_direction.split("/"): translater = subprocess.Popen(['apertium', '-u', '-f', 'html', codes], stdin=subprocess.PIPE, stdout=subprocess.PIPE) translater.stdin.write(string.encode("utf8")+"\n") string, _ = translater.communicate() string = string[:-1].decode("utf8") return string def translate(string, lang_direction): """Takes a string that is to be translated and returns the translated string, doesn't translate the %(format)s parts, they must remain the same text as the msgid""" # simple format chars like %s can be 'translated' ok, they just pass through unaffected named_format_regex = re.compile(r"%\([^\)]+?\)[sd]", re.VERBOSE) matches = named_format_regex.findall(string) new = None if len(matches) == 0: # There are no format specifiers in this string, so just do a straight translation # this fails if we've missed a format specifier assert "%(" not in string, string new = translate_subpart(string, lang_direction) else: # we need to do complicate translation of the bits inside full_trans = translate_subpart(string, lang_direction) for match in matches: # then, for each format specifier, replace back in the string translated_match = translate_subpart(match, lang_direction) # during the translation some extra punctuation/spaces might have been added # remove them translated_match_match = named_format_regex.search(translated_match) assert translated_match_match translated_match = translated_match_match.group(0) # put back the format specifier, the case of the format specifier might have changed replace = re.compile(re.escape(translated_match), re.IGNORECASE) full_trans = replace.sub(match, full_trans) new = full_trans return new def translate_po(filename, lang_direction): """Given a .po file, Translate it""" pofile = polib.pofile(filename) # pretend the same plural forms as English pofile.metadata['Plural-Forms'] = 'nplurals=2; plural=(n != 1)' try: total = len(pofile) num_done = 0 for entry in pofile: if entry.msgid_plural == '': # not a pluralized string entry.msgstr = translate(entry.msgid, lang_direction) else: # pluralised string # we just pretend to use the same rules as english entry.msgstr_plural['0'] = translate(entry.msgid, lang_direction) entry.msgstr_plural['1'] = translate(entry.msgid_plural, lang_direction) num_done += 1 if num_done % 10 == 0: print "Translated %d of %d" % (num_done, total) finally: pofile.save(filename) if __name__ == '__main__': translate_po(sys.argv, sys.argv)
As always, if you have suggestions/feedback, feel free to email me, or fork the project on github.