History
This phrasebook is a program for translating touristic phrases between 14 European languages included in the MOLTO project (Multilingual On-Line Translation):
A Russian version is not yet finished but is projected later. Also other languages may be added.
The phrasebook is implemented by using the GF programming language (Grammatical Framework). It is the first demo for the MOLTO project, released in the third month (by June 2010). The first version is a very small system, but it will extended in the course of the project.
The phrasebook has the following requirement specification:
The phrasebook is available as open-source software, licensed under GNU LGPL.
The source code resides in
www.grammaticalframework.org/examples/phrasebook/
Interlingua-based translation
Incremental parsing
Mixed modalities
Quasi-incremental translation: many basic types are also used as phrases
Disambiguation, esp. of politeness distinctions
Fall-back to statistical translation
Feed-back from users
The use of resource grammars and functors
Example-based grammar writing and grammar induction from statistical models (Google translate)
Compile-time transfer: especially, in Action in Words
The level of skills involved in grammar development
Grammar testing
Sentences
: general syntactic structures implementable in a uniform way.
Concrete syntax via the functor SencencesI
.
Words
: words and predicates, typically language-dependent.
Separate concrete syntaxes.
Greetings
: idiomatic phrases, string-based.
Separate concrete syntaxes.
Phrasebook
: the top module putting everything together.
Separate concrete syntaxes.
DisambPhrasebook
: disambiguation grammars generating feedback phrases if
the input language is ambiguous.
Numeral
: resource grammar module directly inherited from the library.
Here is the module structure as produced in GF by
> i -retain DisambPhrasebookEng.gf > dg -only=Phrasebook*,Sentences*,Words*,Greetings*,Numeral,NumeralEng,DisambPhrasebookEng > ! dot -Tpng _gfdepgraph.dot >pgraph.png
The abstract syntax defines the ontology behind the phrasebook.
Some explanations can be found in the
ontology document, which is produced from the
abstract syntax files
Sentences.gf
and
Words.gf
by make doc
.
The phrasebook uses the PGF server written in Haskell and the minibar library written in JavaScript. Since the sources of these systems are available, anyone can build the phrasebook locally on her own computer.
Language | Grammarian's language skills | Grammarian's GF skills | Informant used for development | Informant used for testing | Use of external tools | Impact of external tools | Changes on the resource grammar | Development time | |
---|---|---|---|---|---|---|---|---|---|
Bulgarian | ### | ### | - | - | - | ? | # | ## | |
Catalan | ### | ### | - | - | - | ? | # | # | |
Danish | - | ### | + | + | + | ## | # | ## | |
Dutch | - | ### | + | + | + | ## | # | ## | |
English | ## | ### | - | + | - | - | _ | # | |
Finnish | ### | ### | - | - | - | ? | # | ## | |
French | ## | ### | - | + | - | ? | # | # | |
German | # | ### | + | + | + | ## | ## | ### | |
Italian | ### | # | - | - | - | ? | ## | ## | |
Norwegian | # | ### | + | - | + | ## | # | ## | |
Polish | ### | ### | + | + | + | # | # | ## | |
Romanian | ### | ### | - | - | + | # | ### | ### | |
Spanish | ## | # | - | - | - | ? | _ | ## | |
Swedish | ## | ### | - | + | - | ? | - | ## |
Explanation on scores
The figure presents the process of creating a Phrasebook using an example-based approach for the language X, where X = {Danish, Dutch, German, Norwegian}.
The time needed for preparing the configuration files for a grammar will not be needed in the future, since the files are reusable for other applications. The time for the second step can be saved if automatic tools, like Google translate are used. This is only possible in languages with a simpler morphology and syntax and large corpora available. Good results were obtained for German and Dutch with Google translate, but for languages like Romanian or Polish, which are both complex and lack enough resources, the results are discouraging.
If the statistical oracle works well, the only step where the presence of a human translator is needed is the evaluation and feedback step. An average of 4 hours per round and 2 rounds were needed in average for the languages for which we performed the experiment. It is possible that more effort is needed for more complex languages.
Disambiguation grammars for other languages than English
Extend the abstract lexicon in Words
by hand or (semi)automatically for
Customizable phone distribution: make your own selection of the 2^15 language subsets when downloading the phrasebook to a phone
The basic things "everyone" can do is
Words
and greetings in Greetings
The missing concrete syntax entries are added to the Words
L.gf
files for each language L. The
morphological paradigms
of the GF resource library should be used. Actions (prefixed with A
, as AWant
) are
a little more demanding, since they also require syntax constructors. Greetings (prefixed
with G
) are pure strings.
Some explanations can be found in the
implementation document, which is produced from the
concrete syntax files
SentencesI.gf
and
WordsEng.gf
by make doc
.
Here are the steps to follow for contributors:
darcs pull
.
make present
in gf/lib/src/
.
gf/examples/phrasebook/
.
make pgf
.
darcs record .
(in the phrasebook
subdirectory).
darcs send -o my_phrasebook_patch
, which you can
send to GF maintainers.
gf/src/server/
and follow the instructions in the
project Wiki.
Phrasebook.pgf
is available to you GF server (see project wiki).
lighttpd
(see project wiki).
gf/examples/phrasebook/www/phrasebook.html
and use your phrasebook!
The grammarian need not be a native speaker of the language.
For many languages, the grammarian need not even know the language - native informants are enough.
However, evaluation by native speakers is necessary.
Correct and idiomatic translations are possible.
A typical development time was 2-3 person working days per language.
Google translate helps in bootstrapping grammars, but must be checked.
Resource grammars should give some more support
The Phrasebook has been built in the MOLTO project funded by the European Commission.
The authors are grateful to their native speaker informants helping to bootstrap and evaluate the grammars: Richard Bubel, Grégoire Détrez, Rise Eilert, Karin Keijzer, Michał Pałka, Willard Rafnsson, Nick Smallbone.