GF character encoding changes

Thomas Hallgren
2017-06-29

Changes to character encodings in GF grammar files

Between the release of GF 3.5 and the next version, two changes were made relating to character encodings in GF grammar files:

  1. The default character encoding was changed from Latin-1 (also known as iso-8859-1, cp1252) to UTF-8.

  2. They way you specify alternate character encodings was changed. Instead of using a flags coding = ... declaration in the source file, you should now use a pragma --# -coding=... at the top of the file instead.

Advantages

UTF-8 is the default encoding for text files on many systems these days, so it makes sense to use it as the default for GF grammar files too.

Changing how alternate encodings are specified allows conversion to Unicode to be done before parsing, which means that

How are my grammar files affected?

If your files still compile without errors after the change, you don't need to do anything. (But see Known problems below!) If you get one of the following errors,

you need to add a --# -coding=... pragma to your file (or convert it to UTF-8).

Grammars will still compile with GF-3.5 after these changes.

Note that GF only understands one option per pragma line. If you already have a --path=... pragma, you can not put the -coding=... option on the same line. Add it on a separate line:

  	--# -path=...
  	--# -coding=...

The recommendation for the future is to use UTF-8 for all source files.

Known problems

The intention is that if a grammar file is affected by the changed default encoding, then you will see one of the messages listed in the previous section when you compile the grammar. But there are a couple if issues to be aware of: