Startseite > informatik, oss > RTF: Fail and you

RTF: Fail and you

As with any programming gig, you are bound to encounter some interesting fails.

One of our customers is providing RTF files generated by a third party. To do the silly little things we computational linguists tend todo – a bit of information extraction here, some sentiment analysis there – I convert the RTF to HTML using rtf2html to process it with BeautifulSoup.

But, of course, the umlauts are broken: any umlaut in the generated HTML is preceded by an @ sign in my editor, which turns out to be 0 – a null byte. This caused some pre-emptive facepalms because I knew I was in for some pain.

Looking at the relevant section in the RTF document, I see the following:

\uc2 M\u228\’00\’E4rz

This is supposed to „März“. I found the download for the 1.91. RTF spec on microsoft.com where you can choose between .doc and .docx. Microsoft, this is bad and you should feel bad.

The spec tells me that RTF initially was not unicode-aware. When unicode was introduced in a later revision, they decided any characters which are not ASCII should be represented by their Unicode code position preceded by \u. Following this unicode representation, a close representation of the character in the declared codepage for the document should be used. In our case, the representation consists of two ASCII characters: \’00 and \’E4. The \uc2 keyword tells unicode-enabled RTF readers to skip 2 characters after the unicode character because they’re obviously redundant.

The unicode presentation here is \u228 which properly points to „ä“. For RTF readers which are not unicode-enabled, the unknown \u keyword is simply ignored and the regular \’xx representations in the declared codepage are used.

So we have two fails here:

  • rtf2html does not support Unicode. This is no big deal per se as the RTF spec is backwards compatible in this regard, but me being a big fan of Unicode, I will consider it s a minor fail. If rtf2html supported unicode, the broken document would be displayed properly
  • the alternate representation of the Unicode character is broken. The declared codepage is cp1252, so why would you ever use a two-byte replacement? Interestingly enough, \’E4 is indeed the correct cp1252 code for „ä“.

Specs are hard. Let’s go shopping. Do people actually test their software before they deploy it? I’m either going to patch rtf2html to ignore 0 (which will probably break other documents in equally hilarious ways) or add proper unicode support which should not be too hard.

Advertisements
Kategorien:informatik, oss
  1. Es gibt noch keine Kommentare.
  1. No trackbacks yet.

Kommentar verfassen

Trage deine Daten unten ein oder klicke ein Icon um dich einzuloggen:

WordPress.com-Logo

Du kommentierst mit Deinem WordPress.com-Konto. Abmelden / Ändern )

Twitter-Bild

Du kommentierst mit Deinem Twitter-Konto. Abmelden / Ändern )

Facebook-Foto

Du kommentierst mit Deinem Facebook-Konto. Abmelden / Ändern )

Google+ Foto

Du kommentierst mit Deinem Google+-Konto. Abmelden / Ändern )

Verbinde mit %s

%d Bloggern gefällt das: