Koha ILS

Diacritics gone wrong

by Melia Meggs on Jan 22, 2013

Have you ever noticed a funky symbol in your catalog that looks like a diamond with a question mark in the middle? Here’s an example:

Gabriel Garc�ia M�arquez

Are you wondering what this symbol means and why it appears in your records? This is a diacritic behaving badly. Instead of seeing the diamond with the question mark, you are probably supposed to be seeing an umlaut or an acute accent, or any number of other diacritics (most commonly found in non-English languages).

If you see these diacritics gone wrong in your catalog, it probably happened when the record was imported. We see this problem often with libraries that get their records from OCLC, for example. The problem above arises when you export records from OCLC as one type of encoding, and then import them into Koha as another. The encoding type must be the same.

There are a couple of different options available for character encoding. The MARC-8 encoding standard was developed in 1968 specifically for libraries, with the beginning of the use of MARC format. Much later on in 1993, the UTF-8 encoding standard was released. UTF-8 supports every character in the Unicode character set, which includes over 110,000 characters covering 100 scripts. UTF-8 supports far more characters than MARC-8 and has become the dominant character encoding for the worldwide web, accounting for more than half of all web pages. MARC-8 encoding, on the other hand, is rarely used outside of library records. As Mark V Sullivan points out in his blog post from May 2012, “libraries which deal heavily in alternate character sets are more likely to be aware of character encoding issues and export in Unicode encoding.”

When you import records into Koha from Tools > Stage MARC Records for Import, you can choose what type of encoding to use: MARC-8 or UTF-8. In Koha, UTF-8 is the default import encoding.

Please see the Koha manual for more info on importing MARC records.

The next thing to do is double check your OCLC export settings and verify that the default export encoding is also UTF-8. To modify your export settings in OCLC:

“Go to Export Options Screen”
On the General tab, click Admin.
At the Preferences screen, click Export Options.
When you’re done modifying these preferences, select “Save My Default” to make this option the default every time.

Please see the OCLC manual for more info on exporting MARC records.

Remember: the important thing is that the encoding matches! If it’s UTF-8 coming out of OCLC, it also needs to be UTF-8 going into Koha. You must use the same encoding all through the pipeline. Otherwise, it’s like you’re asking your computer to use a French dictionary to translate Russian words, and the � symbol peppered throughout your catalog can be interpreted as, “What IS this character, and what am I supposed to do with it?!”

Read more by Melia Meggs

Tags cataloging tutorial