Koha How-To
How to Prevent Diacritics from Happening in Koha
Diacritics
What are diacritics?
Have you ever noticed a funky symbol in your catalog that looks like a diamond with a question mark in the middle? Here’s an example:
Gabriel Garc�ia M�arquez
Are you wondering what this symbol means and why it appears in your records? This is a diacritic behaving badly. Instead of seeing the diamond with the question mark, you are probably supposed to be seeing an umlaut or an acute accent, or any number of other diacritics (most commonly found in non-English languages).
If you see these diacritics gone wrong in your catalog, it probably happened when the record was imported.
Cataloging
We see this problem often with libraries that get their records from OCLC, for example. The problem above arises when you export records from OCLC as one type of encoding, and then import them into Koha as another. The encoding type must be the same.
Character Encoding
There are a couple of different options available for character encoding.
The two most used encoding formats used in Koha are:
- MARC-8
- UTF-8 encoding.
To find out more about this character encoding, see the resources section below.
When you import records into Koha from Tools > Stage MARC Records for Import, you can choose what type of encoding to use: MARC-8 or UTF-8. In Koha, UTF-8 is the default import encoding.
Below is a screenshot of importing records into Koha, where the encoding option lives:
Please see the Koha manual for more info on importing MARC records.
The next thing to do is double-check your OCLC export settings and verify that the default export encoding is also UTF-8. To modify your export settings in OCLC:
- “Go to Export Options Screen”
- On the General tab, click Admin.
- At the Preferences screen, click Export Options.
- When you’re done modifying these preferences, select “Save My Default” to make this option the default every time.
Please see the OCLC manual for more info on exporting MARC records.
Remember: the important thing is that the encoding matches! If it’s UTF-8 coming out of OCLC, it also needs to be UTF-8 going into Koha. You must use the same encoding all through the pipeline. Otherwise, it’s like you’re asking your computer to use a French dictionary to translate Russian words, and the � symbol peppered throughout your catalog can be interpreted as, “What IS this character, and what am I supposed to do with it?!”
If you are finding that you are getting diacritics and your library does not get records from OCLC, check with your vendor that you are receiving records from and find out what encoding they are sending your records as- this is a good place to start.
Important
If your data contains special characters or diacritics, make sure your file is encoded in UTF-8. Otherwise, the special characters will not be imported correctly.
Resources
What is Marc8?
The MARC-8 encoding standard was developed in 1968 specifically for libraries, with the beginning of the use of MARC format.
What is UTF-8?
Much later on in 1993, the UTF-8 encoding standard was released. UTF-8 supports every character in the Unicode character set, which includes over 110,000 characters covering 100 scripts. UTF-8 supports far more characters than MARC-8 and has become the dominant character encoding for the internet.
Read more by Kelly McElligott