MarcEdit QuickTip #2 – Unicode in your source file

Converting a batch of multi-lingual bibliographic records stored in a MS-Excel worksheet to .mrc using MarcEdit? Be sure to check your charset is set to UTF-8 while saving the spreadsheet or its CSV export.

Naveed Bhatti, a fellow Koha ILS user from neighboring Pakistan pinged last week over a problem he was facing. He had multi-lingual bibliographic data stored in an MS Excel worksheet. He wanted to use MarcEdit 6’s Delimited Text Translator tool available under the “Add-ins” menu to convert this file into an Unicode (UTF-8) encoded MARC21 (.mrc) file so that he could import the records into Koha.

However, when he attempted to generate the .mrk file (MarcEdit’s intermediate MarcBreaker mnemonic format before export to .mrc) instead of seeing the Arabic script, he saw a bunch of “?????? ????? ???” wherever there was text in the Arabic script in the spreadsheet. Naveed thought he must be missing something small but crucial, perhaps a setting.

naveed_02

I had a hunch, but I wanted to check the data before commenting. So I asked for a few sample records, which he sent over the next day. I checked and found it was a simple case of charset conversion glitch at the spreadsheet end of things. I could easily generate both the incorrect as well as the correct output (see above) with a simple change of the charset filter. I was using LibreOffice Calc on Windows 8.1 and the default export charset was *not* set to Unicode (UTF-8). The case being, with the default export charset the exported / saved file did not contain the correct Unicode codepoints in the data for the non-Latin data. As a result, at MarcEdit’s end, it became a simple case of garbage-in-garbage-out instead of receiving the correct non-Latin data.

The screenshot below shows the correct filter to use if you are using LibreOffice Calc. If you are using MS-Office, you should see something similar.

naveed_03

1 thought on “MarcEdit QuickTip #2 – Unicode in your source file”

Leave a Reply

Your email address will not be published. Required fields are marked *