MarcEdit QuickTip #3 – Getting your 952 (items / holdings data) field in place for import into Koha

Shows how to de-duplicate a .mrc file, by merging duplicate bibliographic records spread all over the file and then gather up the holdings record into repeatable 952 field that Koha expects its for item records.

Last night Pawan Sharma, a fellow user on “Koha Users” reached out for some help with importing his items into Koha. Like many other, he too had moved his catalog data from Microsoft Excel spreadsheet to MarcEdit utilizing MarcEdit’s “Delimited Text Translator” feature which at the end of the process had given him a .mrc file. This he proceeded to upload into Koha by using the More > Tools > Catalog > Stage MARC records for import option.

There were no surprises here, *except* that for every single books with multiple copies Koha imported each of the copies as a separate biblio record, instead of a single entry for the biblio with multiple item records attached to it via Marc21 952 repeatable field that Koha uses for managing holdings data. Simply put his data needed to be de-duplicated with the holding data merged back before import, typically using the ISBN number of the records (MARC field 020).

NOTE: If you wish to read more about Koha’s holding records schema see “Holdings data fields (9xx)” from the Koha Community wiki.

For someone who has not done this before, MarcEdit’s de-duplication and then merging it can seem like a daunting task. This post will hopefully demystify the process.

The discussions on Koha Users were based on a lot of assumptions, especially with no idea about Pawan’s data. So, I offered to take a look at it. He first sent me a .mrc file that had 12806 records, which I immediately converted into MarcEdit’s MarcBreaker mnemonic, human readable format.

marcedit_01

And proceeded to take a “Field count” report (see under “Tools” menu of MarcEditor) to check exactly how many records had ISBN (MARC21 field 020) out of the total number of records.

marcedit_01A

The result as can be seen above – NOT A SINGLE ONE of the 12806 biblio records had an ISBN number! Well, this file can be de-duplicated and merged, but *not* using MarcEdit. Only being told about this Pawan mentioned that he had other .mrc files that had ISBN and so he sent a second .mrc (LG-32016-32979.mrc) file over. Turns out of the total of 965 biblio records in this second file, 828 records had ISBN numbers defined.

marcedit_02A

The next task was to extract the records that *had* ISBN numbers. The remaining 137 can not be dealt with in this process and will have to be dealt with separately. For now, we closed the file LG-32016-32979.mrk file with 965 records and went back to the MarcEdit main window in order to use the “Delete Selected Records” option available under Tools > Select MARC Records

marcedit_02B

The next few steps are simple, if not immediately apparent to a new user of MarcEdit. We’ll use the numbered markers on the screenshot to explain it in steps. First, we selected the LG-32016-32979.mrk file with the 965 records in step #1; next we typed in 020 (since we want to match for ISBN) in the Display Field option (by default it shows 245$a); third step was to click on “Import File” button. After the file is imported (takes just a second or two depending on your file size) this the top-left data grid which was blank so far, will show up data similar to this. Finally in step #4, we will click the “Does Not Match” link. Records that do not have an ISBN number will be selected just like the big red arrow here shows.

marcedit_02CThe last step is to click on “Delete Selected”, this will open a File Save dialog with the title “Remaining Records”. In the case, we provided the name LG-32016-32979_ISBN.mrk and saved it and exited from this deletion utility.

This file LG-32016-32979_ISBN.mrk now has the 828 records with ISBN numbers and each of which has a holding records. This is what we will work with for the deduplication process.

marcedit_02D

Using the Tools > Record Deduplication option of MarcEditor, we will now remove the duplicate records into a separate file and save it with the name LG-32016-32979_ISBN_DEDUP.mrk. We will use ISBN as the field to use to identify duplicates. A popup showed us that 828 records processes, so we are done with deduplication. We will also need to save our original work file LG-32016-32979_ISBN.mrk. This file now contains biblio records with unique ISBN number. A quick check with the Fields Count tool showed us there were now 523 records (down from 828 records originally, the rest 305 records are the duplicates that are now saved in LG-32016-32979_ISBN_DEDUP.mrk).

marcedit_02E

 Now for the next step MARC Merge, which was the last step in this process. We have to go back to the main MarcEdit window and use the menu option Tools > Merge Records. The order of files we specify here is highly *important*. The “Source File” in this case was LG-32016-32979_ISBN.mrk (the file with the 523 records with unique ISBN numbers), the “Merge File” is LG-32016-32979_ISBN_DEDUP.mrk (the file where we had removed the duplicates to in the previous step) and finally, “Save File” is simply the name of the new merged file we are going to create (Hint: this is the final file that we will push to Koha). We named the final file as LG-32016-32979_ISBN_MERGED.mrk. The Record Identifier is of course 020 (i.e. ISBN number) and we move on the next screen.

marcedit_02F

This is next step is basically *everything* we have been working for in this post so far, we select the field to merge in from “Merge File” into the “Source File” and click next.

marcedit_02GIn this case everything went well and we were presented with the following screen that said “Merge Completed” and gave us the full path and filename to our merged file LG-32016-32979_ISBN_MERGED.mrk.

marcedit_02H

Of course we opened up the LG-32016-32979_ISBN_MERGED.mrk file in MarcEditor. The first thing was to check the Field Count report, and this is what we saw 523 biblio records with a total of 828 holding records, which sounds right! Below is example of the merged holdings.

marcedit_02I

Of course there is still the task of exporting the MarcBreaker (.mrk) back to .mrc so that Koha can ingest it for its MARC21 staging workflow, but everyone knows that 🙂

NOTE: For reference to this tutorial I’m attaching the zip file containing all the LG-32016-32979 files used in this example.

MarcEdit QuickTip #2 – Unicode in your source file

Converting a batch of multi-lingual bibliographic records stored in a MS-Excel worksheet to .mrc using MarcEdit? Be sure to check your charset is set to UTF-8 while saving the spreadsheet or its CSV export.

Naveed Bhatti, a fellow Koha ILS user from neighboring Pakistan pinged last week over a problem he was facing. He had multi-lingual bibliographic data stored in an MS Excel worksheet. He wanted to use MarcEdit 6’s Delimited Text Translator tool available under the “Add-ins” menu to convert this file into an Unicode (UTF-8) encoded MARC21 (.mrc) file so that he could import the records into Koha.

However, when he attempted to generate the .mrk file (MarcEdit’s intermediate MarcBreaker mnemonic format before export to .mrc) instead of seeing the Arabic script, he saw a bunch of “?????? ????? ???” wherever there was text in the Arabic script in the spreadsheet. Naveed thought he must be missing something small but crucial, perhaps a setting.

naveed_02

I had a hunch, but I wanted to check the data before commenting. So I asked for a few sample records, which he sent over the next day. I checked and found it was a simple case of charset conversion glitch at the spreadsheet end of things. I could easily generate both the incorrect as well as the correct output (see above) with a simple change of the charset filter. I was using LibreOffice Calc on Windows 8.1 and the default export charset was *not* set to Unicode (UTF-8). The case being, with the default export charset the exported / saved file did not contain the correct Unicode codepoints in the data for the non-Latin data. As a result, at MarcEdit’s end, it became a simple case of garbage-in-garbage-out instead of receiving the correct non-Latin data.

The screenshot below shows the correct filter to use if you are using LibreOffice Calc. If you are using MS-Office, you should see something similar.

naveed_03

Koha’s MARC modification templates comes to the rescue

A client had defined an item type they called “Hindi graphic books“. They used the notation “H-GR” to denote such an item e.g. an Amar Chitra Katha graphic novel. Their data was stored in a MS-Excel spreadsheet as simple tabular data. We were tasked with moving this data into Koha. So, to start off, we defined a new itemtype in Koha with the code “H-GR” with the description set to “Hindi Graphics”. The new item type correctly showed up on the OPAC advanced search under the “Item type” search filter. We then converted the data in Excel into MARC21 records using MarcEdit’s “Delimited Text Translator” tool and imported the resulting .mrc file into Koha. Import went well. However, when we searched the newly entered entries, we found that in the OPAC results page, the Item type showed up in the search facet area as “H=GR” instead of as it should – “Hindi Graphics”. Uh oh! something was certainly wrong!

itemtype_error
Figure 1: Incorrect item type displayed

Using MarcEdit we could see that this had happened due to an additional space having gotten accidentally padded to the itemtype data during data migration. To the system “H-GR” is NOT EQUAL to “H-GR “. *Every* single record in this particular batch of imported MARC records had their 942$c and 952$y fields affeccted.

itemtype_error_01
Figure 2: Extra whitespace padding highlighted.

Thankfully for us, Koha ships with the MARC modification template tool, which is basically like a “find and replace” feature on steroids! So in the end we simply defined a rule that allowed us to replace all the affected 942$c in a very simple, single batch mode operation.

itemtype_error_06A

With the template (we named it ‘correctH_GR’) in place (see above), all we now had to do was to the gather together the ids (i.e. biblionumber) of the affected records and then tell Koha to run the template against this batch of biblionumbers, using the Batch record modification tool as shown below.itemtype_error_04

After this the Batch item modification tool was used to update 952$y which is actually stored in items.itype field. And like above, with just one difference that instead of biblionumber we now need the list of barcodes for all the H-GR item type, in order to modify it.

itemtype_error_05A

And finally our original problem is solved!

itemtype_error_06

PRO-TIP: MARC modification template tool is not only useful for post-facto correction. It can be used even while we are staging MARC records in the first place, offering us to make wide-ranging changes and updates to our incoming MARC record.

MarcEdit QuickTip #1

Introducing the MarcEdit quick tip series

If like us, you too work with data conversion into MARC21, MarcEdit is probably a trusty tool. However, there are a few “gotchas” that can leave any MarcEdit newbie confused or wasting time in figuring things out. This series of short tuts will focus on providing quick tips that may go a long way to alleviate some of these.

Tip #01 – MarcEdit does not like accessing open files

You are trying to use the nifty Delimited Text Translator tool from the Add-ins menu. You seem to do everything right as shown in the screenshot below:

marcedit_error00

But instead of reading the file, MarcEdit throws up a largish window full of error like this:

marcedit_error01

And if you click on “OK” above you are shown this window, which is basically MarcEdit telling you “NO CAN DO” in no uncertain terms.

marcedit_error02

TIP: The file you tried to give to MarcEdit as the “Source File” is open  and MarcEdit failed to open the file (since it is already open) for reading. Close it and try again. This time you should not have this problem.