Imagine the following scenario: Over the last several months you have been steadily entering your library’s bibliographic data, either from the accession register or by accessing actual physical copies at your library into a spreadsheet. You have managed to create some ~23,000 records and now wish to import these into your favourite ILS i.e. Koha.
While working, in the spreadsheet you had made a single column for author information and recorded the name in the <lastname>, <firstname> style for the personal name entries (i.e. MARC21 tag 100). However you have also used the same column to capture corporate (110) or meeting (111) name entries as well.
So now how do you pick out these non-100 records, the 110 and 111s from among the ~23,000 records so that you do not leave any corporate body entry in the personal name field? A manual curation is possible. But it is simply too error prone and hugely time consuming.
Luckily since you have used the <lastname> comma <firstname> style consistently (or almost consistently) for all personal name entries you can use your Libreoffice Calc spreadsheet to do some magic. 😀
Step 1: A formula to identify possible corporate entries.
As we see in the example above, Calcutta University has been entered AS-IS and not as University, Calcutta so let us look for records that *do not* have any commas in them. [Sidenote: we will hit some false positives, but the magnitude of the problem will be less that searching all 23,000 records.
And for that we enter the following formula in cell B2 (where A2 is our first record and A1 being the header row):
The first part is just a safety check to ensure that our cell is not blank. For course, if it is blank then obviously it can not be a corporate body entry so the answer we want in B2 is “NO”. Once we know it is not indeed blank, we move in to check if we can find our “comma” in the cell. If the comma is present then it is assumed to be a personal name entry, hence also “NO”, and if not found *and* not blank then it is safe to assume it is most likely a corporate name (110) or a meeting name (111).
Lets see what happens when we apply this formula to all our ~23,000 records.
N.B. The “false positive” happened as the person doing the data entry did not format the name of “Mahasweta Devi” as “Devi, Mahasweta”
Step 2: Always count the eggs in your basket
We had started with exactly 22,959 records, let us see how many “YES” records our formula has found: 941. We have narrowed our search down to just about 4% of the total records.
Step 3: Filtering out the “YES” records
Luckily for us, LibreOffice Calc has a nice filtering tool. It is available under Data -> Filter -> Standard filter menu option. We had earlier named our column as “boolean” in the cell B1. So we’ll now filter out all 941 records that marked as “YES” using this tool.
And immediately our spreadsheet will show us only the “YES” records like this:
Step 4: Removing the “YES” from false positives
The false positives have a story to tell. They tell us that we need to do better quality control of our data entry. We also probably need to ensure that the persons entering the data understand how to handle Muslim or foreign names.
Example: Let us take the name of my good friend and Koha expert Mr Vimal Kumar Vazhappally. Now the correct way to address him is as Mr. Vimal Kumar and *not* as Mr. Vazhappally. Vazhappally isn’t really his surname, rather it is the name of his village.
For now, the simplest way to correct the false positive is simply to visual check the A column and if it is apparent that the corresponding cell in B column wrongly has a “YES”, simply to move to the cell in B and simply delete the formula from that cell.
Instead of looking through 22,959 records, we now are going to check less than a 1000 records, but this time looking only for false positives.
Step 5: Two formula to separate the 100s and 110s
After we remove the B column cell values of the false positives – ones with “YES” but not a corporate entity, we now need two more columns to our spreadsheet. These will hold our separated 100 and 110 / 111 values.
So in order to copy over the values of corporate / meeting names (110 / 111) to the C column, we will define the following formula in the cell C2:
And similarly in the next column to the right, in the cell D2 we will define this fomula:
Next we will simply copy the range C2:D2 over to the entire range C3:D22959 and we are DONE!
Columns C and D now have our separated records for all 22959 records. And it took me less time to do the actual correction, than it took me write this blog post, take screenshots, crop, annotate, upload and proof read the final post.
We found ~350+ corporate body / meeting name entries in the list, which could be separated out of 22,959 records using this technique.