Getting LCC labels for a MARC record analysis project

Hello! I’m back with another edition of “Elliot writes out the process for how I did something, mostly for myself, but puts on the internet in case anyone else finds it useful.” This one started out as one thing (counting percentages of MARC records in a set that contain a given field) and morphed into another thing, as well (use OpenRefine to add labels to LCC class numbers).

I’ve been discussing with one of my colleagues ways to enrich our MARC record for musical scores to improve their searchability in the catalog. One thing that we’re interested in is including more 505 table of contents notes, particularly for collections or anthologies of printed music. There are a lot of reasons that would be nice, but a big one is that collections are one of the main sources in our collection for works by contemporary composers, composers of color, and women composers.

As we figure out if that’s worthwhile and how we might approach it, I wanted to do some analysis of our notated music records to see how many have 505 notes. I wanted to break it down by LCC classification, since we’re thinking that that might be a helpful way of identifying priority areas and chunking the project into manageable pieces. So here’s what I did:

Getting the data

  1. In Alma, did an advanced search for all physical notated music records, then exported those MARC records.
    • There were a total of 12,949 records, which is a lot but manageable for what I was hoping to do.
  2. Used MarcEdit to export the fields I care about to a tab-delimited file: 001, 245, 050, 090, 505
    • Much later in the process, I realized it would have been helpful to include the 035 field at this stage. Whoops.

Processing in OpenRefine and class number cleanup

  1. Imported the tab-delimited data into OpenRefine
  2. Created a new column called “TOC?”, where I entered either yes or no based on the presence of a 505 field in the record (using OR’s “Facet by blank” facet)
  3. Combined call number data from 050 and 090, and do some preliminary cleanup (mostly removing indicators/subfields and stray extra characters from the front)
    • At this point, I realized it would have been better to have the call numbers from the holdings records, rather than the bib records, but that would have been more complicated to get out of Alma so I decided to go with what I had. Good enough for this kind of exploratory analysis.
  4. Used a regular expression to create a new column with just the first part of the call number, up to the first period that comes before a letter (hoping to get just the base classification number)
    • value.match(/[A-Z]+*\.?\d*).*/).toString()
Screenshot of an OpenRefine window for "Add column based on column Call No normalized".
Regex for getting the base class number

Getting more info about class numbers

  1. Now I had the basic class number for each record. But I’m not super familiar with music class numbers, so I wanted to include the label for each number. I knew that on our old friend id.loc.gov, you can get data about each LC class in a variety of data formats, and you can construct the URL using the class number itself, e.g. https://id.loc.gov/authorities/classification/ML3551.html.  I first tried to work with the JSON versions, but I’m more comfortable with XML, so ended up using the RDF/XML format, e.g. https://id.loc.gov/authorities/classification/ML3551.rdf
  2. Before making a bunch of URL requests to the Library of Congress, I used OpenRefine’s records mode and blank down feature, to make sure that I only had to make one request for each class number.  So only 757 URL requests, instead of 12,949.
Screenshot of OpenRefine in records mode, showing three records with multiple rows each, with the column "Call No base" as the first column
Records mode, always a lifesaver
  1. Okay, now the good stuff: I created a new column in OpenRefine by fetching URLs based on Call No base column, using expression 'https://id.loc.gov/authorities/classification/'+value+'.rdf'.  Now I have a column full of XML!
  2. Looking at the RDF/XML, I decided I wanted the <rdfs:label> field that contained the full chain of where this class number fits in the classification hierarchy.  The problem is, though, that there are multiple <rdfs:label> elements in the output – so using just plain parseXml().select(’rdfs|label’) wouldn’t work.  I want the instance of that element that is directly under the <lss:ClassNumber> element.  So I looked at the documentation for Jsoup (I know, what an idea, right?), and figured out I could use value.parseXml().select("rdf|RDF > lcc|ClassNumber > rdfs|label")[0].toString() to get exactly the element I wanted.
    • (This makes it sound a lot more straight-forward than it actually was. Imagine lots of flailing around and frantic googling at this stage, particularly when I was trying to use JSON results before switching to XML.)
    • Pro-tip that I learned in this process: if your XML elements have namespaces, you have to enter them in the select expression as “rdfs|label”, instead of “rdfs:label”.
Screenshot of three columns in OpenRefine. First column is labeled "Call No base", and the value is "M1621". Second column is labeled "RDFXML" and has a lot of XML data in it. Third column is labeled "rdfs Label", and the value is "Music and Books on Music--Music--Vocal music--Secular vocal music--One solo voice--Accompaniment of keyboard instrument, keyboard and one other instrument, or unaccompanied--Separate works--Keyboard instrument accompaniment"
Class number, retrieved XML, and the full hierarchy label for that number
  1. Ta-dah!  Now I have a column with the full LCC description for each class number in my dataset.  I used fill down in OpenRefine to add the class number and label to all of the rows.
  2. There were about 7 records where the process didn’t work because the data wasn’t structured normally or the call number was missing from the bib, so I added those manually.

At this point, I thought I might be fancy and do something with pandas (which I’m very much a beginner with) to count the number of records without a 505 per class number. But I decided I knew how to do what I wanted to in Excel, so I just did that instead. 🤷 I used some simple deduping and COUNTIF formulas to get the number of records per class number, and the number in each class number that do not have a 505 field. And it includes the full classification hierarchy label for each number, which is helpful. That allowed me to highlight all of the rows with “Collection” in the classification label, which has already been helpful in pinpointing some class numbers to explore in more depth.

Screenshot of an Excel spreadsheet, showing columns labeled # in class, Class Number, # missing TOC, % missing TOC, and Class Label.
The final product in Excel

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.