Skip to content

Simon Spero Questions LOC Authorities

Simon Spero announced yesterday on the SILS student listserv a preliminary release of some of his dataset of Library of Congress authority records. His email is copied below and sets the stage for a great deal of new work.

Tagging – anyone?

The happy meeting of folksonomy and authority records? Our first look into how these align “in the wild”? Where are the strengths and weaknesses in each? How can we improve on each – since we now have a fuller picture of how they relate… LibraryThing? Are you listening? Researchers – come and get it.

Simon writes:

This may be of interest to some: last month I created and deployed a custom web agent designed to recover full MarcXML authority records via

There are still some inaccuracies that appear to reflect problems on the original; until these issues can be resolved, I’m only making a limited release (bad authorities are worse than no authorities).

The current results are available in

authorities – contains all authority records, broken down by heading tag (1XX). You can either fetch individual batches of records or download a tar file containing all batches.

Be careful when uncompressing these files, as although the compressed data only takes 637 MB, the compression ratio is around 15:1 (XML is not the world’s most compact encoding).

subjects-NFC.tgz – contains only subject headings. – is a little RubyCocoa application for viewing marc xml files.

Please let me know if you spot any problems.




Fred 2.0
Phase 1: Library Of Congress Authorities Files

Open Catalog Liberation Council, Provisional ALA
22nd December 2006
Fred 2.0 is dedicated to the memory of
Frederick G. Kilgour (Jan. 6, 1914 – July 31, 2006)
Distinguished Research Professor Emeritus
School Of Information And Library Science
University of North Carolina at Chapel Hill

This phase of the project is dedicated to the men and women at the Library of Congress and outside, who have worked for the past 108 years to build these authorities, often in the face of technology seemingly designed to make the task as difficult as possible.


Using a custom agent, we were able to harvest 6.95 million authority records, using the publicly accessible interface to the Library of Congress authority files located at

Retrieved records have been converted into MarcXML.
Accented characters have been converted into NFC (Composed Normal Form).

Initial checks against indicate that the retrieved data faithfully reflect that on the original system; however these checks are still only preliminary.

Cross checks against Classification Web have revealed some inconsistencies. For this reason, we are releasing these records for research purposes only. These data are not suitable for production use.

Tags: - - - - - - -

View blog reactions