Why Web 2.0 is leading back to full cataloging

Just an observation of interest to librarians, about Web 2.0 types of websites.

Two examples of rich Web 2.0 sites are Last.fm and LibraryThing.

We often think of Web 2.0 sites in terms of the idea of “tagging instead of cataloging.” In fact, rich 2.0 sites, the ones that do a lot of data processing to create their services, usually have both free-form tagging by users and standards-controlled metadata about objects, and it is actually often the latter that drives the main functionality of the sites. This is the case with both Last.fm and LibraryThing.

Most readers of Library Juice are probably very familiar with LibraryThing, and know that users can apply whatever tags they want to the books in theirs collections, and also that when they add a book to their LibraryThing collection, there is a Z39.50 connection in the background that imports some basic cataloging data – from Amazon.com by default but optionally from any of a number of major research libraries.

Tags can be words used universally across the userbase of the site (like “philosophy”) or just among a minority of users or only by you (“Thursdays and Marie’s”). Folksonomies like the user tags on LibraryThing don’t have to be universal throughout; there can be lots of tags that are useful only to a minority of users or to individuals. Users of websites that have user tagging are tolerant of the meta noise that results from this and from the standards-free nature of folksonomies.

The real functionality of both Last.fm and LibraryThing, though, rests not on user tags but on the standards-based metadata for the objects in it – books for LibraryThing and music tracks for Last.fm. In both cases, casual users can simply rely on the data that the system loads into their profiles automatically, and the more technically inclined enthusiasts of the systems can modify the data to make it more accurate and consistent. In LibraryThing, for example, data from Amazon, which is the default data source, is often inaccurate and sparse; serious users can correct the cataloging to fix the spellings of names or to add information. Also, in LibraryThing, users who are really into it can also “combine” different editions of a book into the same work for processing purposes, so that owners of different editions of the same book will be linked as owners (without the cataloging of their specific copies being changed).

The functionality of LibraryThing is enhanced because of the fact that it makes use of cataloging that has already been done by professional catalogers. Data in LibraryThing that comes from Amazon is not as rich or as accurate as the data from research libraries, but in most cases it is quicker to get, and it is still based on essentially the same Z39.50 standard, which is in turn based on cataloging standards.

Last.fm functions in another realm, and deserves some explanation. It is a valuable example for discussing this type of issue, because:

  • it does a lot of data processing that is based on metadata about objects;
  • it does it in such a way that consistency and accuracy of data is important;
  • the data structure of this information (ID3) is not quite up to the needs of the application;
  • there are competing metadata standards in use, both of which are used by sources that feed data into the Last.fm database.

Last.fm is a complex site, and seems to have every conceiveable Web 2.0 feature in it. In essence, it is a combination of a social networking website and a music streaming website that works by statistically comparing the music-listening histories of its users. By comparing one user’s “charts” with other users’, it generates a group of musical “neighbors” for that user and a stream of music that that user might like. In a similar way, it can create a stream of music based on listeners to a particular artist. Add a “friends” function, tagging, groups, public messages, private messages, a forum, and a variety of streaming functions based on its database, and you have a wonderfully rich website. (Actual audio tracks are provided to Last.fm by record companies for their promotional value.)

A user’s “charts” in Last.fm can be updated in two ways: either by listening to Last.fm’s own streams (here are mine) or by setting up one’s own media player (such as iTunes) to upload metadata when one plays one’s own music through it.

In order for Last.fm to connect me to other people who listen to the same artist, it needs consistent information in its data about the artist. A quick and easy example of where the system often falls short in this regard is with classical music recordings, where in some cases the composer is recorded as the artist, and in other cases the performer is recorded as the artist. Among classical music aficionados, the performer is much more important than it is to the average listener, who is mostly interested in the composer. Similarly, jazz fans want to know who the players were on a certain Coltrane recording, where the average listener only needs to know that it was John Coltrane. Information about such details of a recording can be added to any of the already existing ID3 fields (which are limited to things like artist, album, and track), but this tends to be done in inconsistent ways or not at all, leading to non-matches in the database. The ID3 data structure, which is what is used for metadata in MP3s and a number of other media formats, only has room for a few fields, and doesn’t allow for anything like the richness of MARC.

To deal with some of the needs that ID3 doesn’t answer by itself, another standards layer was set up, really as a set of guidelines, first by CDDB, which is the private industry database of CD track information now known as Gracenote, and then by freeCDDB, the open-source, volunteer-created database now called MusicBrainz.

In terms of Last.fm’s use of music metadata, there are essentially two problems. The first problem is that although Last.fm has selected a standard (MusicBrainz), much of the data uploaded by users comes from users who aren’t interested in dealing with nuts and bolts and want things to work automatically. These users are using iTunes or WinAmp or similar commercial software applications, and these download CD track information directly from Gracenote. (The cost of the software includes a license to download metadata from Gracenote.) This means that when they upload their track information into Last.fm, their data follows a different metadata standard than the Last.fm standard, and they probably don’t know or care about this, since they are just users who “want it to work.”

The other main problem with Last.fm data is that both the Gracenote and MusicBrainz databases of CD track information are loaded with bad metadata that doesn’t fit either standard. In the case of Gracenote, the data comes from two places: first, Gracenote employees (who you can bet are not trained catalogers making a decent wage), and second, the record companies themselves (which have other priorities than supplying clean data according to a somewhat complex standard). With MusicBrainz it is simply that the data comes from volunteers and it is a collaborative project. In both Gracenote and MusicBrainz, there are systems set up for correcting bad metadata, but these only work so well. In the end, the data in both databases is much worse than what we find in our library catalogs (admitting that that those are far from perfect).

On Last.fm, the onus is on the user to upload tracks that have proper metadata; this is because of the conception of the site as user-driven. Most users, however, aren’t that into changing their iTunes metadata to match the MusicBrainz standard or simply to clean them up. Users who do care exhort them to “Get their damn tags right.” A well-subscribed group exists on Last.fm with exactly that name. This, I believe, shows some tension between the idea of Web 2.0 sites being collaborative, user-driven, and defined by free-form tagging, and the need for consistency and accuracy in their more data-driven manifestations.

So… If sites like Last.fm eventually become a part of life for the majority of people, I think there will be an emergence of support for the role of professional catalogers somewhere in the system, so that the majority of users, who “just want it to work,” will be satisfied. Free-form tagging has its place, but where consistency and accuracy counts, as it does in many Web 2.0 sites, I think reliance on users will turn out to have been a dead-end, and there will be a new appreciation for our professionalism.

9 comments on “Why Web 2.0 is leading back to full cataloging

  1. I’ve been getting more into the Last.fm community lately, and the lack of authority control, particularly in regard to band/artist identities, is becoming a bigger problem for me. Sure, there are the ubiquitous “Various Artists” and similarly-scrobbled tracks, but one of the huge stumbling blocks I’ve come across is the need to disambiguate artists with the same name.

  2. Pingback: Cataloging Futures
  3. Ben: Note that the issue with the disambiguation of bands with the same name on last.fm is not caused by user provided data, and professional cataloging wouldn’t solve this issue. This is a technical limitation of last.fms database. One that I found extremely annoying as another Tangent entered themselves in last.fms database before I got there…

Comments are closed.