October 07, 2008

SWAP and E–prints structures don't match

For a long time now, our Data Services manager Stuart Hunt has been saying that E-prints and SWAP don't fit together. Stuart should know: he's a metadata expert and he made all the changes to our E-prints configuration files to try to make it capable of hosting SWAP metadata records for us. I've just been talking to Stuart to get my head around some of the problems we're experiencing, trying to marry SWAP and E-prints.

SWAP expects a hierarchy and E-prints is flat. There's this thing called FRBR (pronounced "ferber" by those in the know!) that SWAP is supposed to follow. That stands for "Functional Requirements for Bibliographic Records", I believe. But the point of it is that SWAP refers to a "work" that is the concept from which all "expressions" (versions) are derived. SWAP metadata describes the relationships between such versions. When trying to do this in E-prints, the conceptual "work" is lost, as far as recording and presenting data is concerned, because there is no way to describe a "work". Each item in WRAP has its own metadata record, and if there are two versions of the same work, e.g. a conference paper and a later journal article of the same title and about the same topic, those items will each have their own metadata record, which would describe the relationship between those items. But there is no "work" actually described in WRAP, because E-prints simply isn't structured to describe it.

This is why we are not providing our academics with lists of works they have published, when WRAP is searched by their name (eg http://wrap.warwick.ac.uk/cgi/saved_search?savedsearchid=3). Our metadata records describe the items they have sent us and not the published works. They have sent us, or we have harvested, early versions of their works, because of copyright restrictions on the final published journal articles. I had thought that we would describe the published works and link the unpublished items to those descriptions, because that was my original understanding of SWAP, but of course that would be bad cataloguing practice if the version we have is a discussion paper. The metadata record must describe the actual item. The "work" is not described in WRAP because E-prints does not support such a hierarchical structure, but if it were, that would meet our academics' needs better than our current set up.

I'm concerned that the latest version of Eprints (we have just upgraded to 3.0.5, whilst 3.1 is now released) diverges from our SWAP model further. Eprints developers talk about the "e-print" which is the metadata record, as far as I'm concerned, and the "document". In their model, several "documents" might be attached to one "e-print". 3.1 allows more information to be attached to the document itself. So, if we were to start again with a SWAP implementation and Eprints, would we want to edit the "e-print" record to contain minimal metadata, and add lots of lovely rich SWAP elements to the "document" metadata? Would it even be possible to enrich the "document" metadata so much? And Stuart tells me that this still wouldn't meet all the hierarchical description that SWAP actually provides for. I lost him at that point, I confess... time to read up some more about SWAP. But even if it would still not be a complete solution, would it be one that would meet some of SWAP's features, just like our current implementation, whilst also meeting the need of the academic to see all their published work in one list?

Does it matter whether we input or store SWAP, if what is coming out of the repository is SWAP? That is the approach that the Eprints developers are taking, as far as I can tell, since they wrote a SWAP export plugin. But our problem with that is that if you don't describe all the relationships at the input stage, you can neither store nor export those relationships, so all that you end up with is the same metadata re-packaged to fit a different application profile. The point of using SWAP for WRAP is the completeness and richness that it allows us to describe about the items in WRAP, because that demonstrates quality, which is important to the University of Warwick's image.

SWAP for WRAP is not all about others' harvesting our data or what we get out of the repository from a technical point of view. It is about what we can say about the articles that our academics have written. There is also an element of future proofing in our motivation, in terms of how we might be able to use our metadata to link between citations in the future, for example, and to present WRAP records alongside records from other data sources such as the library catalogue. The catalogue, incidentally, describes monographs, whilst WRAP does not, so a search with results from both sources would provide academics with a more complete record of everything they have published...

It will be interesting to see which SWAP elements we're actually using at the record creation stage. Because although we have SWAP metadata elements in our workflow, we often don't have enough information to create records that are any richer than an ordinary E-prints metadata record. Or at least that is my impression so far, and I may be wrong. To describe a relationship between two items, you need to know about both those items. Most authors are only prepared to supply us with just one version of their work, usually the most up to date one that they can, if not the final version itself. Often they're vague about what version they have supplied to us, as well. That may change in the future, but for now, that's the case.

Also, our workflow in E-prints is pretty long and off-putting for those not used to ignoring the irrelevant elements for each item. With expert metadata librarians (cataloguers) creating our records, that's fine: they get to know the schema well, and which elements they will want to use to describe each item. But if you're following an author self-archiving model, that's not fine. They want minimum key strokes and simple processes for depositing, although of course we will want to prompt them to describe everything they know, because it is only they who know what version they are supplying. Is SWAP preventing us from following an author self-archiving model? Not really, it's all about how we present SWAP to the authors, and E-prints doesn't make it look pretty at the moment.

The question of how long it takes to edit and polish author-created records so that they meet our quality requirements is another matter entirely. The problem with aiming for high quality records is that they do take time to create, and if authors are self-archiving, they will want to see their items appearing live in the repository as soon as possible. So there may well be a tension between SWAP and author self archiving from that perspective.

The matter of presentation is true for the metadata record view and search results views in E-prints as much as it is for the deposit workflows. It's not about SWAP itself and it's not about E-prints itself. It's about how we get them to work together, and although we have an example of that in WRAP as it is at present, I wouldn't say that we necessarily have the best possible solution for Warwick's needs. Just that we did the best possible at the time with the resources that we had, and E-prints is moving on. So the lessons that our funders JISC might learn from WRAP, as regards SWAP and Eprints will most likely not apply to all future implementations. And that is what happens when you're pioneering...


- One comment Not publicly viewable

  1. code Gorilla

    How right you are… and your not the only one saying this!

    This is a fundamental problem we have have with the current “Repository” technology: EPrints & DSpace are both centred on the object (the “e-print”) being deposited, not where that object fits into the grand scheme of things.

    Paul Needham, myself, and others have been banging this drum for some time how. I was button-holing people about this at OR08, and even wrote about it back in August… and I’ve even tried to short-circuit the problem by talking direct to the code-monkeys that produce the software.

    Don’t get me wrong: we NEED to go through the current repositories ideas to find out what works and what doesn’t; we NEED to discover about duplicate deposits and authors having different ways of writing their name (I’ve got one man with three different variants… and they are all self-deposits too!).
    We need to find out if we want a metadata rich, but data-poor landscape; a data-rich, metadata poor landscape; or if we are prepared to expend resources to make the landscape rich in both data and metadata.

    Personally, the idea of trying to capture 30-something fields of meta-data after the event is doomed before it even starts.
    We know people don’t browse web sites beyond the first page unless there is a goon incentive to do so
    We know there is a “keystroke” problem
    We *know” there is a resource problem

    Why, oh why, oh why (“points of view”, you have a lot to answer for!) do we not look towards a system where metadata is entered at the time and only for the conceptual item it relates to: the area of interest; the specific research grant; the article being started; the peer-review item being published; the book that’s been written…. keep all this data; call it a “Comprehensive aRchive of Interesting Stuff” if you want… and make public the bits of it that the academics are happy to make public.

    Oh, look! there’s something that looks remarkably similar to a “repository”.

    [/Rant]

    13 Oct 2008, 15:19


Add a comment

You are not allowed to comment on this entry as it has restricted commenting permissions.

October 2008

Mo Tu We Th Fr Sa Su
Sep |  Today  | Nov
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31      

Visit the WRAP repository

Twitter Feed

Search this blog

Tags

Galleries

Most recent comments

  • @Jackie, thanks! I'm very proud of the team and everything we have achived in the past year. Looking… by Yvonne Budden on this entry
  • That's an impressive amount of full text Yvonne. Congratulations to everyone at Warwick. by Jackie Wickham on this entry
  • In my opinion the DEA is a danger to digital liberties and should be thrown out, period Andy @ Lotto… by Andy on this entry
  • Has anyone tried an assessment using the suggested PIs– including the author of the paper? It seems … by Hannah Payne on this entry
  • Hi Yvonne I came across this article myself recently. And I was wondering how much of an issue this … by Jackie Wickham on this entry

Blog archive

Loading…
Not signed in
Sign in

Powered by BlogBuilder
© MMXIX