Writing about web page http://www.bl.uk/aboutus/stratpolprog/digi/datasets/workshoparchive/archive.html
Last Friday saw the second DataCite event, jointly hosted by JISC and the British Library, this one focussing on metadata for datasets. This is an area of interest for me not just because of the developments in research data management (RDM) that are starting to impact repositories but also because of my background in metadata. The organisers warned us that they were starting with the basics and getting increasingly complicated as the day went on, this was certainly true!
The sessions started with a very good introduction, from Alex Ball of the DCC, on some of the essential metadata needed for both data citation and also data discovery. As he put it the different between, known item searching (data citation) and speculative searching (data discovery). The needs of users undertaking both of these activities are fundamentally different but do have overlap. Through analysis of 15 schemas being used by data centres at the moment he highlighted 16 common metadata fields that appear in the majority of the schemas. None of these fields will come as much of a surprise to people creating and using metadata, but might be unfamiliar to the researchers who may have to create this metadata.
Elizabeth Newbold, from the British Library, spoke about the development of the DataCite schema, listing the essential/mandatory fields that they expect people to provide DataCite with in return for minting the DOIs. These fields mostly represent the fields for data citation as mentioned by Alex but DataCite is hoping that data centres will supply them with some of the additional metadata for discovery as well. This is key to the BLs other presentation from Rachel Kotarski who spoke about developments at the BL in transforming the DataCite metadata into MARC records for use in the main BL catalogue. Rachel spoke about a pilot project run to add the dataset metadata into a trial instance of Primo as a 'proof of concept' to assess whether users were looking for this kind of material and if so what kind of metadata did they want when trying to discover it. At least one JISC RDM project in the room now plans to send much more of their metadata to DataCite to allow better harvesting by the BL and it's certainly something we need to bear in mind when developing Warwick's services in this area.
David Boyd from the data.bris project laid out in detail how they are building on the new capacity for data storage at Bristol to build integrated services around data registration, publishing and discovery. This was an excellent insight into how one University has conceived the whole data model and highlighted some key areas of integration with other services that is possible with joined up processes. I particularly took away the details of the range of ways in which they are thinking about automating metadata creation to remove some of the burden on researchers. Michael Charno from the Archaeology Data Service gave some insight from one of the existing Data Centres, who have been in the game at lot longer than most, in a fascinating talk entitled '2000 years in the making, 2 weeks to record, 2 days to archive, too difficult to cite?'. The ADS model charges data creators/projects to host their data and presents the data free at point of access to the user. Currently one of their challenges is persuading users to reuse data and data loss, archaeology is inherently a destructive process so the records of the excavation are often the only evidence remaining at the end of the project. Michael pointed us all towards a set of guidance documents and toolkits used by the ADS to advise researchers on creating metadata but admitted they didn't have any evidence on the amount of use these tools got. Another area of work discussed was looking at the mappings between the current schema, developed in house for the ADS compares to the new DataCite schema.
The final two talks highlighted issues of interoperability with Steve Donegan from the STFC speaking about the difficulties of reconciling the variety of different schemas used by different environmental sciences as part of developing the NERC Data Centre. He highlighted the different metadata needs of scientists who want the raw data and government agencies who want the data at one level of analysis higher for policy decisions. Steve finished by discussing in some technical depth the challenge of making the NERC data complient to the INSPIRE, European standard. Finally David Shotton of the University of Oxford spoke on a range of projects at Oxford looking at the DataCite metadata. Firstly he has worked on a schema to make DataCite metadata available in RDF (new mapping, DataCite2RDF available in draft form http://bit.ly/N3VKsx) using a range of SPAR ontologies. He also spoke about a colleagues project to create a we form to help researchers generate DataCite metadata in an easily exportable XML format and finally on the importance of citing data in the reference list as well as in the text, allowing it to be picked up by services like Scopus and Web of Knowledge.
Discussions at the end of the day was centred around versioning and DOIs for subsets if datasets as well as the importance of keeping things machine readable as well as human readable! Overall the was a fascinating day that provided a little of everything, from clear guidance on the basics and essential metadata required for the basic functions to very complex topics showing how far good metadata can take you. Lots of food for thought for the development of our own services!