October 26, 2010

ISKO–UK Linked Data Conference

Writing about web page http://www.iskouk.org/events/linked_data_sep2010.htm

**Finally getting round to making this live after having to put off the editing for OAW and the start of term!**

This event, hosted by UCL, was one that I had been looking forward to for some time.  Whether or not linked data is the 'next big thing' in web technology, and one that has to potential to solve a number of thorny problems for the administrators and maintainers of web resources in the face of increasingly complex demands, is a question that only time will answer.  However as it stands at present linked data has enormous potential as a service and as a tool and I wanted to find out more before I started getting any awkward questions from stakeholders!

The sessions on the day were a nice mix of technical and non-technical and my biggest fear of being lost before the end of the keynote was mercifully misplaced.  Also very usefully the presenters not only spoke about the technology and standards underpinning the creation of linked data but also presented us with a number of real world example of things that linked data can be used to achieve.  These kinds of presentations are the ones I'm always on the lookout for any new development because it's always easier to say to someone "linked data can do all these kinds of things" when you have some way to show the power of linked data directly.

Prof Nigel Shadbolt of the University of Southampton gave the morning keynote, focusing on the current policy of the government and the ways in which this might create a 'tipping point' for the ideas behind the semantic web.  Here we saw that not only has the previous government made a commitment to releasing government data through the data.gov.uk portal (now at 4000 datasets and counting!) but that this commitment has survived the change in leadership. This release of public data has allowed the users of public services to hold the providers to account. It has also opened up a number of ideas about streamlining data collection, with the expected issues of trust and privacy raised. A lot of the applications based on this information have at the heart of them, place as the central piece of information, and allows the ‘crowd-sourcing’ out of errors! Prof Shadbolt also introduced the idea of a “star rating” of data publishing, as a measure of quality ranging from the ‘1 star’ data (better that it’s on the web than not) to the ‘5 star’ data (with full linked data coding). A national infrastructure is being built at the moment that requires 5 star data, as a way for departments to interact, to allow for ‘back linking’ and well as ‘forward linking’, do investigate the levels of relationships that exist between information that might not have been obvious before. And if national linked data is possible could we extend it to a global network of linked data? And if we can have global linked data can we meaningfully compare the UK with other countries? Are the ontologies we use compatible?

Antoine Isaac from the Vrije Universiteit Amsterdam spoke next about SKOS (Simple Knowledge Organisation System) and linked data. SKOS has been designed to remove ambiguity from the representation of terminology in RDF as possible but in a way that is easier to use than a formal, rigid ontology like OWL. SKOS has a number of basic features; concepts, lexical properties, semantic relations and documentation. Taking the example presented on the day: Cats is a SKOS ‘concept’ around which is built relationships and multilingual labels. For example ‘Cats’ has an rdf:type of skos:concept and a skos:prefLabel of ‘cats’, ‘chats’ and ‘коты’ in English, French and Russian respectively. It also has a skos:broader term of mammals and a skos:related term of ‘wildcats’. However any concept can only have one skos:prefLabel in any language e.g. you can’t use animals and beasts. Data coded in SKOS RDF can be used to infer relationships between things that might not explicitly be stated but this process needs monitoring e.g. for the broader/narrower functions you only need to code in one direction and the system assumes the reverse is true. Overall an interesting project that is a lot easier to use than other ontologies and can be used to create links between other ontologies thus making them more accessible.

Richard Wallis for Talis Information took us on a ‘Linked Data Journey’ next. This took us on a potted history of the development of the semantic web from the beginnings of the web to the present day. Along the way we stopped briefly at important developments such as the US’s data.gov and the UK’s equivalent data.gov.uk, the forthcoming legislation.gov.uk as well as the standards used to manage the data, SPARQL, RDF, SKOS and others. Richard mentioned the developments brought about by the data.gov.uk initiative that has started to involve government departments sharing identifiers to allow the data to be more easily retrieved and used. The overall message to people sitting on the fence about linked data was to get the data on the web as a first priority, the linked data will come afterwards.

Steve Dale spoke briefly on the Knowledge Hub, a social media project to aggregate the ‘communities of practice’ found in local government and the help staff to use those communities of practice to improve the working of government departments. He made the point that there is lots of data available and that it’s not always easily accessible, certainly not in a machine readable way. There is an increasing need to compare your performance with others and to identify the best places to find out the best and most appropriate measures on which to compare yourself. The Knowledge Hub is designed with the idea that communities + data intelligence = improvement and Innovation. Behind the scenes the system works on principles similar to those used by Amazon to personalise your recommendations.

The afternoon keynote was given by Prof Martin Hepp on the GoodRelations Ontology had a very practical perspective. This project has a very practical, commercial purpose which is very important as it is only going to be with commercial engagement that some of these principles are ever going to take off properly. In a similar way to the way the web itself exploded during the dot.com boom. The GoodRelations Ontology is based on leveraging linked data to start the process of ‘matchmaking’ in market economies. A good business lives and dies on its suppliers. Transaction costs are now estimated to account for at least 50% of the GDP of a modern market economy so the amount of money to be saved by making it easier and quicker for a company to find the best-fit is considerable. The root of the problem is that the WWW is the largest data shredder in history, taking structured data as it is added to the web and then removing the entire context. This now, unstructured data cannot be reassembled into structured data. In this project links are important but not the whole story, need to record data semantics, hold data in a structured manner, group links by type, and link to information within documents. The GoodRelations project has spent 8-10 years trying to create a global schema for commerce data on the web and now feels that it is getting close and with 16% of all current RDF triples having a basis in the GoodRelations project they might be right!

Andy Powell of Eduserve spoke about the work of the DCMI (Dublin Core Metadata Initiative) to align itself with the semantic web. This talk focussed particularly on the challenges of using linked data in a practical manner and the fact that linked data is not the only way for the web to develop. I short history of the Dublin Core metadata schema was given and an acceptance of the fact that some of the elements were very “fuzzy buckets” for people to put things into. Dublin Core is seen as having a bit of a ‘flat-world’ modelling approach, it can only deal with one thing at a time, there has been very little abstraction of the model since it was first proposed it was just moved from HTML to XML to RDF. If linked data is the future then linked data must be successful on the web and this means that RDF has to be successful and it hasn’t been so far. DC can be seen as providing a useful vocabulary of the core ‘classes’ and ‘properties’ that can be used in a linked data environment.

John Goodwin’s demonstrations on the use of linked data within the Ordnance Survey data were fascinating and raised some interesting questions. For example when you say Essex do you mean the town or the county and does the ‘machine’ you are using know that? What happens to geographic data when the boundaries of local government data change? The temporal aspect of geographic data is a continuing problem. Linked data within the BBC website is allowing news stories to be grouped geographically, also the problems of harmonising data across a number of formats. The final problem mentioned was that of vernacular geography, the Ordnance Survey has done the ‘real’ geography but the emergency services are more interested in knowing that people say when they say….

In the next talk we were introduced to PoolParty, thesaurus management software, by Andreas Blumauer. The idea behind PoolParty was to give people a tool to allow them to publish any knowledge as semantic linked data. The semantic web could be either the high level modelling of OWL or the lower level of the SKOS information. The knowledge people can publish can either be open or closed access enterprise information and uses the SKOS coding as a standard to take advantage of the open data available. Functions available include auto-complete for terms, text analysis, linked data mapping, drag-and-drop term adding and advanced reporting. The most useful way to use the system, I thought, was the ability to create a bespoke thesaurus and map it to existing schemas, something as a cataloguer I often wished I could do.

The final presentation of the day was from Bernard Vatant from Mondeca discussing the backend products offered by the company to align semantic web technologies with the real world needs of the people using them. He presented an interesting view of the web:

  • Internet (1970’s) = network of identified, connected and addressable computers.
  • Web 1.0 (ca. 1990) = network of identified, connected and addressable resources.
  • Semantic web (ca. 2010) = network of identified, connected and addressable representations.

His view of the semantic web is that we needed an extra level than is currently being offered by thesaurus products, that of context. In the current process terms denote concepts and can be represented by things and it is this coordination of terms, concepts and things that creates the context. Bernard Vatant described this intersection as the ‘semiotic triangle’. This intersection of linguistics, cognition and technology is one of the areas that excite me the most about semantic web technology.

The day was rounded off with a full panel discussion that covered some very big questions: for example ‘can you really define a universal concept of anything?’ and ‘is linked data really the future?’ at the speakers of the day had a valiant attempt at answering. Some comments I particularly liked (paraphrased in most cases): ‘linked data allows you to circumvent many problems by allowing you to link vocabularies to each other’; ‘data is the new raw material’; ‘data is free/open, roll out is free, sell the services built over the system’; ‘the internet is already changing the traditional business models, this just takes it a little further; ‘take up is still determined on the discipline of the author’. All in all a fascinating day that may (or may not depending on who you believe) have given a sneak peek of the future.

October 22, 2010

International Open Access Week 2010

Writing about web page http://go.warwick.ac.uk/lib-openaccess

The 4th International Open Access Week is drawing to a close now and looking back of a busy week of events I think that we can be quietly proud of the way things have gone here at Warwick.  This year we celebrated in a number of ways:

  • We held two experimental drop-in sessions which generated some interesting discussion on the citation advantage and how to convince colleagues.  As well as a discussion on the importance of accurate metadata!
  • I recorded my first conversational podcast for display on the new Knowledge Centre website.
  • We hosted a well attended event, intended for researchers but better attended by Library staff.  The researchers missed a really excellent talk by Gerry Lawson of the Natural Environment Research Council about the views and attitudes of funder's to the Open Access as well as talks by myself and Jen Delasalle about a whole collection of other Open Access topics.
  • I was invited to speak at the regular meeting of our subject staff to give them a refresher on WRAP, Open Access and other things!  I found this meeting really useful and I think both sides came away with ideas to better support the work of the other, which is always fantastic!
  • And finally I celebrated Open Access Week with the addition of two new members of my team who have managed to more than double the size of the WRAP team in one go!  The timing was coincidental but it was a great way for the Library and University to demonstrate their commitment to open access and WRAP!

There have been lessons learnt from my first Open Access Week but I think overall it was a moderate success and the WRAP competition continues to run and I'll announce the winner of that early next week!

I'll close with a huge thank you to Gerry Lawson for speaking at Wednesday's event and an equally big thank you to Jen for speaking and for co-organising the whole week!

September 07, 2010

Highlights of Repository Fringe 2010

Writing about web page http://www.repositoryfringe.org/

I'm just back from a trip to gloriously sunny Scotland (which was obviously breaking out the good weather for the festival) and the 2010 Repository Fringe Event.

Hosted at the National E-Science Centre (NESC), in the heart of Edinburgh the sessions began with Sheila Cannell (Director of Library Services University of Edinburgh) asking us to consider fireworks.  She invited us to join in with the firework display at the end of the Edinburgh festival, which in her works were 'open fireworks' (paid for by a combination of public money and the 'subscriptions' of a few), and use thinking that would light up the sky.  This nicely set up the tone for the next couple of days.

The keynote by Tony Hirst (Open University) followed where he presented us with an outsider's view of repositories on the theme of openness. The central theme of the talk was "content as data" and urged us to consider new ways to store and present the information in our repositories to our users.  New ways to manipulate the data and new ways to present the data were central as well as information we might want to start recording but currently aren't doing so, such as 'open queries' showing users exactly how the charts in an article were generated from the underlying data.  In a nice touch Dr. Hirst finished with a revisit to S.R. Ranganathan's 'Five Laws of Library Science' as he encouraged us to keep our repositories as living organisms rather than as a place research is dumped and forgotten about.

The following session by Herbert Van de Sompel (Los Alamos National Laboratory) introduced us to the Memento Project a way to provide web users with time travel!  A clever way to allow your web browser to access the web as it would have been on a certain date using the same uri that you have for the current version and with as much of the functionality the page had originally as possible.  This is one thing I'm looking forward to experimenting with, if you use Firefox the link above will lead you to the gadget to try it out for yourself!

Repo Fringe was my first experience of the Pecha Kucha style of presentations (for those not in the know, 20 slides, 20 seconds a slide, autorun for 6mins 40 per presentation) and the looked just a nerve wracking as you might expect!  On the first day we had an update on the Open Access Repository Junction, beauty and the Jorum repository, Glasgow's Enlighten repository through the metaphor of cake, the problem of dataset identity, Research data management and the Incremental project and finally the Edina Addressing History project.  I will admit I was hard pressed to choose my favourite when the time came to vote!  I was also impressed at how many ways there are to approach these sessions and how much information you can pack into just under seven minutes.

The EPrints team reinforced their reputation for giving some of the more entertaining presentations that any conference is likely to see with their live demo of EPrints 3.3 and the new Bazaar functionality.  A very interesting look at what is to come in terms of the software many of us in the audience is using!

The first round table of the conference for me was on the thorny issue of the relationship between an institutions' CRIS (Current Research Information System) and its institutional repository (IR).  The talk was sparked by the work done on the CRISPool project which was aimed at creating a cross-institutional CRIS to cover the Scottish University Physics Alliance (SUPA) research group.  The discussion invited us to consider whether the distinction is a false one or whether the issue is to consider what functionality best fits where in the system.  Is it right that IR's exist when we could all have CRIS's?  Could we create a centralised, national IR and all our CRIS's harvest from there?  Should we be looking to integrate CRIS functionality with IR's?  What impact does the REF have on the discussion?  In all we didn't come to any definite answers (not that I think that that is the purpose of round tables of this sort) but we all took away something to think about.

Day two began with Chris Awre (University of Hull) discussing hangover cures through the ages (the Romans apparently favoured deep-fried canaries) before moving on to the main meat of his presentation on the Hydra Project, a collaboration between the Universities of Hull, Virginia, Stanford and Fedora Commons.  This unfunded project is aimed at providing solutions to identified common problems on the understanding that no single institution can (or needs to) create a full range of content management solutions on their own.  For Hydra collaboration is the key to the success of a project with each institution providing what they can to the project.  The project makes use of ruby on rails technology and the work of Project Blacklight, an open source 'next-gen discovery tool' to allow a more sophisticated search function.

The second round table of the event was focused on linking data to research articles.  This is an area that we  are looking to move forward into in the future and so I was fascinated to hear some of the comments and opinions from places that already had systems running.  Form the responses of the attendees I was not alone in this, many institutions seem to realise that this is an important area and that the implications of a project such as this can be huge.  The keyword here was always going to be linking, but linking what to what?  What is a dataset?  As there is a clear difference between a dataset associated with an article and a working dataset can we pull out only the data that was used in the article and storing it with the article without loosing the meaning of the data?  The point was made that the cost of storage (while large) pales to the cost of curating many small things as with curation you have the cost associated with each item.  We discussed the fact that with archives the expectation is that you just put things inside it and with repositories you have the added issue of people trying to reuse the data.  In the current age of research funding cuts the reuse of data is going to become critical as fewer and fewer institutions are going to be able to afford to run the experiment again from scratch!  The issue of trust was discussed, can we trust a conversion of a dataset for preservation?  Will it have maintained all of the formulae that are inherent in the dataset?  The spectre of 'ClimateGate' was raised will the availability of the data safeguard against this in the future?  If we are linking to things inside of a dataset do we have the functionality to 'cite' a small part of the larger whole without making the link meaningless?  All this and metadata schemeas were touched upon in a stimulating discussion that could have run a lot longer than it did.  Again we came to no conclusions but everyone I spoke to afterwards had gained at least one thing that they hadn't considered before to think about!

The second round of Pecha Kucha talks were as interesting as the first and covered:  The Ready for REF project looking at the XML output needed for the REF reporting, JISC RePosit working to simplify the deposit process through use of research information systems like Sympectic, more on research data management this time from the Edina team and looking particularly at the creation of training tools, the JISC CERTIS services and their approaches to open educational resources, ShareGeo and the Digimap and finally the SONEX think tank on work done by this group.

Possibly the most challenging presentation of the event was from Michael Foreman (University of Edinburgh) introducing the concept of 'Topic Models'.  The concept from a paper by Blei and Lafferty (2009) about their work with articles in JSTOR allows people to create maps of related documents based on the statistical analysis of the frequency of words within the article.  A lot of the meat of the statistics did stretch my understanding to the limit but anyone (and everyone in the room certainly did) could see the value to be gained from work of this variety as we search for more and more automated ways to define the content of items in our repositories and the way they relate to others.

The closing presentation from Kevin Ashley (Digital Curation Centre) gave us a round up of the presentations that had gone before it, a round up of the development of the repository world as a whole and as a way of looking forward revisited the idea of citing data.  He urged us to be aware that we are "Standing on the shoulders of Giants" and also to remember that sometimes fireworks are a good way to burn a lot of money very quickly!  Curation issues were raised; what to keep?  How long do we keep it for?  The fact that repositories have not yet had to consider throwing things away and that we may have to at some point!  The concept of the value of data being unknowable was also raised, with the example being given of the data from ships logs were used three times, first to navigate, secondly to tell historians about economic and trade conditions and finally most recently to discover evidence of climate change.  Again we came back to the idea of the 'data behind the graph' the information in the article that we just can't get hold of.  As well as the fact that people don't always realise that data can be changing all the time, nothing is truly static.

Overall the two days in Edinburgh were packed with many interesting things but the thing I took away from it most was the fact that there is always a different way of looking at something but that you should never forget your foundations.

July 12, 2010

Open Repositories 2010

Writing about web page http://or2010.fecyt.es/publico/Home/index.aspx

There will be a full report of the event going up here soon but I thought I'd get a few of the highlights (non-football related, I'm afraid) up in advance.  Presented in no particular order here are some of the things I took way from the conference.

  • News that Spain's new law for Science, Technology and Innovation will mandate the open access publishing of all publicly funded research no more than 12 months after completion in a repository, is (hopefully) to be ratified later this year (Proyecto de Ley de la Ciencia, la Tecnología y la Innovación, Article 36).
  • The 'buzzword' of the conference was 'linked-data', why you should use it, how to code it and most of all how to share it.
  • Need for a awareness that the published paper is only part of the process, research is not just about the results but also about the process of getting the results.  It is just as valuable to researcher for us to archive this data as well.
  • Everyone knows what the problems and issues are in the broad areas of repositories and Open Access and the solutions are a numerous as the problems.  However at the moment development is so close to the present that people are not having as much choice about waiting for their preferred option to be ready.
  • Some institutions want their mandate in place before they even have a repository.  This has definitely helped them in that they are now starting the repository from a position of community engagement but I can see problems if they have any delays in the building of the repository.
  • Interoperability and integration with other library systems were highlighted as particular issues and concerns and a number of presentations touched on this, bringing us again back to linked-data.
  • Repository drivers (particularly in terms of research assessment) are sometimes driving repositories away from the 'core' or 'ideal' of open access to research.
  • Non-text research outputs lead to non-standard repositories.  Possibly obvious, but it's worth bearing in mind we don't all have the same challenges, and that even if we think we've got it worked out, unexpected deposits can play havoc with systems.  Also it is to our advantage not to get locked into the idea of a single output type.
  • Disambiguation is the next big challenge and a number of different projects were presented in this area, both in session and as posters.
  • Libraries in general and repositories in particular need to be aware that each discipline has it's own 'language'.  We need to strive to be the common language that allows them all to communicate, not another language for them to learn.
  • The more we can move into their preferred working environment instead of forcing them to learn a new one the better, lessons can be learnt from the social networking world (hands up how many of you have linked all your accounts so you only have to update one!?!).
  • The Carrot vs Stick debate: both approaches work and some institutions are using some very big sticks indeed!
  • Digital Preservation doesn't have to be hard, but you do have to want to do it!

Finally, congratulations to Richard Davis and Rory McNicoll of the University of London Computer Centre for winning the 'Developers' Challenge' (for details see here) with a tool to hugely increase the number of useful links out of a repository record.  Also to Colin Smith, Chris Yates and Sheila Chudasama of the Open University for winning the poster contest (available here).

June 01, 2010

The Changing of the Guard

A first post from WRAP's new E-Repositories Manager.

I have been working in Academic libraries since 2003, having been part of the University of Nottingham and the University of Birmingham before joining Warwick as a Metadata Librarian attached to WRAP in May 2009.

My background is chiefly in the areas of cataloguing, resource description, information retrieval and subject indexing (expect to here me banging on about name authorities a great deal!) so this position builds on my strengths but is also going be a real challenge for me.  I anticipate that this blog will continue to be used in the same way that it was set up to cover thoughts, impressions, plans and developments in the Warwick repository landscape.

March 03, 2010

A change of role

Writing about web page http://blogs.warwick.ac.uk/libresearch/

I have moved to a different role at the library now: I am Academic Support Manager (Research) from 1 March onwards. Mind you, I've been doing some work for this role before the switch and so I owe time back to the repository. Which is convenient, until a new repository manager is appointed!

I've started a new blog for the more general research support issues, which this post links to.

February 23, 2010

Highlights of UKCoRR meeting, Feb 2010

Last Friday I was at the UKCoRR members' meeting. As their Chair, I reported on my activities and announced speakers. As a repository manager, I learnt a lot from the other participants.

Louise Jones introduced the day, as the University of Leicester library were our hosts. They have recently appointed a Bibliometrician at Leicester and they're acquiring a CRIS to work alongside their repository. They have a mandate for deposit and Gareth Johnson's presentation later in the day about the repository at Leicester mentioned that they have more than enough work coming in, without the need for advocacy work to drum up deposits. I guess that the CRIS will come in handy for measuring compliance with the mandate!

Gareth's presentation also included some nice pie charts showing what's in their repository by type, and what's most used from the repository, by type and then again by "college" (their college is like a faculty). Apparently he had to hand-count the statistics for the graphs... well done Gareth!

Nicky Cashman spoke about her work at Aberystwyth and I found it interesting that one of their departments' research projects on genealogy has hundreds of scanned images of paper family trees that they are looking for a home for, at the end of their project. They don't require a database to be built around their data as they already built one, and they want to link from it to the scanned images. This sounds like a great example of the kind of work that the library/repository can do to support researchers with their research data. The problem is, though, that in order to host that kind of material in a repository there will be substantial costs, (cataloguing each item, storing it and preserving it) and these costs perhaps ought to have been included in the original research bid. Researchers ought to be thinking about such homes for their data at the beginning of their projects, rather than at the end.

Nick Sheppard spoke about his work on Bibliosight and using the data provided through Web of Science's Web Services. There was some discussion about the fact that you can't get the abstract out of WoS because they don't own the copyright in it in order to grant that we might use it...

Jane Smith of Sherpa demostrated some of the newer and more advanced features of RoMEO. I think that the list of publishers who comply with each funders' mandate is something that might be of use to researchers looking to get published. Also, the FAQs might be useful for new users of RoMEO.

I would like to see the Sherpa list of publishers who allow final version deposit enhanced to include which of them will allow author affiliation searching as well, so that we can find our authors' articles in final versions and put them into the repository. And another column to say whether the final versions are already available on open access or not, because I'd prioritise those not already available on open access.

One development that has been considered for SherpaRoMEO is that it should list the repository deposit policy at journal title level, because publishers often have different terms for different titles. However, in trying to develop such a tool, it has transpired that one journal might appear to have many copyright rights owners, when looking at the different sources of information about journal publishers. For instance, the society or the publisher who acts on their behalf might each claim the rights and each have different policies. Which rights owner's policy ought SherpaRoMEO to display?

Hannah Payne spoke about the Welsh Repository Network who have a Google custom search for all the welsh repositories which I like but would wish to see a more powerful cross-searching interface, and in the afternoon we did a copyright workshop that had also been run at one of the WRN events.

So there is plenty I can take away from the day.

February 22, 2010

Referring sites

Today I've been writing up some handover notes on statistics for the next E-Repositories Manager at Warwick.

One thing that has been interesting me for a while is the "Referring sites" information on Google Analytics. Most of our visitors come from Google itself, and the great blue wedge on the pie chart that is search engine referrals resembles a pac-man shape: it has been swallowing up all other sources of visitors, month on month...

Ideally, we'd like for people to be linking to documents in the repository, and for people to be following these links: this would increase our "Google juice"... and perhaps such an effect would result in more visitors from search engines, and thus my pie-chart of visitor sources will always look like a blue pac-man character!

The referring site that brings us most visitors is Warwick's own, and within the Warwick domain, the page we created under the University's "Research" page brings us most visitors. This is good news because it shows the importance of us having this page, and not only linking to the repository within the library's pages.

The next most important pages are the ones from within the library's website, which is fine. Our next most important source of visitors is from the profile page of one particular academic who is very good at linking to his papers in WRAP!

It would probably be a good advocacy tactic to write to authors to say how many visitors have come to WRAP by following links on their pages... if we had the time to go through all these stats! Given that many of the profile pages which are bringing visitors to WRAP are those generated by the University's "MyProfile" system, then it would also serve as good advocacy for MyProfile.

(NB for non-Warwick people: MyProfile is what we call the part of InfoEd which documents academics' work and is used by our Research Support Services department. It is used well by some departments and not very well by others, and not all departments choose to have staff profile pages driven by its data. It serves as a kind of publications database for Warwick and is one of the reasons why WRAP remains full text only. We share our data with MyProfile through a report sent every month and Warwick authors can update WRAP by uploading a file through MyProfile.)

February 15, 2010

Ranking repositories

Writing about web page http://repositories.webometrics.info/methodology_rep.html

Webometrics have published their rankings for repositories, and their methodology is described online. This is the first time they've actually listed WRAP and we're at no. 273. They are primarily focussed on repositories like WRAP that are all about research content. Their criteria for measurement are listed as:

"Size (S). Number of pages recovered from the four largest engines: Google, Yahoo, Live Search and Exalead.
Visibility (V). The total number of unique external links received (inlinks) by a site can be only confidently obtained from Yahoo Search and Exalead.
Rich Files (R). Only the number of text files in Acrobat format (.pdf) extracted from Google and Yahoo are considered.
Scholar (Sc). Using Google Scholar database we calculate the mean of the normalised total number of papers and those (recent papers) published between 2001 and 2008."

But if you decided that the Webometrics ranking were an important one (a whole other issue!) then you might want to work on influencing these...

50% of the ranking is given to Visibility, so you'd want to concentrate on getting people to link in to your content from other sites. This is not only good for Webometrics, but reputedly also for your "Google Juice" (ie how high your content appears in Google results lists). I've yet to investigate whether we can find any stats out for ourselves from Yahoo Search or Exalead. However, sending this message out to your authors that they should link in to your content and encourage others to do so could cloud the main issue, which is about getting them to send us content in the first place. I think that this kind of a message is one for a mature repository to focus on, where there is already a culture of high deposits. Because the main priority for a repository is surely to make lots of content available on OA, not to score well in a repository ranking!

20% is dependent upon size. So getting lots of content and focussing on this message with your authors is important too. It is my highest priority in any case...

15% is dedicated to "Rich files" which seems to be if there are pdf files... this isn't necessarily the best thing for a repository from a preservation angle, nor if you would like to allow data-mining on your content. It might not even be the best display format for all types of content. So it would seem to me to be the least important metric to focus on, if I understand it correctly.

The final 15% is dependent on Google Scholar... Google Scholar does not currently index all of WRAP's content. I have written to them about this, and I know that other repositories have the same issue but I still haven't go to the bottom of it. My theory is that, if you read their "about" pages, they are indexing our content but not presenting it in their results sets because they de-duplicate articles in favour of final published versions: they present these rather than repository results, so if I look for all content on the wrap domain through GScholar I won't get as many results as I have articles in the repository. If my theory is right then it could be significant to learn whether Webometrics is using their raw data before any such de-duplication.  I might be wrong, though!

Also note the dates of publication that are relevant to the GScholar data. We have said to authors that as far back in time as they feel is important/significant is fine with us (helps to win them over, useful for REF data and web-pages driven by RSS feeds from WRAP). But if you wanted to be more strategic in raising your ranking on Webometrics then you'd need to change the policy to focus on content published in the last 10 years...

I don't think we shall be playing any such games! But it is interesting to see what ranking providers consider to be important...

December 14, 2009

What does repository deposit mean?

Follow-up to Theses and early draft deposit in repositories: is that publication? from WRAP repository blog

Last week I attended a meeting with some publishers and it seems to me that there is considerable potential for confusion amongst those not involved in repository management, about what repository deposit actually means. The two main areas of confusion seem to be:

1) Not all content in all repositories is necessarily open access. Some repositories have metadata-only records along with some records which also have full text items available on open access. Some also have full text items that are locked such that only repository staff and the author can see them, or such that only members of the institution can see them. Some repositories add a "request a copy" button to their records so that those who can't see the locked full text can request it from the author. Sometimes the locked access is in order to meet a publisher's requirement or sometimes it is because the author prefers that requests are sent to him/herself so that s/he can know who is reading his/her work.

Publishers' agreements with authors and their information about what can and can't be done usually refer to whether repository deposit is allowed or not. I suspect that more of them would allow repository deposit if the article were locked to be accessible only within the institution or only to the author and repository staff.

2) Just because an item is available on open access, that does not mean that it is available for further copying by anyone! Publishers might also be more inclined to allow repository deposit and open access availability if they knew that allowing this is not granting permission for others to on-copy from the repository. Some repositories do also ask authors to grant a Creative Commons (CC) licence for the use of the article they deposit, and when this is the case then the article will also be available for further copying. Authors can do this when it is clear that they own the copyright themselves. Those repositories which do use the CC licence don't all expect every single item they hold to be deposited with such a licence, although perhaps that would be an ideal scenario. WRAP isn't one of those repositories which asks authors to sign a CC licence, for now. It would just be another hurdle to deposit and our main aim is to make the works available without subscription barrier.

Publishers' agreements with authors who have paid for their article to be made available on open access on the publishers' site do not state that repository deposit is also allowed, although it seems that (some, at least) do expect that to be the case without their stating it. Perhaps their agreements with the authors do grant copyright back to the authors and that's why they expect it, but it's not always clear to repository managers that this is the case.

We don't put open access articles into the WRAP repository unless permission is expressly granted by the publisher or clearly owned and granted by the author. Open access seems to have been conflated with waiving of copyright, but copyright still exists in open access works. BioMed Central are very clear that their open access articles can be further copied, and they state how, etc, so they're an example of how open access should be handled by publishers, in my opinion. This is another reason that I wouldn't consider deposit in WRAP to be a form of publication. WRAP has no copyright owndership over the works it holds: that still rests with the rights owners.

For WRAP, we are clear that we want full text, to be made available on open access for all journal articles and for as many PhD theses as possible. We don't have metadata-only records for journal articles but we do for theses, and we also allow theses to be deposited but locked to repository staff only. The works in WRAP are not made available with any particular licence and rights owners would still need to be consulted before further copying could be done.

It seems to me that there are so many different flavours of repository, all with ever so slightly different aims and purposes and so we're all doing slightly different things with them. No wonder there is so much potential for confusion! In any case, I was very glad to begin speaking to publishers as I did last week with some representatives from the Highwire publishers, in my role as Chair of the UK Council of Research Repositories.

March 2023

Mo Tu We Th Fr Sa Su
Feb |  Today  |
      1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31      

Visit the WRAP repository

Twitter Feed

Search this blog



Most recent comments

  • @Jackie, thanks! I'm very proud of the team and everything we have achived in the past year. Looking… by Yvonne Budden on this entry
  • That's an impressive amount of full text Yvonne. Congratulations to everyone at Warwick. by Jackie Wickham on this entry
  • In my opinion the DEA is a danger to digital liberties and should be thrown out, period Andy @ Lotto… by Andy on this entry
  • Has anyone tried an assessment using the suggested PIs– including the author of the paper? It seems … by Hannah Payne on this entry
  • Hi Yvonne I came across this article myself recently. And I was wondering how much of an issue this … by Jackie Wickham on this entry

Blog archive

RSS2.0 Atom
Not signed in
Sign in

Powered by BlogBuilder