All entries for Monday 15 February 2010
February 15, 2010
Writing about web page http://repositories.webometrics.info/methodology_rep.html
Webometrics have published their rankings for repositories, and their methodology is described online. This is the first time they've actually listed WRAP and we're at no. 273. They are primarily focussed on repositories like WRAP that are all about research content. Their criteria for measurement are listed as:
"Size (S). Number of pages recovered from the four largest engines: Google, Yahoo, Live Search and Exalead.
Visibility (V). The total number of unique external links received (inlinks) by a site can be only confidently obtained from Yahoo Search and Exalead.
Rich Files (R). Only the number of text files in Acrobat format (.pdf) extracted from Google and Yahoo are considered.
Scholar (Sc). Using Google Scholar database we calculate the mean of the normalised total number of papers and those (recent papers) published between 2001 and 2008."
But if you decided that the Webometrics ranking were an important one (a whole other issue!) then you might want to work on influencing these...
50% of the ranking is given to Visibility, so you'd want to concentrate on getting people to link in to your content from other sites. This is not only good for Webometrics, but reputedly also for your "Google Juice" (ie how high your content appears in Google results lists). I've yet to investigate whether we can find any stats out for ourselves from Yahoo Search or Exalead. However, sending this message out to your authors that they should link in to your content and encourage others to do so could cloud the main issue, which is about getting them to send us content in the first place. I think that this kind of a message is one for a mature repository to focus on, where there is already a culture of high deposits. Because the main priority for a repository is surely to make lots of content available on OA, not to score well in a repository ranking!
20% is dependent upon size. So getting lots of content and focussing on this message with your authors is important too. It is my highest priority in any case...
15% is dedicated to "Rich files" which seems to be if there are pdf files... this isn't necessarily the best thing for a repository from a preservation angle, nor if you would like to allow data-mining on your content. It might not even be the best display format for all types of content. So it would seem to me to be the least important metric to focus on, if I understand it correctly.
The final 15% is dependent on Google Scholar... Google Scholar does not currently index all of WRAP's content. I have written to them about this, and I know that other repositories have the same issue but I still haven't go to the bottom of it. My theory is that, if you read their "about" pages, they are indexing our content but not presenting it in their results sets because they de-duplicate articles in favour of final published versions: they present these rather than repository results, so if I look for all content on the wrap domain through GScholar I won't get as many results as I have articles in the repository. If my theory is right then it could be significant to learn whether Webometrics is using their raw data before any such de-duplication. I might be wrong, though!
Also note the dates of publication that are relevant to the GScholar data. We have said to authors that as far back in time as they feel is important/significant is fine with us (helps to win them over, useful for REF data and web-pages driven by RSS feeds from WRAP). But if you wanted to be more strategic in raising your ranking on Webometrics then you'd need to change the policy to focus on content published in the last 10 years...
I don't think we shall be playing any such games! But it is interesting to see what ranking providers consider to be important...