All 3 entries tagged Google
View all 131 entries tagged Google on Warwick Blogs | View entries tagged Google at Technorati | There are no images tagged Google on this blog
February 15, 2010
Writing about web page http://repositories.webometrics.info/methodology_rep.html
Webometrics have published their rankings for repositories, and their methodology is described online. This is the first time they've actually listed WRAP and we're at no. 273. They are primarily focussed on repositories like WRAP that are all about research content. Their criteria for measurement are listed as:
"Size (S). Number of pages recovered from the four largest engines: Google, Yahoo, Live Search and Exalead.
Visibility (V). The total number of unique external links received (inlinks) by a site can be only confidently obtained from Yahoo Search and Exalead.
Rich Files (R). Only the number of text files in Acrobat format (.pdf) extracted from Google and Yahoo are considered.
Scholar (Sc). Using Google Scholar database we calculate the mean of the normalised total number of papers and those (recent papers) published between 2001 and 2008."
But if you decided that the Webometrics ranking were an important one (a whole other issue!) then you might want to work on influencing these...
50% of the ranking is given to Visibility, so you'd want to concentrate on getting people to link in to your content from other sites. This is not only good for Webometrics, but reputedly also for your "Google Juice" (ie how high your content appears in Google results lists). I've yet to investigate whether we can find any stats out for ourselves from Yahoo Search or Exalead. However, sending this message out to your authors that they should link in to your content and encourage others to do so could cloud the main issue, which is about getting them to send us content in the first place. I think that this kind of a message is one for a mature repository to focus on, where there is already a culture of high deposits. Because the main priority for a repository is surely to make lots of content available on OA, not to score well in a repository ranking!
20% is dependent upon size. So getting lots of content and focussing on this message with your authors is important too. It is my highest priority in any case...
15% is dedicated to "Rich files" which seems to be if there are pdf files... this isn't necessarily the best thing for a repository from a preservation angle, nor if you would like to allow data-mining on your content. It might not even be the best display format for all types of content. So it would seem to me to be the least important metric to focus on, if I understand it correctly.
The final 15% is dependent on Google Scholar... Google Scholar does not currently index all of WRAP's content. I have written to them about this, and I know that other repositories have the same issue but I still haven't go to the bottom of it. My theory is that, if you read their "about" pages, they are indexing our content but not presenting it in their results sets because they de-duplicate articles in favour of final published versions: they present these rather than repository results, so if I look for all content on the wrap domain through GScholar I won't get as many results as I have articles in the repository. If my theory is right then it could be significant to learn whether Webometrics is using their raw data before any such de-duplication. I might be wrong, though!
Also note the dates of publication that are relevant to the GScholar data. We have said to authors that as far back in time as they feel is important/significant is fine with us (helps to win them over, useful for REF data and web-pages driven by RSS feeds from WRAP). But if you wanted to be more strategic in raising your ranking on Webometrics then you'd need to change the policy to focus on content published in the last 10 years...
I don't think we shall be playing any such games! But it is interesting to see what ranking providers consider to be important...
January 19, 2009
1) 6 monthly report of data changes on WRAP to show which records have been altered since the date they were added into the live repository. (For sharing data with Warwick’s Research Support Services.) Not currently possible.
2) A graph to show how the pattern of new record creation/repository growth has gone, over the last x months/year. I can get this from ROAR. (http://www.roar.org)
3) Monthly report of all records added since last month, with data in specific formats to suit RSS’ InfoEd system (and/or other departments at Warwick). Key issues with sharing with RSS: need to store staff number (or key to call up staff number) for each Warwick staff member amongst the authors, and lack of security for such data in WRAP. Also, page range is currently exported as, eg 51-72, whereas RSS need it as "start page 51, end page 72". More investigation into the technical possibilities for data sharing needs to be done. It may be significant that InfoEd attaches information to a person’s profile, relating to publications (& other activities). Whereas WRAP attaches information about authors to a record describing a publication.
4) Statistics on visitors to WRAP and what they are clicking on, where they come from, etc. Google Analytics does this well enough for me: I can see where they’re clicking, what keywords brought them to WRAP, to where in WRAP, and who their network provider is, (which is a clue to some academic interest, and also helps to identify internal interest). I can see what countries visitors are in, and what cities, etc. I can do all this at a per paper level, but I have to know which paper(s) I want to look at.
5) To look at features like those listed above, for a set of data (eg all by one author, or all for a particular department). Departments and authors may well want to know who is looking at their work in WRAP. I can look at particular paper, but not at a set: I would have to collate reports for each paper, in some way. IRStats should be able to do this, if we were to install it successfully on WRAP… although it may require some change in our workflow. At the moment, most papers are added to WRAP by our very own administrator, since authors use a separate (& simple) submission form. Authors do not upload data about their own publications and therefore the papers are not attached to separate accounts in WRAP. I believe that IRStats would need separate accounts to be used for each author’s papers, in order to produce reports on all of an author’s papers. Our administrator could create accounts in authors’ names and then log in as the author before creating the record… but that all presupposes that we can get IRStats to work, and that it does work as I expect.
6) It would also be better for me (and for those interested in the data) if I did not have to look up statistics such as those already provided by GA myself, but if those interested could just look them up, on demand. In theory, I can grant access to the GA reports to anyone with a Google account… although this requires some intervention from me. And Google Analytics is great for those who know how to use it, but I can see academics being put off learning how to use it. There are barriers to authors getting data about all the wonderful good WRAP is doing in bringing an audience to their work!
7) GA is great for looking at the site and our html files, but tells us nothing about pdf/word document downloads. The difference between “the most downloaded document” and “the most looked at record” could be very important indeed, if any correlation with citations is to be explored. Also, I can tell from GA if someone has followed the link to the DOI on a particular record. I can’t tell whether anyone has followed the link from within the pdf file to the full text, published version, though.
8) What are people searching for from the repository's own search form: which fields do they search by? GA can only tell me whether people click through from our Advanced form to the Simple search one, and indeed whether people follow the link to search the repository in the first place from our home page. Thus far, there aren’t so many people searching, and we expect that people will not search through our form but on search engines like Google, with keywords which GA does record and tell us, so this isn’t particularly crucial.
I’m also not sure of how to make GA discount visits from members of the WRAP team… but I expect that’s something I ought to look into.
I’ve learnt a lot about what GA can tell me about WRAP and its visitors. I find it fascinating to delve in every now and again and see what brings people to us. It can be used as a website management tool, to see how to make important links more visible and hence more clicked upon. It can be used in advocacy to authors, explaining why they might want to put work into WRAP, showing that others do look at it.
What I would like to do is to compare our statistics with those of other repositories, at other institutions. It’s not easy to find other repositories that are comparable with ours in their features (full text, mediated metadata, voluntary deposit), never mind such repositories at comparable institutions. But it is possible to find those who are much further ahead of us, and it would be good to see where we might be heading, in terms of visitor profiles, whether most visitors came from search engines (as now) or direct links, etc. I would like to know whether the most popular content in others’ repositories is journal articles or unpublished content, and whether there is a particular subject that gets heavier attention than others. So, I would like to be sure that, whatever statistics package we use for WRAP, it is one that would enable us to compare our repository with others. There isn’t such a package or method of using a combination of statistics packages, yet.
February 08, 2008
Writing about web page http://www.jisc.ac.uk/media/documents/programmes/reppres/gg_final_keynote_11012008.pdfAn interesting report on research carried out into what the Google generation actually are like in their information seeking behaviour. All generations seem to like simple searching rather than deploying complex information skills, it seems. So it's a good job that repositories like WRAP can be indexed by Google and therefore their contents found through a Google search, rather than repositories waiting for people to visit the repository itself. It will be interesting to see how many users come to our content through services that index our records like Google rather than directly through our interface.