All 10 entries tagged Statistics
View all 59 entries tagged Statistics on Warwick Blogs | View entries tagged Statistics at Technorati | There are no images tagged Statistics on this blog
May 11, 2009
I've had confirmation from the Google Help forum, that Google Analytics does differentiate between Google and G Scholar. So none of those visitors who come to us from Google have come via Google Scholar... Google itself is our main source of visitors to the metadata records and other pages on WRAP.
Watch this blog for news of whether I find out how people get to the pdf files!
May 05, 2009
More thoughts on repository statistics!
My basic reasons for looking at repository statistics are:
-1 Can assess and demonstrate that you are meeting aims/targets (& set such targets).
-2 Can gain interest/approval/support on the back of large numbers!
-3 Providing authors with information about who is looking at their work could motivate them to deposit.
-4 Might generate some competitive spirit!
-5 Identifying popular content might help in measuring citation impact of repository deposit.
Looking back at the basic reasons:
1) Aims and targets need to be set for the next 12 months, as we have emerged from our JISC funded project. I can only aim for things that I can measure so this becomes a circular argument! Ideally, I would like to be able to ensure that we are getting deposits of all appropriate items, across the whole University - and to know that we can handle such numbers. So what I really need to be able to do is to measure what the University's authors are actually producing.
I need to know about numbers of visitors to WRAP, and whether or not these can be boosted, in order to meet the goal of WRAP being a showcase for Warwick research.
Measuring how people get to WRAP is important, because if they all come via Google and bypass our metadata entirely, then this might cause us to review our metadata creation workflows. The value of metadata goes beyond bringing visitors to the repository, however, and that also needs to be documented.
2) Shouting about large numbers is fairly crude as a way of getting attention, so a crude measurement such as GA is probably appropriate. Having said that, the Apache logs record higher numbers, so I should be reporting those numbers rather than GA ones!
3) Providing information to authors. Well, GA is entirely inappropriate for that. I can provide some information for some authors, and that has been welcome. But the ideal scenario would be for authors to be able to access such information for themselves, whenever they want. And it really is a huge gap in the knowledge that I can share with authors, if I can't tell them about accesses to the actual pdf files. I'm not sure what authors' interest in statistics is in those repositories who do help authors to check for themselves. Authors here aren't clamouring for figures about who is accessing their work: some are pleased when I write to them with figures, but that is probably because I only write to our top content authors so I'm always spreading good news!
Generally, authors want to know if visitors are indeed academic, which is often very difficult to tell but GA does give me some clues. Being able to tell authors a little bit about visitors to WRAP is reassuring for them, and whilst addressing their every concern is more than I can manage, not knowing about pdf file visitors is a huge gap.
Authors are also concerned about their publishers, and it would be great to be able to demonstrate that repositories like WRAP don't harm publisher interests. This would not only reassure authors, but also perhaps reassure publishers and it would make the business of populating a repository so much simpler if publishers were supportive.
4) The competitive spirit could be between individuals or departments or even institutions. It could be based upon numbers of items in the repository, or numbers of visits or all sorts of different criteria. The competitive spirit ought to be directed towards appropriate measures. Focussing on numbers of items in the repository is probably enough for now: our main goal is to grow the repository.
Some element of benchmarking against other institutions is also going to be important, when it comes to resourcing decisions. This will mean measuring how many items we have, of what type and whether of full text or metadata only. Measuring how fast the collection is growing will help us to plan our workflows accordingly, and also be useful for benchmarking.
5) Measuring impact on citation: this is something that we claim as a benefit of repository deposit. I am always very cautious to claim this only in as far as it is common sense that more readily accessible work will be read more, and that more widely read research will be cited more. Even so, departments are asking me for evidence that repository deposit will boost citations. The repository does seem to fit into departmental meetings along with departments' concerns to raise citations so it is no surprise that the two are so closely associated. Evidence of this sort of impact would be highly influential in terms of encouraging deposit, if I could find it. I believe that the problem is that, by the time a repository has had its effect, it will be one of a number of factors influencing higher citations.
What I can hope to do, is to prove that the most highly visited items in the repository become the most highly cited. I need to know which items are most highly visited, and to look at the reasons why that might be.
April 30, 2009
A (very) quick investigation into what Google knows about WRAP...
An "Advanced Search" through Google, on the WRAP site, for pdf files tells me that Google is indexing 659 pdf files. This is more than our total number of records (580-odd) but it figures, because we have more than one file per record, sometimes.
Remove the filetype criterion and you get 2,440 results. The top one is our homepage, but the second one is what seems to be a random record within WRAP. It's not one of our Google Analytics' "Top Content" items. It's not the first one added to the repository (no. 381). Not sure what is happening there.
Doing the same searches on Google Scholar gives me considerably lower results. I had thought that Google Scholar was indexing our content, and it is: some of it. 210 pages & files in total, to be precise. This is something that was raised on the UKCoRR discussion list recently as a concern amongst repository managers. We just don't know what Google Scholar is doing with our content. Most of the 210 pages are pdf files, though - 145 of them, and the top results are again fairly random in their order, as far as I can tell.
Google Analytics is great at telling me lots of interesting stuff about visitors to WRAP. But it's biggest weakness is that it can't measure accesses to pdf files. I was reminded today of just how big that weakness is, when our IT Services department looked at the access logs for me and told me that we're averaging about 5000-6000 pdf files accessed per day, whilst the metadata records are only accessed around 700-1000 times a day.
...and I had thought that no-one was looking at the pdf files. Silly me!
But it is surprising that Google Analytics was trying to tell me anything about pdf file accesses at all. I looked at my "Top Content" today, and did a search for URLs containing "pdf". And I found that the top 10 pdf files for the last month had had 134 accesses between them. The most popular of these I investigated further, and the source included one access through Google. What is Google/Google Analytics doing?!!
I have not yet had a reply to my query on one of the Google forums, about whether Google Analytics differentiates between Google Scholar and Google referrals to our pages.
Why am I so hung up on Google? I think most of our visitors come from there.
Why am I so keen to find out what Google Analytics can tell me about our visitors? I can't tell academics to put their work in the repository if I can't prove that:
a) more people will read their work as a result.
b) those people are likely to be the kind of people who will cite their work & further academic knowledge.
c) visitors to WRAP are not going to adversely affect journal publishers.
Of all of these, point c is probably the most contentious. Academics are very protective over their journal publishing model and I can see why: WRAP certainly can't do everything that their publishers can. I am keen to hunt out whatever evidence I can find, to reassure our academics and to back up claims that repository deposit leads to more citations. It is citations that Universities are seeking, in time for the REF!
April 27, 2009
Some more trawling through the Google Analytics Help section today has turned up some discussion on their forum, which suggests that GA doesn't count clicks on links to external web pages, only clicks on links to elsewhere within the site.
This is something of a relief: I am not surprised that people are not clicking to read the pdf files of articles in WRAP, because I expected them to be interested in reading the "version of record", which is the one on the publisher's website which we always link to. I was very surprised that GA's site overlay reported no clicks from our records to publishers' web pages. Why would so many people be looking at the record in WRAP, while none were looking at the article itself? The only answer I could think of was that they must have gone back to their Google results that brought them to WRAP in the first place, or that they were satisfied by reading the article record in WRAP. Yet some must surely have wanted to read more, especially given that so many visitors seem to have come from academic networks... and now it all makes sense that people could indeed be clicking on our links to publishers' pages, but we simply can't measure those clicks.
So I can reassure authors that no-one will be reading the pdf versions we hold in WRAP(!) unless they have no other option because either they don't have a subscription to the published version or the published version is no longer available. Which is kind of what they want: many authors don't really feel comfortable with making their own early versions available. Now all I need to do is to convince authors of why we want the full text in WRAP at all, given that I know no-one is looking at it! My usual list of reasons is:
1) Google indexes the full text file, bringing visitors to your work in WRAP.
2) There will be those without subscription access who will be glad to read the earlier version. (This will include those in the commercial world but also those in academia in less wealthy countries).
3) This will be a back-up version of the work, for times when the publisher might be unable to make the work available - either temporarily, due to a technical hitch, or for the long term.
4) It is the long term nature of a repository that makes it different to putting your article on your own web page. Putting your article into an institutional repository is like libraries of old having copies of books and journals on shelves for future generations to consult.
As a librarian, my concern is to collect scholarly works and to make them available when they are needed. WRAP may be an electronic collection, but what we (the library) are trying to achieve with WRAP is very much our traditional role: we're just finding different ways to do that, as technology changes the possibilities for us.
April 16, 2009
I've been thinking about all the things that might lead to an author becoming highly cited, or raising citations for a particular paper, in ways other than just WRAP deposit. I often feel pressured to prove that WRAP deposit can raise citations: it is one of the claims that we make, so we should be able to prove it. Yet it seems to me to be impossible to do: all I can do is point to articles which say that open access publishing raises citations, and say that WRAP deposit is a form of open access publishing. It stands to reason that if more people can find and read your article, then more people will cite it, in the long run. But authors seem to want better evidence than that, and they would prefer to have evidence that WRAP itself will raise their citations, not just about repositories generally.
So, I've been looking at what Google analytics can tell me (again!) and matching that to tactics that are rumoured to raise citations. One such tactic is to use a key phrase repeatedly in article titles, or to publish consistently on/around a particular theme, so that you get known as an expert for something in particular. I'm not sure whether anyone ever does this in such a calculated way, and it's probably more likely that a particular phrase is associated with an expert on account of the fact that it was his/her work which invented the concept. But anyway, GA can tell me which keywords have led people to WRAP.
This month, the highest keyword search leading to visits to WRAP is "interracial sex", and other keyword phrases that people are searching for when they come to WRAP are: "street slang", "leishmaniasis recombinant vaccines" and "educational leadership theories". Other phrase searches include entire article titles.
What do such phrase searches tell me? Well, in the case of article titles, it is clear that it is the academics' work that is being sought. In the case of keyword phrases, it could be that "social searching" is leading visitors to WRAP as much or as well as academic searching, in some cases. Looking at the papers that these keywords led to, and at the "content overlay" feature of GA, which tells me where people clicked when they visited that page, I can't see that people are clicking on the pdf or the publisher's link. They appear to be looking at the WRAP record and then looking away again: this might mean that they read the abstract and learnt enough, or that they were indeed looking for something else entirely. The most popular papers in WRAP correspond with the keyword phrases that are leading most visitors to WRAP. I've looked in some detail at the profile of visitors to those popular papers, and from the network locations of the visitors, many are indeed on identifiably academic networks. Even those on commercial networks could be academics working from home.
In short, what I can say is that keyword phrases will bring visitors to your paper in WRAP - if your paper is there. At least some of those will be the kind of visitor that you will want to have. It really doesn't take that much effort to deposit: visit a web page, upload a file, tick a couple of boxes (literally 2!) and paste a reference in. Time will tell whether all that effort is worth it, because the business of becoming highly cited takes a very, very long time and a lot more than just repository deposit.
If I can possibly prove that WRAP deposit will raise citations, I will do. But in the meantime, there needs to be work in the repository for me to look at the statistics for... it's early days for WRAP still, and even for repositories.
April 03, 2009
Writing about web page http://www2.warwick.ac.uk/services/library/main/research/instrep/erepositories/
I've just noticed that it's almost a month since I last posted to this blog. It's been incredibly busy as we've just concluded the JISC funded project of WRAP, and now we're entering a new era in which WRAP is the repository but no longer a project.
Our end of project report is now on the web (see link above to project page), and for those with an interest in repository statistics, I recommend the appendix on what we can tell about WRAP through Google Analytics, ROAR and the University's web tool known as Sitebuilder.
The future direction of WRAP is uncertain as the University's steering group are considering what to do with Publications data, but we will certainly be carrying on with WRAP as we have done, increasing our deposits of full text journal articles and handling PhD theses.
We just announced to the University of Warwick staff that we have surpassed 500 items in WRAP (through a University-wide electronic newsletter known as "inbox insite"), and to date we have 536 items, so the collection is most certainly growing, and with it our visitor numbers are increasing.
As we've had more interest in WRAP we've had more enquiries and our FAQs have been expanded beyond what I think is a reasonable number, so I just introduced an index, using a feature of Sitebuilder that creates a "table of tags" from the keywords in the metadata for the pages... I just had to revise the keywords for all those pages, to make sure I used consistent terms, applied consistently throughout the collection. I like to call this a "tagsonomy", and if you're at Warwick and would like to read more about these tags, please see these other blog postings: http://search.warwick.ac.uk/blogs?q=tagsonomy
January 19, 2009
1) 6 monthly report of data changes on WRAP to show which records have been altered since the date they were added into the live repository. (For sharing data with Warwick’s Research Support Services.) Not currently possible.
2) A graph to show how the pattern of new record creation/repository growth has gone, over the last x months/year. I can get this from ROAR. (http://www.roar.org)
3) Monthly report of all records added since last month, with data in specific formats to suit RSS’ InfoEd system (and/or other departments at Warwick). Key issues with sharing with RSS: need to store staff number (or key to call up staff number) for each Warwick staff member amongst the authors, and lack of security for such data in WRAP. Also, page range is currently exported as, eg 51-72, whereas RSS need it as "start page 51, end page 72". More investigation into the technical possibilities for data sharing needs to be done. It may be significant that InfoEd attaches information to a person’s profile, relating to publications (& other activities). Whereas WRAP attaches information about authors to a record describing a publication.
4) Statistics on visitors to WRAP and what they are clicking on, where they come from, etc. Google Analytics does this well enough for me: I can see where they’re clicking, what keywords brought them to WRAP, to where in WRAP, and who their network provider is, (which is a clue to some academic interest, and also helps to identify internal interest). I can see what countries visitors are in, and what cities, etc. I can do all this at a per paper level, but I have to know which paper(s) I want to look at.
5) To look at features like those listed above, for a set of data (eg all by one author, or all for a particular department). Departments and authors may well want to know who is looking at their work in WRAP. I can look at particular paper, but not at a set: I would have to collate reports for each paper, in some way. IRStats should be able to do this, if we were to install it successfully on WRAP… although it may require some change in our workflow. At the moment, most papers are added to WRAP by our very own administrator, since authors use a separate (& simple) submission form. Authors do not upload data about their own publications and therefore the papers are not attached to separate accounts in WRAP. I believe that IRStats would need separate accounts to be used for each author’s papers, in order to produce reports on all of an author’s papers. Our administrator could create accounts in authors’ names and then log in as the author before creating the record… but that all presupposes that we can get IRStats to work, and that it does work as I expect.
6) It would also be better for me (and for those interested in the data) if I did not have to look up statistics such as those already provided by GA myself, but if those interested could just look them up, on demand. In theory, I can grant access to the GA reports to anyone with a Google account… although this requires some intervention from me. And Google Analytics is great for those who know how to use it, but I can see academics being put off learning how to use it. There are barriers to authors getting data about all the wonderful good WRAP is doing in bringing an audience to their work!
7) GA is great for looking at the site and our html files, but tells us nothing about pdf/word document downloads. The difference between “the most downloaded document” and “the most looked at record” could be very important indeed, if any correlation with citations is to be explored. Also, I can tell from GA if someone has followed the link to the DOI on a particular record. I can’t tell whether anyone has followed the link from within the pdf file to the full text, published version, though.
8) What are people searching for from the repository's own search form: which fields do they search by? GA can only tell me whether people click through from our Advanced form to the Simple search one, and indeed whether people follow the link to search the repository in the first place from our home page. Thus far, there aren’t so many people searching, and we expect that people will not search through our form but on search engines like Google, with keywords which GA does record and tell us, so this isn’t particularly crucial.
I’m also not sure of how to make GA discount visits from members of the WRAP team… but I expect that’s something I ought to look into.
I’ve learnt a lot about what GA can tell me about WRAP and its visitors. I find it fascinating to delve in every now and again and see what brings people to us. It can be used as a website management tool, to see how to make important links more visible and hence more clicked upon. It can be used in advocacy to authors, explaining why they might want to put work into WRAP, showing that others do look at it.
What I would like to do is to compare our statistics with those of other repositories, at other institutions. It’s not easy to find other repositories that are comparable with ours in their features (full text, mediated metadata, voluntary deposit), never mind such repositories at comparable institutions. But it is possible to find those who are much further ahead of us, and it would be good to see where we might be heading, in terms of visitor profiles, whether most visitors came from search engines (as now) or direct links, etc. I would like to know whether the most popular content in others’ repositories is journal articles or unpublished content, and whether there is a particular subject that gets heavier attention than others. So, I would like to be sure that, whatever statistics package we use for WRAP, it is one that would enable us to compare our repository with others. There isn’t such a package or method of using a combination of statistics packages, yet.
December 15, 2008
Writing about web page http://www2.warwick.ac.uk/services/library/main/research/instrep/ir_value_event
Our recent event at Northampton went pretty well: I've been writing up comments from the workshops, so that I can share them with event delegates... here they are for all to see, arranged in a way that makes sense to me, but others may have more to add, from their notes.
What are you aiming for? (Researcher)
|How can you measure/demonstrate success at this?|
|It's just a vehicle, so it should make my research & career look good.||
My best publications are all included (in my opinion).
No worries about IPR: this is taken care of on my behalf.
|I don't have to check anything, nor do I get any complaints.|
|A single submission only, and all systems use that single data source:||I am only asked to send this information to one system, once.|
|It helps me to network with other academics.||I get e-mails/contact from people who've read my articles. I can contact other authors.|
|I can look at other people's stuff.||Articles I am interested in are available for me to read.|
|My work is preserved.||If it lives at the same place for two years, that might be long enough for me!|
What are you aiming for? (Repository Manager)
|How can you measure/demonstrate success at this?|
|I want lots of information in there: that's a sign of health!||
Benchmarking against other institutions of similar size and character? Total number of items: growth pattern.
Comparison with ISI Web of Science information about my institution?
I want lots of traffic, people looking at that information, which will indicate that it is useful to others.
Measure the number of visitors to your repository. Ever? Every month? Benchmarking against other institutions of similar size and character?
What kinds of people, anyway? Are they academics?
Feedback from users can indicate whether or not it is easy and useful for them.
External recognition of the value of our repository will demonstrate its value to authors at my institution.
I need to be able to demonstrate that there is a lot of traffic, so I need good reporting/statistics handling.
|Can I get the information I need?|
The repository will have a high profile in the University.
Senior staff will be aware of the repository.
Invitations to present/talk about the repository will indicate that others have heard of us and want to know more.
The whole community will be engaged and will have bought-in to the repository.
If I get continuation funding, that will be a sign that some of this will have happened.
The repository deposit process becomes part of the lifecycle of research activity: culture has been changed.
|Deposits flow in without prompting, and there are relatively few queries to be chased up or enquiries to be handled.|
What are you aiming for? (Central administration/Management)
|How can you measure/demonstrate success at this?|
It should make the institution look good.
The repository should have a clear identity.
The highest profile, high quality work should be represented.
|It should generate income for us and be cost effective.||
Small staff costs to run.
It should be a complete record - so we can make our staff accountable for their work and use it in reporting for the REF.
What is the level of engagement: is everyone, from every department represented?
|Data should be re-useable.||Can I use the data for our REF and to meet FOI requests? Do all the Uni systems that need this data have access to it?|
|There should not be risks in storing this data.||If no-one complains that's a good sign, and no law suits either!|
The data should be preserved.
|We can access and manipulate the data in the long term.|
November 10, 2008
Well, my thoughts on the topic so far stretch to:
1) Numbers of visits/visitors, which you can get as a whole since you launched and/or as a month on month comparison. Ours don't tell us too much, except that people don't visit WRAP much at weekends and we're growing more visitors since we launched. As we're also growing content, this just confirms that there's nothing I should be worried about! I'm not altogether sure what is the best way to measure these using Google Analytics: should I be looking at page visits or visitors? Should I be looking at the Unique Visitors if I'm going to look at Visitors? At the moment we're talking pretty small scale differences and there is no difference in the pattern, so for my own needs, any of these would be appropriate. But what if I wanted to benchmark against another repository? (GA does have a "benchmark" feature which supposedly benchmarks your website against other sites of the same size. I don't fully understand it and it makes WRAP look really good, but I don't believe it's all that useful to benchmark WRAP against unknown websites!)
Information about visitors includes looking at which countries and networks they have come from. I can drill down further within the UK visitors to find out which cities the visitors came from. Of course the largest contingent of our visitors were from Coventry, and from withing the Warwick network. But there are other academic networks appearing in the list, including Southampton, Durham, Birmingham, Edinburgh and others.
2) Traffic sources. The latest beta "Advanced segments" option shows me very nicely the whole number of visits as a line graph, with different coloured lines for the traffic sources, be they direct visits (eg bookmarks, someone types in the URL), search engine referrals or web page referrals. The pattern seems to be remarkably similar across all three, although the search engines are by far the largest traffic sources. Looking further at which web pages link to WRAP is an interesting exercise... likewise for looking into which keywords were typed into the search engines that led to a visit on WRAP. Mostly the web pages are Warwick Uni ones. At first the keywords were nearly all general enough to suggest that peopl were looking for WRAP itself, or something like it. But now that we have more content, the keywords are getting much more specific.
3) Content: Pageviews give you a lovely big number, if that's what you need to show! But I hardly think it is more useful than the number of visits or visitors. TopContent tells me that the pages in WRAP that are visited most are the home page, search pages, admin pages and the browse pages, etc. This is the closest to telling me which are the most visited papers in the repository, which is useful for advocacy. Except that I can't possibly know whether the papers themselves were read, only that their records were read... The site overlay feature might show this for a single record, but I can't compare papers on such popularity of pdf download. And I cannot tell much information about the visitors to an individual paper: I can see which keywords led to that paper, which sources linked to it. But not whether the visitors were on an academic network or not, from the UK or not.
The Top Landing pages tell me which pages within WRAP people are reaching WRAP through. Our most important page is our home page, but after that are actual article records. I can use this in advocacy work, to claim that "the paper that has had most direct hits to it within WRAP is...." But of course that would not necessarily be the most popular paper in WRAP. Just the one that more people are following links from elsewhere to. Academics could easily boost this statistic for their paper just by sharing the WRAP URL for their work.
The Top Exit pages provide a nice balance to those, so presumably our visitors are looking at precisely what they wanted to find and not hanging around (also described in the high Bounce Rate). However, people are exiting from our search page, browse by department page and latest additions page as well. I am a little concerned about people who don't make it past the search page: we link directly to the advanced search form, but I might want to change that if there is a real problem with this. But I'm not worried yet, it's just something to watch.
Site Overlay looks like a great feature but I don't understand what on earth all those percentages mean! If I'm right, when I look at the record for an article and I can see the link for the pdf, if it says "0%" then that means that no-one has clicked on it. But I'm not sure I've got that right.
But that's only all about what I can do with Google Analytics. The list of what I would really like to be watching/providing to authors is most likely to be entirely different.
January 22, 2008
Writing about web page http://www.rsp.ac.uk/events/ProfBrief.php
I attended my second Repositories Support Project briefing day yesterday, at the British Library. I like going to the BL as it's easy to get to, the conference facilities are really very good, and there's always the exhibition to go round in your lunch break so you do get a proper break from whatever you're learning on the day itself. But I did get the slow train yesterday, so I deserved that break!
The themes for yesterdays event were Funder mandates, Repository Metrics, Repository Statistics and Preservation Metadata. I've linked to the programme which appears to include slides from most of the presentations already.
I found the background information about funder mandates very useful: I kind of knew what was being said as I followed the announcements at the time, but it is good to see a summary that clarifies things, and the main point that occurred to me is that the funders do indeed hold the key to both authors and publishers' involvement with open access repositories.
The repository metrics presentation was interesting and entertaining, but perhaps less relevant to our repository at the moment as our VC is already keen on the repository. But no doubt we will need to be able to demonstrate its value in order to keep that interest.
The Repository Statistics tool that was shown looked most interesting, although it was a pity that the presentation did not include a demonstration of the download due to lack of time.
I was less interested by the preservation metadata workshop, but I still gleaned some useful stuff from that, including considering how we might want to record any preservation processes that might be run at some point in the future.