Statistics

This page discusses some points related to the gathering statistics of Digital Humanities and TEI projects.

Kinds of statistics
There are in general three reasons to produce statistics. It is important to know what you want to do with your statistics before you start or you may end up being unable to measure what you need.
 * 1) to monitor workflow and progress towards set targets. These are most often reported internally often include documents ingested in a unit time, measured error rates of OCR, server uptime, server response time, collection size, etc. They are typically tied to a workflow or service contract (real or implied) of some description.
 * 2) to find how users are using a collection. These focus on how users interact with the collection and often include most popular pages, popular search terms, referring websites, etc. They are typically used in aggregate to measure what people are interested in and how easy a site is to use.
 * 3) to motivate participants. These focus on the way individual items / pages are used and are used to motivate participants such as contributors to Open Archives.

Beyond the web
Measuring the use of HTML websites which don't allow redistribution of content and which can track users is relatively trivial. Proxying, caching, redistributability, multiple formats (ePub, PDF, Open Document Format, etc) and printable formats all introduce complexity. For example, if a student downloads an Wikibook ePub format (think they're getting it from the website but really from there institutional caching proxy), and gives it to their sibling who prints it to their parents, how does wikipedia know how many people read that printed copy? The answer is that wikipedia gave up counting hits long ago and only tracks a few measures of relative popularity for technical and housekeeping work.

Google Analytics
A number of TEI projects use google analytics as a tool for measuring user statistics. It uses a piece of javascipt on the bottom of every page to track users as they browse to the website and continue on their way across the web. Hosted, so once you install a fragment of javascript someone else takes care of everything else. Very good for tracking which search terms are bringing users to a site. Very good filtering-out of bots, scripts and malware. Nice PDF reports. Only sees javascript-using web browsers. Lots of features assume that the website is selling something in a monetary transaction which are thus irrelevant to most TEI projects. Sites can grant read access to third parties to look at their stats, enabling sharing and comparison of stats.

Also used on parts of the http://www.tei-c.org/ website (P5 guidelines, most navigation: yes; P4 guidelines, some archival content: no).

https://www.google.com/analytics/

Wikipedia
As an encyclopedia, wikipedia is almost entirely dependent on primary sources for verification; it thus accumulates links to various primary-source-containing digital humanities projects over time. Counting the number of links from such an actively curated collection to a collection of primary sources can be used as a measure of the utility of the primary sources. There is a page available for this. http://en.wikipedia.org/wiki/Special:LinkSearch Such measures are biased towards English language resources and topics with good wikipedia coverage (such as the world wars).

Adding links to your own website to wikipedia is usually considered bad form for a detailed discussion of interactions between digital humanities projects and wikipedia see http://en.wikipedia.org/wiki/Wikipedia:Advice_for_the_cultural_sector

Webalizer
A tool for generating statistics from weblogs, webalizer is perhaps one of the oldest and most staid ways of measuring statistics. Effectively creates a website out of the statistics which can be accessed in the same ways as any other website. Not particularly slick, prone to over reporting.

http://www.mrunix.net/webalizer/

Piwik
Newer tool, screenshots look similar to Urchin/ Google analytics

http://piwik.org/

AWStats
Log based statistics, similar to Webalizer. See for example: http://vgstats.huygensinstituut.knaw.nl/awstats/awstats.pl?config=vg

http://awstats.sourceforge.net/

Urchin
The code base for google analytics

http://www.google.com/urchin/

Mint
http://haveamint.com/

StatsCounter
http://statcounter.com/

Examples of web statistics in the Digital Humanities
Examples stolen from the wikipedia page for the TEI. The reason that 'Raw hits per day' is so much higher than the 'Google pageview per day' is that the raw hits include requests by real users for images, javascipt, etc as well as requests from bots and scripts (including, of course, Google's own web crawlers). The raw hits is perhaps a useful measure of server load, but pageviews a much better measure of how many real people are really using a website.