Statistics

From TEIWiki
Revision as of 09:51, 4 April 2011 by Stuartyeates (talk | contribs)
Jump to navigation Jump to search

This page discusses some points related to the gathering statistics of Digital Humanities and TEI projects.

Kinds of statistics

Three are in general three reasons to produce statistics, it is important to know you want to do with your statistics before you start or you may end up begin usable to measure what you need.

  1. to monitor workflow and progress towards set targets. These are most often reported internally often include documents ingested in a unit time, measured error rates of OCR, server uptime, server response time, collection size, etc. They are typically tied to a workflow or service contract (real or implied) of some description.
  2. to find how users are using a collection. These focus on how users interact with the collection and often include most popular pages, popular search terms, referring websites, etc. They are typically used in aggregate to measure what people are interested in and how easy a site is to use.
  3. to motivate participants. These focus on the way individual items / pages are used and are used to motivate participants such as contributors to Open Archives.

Beyond the web

Measuring the use of HTML websites which don't allow redistribution of content and which can track users is relatively trivial. Proxying, caching, redistributability, multiple formats (ePub, PDF, Open Document Format, etc) and printable formats all introduce complexity. For example, if a student downloads an Wikibook ePub format (think they're getting it from the website but really from there institutional caching proxy), and gives it to their sibling who prints it to their parents, how does wikipedia know how many people read that printed copy? The answer is that wikipedia gave up counting hits long ago and only tracks a few measures of relative popularity for technical and housekeeping work.

Google Analytics

A number of TEI projects use google analytics as a tool for measuring user statistics. It uses a piece of javascipt on the bottom of every page to track users as they browse to the website and continue on their way across the web. Hosted, so once you install a fragment of javascript someone else takes care of everything else. Very good for tracking which search terms are bringing users to a site. Very good filtering-out of bots, scripts and malware. Nice PDF reports. Only sees javascript-using web browsers. Lots of features assume that the website is selling something in a monetary transaction which are thus irrelevant to most TEI projects. Sites can grant read access to third parties to look at their stats, enabling sharing and comparison of stats.

https://www.google.com/analytics/

Wikipedia

As an encyclopedia, wikipedia is almost entirely dependent on primary sources for verification; it thus accumulates links to various primary-source-containing digital humanities projects over time. Counting the number of links from such an actively curated collection to a collection of primary sources can be used as a measure of the utility of the primary sources. There is a page available for this. http://en.wikipedia.org/wiki/Special:LinkSearch Such measures are biased towards English language resources and topics with good wikipedia coverage (such as the world wars).

Adding links to your own website to wikipedia is usually considered bad form for a detailed discussion of interactions between digital humanities projects and wikipedia see http://en.wikipedia.org/wiki/Wikipedia:Advice_for_the_cultural_sector

Webalizer

A tool for generating statistics from weblogs, webalizer is perhaps one of the oldest and most staid ways of measuring statistics. Effectively creates a website out of the statistics which can be accessed in the same ways as any other website. Not particularly slick, prone to over reporting.

http://www.mrunix.net/webalizer/

Piwik

Newer tool, screenshots look similar to Urchin/ Google analytics

http://piwik.org/

AWStats

Log based statistics, similar to Webalizer.

http://awstats.sourceforge.net/

Urchin

The code base for google analytics

http://www.google.com/urchin/

Mint

http://haveamint.com/

StatsCounter

http://statcounter.com/

Examples of web statistics in the Digital Humanities

Project URL Wikipedia links Google hits per day Raw hits per day
http://en.wikipedia.org/wiki/British_National_Corpus http://www.natcorp.ox.ac.uk 16
http://en.wikipedia.org/wiki/Oxford_Text_Archive http://ota.ahds.ac.uk/ 4
http://en.wikipedia.org/wiki/Perseus_Project http://www.perseus.tufts.edu/ 8410
http://en.wikipedia.org/wiki/Women_Writers_Project http://www.wwp.brown.edu/ 15
http://en.wikipedia.org/wiki/New_Zealand_Electronic_Text_Centre http://www.nzetc.org/ 2091
http://en.wikipedia.org/wiki/The_SWORD_Project http://www.crosswire.org/sword/ 9
http://en.wikipedia.org/wiki/FreeDict http://freedict.org 2

See also

  1. http://www.jicwebs.org/standards.php
  2. http://en.wikipedia.org/wiki/Web_analytics
  3. http://www.jiscmu.ac.uk/ (I know they do monitoring / statistics of websites but they don't seem to have documentation of how/what they measure and why)
  4. http://news.netcraft.com/ Good for what web-server are/were they running? type checking
  5. http://waybackmachine.org/ Good for find dates of site-wide updates