Statistics

This page discusses some points related to the gathering statistics of Digital Humanities and TEI projects.

to monitor workflow and progress towards set targets. These are most often reported internally often include documents ingested in a unit time, measured error rates of OCR, server uptime, server response time, collection size, etc. They are typically tied to a workflow or service contract (real or implied) of some description.
to find how users are using a collection. These focus on how users interact with the collection and often include most popular pages, popular search terms, referring websites, etc. They are typically used in aggregate to measure what people are interested in and how easy a site is to use.
to motivate participants. These focus on the way individual items / pages are used and are used to motivate participants such as contributors to Open Archives.

Beyond the web

Measuring the use of HTML websites which don't allow redistribution of content and which can track users is relatively trivial. Proxying, caching, redistributability, multiple formats (ePub, PDF, Open Document Format, etc) and printable formats all introduce complexity. For example, if a student downloads an Wikibook ePub format (think they're getting it from the website but really from there institutional caching proxy), and gives it to their sibling who prints it to their parents, how does wikipedia know how many people read that printed copy? The answer is that wikipedia gave up counting hits long ago and only tracks a few measures of relative popularity for technical and housekeeping work.

Google Analytics

A number of TEI projects use google analytics as a tool for measuring user statistics. It uses a piece of javascipt on the bottom of every page to track users as they browse to the website and continue on their way across the web. Hosted, so once you install a fragment of javascript someone else takes care of everything else. Very good for tracking which search terms are bringing users to a site. Very good filtering-out of bots, scripts and malware. Nice PDF reports. Only sees javascript-using web browsers. Lots of features assume that the website is selling something in a monetary transaction which are thus irrelevant to most TEI projects. Sites can grant read access to third parties to look at their stats, enabling sharing and comparison of stats.

https://www.google.com/analytics/

Wikipedia

As an encyclopedia, wikipedia is almost entirely dependent on primary sources for verification; it thus accumulates links to various primary-source-containing digital humanities projects over time. Counting the number of links from such an actively curated collection to a collection of primary sources can be used as a measure of the utility of the primary sources. There is a page available for this. http://en.wikipedia.org/wiki/Special:LinkSearch Such measures are biased towards English language resources and topics with good wikipedia coverage (such as the world wars).

Adding links to your own website to wikipedia is usually considered bad form for a detailed discussion of interactions between digital humanities projects and wikipedia see http://en.wikipedia.org/wiki/Wikipedia:Advice_for_the_cultural_sector

Webalizer

A tool for generating statistics from weblogs, webalizer is perhaps one of the oldest and most staid ways of measuring statistics. Effectively creates a website out of the statistics which can be accessed in the same ways as any other website. Not particularly slick, prone to over reporting.

http://www.mrunix.net/webalizer/

Piwik

Newer tool, screenshots look similar to Urchin/ Google analytics

http://piwik.org/

AWStats

Log based statistics, similar to Webalizer.

http://awstats.sourceforge.net/

Urchin

The code base for google analytics

http://www.google.com/urchin/

Mint

http://haveamint.com/

StatsCounter

http://statcounter.com/

Examples of web statistics in the Digital Humanities

Project	URL	Wikipedia links
http://en.wikipedia.org/wiki/British_National_Corpus	http://www.natcorp.ox.ac.uk	16
http://en.wikipedia.org/wiki/Oxford_Text_Archive	http://ota.ahds.ac.uk/	4
http://en.wikipedia.org/wiki/Perseus_Project	http://www.perseus.tufts.edu/	8410
http://en.wikipedia.org/wiki/Women_Writers_Project	http://www.wwp.brown.edu/	15
http://en.wikipedia.org/wiki/New_Zealand_Electronic_Text_Centre	http://www.nzetc.org/	2091
http://en.wikipedia.org/wiki/The_SWORD_Project	http://www.crosswire.org/sword/	9
http://en.wikipedia.org/wiki/FreeDict	http://freedict.org	2

Statistics

Contents

Kinds of statistics

Beyond the web

Google Analytics

Wikipedia

Webalizer

Piwik

AWStats

Urchin

Mint

StatsCounter

Examples of web statistics in the Digital Humanities

See also

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools