How are you using tags?

May 6, 2007

I’m wondering how people are using tags and how it differs from keywords usage in the scientific literature.

Usually when I add tags on web services like or Flickr, I tend to add as many tags as possible. For example, even if a man is not the main subject of a photo, I’ll use the tag “man”. The rationale is we never know if, one day, I (or someone) would like to find a photo with a man and a tree (for example), the tree being the main subject. The problem is that I think I’m “diluting” the power of main tags. Another example … about a website helping find post-doc jobs, I’ll use the following tags: “jobs postdoc research science grants PhD job”. The problem is that “grants” is not really related (there are no list of available grants but only some jobs require grants and you never know what you’ll look for later).

In the biomedical sciences field (and many other scientific fields, I guess), we are using “keywords” when submitting a paper to a peer-review journal. This helps in the selection of peer-reviewers but, more important, it allows us to find interesting papers. The main difference with tags, imho, is that we only use a small number of keywords. For example, in this article, the author only used 4 keywords (and it’s considered sufficient). If this article would have been a webpage, I would have added some more tags: MS, Mw, pI, proteomics, …

Why is there a difference? Is it relevant? How are you using tags? Is there a “good” strategy?

I collected tag-lists from some users and tried to compare (*) to my tag list …

user abbrev N links N tags Mean citation per tag Max citation for a tag
je 401 757 3.52 84
ad 614 1123 3.07 98
do 113 195 2.33 27
ch 2320 582 15.86 326
de 3528 1550 13.29 923

With 5 people, I don’t pretend that it’s significant … We have clearly two groups: me and my friends (the 3 first lines) with < 1000 links and a mean citation per tag of around 2-3. The two last lines are from 2 people taken “at random” (well, I eliminated people with < 1000 links like the 1st group). When I plot the histogram of tags usage, I always get the same trend: a huge amount of tags used a few times and very few tags cited very often (as expected, see figure below).

histogram of my tags usage

Rashmi Sinha’s cognitive analysis of tagging is a good start to understand the tagging process. But it could be nice to find other important ressources and/or learn from others experiences …

(*) data and Python scripts available upon request. I had to write my own Python scripts to retrieve data since, unfortunately, Michael G. Noll’s Unofficial Python API for research are not available anymore.

Update on May 6th (a bit later): Michael’s API is back! I’ll use it later 🙂 Thanks Michael. If you want to spend your holiday in Canada, you can go to the ACM Document Engineering 2007 where he’ll introduce a paper related to this subject. Another thing: when I looked again at the table above, there are two “trends” (remember, I don’t pretend to be exhaustive nor significant): people with < 1000 links have more keywords than links ; with > 1000 links, more links than keywords. Is there a more precise limit? I guess this has something to do with the fact people are only interested in a “small” number of subjects and tend to collect as many variations (links/webpages) as possible on the subject.

