Category: Projects

People love free e-mail services

… at least in my biased population. It came from a file containing all the people interested by ISAL and that I had to parse (the file, not the people). It’s a tab-delimited file with the names, e-mail addresses, location, interests, etc (in total, 379 unique e-mail IDs). I used Python for that purpose.

Since I had all the e-mail addresses in a Python set (*), I decided to do some stats. I know it’s useless but here are the results: 30.34% of Hotmail accounts, 27.18% of Yahoo ones, 7.39% of Gmail ones, 7.12% of Rediffmail ones (a popular e-mail service in India) and “only” 7.65% of KULeuven ones. As we can see, members mainly use free e-mail accounts (probably because the majority of them are students – the “S” in ISAL). And less than 10% of members come from the KULeuven, although ISAL is a students organisation from Leuven (the “L” in ISAL). Of course, R can produce nice charts. And since its documentation states that pie charts are “a very bad way of displaying information”, I also produced a regular bar plot.

Pie chart of ISAL members e-mail accounts
bar plot of ISAL members e-mail accounts

With that very interesting information in mind, I’ll now be able to go to sleep (and work, because there is a lot of work that is waiting for me, this month!).

(*) a set is interesting because it could not contain duplicates and the file does contain duplicates.

Plugins for Digital Object Identifier lookup

I’ve just written some “search plugins” for Firefox (1.x and 2.x) that allow you to quickly look for a specific Digital Object Identifier (DOI). These DOI are more and more used in biomedical sciences. One of their interesting features is that they allow direct linking to the scientific article.

The plugins are availble here. If you already have Firefox 2, the installation procedure is very easy: all you have to do is go to the plugins page, click on the small arrow near your Firefox search box and choose the “Add DOI lookup” option; it will then automatically be installed for you.

How to remove files ending with '~'

The vim text editor always produce a file ending with a tilde (~) as a kind of backup of the currently modified file (this is a default behaviour). On my MS-Windows machine (Pentium M, 1.73GHz), I was tired of manually deleting these files so I first used the “Search” option in the File Explorer. After some time, I got tired to wait for the results.

So I wrote a Python and a batch scripts to find all these files. They are going much faster than the Search GUI. The first time I launch them, they are still going slow (but faster than a GUI). As you can see in the graph below, the second time I launch these scripts, they went at least 10 times faster. I’m not a specialist but I guess it has something to do with caching at the OS level. For the first run, the batch script is 20% slower than the Python script. After that, the Python script is 50% slower than the batch script (but between 3.7s and 5.6s, the difference is not big).

Comparison of .bat and .py files

Here are the scripts : find files ending with ~ in Batch (the problem is that you have to do the duration calculation by yourself), find files ending with ~ in Python and remove files ending with ~ in Python (all scripts are 1kb).

Each of these scripts were run as the first application after my computer was turned on. I didn’t repeated the measures (doing real stats wasn’t the goal anyway). Deleting all the files (after having found them) took 5.4s. It just goes to show what we can do, just before the beginning of a lab seminar.

Simple Sitemap.xml builder

In a recent post, Alexandre wrote about web indexing and pointed to a nice tool for webmaster: the sitemap. The Sitemap Protocol “allows you to inform search engines about URLs on your websites that are available for crawling” (since it’s a Google creation, it seems that only Google is using it, according to Alexandre).

If you have a shell access to your webserver and Python on it, Google has a nice Python script to automatically create your sitemap.

I don’t have any shell access to my webserver 😦 But I can write a simple Python script 🙂 Here it is: sitemapbuilder.py, 4ko. After having specified the local directory where all your files are and the base URL for your on-line website (yes, you need to edit the script), launch the script and voilà! You can now upload your sitemap.xml file and tell web crawlers where to find the information on your website.

Interested in the other options you can specify?

  • You can specify an array of accepted file extensions. By default, I’ve put ‘htm’, ‘html’ and ‘php’ but you can add ‘pdf’ if you want.
  • You can specify an array of filenames to strip. By default, I strip all ‘index.*’ (with * = one of the accepted extensions) because http://www.poirrier.be is the same as http://www.poirrier.be/index.html but more easier to remember
  • You can specify a change frequency (it will be the same for all files)
  • You can specify a priority (it will also be the same for all files and even omitted if equal to the default value)

On the technical side, there is nothing great (I even don’t use any XML tool to generate the file). I was impressed by the ease of walking through a directories/files tree with the Python os.walk function (it could be interesting to use it in the future blog system I mentioned earlier).

Finally, you can see the sitemap.xml generated for my family website.

Edit on November 17th: it seems that Google, Yahoo! and Microsoft made a team to release a “more official” sitemaps specification: http://www.sitemaps.org/.

Automated Pubmed reference to BibTeX

In biology, we often need to use PubMed, a biomedical articles search engine for citations from MEDLINE and other life science journals.

In the MS-Windows world, you have nice, proprietary tools (like Reference Manager or Endnote) that retrieves citations from PubMed, store them in a database and allow you to use them in proprietary word processing software (in fact, in MS-Word only since nor Wordperfect nor OpenOffice.org are supported). If you are using BibTeX (for LaTeX) as your citations repository, there isn’t a lot of tools. The best one, imho, is JabRef, a free reference manager written in Java (for me, the only “problem” is that it adds custom, non-BibTeX tags). Or you can edit the BibTeX file by yourself with any text editor.

The problem with manual edition is that it is prone to error (even when copying/pasting from the web). Since Python programming is my hobby horse for the moment, there are two solutions to this problem:

  1. Use Biopython to get a reference from PubMed but are you ready to have a huge module dependency just to use 1 function?
  2. Write your own Python script, using a PubMed URL to download your reference and a little bit of XML parsing to extract the relevant info (one can use the ESearch and EFetch tools but my lazy nature tells me to simply use the URL).

Obviously, I chose to write my own Python script. Each reference from this PubMed XML format example (full DTDs) should be like this:

@article{poirrier06,
  author = {Poirrier, J.E. and Poirrier, L. and Leprince, P. and
Maquet, P.},
  title = {Gemvid, an open source, modular, automated activity
recording system for rats using digital video},
  year = 2006,
  journal = {Journal of circadian rhythms},
  volume = 4,
  pages = {10},
  pmid = 16934136,
  doi = {10.1186/1740-3391-4-10}
}

The script is here (4kb). First, use PubMed to check the reference you want, then take its PubMed ID (PMID) and launch the program, giving your BibTeX file in a pipe, for example:

./pyP2B.py 16934136 >> myrefs.bib

If you like, you can edit the script to change the tab size (here = 2).

How does it work?

  1. With PubMed, I do not use the correct tool but a HTTP query. It is much more simple and easier. The script asks for the PMID citations. Since it gets a HTTP answer, I need to parse this answer to replace entities (like , etc.) and obtain a valid XML file.
  2. Once I got the XML file and after some checking, I use XPaths from LXML (for me, XPaths are quick and dirty compared to write a DOM/SAX structure but it works!)
  3. Then the script simply prints the result to the standard output (even if it’s an error ; improvement : print on the error output). You simply need to get this output into your BibTeX file with the correct pipe.

Edit on October 23rd: this script has errors when dealing with non-ascii chars like “ö” in Angelika Görg. I won’t fix it for the moment.

Dasher: where do you want to write today?

Hannah Wallash put their slides about Dasher on the web (quite the same as these ones from her mentor). Dasher is an “information-efficient text-entry interface”.

What made me interested in Dasher is her introduction about the way we communicate with computers and how they help us to communicate with them. There are keyboards (even reduced ones), gesture alphabets, text entry prediction, etc. I am interested in the ways people can enter text on a touch-screen, without physical keyboard. Usually, people use a virtual keyboard (like in kiosks for tourists or in handheld devices). But they are apparently not the best solutions.

They come with an interesting way of entering text, where pulling and pushing elements on screen are used to form words (with the help of the computer that is “guessing” the words from the previous letters). It requires a lot of visual attention but this can be turned into a feature for people unable to communicate with hands (for physical keyboard and mouse ; one man even wrote his entire B.Sc. thesis with Dasher and his eyes!).

You can download Dasher for a wide range of operating systems and even try it in your web browser (Java required) (btw, it’s the first software I see that adopted the GNU GPL 3). After reading the short explanation, you’ll be able to easily write your own words, phrases and texts.

They are interested in the way people are interacting with the computer. They are using a language model to show the next letters. On the human side, I am wondering if this kind of tool has an influence on how the human brain works. Visual memory should be involved in physical keyboard (“where are the letters?”) but also here (same question but the location of letters is changing all the time). Here, letters are moving but one can learn that boxes are bigger if the next letter probability is bigger. How is the brain involved in such system? What is it learning exactly? Are there fast and slow learners in this task? It could be interesting to look at this …

Looking for a good free UML2 modelling editor …

I was using Poseidon as a modelling editor for my UML2 diagrams. It was based on Java and I was able to run it from both GNU/Linux and MS-Windows. It was not free software but the Community Edition was free (as in “free beer”) and has all the tools I modestly needed. The only trick: all the diagrams had a string in the bottom, stating it was not meant to be used for commercial purpose (for educational purpose, I’ve written a small software that removes it).

Today, Gentleware boss announced that Poseidon will go away. He said it will be replaced by Apollo for Eclipse and by a new licensing model (renting) starting at 5€ per month. An unregistered version will be available but it won’t be possible to export, print, save, etc.

First, this shows one of the problems of using free-as-in-free-beer-but-proprietary software (and not really free software): the owner can change the licence, the software availability and usage conditions at any time. Secondly, althought I understand the move from a commercial/business point of view (if they need money) but I wonder if they are not depriving themselves from a potential user base (Community Edition users that will recommend paid version in a professional environment).

Anyway, I am now looking for a new, good and free UML2 modelling editor. After a quick search, I’ve found:

  • ArgoUML, a Java-based editor supporting UML1.4, able to import/export Java code (*) (BSD licence)
  • Umbrello UML Modeller, written for KDE only, it can import/generate code from/for Python, Java, C++, … (GPL)
  • BOUML is also Java-based and I think it’s the only one in this list that supports UML2 ; it can generate code for C++, Java (and Idl) (GPL)
  • PyUT is a class diagram editor written in Python and supports UML1.3 ; it can import/export Python and Java source code and export C++ code (GPL)

I really don’t have time for the moment to test all these software. As soon as I’ll have time, I’ll give them a try. Meanwhile, if you have other suggestions and/or any experience with one of them, please feel free to post a comment.

(*) Although I am not confortable with code auto-generation tools, the ability to import/generate code for a programming language is a good indication of the ability of the modelling tool to understand and take into account this language specificity. You don’t want Java syntax highlights when developping a Python application.

Playing with Python and Gadfly

Following my previous post where I retrieved EXIF tags from photos posted on Flickr, here is the next step: my script now stores data in a database.

There is a lot of free wrappers for databases in Python. Although I first thought of using pysqlite (because I am already using SQLite in another project), I decided to use Gadfly, a real SQL relational database system entirely written in Python. It does not need a separate server, it complies with the Python DBAPI (allowing easy changes of DB system) and it’s free.

Using Gadfly is very easy and their tutorial is very comprehensible. Put toghether, here is an example of the creation of a database, the addition of some data and their retrieval (testGadfly.py):

#!/usr/bin/python
# test for Gadfly
import os
import gadfly
import time

DBdir = 'testDB'
DBname = 'testDB'

if os.path.exists(DBdir):
    print 'Database already exists. I will just open it'
    connection = gadfly.gadfly(DBname, DBdir)
    cursor = connection.cursor()
else:
    print 'Database not present. I will create it'
    os.mkdir(DBdir)
    connection = gadfly.gadfly()
    connection.startup(DBname, DBdir)
    cursor = connection.cursor()
    cursor.execute("CREATE TABLE camera (t FLOAT, make VARCHAR, model VARCHAR)")

print 'Add some items'
t = float(time.time())
cmake = 'Nikon'
cmodel = 'D400'
cmd = "INSERT into camera (t, make, model) VALUES\
    ('" + str(t) + "','" + cmake + "','" + cmodel + "')"
cursor.execute(cmd)

print 'Retrieve all items'
cursor.execute("SELECT * FROM camera")
for x in cursor.fetchall():
    print x

connection.commit()
print 'Done!'

Regarding the initial project, the script became too long to be pasted in this post but you can download it here: flickrCameraQuantifier2.py (5ko). To run it, you should have installed the wrapper for EXIF and Flickr and the Gadfly DB system. In the beginning of the script, you can define the number of iterations (sets of queries) you want in total (variable niterations), the sleep duration between queries (variable sleepduration) and the number of photos to get for each query (variable nphotostoget). Everything will then be stored in a Gadfly database (default name: cameraDB). If you want to read what is stored, here is a very basic script: flickrCQ2Reader.py.

For example, I’ve just asked 125 queries (with 5s between each query). I’ve got 88 photos (70.4% of queries) with 27 photos without EXIF tags (30.68% of all the photos). Among all the camera makers, Canon has 27%, Fuji has 11%, Nikon has 18% and Sony has 21% of all the photos with EXIF tags at that moment. This is approximately what Flagrant disregard found. I don’t have time anymore but one could improve the data retrieval script in order to automate the statistics and their presentation …

Edit on October 9th, 2006: added the links to the missing scripts

Playing with Python, EXIF tags and Flickr API

Some days ago, I was quite amused by Flagrant Disregard Top Digital Cameras: these people daily took 10000 photos that were uploaded on Flickr and looked at the camera makes and models of these photos. This kind of study is interesting because one can see what people are actually using and what camera models can give good results (with a good photographer, of course). I was just disappointed by the fact that they are not saying anything about their sampling method nor the statistics they can apply to their data. I then thought that I can do a kind of survey like this one and publish results along with the method.

One more time, I’ll do this with Python. Instead of reading binary data from JPG files to look at EXIF tags, I’ll use external “modules” (wrappers). After a small survey on the web, it seems that GeneCash’s EXIF.py was the best solution. Indeed, to get the camera make and model of a test image, the code is simply:

import EXIF
f = open('testimage.jpg', 'rb')
tags = EXIF.process_file(f)
print "Image Make: %s - Image Model: %s" % (tags['Image Make'], tags['Image Model'])

Now, to access Flickr most recent photos, I had two options:

  1. I open the Flickr most recent photos page and parse the HTML in order to get the photo. This can be done with regular expressions or XML parsing.
  2. I use the Flickr API where there is a specially designed method: flickr.photos.getRecent

I chose the second option and looked at the three kits for Python referenced by Flickr:

  • FlickrClient author admits his kit is outdated and gives a link to Beej’s Python Flickr API
  • Beej’s Python Flickr API seems to be interesting but there isn’t much documentation and, being a beginner in Python, I was quickly lost
  • Finally, James Clarke’s flickr.py seemed to be a nice and easy to use wrapper. So, I decided to go with it.

Unfortunately, the getRecent method isn’t implemented (James Clarke did not maintained this wrapper since 2005). I tried to use the photos_search method (wrapper for flickr.photos.search method), hoping that using it without any tag will give me all the most recent photos. But some people probably thought of it before me because Flickr disabled parameterless searches. Look at the error:

import flickr
z = flickr.photos_search('', False, '', '', '','', '', '', '', '', '2', '', '')

Traceback (most recent call last):
[...]
FlickrError: ERROR [3]: Parameterless searches have been disabled. Please use flickr.photos.getRecent instead.

So, I was forced to implement the getRecent method. Fortunately, it wasn’t too difficult. Here is the code you can insert at line 589 in James Clarke’s flickr.py (or download my flickr.py here):

def photos_getrecent(extra='', per_page='', page=''):
    """Returns a list of Photo objects.

    """
    method = 'flickr.photos.getRecent'

    data = _doget(method, API_KEY, extra=extra, per_page=per_page, page=page)
    photos = []
    if isinstance(data.rsp.photos.photo, list):
        for photo in data.rsp.photos.photo:
            photos.append(_parse_photo(photo))
    else:
        photos = [_parse_photo(data.rsp.photos.photo)]
    return photos

Now, I have Python, an EXIF wrapper and a Flickr wrapper with a getRecent method, I can write a small script that fetch the 10 most recent images from Flickr and display their camera make and model (if they have one) (flickrCameraQuantifier.py):

#!/usr/bin/python
import urllib2

import EXIF
import flickr

recentimgs = flickr.photos_getrecent('', '10', '1')

imgurls = []
for img in recentimgs:
    try:
        imgurls.append(str(img.getURL(size='Original', urlType='source')))
    except:
        print 'Error while getting an image URL'

for imgurl in imgurls:
    imgstream = urllib2.urlopen(imgurl)
    # save the image
    f = open('tmp.jpg', 'wb')
    for line in imgstream.readlines():
        f.write(line)
    f.close()
    # get the tags
    f = open('tmp.jpg', 'rb')
    try:
        tags = EXIF.process_file(f)
        if len(str(tags['Image Make'])) > 0:
            if len(str(tags['Image Model'])) > 0:
                print "Image Make: %s - Image Model: %s" % (tags['Image Make'], tags['Image Model'])
            else:
                print "Image Make: %s" % (tags['Image Make'])
        else:
            print "No Image Make nor Model available"
    except:
        print 'Error while getting tags from an image'
    f.close()

print "Done!"

Out of 10 images, it usually can give 7-9 camera models. I didn’t checked yet if errors are due to my script or the lack of EXIF tag in submitted images. The EXIF tag detection is a bit slow (imho) but it’s ok. And it’s a “one shot” script: once it finishes it work, nothing remains in memory. So, the next step is to use a flat file or a database connection to remember details found.

I suggest the following method: every 5 minutes, the script retrieves the most recent photo uploaded on Flickr and store the camera make and model somewhere. Each day, one would be able to do some decent statistics. I prefer a sampling of 1 photo every minute rather than 10 photos at one precise moment because people usually upload their pictures in batch processes. There is then a risk that these 10 photos are from the same person and taken by the same device.

Stream redirection in Python

Two computers are on the same network. A firewall segregates the internet from the intranet. Both computers can access everything on the intranet but only one of them is allowed to access the internet. The problem is to listen to a music stream from the computer that cannot access the internet.

A possible solution is to run a stream redirection software on the computer that can access the internet. Then the computer that cannot access the internet can get the stream from the intranet (figure below).

illustration of the redirection

Since I begin to play with Python, I tried to write such software with this language. Here is the code (preset for a Belgian radio, Pure FM):

#!/usr/bin/python
import socket
import traceback
import urllib2

# get the right input stream (in case it changes everyday)
# change the address to suit your need (yes: user input needed!)
address = "http://old.rtbf.be/rtbf_2000/radios/pure128.m3u"
content = urllib2.urlopen(address)
stream = content.readlines()
stream = stream[0][7:len(stream[0])-1]
inHost = stream[0:stream.index(":")]
inPort = int(stream[stream.index(":")+1:stream.index("/")])
inPath = stream[stream.index("/"):len(stream)]

# set output stream (default is localhost:50008)
outHost = ''
outPort = 50008

# get the in/out sockets
inSock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
inSock.connect((inHost, inPort))
outSock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
outSock.bind((outHost, outPort))

outSock.listen(1)
outNewSock, outAddress = outSock.accept()

# get the info from a *file*, not a simple host URL ...
inSock.send("GET " + inPath + " HTTP/1.0\r\nHost: " + inHost + "\r\n\r\n")

try:
    while 1:
        inData = inSock.recv(2048)
        if not inData:
            print "No data"
        else:
            print "Read ", len(inData), " bytes"
        outNewSock.send(inData)
        print "Sent data to out"
except Exception:
    traceback.print_exc()

# not really needed since program will stop by Ctrl+C or sth like that:
outNewSock.close()
outSock.close()
inSock.close()

Now the computer that doesn’t have access to the internet can connect to port 50008 on the other computer to get the stream and listen to music. It was quite simple.

Note that if you are trying this within IDLE on MS-Windows, you’ll get some errors because of problems in the synchronisation of the mpeg stream.