Playing with Python and Gadfly

Following my previous post where I retrieved EXIF tags from photos posted on Flickr, here is the next step: my script now stores data in a database.

There is a lot of free wrappers for databases in Python. Although I first thought of using pysqlite (because I am already using SQLite in another project), I decided to use Gadfly, a real SQL relational database system entirely written in Python. It does not need a separate server, it complies with the Python DBAPI (allowing easy changes of DB system) and it’s free.

Using Gadfly is very easy and their tutorial is very comprehensible. Put toghether, here is an example of the creation of a database, the addition of some data and their retrieval (testGadfly.py):

#!/usr/bin/python
# test for Gadfly
import os
import gadfly
import time

DBdir = 'testDB'
DBname = 'testDB'

if os.path.exists(DBdir):
    print 'Database already exists. I will just open it'
    connection = gadfly.gadfly(DBname, DBdir)
    cursor = connection.cursor()
else:
    print 'Database not present. I will create it'
    os.mkdir(DBdir)
    connection = gadfly.gadfly()
    connection.startup(DBname, DBdir)
    cursor = connection.cursor()
    cursor.execute("CREATE TABLE camera (t FLOAT, make VARCHAR, model VARCHAR)")

print 'Add some items'
t = float(time.time())
cmake = 'Nikon'
cmodel = 'D400'
cmd = "INSERT into camera (t, make, model) VALUES\
    ('" + str(t) + "','" + cmake + "','" + cmodel + "')"
cursor.execute(cmd)

print 'Retrieve all items'
cursor.execute("SELECT * FROM camera")
for x in cursor.fetchall():
    print x

connection.commit()
print 'Done!'

Regarding the initial project, the script became too long to be pasted in this post but you can download it here: flickrCameraQuantifier2.py (5ko). To run it, you should have installed the wrapper for EXIF and Flickr and the Gadfly DB system. In the beginning of the script, you can define the number of iterations (sets of queries) you want in total (variable niterations), the sleep duration between queries (variable sleepduration) and the number of photos to get for each query (variable nphotostoget). Everything will then be stored in a Gadfly database (default name: cameraDB). If you want to read what is stored, here is a very basic script: flickrCQ2Reader.py.

For example, I’ve just asked 125 queries (with 5s between each query). I’ve got 88 photos (70.4% of queries) with 27 photos without EXIF tags (30.68% of all the photos). Among all the camera makers, Canon has 27%, Fuji has 11%, Nikon has 18% and Sony has 21% of all the photos with EXIF tags at that moment. This is approximately what Flagrant disregard found. I don’t have time anymore but one could improve the data retrieval script in order to automate the statistics and their presentation …

Edit on October 9th, 2006: added the links to the missing scripts