Category: Websites

First trace for OpenStreetMap

OpenStreetMap is a “project aimed squarely at creating and providing free geographic data such as street maps to anyone who wants them. The project was started because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive or unexpected ways.” I thought it was worth participating and more documented than the UPCT project. So I got a Locosys NaviGPS GT-11 and used it for the first time on the way to FOSDEM (and back). I did a small mistake by taking an interval between points of 30s: on a highway, at 120km/h, 30s means 1km and the road direction can change a lot. When I’ll have more time, the next step will be to do some edition and mark roads, highways, interesting landmarks, etc. Stay tuned …

The GT-11 is only a GPS data logger. It has no built-in maps, road finder, etc. That’s why I got lost a little bit while trying to exit Bruxelles 😦 But, at least, you can see your errors later.

Ruby France logo proposals

I couldn’t find sleep tonight (*) so I did two small variations on the Ruby official logo for Ruby France, since they are looking for a new logo (**). Double-clic to enlarge, single-clic to reduce back to the small images (***) :

 

I also like Greg’s proposal.

(*) it happens very often these days
(**) No, I don’t know Ruby
(***) Doesn’t work if javascript isn’t enabled in your browser (usually it is) + Internet Explorer doesn’t correctly render the transparency ; please use a real browser instead

Simple Sitemap.xml builder

In a recent post, Alexandre wrote about web indexing and pointed to a nice tool for webmaster: the sitemap. The Sitemap Protocol “allows you to inform search engines about URLs on your websites that are available for crawling” (since it’s a Google creation, it seems that only Google is using it, according to Alexandre).

If you have a shell access to your webserver and Python on it, Google has a nice Python script to automatically create your sitemap.

I don’t have any shell access to my webserver 😦 But I can write a simple Python script 🙂 Here it is: sitemapbuilder.py, 4ko. After having specified the local directory where all your files are and the base URL for your on-line website (yes, you need to edit the script), launch the script and voilà! You can now upload your sitemap.xml file and tell web crawlers where to find the information on your website.

Interested in the other options you can specify?

  • You can specify an array of accepted file extensions. By default, I’ve put ‘htm’, ‘html’ and ‘php’ but you can add ‘pdf’ if you want.
  • You can specify an array of filenames to strip. By default, I strip all ‘index.*’ (with * = one of the accepted extensions) because http://www.poirrier.be is the same as http://www.poirrier.be/index.html but more easier to remember
  • You can specify a change frequency (it will be the same for all files)
  • You can specify a priority (it will also be the same for all files and even omitted if equal to the default value)

On the technical side, there is nothing great (I even don’t use any XML tool to generate the file). I was impressed by the ease of walking through a directories/files tree with the Python os.walk function (it could be interesting to use it in the future blog system I mentioned earlier).

Finally, you can see the sitemap.xml generated for my family website.

Edit on November 17th: it seems that Google, Yahoo! and Microsoft made a team to release a “more official” sitemaps specification: http://www.sitemaps.org/.

Search for images by sketching

On his blog, Laurent wanted to know who is this guy. I though it was an interesting starting point to see how good is Retrievr, “an experimental service which lets you search and explore in a selection of Flickr images by drawing a rough sketch”.

Although my drawing skills really needs to be improved (and their drawing tools more refined – always blame the others for your weaknesses 😉 ), a first sketch gives some interesting results (see screenshot below): 7 retrieved photos (44%) show a b/w human face in “frontal view” (if you count the dog, it’s even 8 correct images).

Click to enlarge

If I just give the photo URL, results are not so good (see screenshot below). I am nearly 100% sure that it’s because it’s a greyscale photo scanned as a color image.

Click to enlarge

When I upload the image on my harddisk, convert it to a greyscale image and upload it on Retrievr, now it gives more expected results: 10 images show a person with his/her hair (63%), either from the front or from the back.

Click to enlarge

So this photo is not among the “most interesting” ones on Flickr (it is even probably not on Flickr). I suppose that if one applies Retrievr on a larger subset of photos, we’ll have a higher probability to find it (but it will also increase noise, i.e. the number similar photos). If you like playing with Flickr, other intersting Flickr mashups can be found here.

The Belgian press is fighting for its rights (really?)

A lot of blogs, Belgian or not, are talking about the fact that the Belgian French-speaking press (lead by CopiePress, a Belgian rights management company) successfully sued Google in Belgium over indexing, author rights, content copying, etc. The full order is available on the Belgian Google homepage (in French).

I am not a lawyer. So I read the order:

  • CopiePress wanted the Belgian court to look at the lawfullness of Google News and Googgle Cache services, according to the Belgian law
  • CopiePress wanted Google to remove all links to any data from CopiePress clients
  • CopiePress wanted Google to publish the order in its first page on their Belgian website

So, CopiePress won the first case (the case will be heard again in appeal). I assume that the Belgian justice department is doing its job. So, let us consider that Google broke the Belgian law with their services. If you want to know more about the legal stuff, P. Van den Bulck, E. Wery and M. de Bellefroid wrote an article about which Belgian laws Google seems to have broken (in French).

I am not a lawyer but I grew up with the internet. In my opinion, the internet was technically not designed for the kind of use CopiePress wants. The internet was designed to share information in a decentralised way. All TCP/IP requests are equal (no intrinsic difference in paid/unpaid subscribed/unsubscribed access, i.e.). Search engines were “invented” later, when it became difficult to find a piece of information on the internet. Later on, people invented technical solutions to avoid being indexed by robots (robots.txt convention) or to avoid anyone having access to “protected” (unpaid) content. For instance, Le Soir robots file is useless (it dissallows nothing), La Libre Belgique robots file is only there to protect access statistics and advertisement images. And LeMonde.fr successfully protected its report on interns: no direct access, no Google cache.

As many other people (even on these journals blogs or even from journalists working for these journals), I think these newspapers will lose readers (hits), they will lose their credibility in the young generation of readers who, rightly or wrongly, loves all these free web services (Google, Flickr, YouTube, Skyblog, etc.). At least they lost mine because I am sure there are other ways to avoid Google on their pages and because I am asking myself some questions tending to show that they just want some free advertisement or even hide something else (see below).

Why aren’t they suing other search engines? Yahoo! indexes pages and articles from these newspapers, it even has a cached copy of these. MSN Newsbot also indexes pages and articles from these newspapers, with direct links to the articles (no roundabout way to the first page/ad). Etc. I suppose Google is the internet big player, the first search engine for the moment and they want to catch the public attention.

An very good article by D. Sullivan suggests that they are doing that for money only. Here is their new business plan: if we don’t succeed in selling our products, we’ll sue an internet big player for money!

Why Flemish-written newspapers didn’t launch such lawsuit against Google? Either they like what Google is doing, either they don’t care (or they are preparing such lawsuit).

Finally, these French-writing newspapers launched this lawsuit at the very same moment a French-speaking professional journalist association is launching a press campain against these newspapers practises with freelance journalists: minimum salary, undefined conditions, etc.. That’s strange because Google Cache exists since at least 2 years ; they didn’t noticed it before?

In summary, I am sure there are other ways to make search engines “friendly” with your news website. This lawsuit is giving a bad impression on Belgium and its French-written press in the electronic world. I am wondering how long it will take until they will again complain that their number of readers is down. I am not defending Google. I’m just criticising the French-written newspapers lawsuit.

Digital access to the ULg libraries

Although the University of Liege (ULg) network of libraries webpage is very old and ugly, the network is starting to use new, technologically advanced tools to allow digital access to its content (articles, books, thesis and other media). Three tools are available since a short time:

  • Source gives access to all media currently available in libraries (it replaces the Telnet-based Liber, for those who used it before). Source is based on Aleph from ExLibris, a proprietary software.
  • PoPuPS is a publication platform for scientific journals from the ULg and the FSAGx. PoPuPS is based on Lodel CMS, a free (GPL) web publishing software. Articles in this database seem to be Open Access although no precise licence is defined (and some articles look strange : see the second picture in this geological article).
  • BICTEL/e is an institutional repository of Ph.D. thesis. It seems to be developed internally by the UCL

With these tools, the ULg try to catch the Open Access movement. Source is already connected to other types of databases but it seems that PoPuPS and BICTEL are not (yet) connected to cross-references systems like DOI nor using standardised metadata like in Eprints.

P.S. an old tool is still very usefull: Antilope gives you the location (in Belgium) of publications. If your library doesn’t have a subscription to a specific journal, maybe another library in Belgium has it.

How to fight spam in a wiki?

On Friday, having to wait for a librarian to fetch the old articles I wanted to read, I spent a few minutes removing spam from the AEL wiki. This form of spam is very easy to spot because it’s always the same : <small> HTML tags enclosing 30 links and the text linking to these sites have well-known spam, adult-oriented words in it (see the end of MsSecurity page where I didn’t had time to remove spam).

After the librarian gave me my articles, I went back to my lab, thinking of a possible solution … This kind of spam is constant. Why not writing a simple software bot that will fetch each and every page on a wiki, check if there is some litigious content in it and then going to the next page or cleaning the content. This bot wouldn’t prevent spam but can act quickly after spam (e.g. if you launch it every hour/day/week with cron).

This afternoon, I thought I could not be the only one trying to find ways to fight spam on wiki. Indeed, fighting wiki spam already has a verb, “to chongq” (although it also includes retaliation), a Wikipedia page (event two) and many other dedicated pages.

Basically, there is three types of behaviour to fight spam: wiki-specific methods, general http/web methods and manual actions.

  1. Wiki-specific methods are add-ons to your wiki system that help prevent spammer to modify your wiki. For example, Wikimedia has its anti-spam features and a Spam blacklist extension, TWiki has a Black List plugin, etc. Once set up, you generally do not need to care about them (except to see if they are properly working, to update them, etc.).
  2. General http/web methods use general web mechanisms and/or special features independent from the wiki software you use. These systems are also automated, like Bad Behaviour, use of CAPTCHA images, use of the “rel=nofollow” attribute in link tags, etc.
  3. Finally, manual actions can be taken by any human: removing spam like I did, renaming well-known wiki pages like sandbox, etc. The only advantage of this method is that the human brain can easily adapt itself to new forms of spam. Otherwise, it’s rather time consuming …

Finally, I read that some spam bots are removing spam, but only a part of it. This is the kind of thing I would like to do, but it should remove all the spam. (But before this one, I should begin the simple, geek blog software).

Addedum on August, 21st, 2006: independently of this post, Ploum made an interesting summary of a post from Mark Pilgrim (this post looks rather old: 2002!). In his post, Mark Pilgrim sees two ways of fighting spam: club or lowjack solutions.

With a club solution, your wiki is protected against lazy spammers. Clubs are technical solutions that make it harder for spammers to vandalize your website/wiki/blogs/etc. The Club works as long as not everyone has it. Once everyone had clubs, spammers will think a little bit and update their software to circumvent most of your clubs. In conclusions, “the Club doesn’t deter theft, it only deflects it.”

With a Lojack solution, your wiki isn’t necessarily protected but spammers that will vandalize it will be traced back. “Although it does nothing to stop individual crimes, by making it easier to catch criminals after the fact, Lojack may make auto theft less attractive overall.”

My bot that completely removes spam is definitely not a lojack. But it’s not a club neither. This tool will allow you to be spammed and it will not trace spammers back. Still, it will be less attractive for spammers to add links on wiki since they will be removed soon after being added.

(Btw, I’ve just noticed that comments were automatically forbidden for any post. That was not intentional)

Links: Original and GEGL

While looking for something totally different (what exactly is the Mascot score? Partial answer here), I found the Original photo gallery, a two-parts tool to get digital photos on the web. This tool could be interesting for the family website I plan to build. Since my hosting company enabled PHP Safe Mode, I cannot use most on-line gallery tools. Original seems to be an ideal solution because all the treatments are done off-line, on my own PC. Then everything is loaded on the website. Still, it’s not a static gallery like the one Picasa does (for example).

I also saw GEGL (Generic Graphical Library), “an image processing library for on-demand image processing. It is designed to handle various image processing tasks needed in GIMP.” They just released a first version. Øyvind KolÃ¥s is also maintainer of Babl, a “dynamic, any to any, pixel format conversion library”. I am interested in all kind of (free as in free speech) image processing libraries because I’m trying to correctly open and manipulate .gel files (see problems written in this forum thread ; basically, they are .tiff files with additional tags and a different way to encode pixels).

P.S. I quickly installed and tried Istanbul, as I previously planned to do. It’s working for a few seconds of recording but then it stops. I didn’t have time to see what’s wrong (I think my resolution is too high: 1280×1024).

P.P.S. Oh, yes … And I suggested and became the new contact person for the French team for LinuxFocus. I would like to thank Iznogood for the work done before me and I hope he will stay active in the free software community. I will upload my last translations as soon as possible and try to put some pep in the French-speaking team.