In a recent post, Alexandre wrote about web indexing and pointed to a nice tool for webmaster: the sitemap. The Sitemap Protocol “allows you to inform search engines about URLs on your websites that are available for crawling” (since it’s a Google creation, it seems that only Google is using it, according to Alexandre).
If you have a shell access to your webserver and Python on it, Google has a nice Python script to automatically create your sitemap.
I don’t have any shell access to my webserver 😦 But I can write a simple Python script 🙂 Here it is: sitemapbuilder.py, 4ko. After having specified the local directory where all your files are and the base URL for your on-line website (yes, you need to edit the script), launch the script and voilà ! You can now upload your sitemap.xml file and tell web crawlers where to find the information on your website.
Interested in the other options you can specify?
- You can specify an array of accepted file extensions. By default, I’ve put ‘htm’, ‘html’ and ‘php’ but you can add ‘pdf’ if you want.
- You can specify an array of filenames to strip. By default, I strip all ‘index.*’ (with * = one of the accepted extensions) because http://www.poirrier.be is the same as http://www.poirrier.be/index.html but more easier to remember
- You can specify a change frequency (it will be the same for all files)
- You can specify a priority (it will also be the same for all files and even omitted if equal to the default value)
On the technical side, there is nothing great (I even don’t use any XML tool to generate the file). I was impressed by the ease of walking through a directories/files tree with the Python os.walk function (it could be interesting to use it in the future blog system I mentioned earlier).
Finally, you can see the sitemap.xml generated for my family website.
Edit on November 17th: it seems that Google, Yahoo! and Microsoft made a team to release a “more official” sitemaps specification: http://www.sitemaps.org/.