In biology, we often need to use PubMed, a biomedical articles search engine for citations from MEDLINE and other life science journals.
In the MS-Windows world, you have nice, proprietary tools (like Reference Manager or Endnote) that retrieves citations from PubMed, store them in a database and allow you to use them in proprietary word processing software (in fact, in MS-Word only since nor Wordperfect nor OpenOffice.org are supported). If you are using BibTeX (for LaTeX) as your citations repository, there isn’t a lot of tools. The best one, imho, is JabRef, a free reference manager written in Java (for me, the only “problem” is that it adds custom, non-BibTeX tags). Or you can edit the BibTeX file by yourself with any text editor.
The problem with manual edition is that it is prone to error (even when copying/pasting from the web). Since Python programming is my hobby horse for the moment, there are two solutions to this problem:
- Use Biopython to get a reference from PubMed but are you ready to have a huge module dependency just to use 1 function?
- Write your own Python script, using a PubMed URL to download your reference and a little bit of XML parsing to extract the relevant info (one can use the ESearch and EFetch tools but my lazy nature tells me to simply use the URL).
Obviously, I chose to write my own Python script. Each reference from this PubMed XML format example (full DTDs) should be like this:
@article{poirrier06, author = {Poirrier, J.E. and Poirrier, L. and Leprince, P. and Maquet, P.}, title = {Gemvid, an open source, modular, automated activity recording system for rats using digital video}, year = 2006, journal = {Journal of circadian rhythms}, volume = 4, pages = {10}, pmid = 16934136, doi = {10.1186/1740-3391-4-10} }
The script is here (4kb). First, use PubMed to check the reference you want, then take its PubMed ID (PMID) and launch the program, giving your BibTeX file in a pipe, for example:
./pyP2B.py 16934136 >> myrefs.bib
If you like, you can edit the script to change the tab size (here = 2).
How does it work?
- With PubMed, I do not use the correct tool but a HTTP query. It is much more simple and easier. The script asks for the PMID citations. Since it gets a HTTP answer, I need to parse this answer to replace entities (like , etc.) and obtain a valid XML file.
- Once I got the XML file and after some checking, I use XPaths from LXML (for me, XPaths are quick and dirty compared to write a DOM/SAX structure but it works!)
- Then the script simply prints the result to the standard output (even if it’s an error ; improvement : print on the error output). You simply need to get this output into your BibTeX file with the correct pipe.
Edit on October 23rd: this script has errors when dealing with non-ascii chars like “ö” in Angelika Görg. I won’t fix it for the moment.
Update on Feb 14th, 2007: I corrected the error when dealing with non-ascii chars. I also wrote a small webpage to explain how it works on my website: http://www.poirrier.be/~jean-etienne/software/pyp2b/