Automated Pubmed reference to BibTeX

In biology, we often need to use PubMed, a biomedical articles search engine for citations from MEDLINE and other life science journals.

In the MS-Windows world, you have nice, proprietary tools (like Reference Manager or Endnote) that retrieves citations from PubMed, store them in a database and allow you to use them in proprietary word processing software (in fact, in MS-Word only since nor Wordperfect nor OpenOffice.org are supported). If you are using BibTeX (for LaTeX) as your citations repository, there isn’t a lot of tools. The best one, imho, is JabRef, a free reference manager written in Java (for me, the only “problem” is that it adds custom, non-BibTeX tags). Or you can edit the BibTeX file by yourself with any text editor.

The problem with manual edition is that it is prone to error (even when copying/pasting from the web). Since Python programming is my hobby horse for the moment, there are two solutions to this problem:

Use Biopython to get a reference from PubMed but are you ready to have a huge module dependency just to use 1 function?
Write your own Python script, using a PubMed URL to download your reference and a little bit of XML parsing to extract the relevant info (one can use the ESearch and EFetch tools but my lazy nature tells me to simply use the URL).

Obviously, I chose to write my own Python script. Each reference from this PubMed XML format example ( full DTDs) should be like this:

@article{poirrier06,
  author = {Poirrier, J.E. and Poirrier, L. and Leprince, P. and
Maquet, P.},
  title = {Gemvid, an open source, modular, automated activity
recording system for rats using digital video},
  year = 2006,
  journal = {Journal of circadian rhythms},
  volume = 4,
  pages = {10},
  pmid = 16934136,
  doi = {10.1186/1740-3391-4-10}
}

The script is here (4kb). First, use PubMed to check the reference you want, then take its PubMed ID (PMID) and launch the program, giving your BibTeX file in a pipe, for example:

./pyP2B.py 16934136 >> myrefs.bib

If you like, you can edit the script to change the tab size (here = 2).

How does it work?

With PubMed, I do not use the correct tool but a HTTP query. It is much more simple and easier. The script asks for the PMID citations. Since it gets a HTTP answer, I need to parse this answer to replace entities (like , etc.) and obtain a valid XML file.
Once I got the XML file and after some checking, I use XPaths from LXML (for me, XPaths are quick and dirty compared to write a DOM/SAX structure but it works!)
Then the script simply prints the result to the standard output (even if it’s an error ; improvement : print on the error output). You simply need to get this output into your BibTeX file with the correct pipe.

Edit on October 23rd: this script has errors when dealing with non-ascii chars like “Ã¶” in Angelika GÃ¶rg. I won’t fix it for the moment.

Jean-Etienne February 14, 2007

Update on Feb 14th, 2007: I corrected the error when dealing with non-ascii chars. I also wrote a small webpage to explain how it works on my website: http://www.poirrier.be/~jean-etienne/software/pyp2b/

epot’s blog » Blog Archive » GUI version of pyP2B February 16, 2007

[…] My python script pyP2B was command-line only. Tonight, I played for the first time with Tk, re-wrote pyP2B as a class and thus added a GUI. […]