The National Academies Press are putting some of their books on-line. I was particularly interested in the Guidelines for the Care and Use of Mammals in Neuroscience and Behavioral Research. The only “trick” is that they provide the book one page at a time (either in HTML or in PDF format). If you want entire chapters or the whole book in one file, you have to purchase it. I think it is a fair deal (how many publishers do that?).
Now, I was sure I can automate the retrieval of PDFs and obtain one file containing the whole book. They give pages 1 up to page 209. So, I wrote this small Bash script to retrieve all the pages (all the PDFs):
#!/bin/bash # ./getbook.sh -> retrieve PDFs from nap.edu/ c=1 while [ $c -lt 210 ] do wget -c http://print.nap.edu/pdf/0309089034/pdf_image/$c.pdf c=$((c+1)) done
In a few minutes, I was able to get all the PDFs. đŸ™‚ Now, I want them all in only 1 PDF. Here, I’ll use pdfjoin
(from PDFJam) to … join them. Of course, I can begin to type one big command like “pdfjoin 1.pdf 2.pdf 3.pdf ...
” but I was sure there is a better solution. Again, I used a Bash script:
#!/bin/bash # ./joinpdf.sh -> join PDFs from nap.edu/ c=2 s="1.pdf" while [ $c -lt 210 ] do s="$s $c.pdf" c=$((c+1)) done pdfjoin $s --fitpaper false --paper a4paper --outfile book.pdf
Now, I have a wonderful book.pdf that I can read on my computer or print on a printer. đŸ™‚
P.S.1: it’s not Perl but I am sure there is more than one way to do it
P.S.2: you can join the two Bash script to to everything in one go. In this case, it would be interesting to create a variable for the maximum number of PDF available (210 in the two scripts above).
P.S.3: as usual, explanations around these scripts are longer than the scripts themselves!