Saturday 12 November 2022

Progress report

Writing a book takes time and while it has taken me a bit longer than expected, I am making progress. So much so that I am already thinking of printing it. However this post is not about content and rather about technical issues.

I found a print on demand service, very aptly called Book on Demand which seems reasonable. Truth be told, I did not spend a lot of time doing research comparing features and prices. For what I want (printing about five books for the family) any book service will do. This one offers what I assume is the usual selection of soft cover and hardcover, paper weights, color vs B&W and a long list of other options. You can pay for many things, or do it all yourself (my case). Not quite unexpectedly, my book has a ton of photos (more on that later), and it would be nice to print them in color, except that obviously not every page has a photo, only about one third of them. I am currently at over 300 pages (A5), and that´s a lot of pages to print black text in "color", which would be too expensive. Luckily, there is an option to enter the list of pages which should be printed in color, which reduces the price significantly.

The way I saw it there were two main options:

Option A: Finish the pdf and make a note manually of which pages have photos. This would take about ten minutes, maybe fifteen minutes?

Option B: Write some code that inspects the pdf and spits out the list of pages with photos. Who knows how long this takes, for sure more than 15 minutes as there are likely many unknown unknowns, plus I have no idea about structure of pdf files.

It goes without saying that I went for Option B. Adn even writing this post took more than 15 minutes.

On to the tech details. I ended up installing PyPDF2 and following this. But for some reason my pdf has many images in all pages (I generate it from LibreOffice, no idea if that makes a difference).

Since the easy way did not work, I did it the hard one, aka the fun way, opening the pdf file as a text file (ignoring utf-8 errors and not even bothering to uncompress the pdf file first) and peeking into it until I figured out the following.

There is an object which gives some information about pages in the pdf (my pdf currently has 317 pages). I just had to figure out what those numbers followed by '0 R' were about.

1487 0 obj
<</Type/Pages
/Resources 1524 0 R
/MediaBox[ 0 0 419 595 ]
/Kids[ 1 0 R 4 0 R 7 0 R 10 0 R 13 0 R 16 0 R...
56 0 R 60 0 R 64 0 R 69 0 R 74 0 R 78 0 R 81 0 R...
122 0 R 125 0 R 128 0 R 131 0 R 135 0 R 138 0 R...
...
1045 0 R 1048 0 R 1051 0 R 1054 0 R 1057 0 R 1060 0 R...
...1120 0 R 1123 0 R 1126 0 R 1129 0 R ]
/Count 317>>
endobj
990 0 R 993 0 R 996 0 R 999 0 R 1002 0 R 1008 0 R...

Turns out that searching for "images" I found the list of my inserted images. Something like 

34 0 obj
<</Type/XObject/Subtype/Image/Width 881 /Height 644 ... /Length 122470>>
stream
JFIF
O]NV
...

And putting two and two together I figured out that the first list has the first object that goes in that page, so first page has objects 1 to 3, second page objects 4 to 6, etc. I know that the first page with an image is the 11th one (at least in this iteration of the book), which would include objects 31 to 34 and, bingo, the first image is object 34. To do another I checked the last page with a photo in my pdf (288th page), and the object range also includes the object listed for the last image. So that seems to be it, actually a lot easier than expected.

This takes less than 30 lines of code, and this remind me of the Automate the Boring Stuff with Python book, by Al Sweigart. And yes, it took me more than 15 minutes, but I am sure I will make changes to the output pdf, and if nothing else, this has been way more satisfying than doing it manually. I don´t get to code any more at work and sometimes I miss it.


2 comments:

Kety said...

Con ganas de leerlo en papel.
Besos

Kety said...

Ya hacía tiempo que no pasaba por aquí, Te ha dado tiempo de escribir un libro. :-))