As a gift and help to my Dad, I've been working on digitizing a manuscript that he wrote over 30 years ago but did not publish. The work is about hot-air balloons and hot-air ballooning. Last night I made some progress on digitizing it. I have now structured nearly 50% of the OCR'd text of the manuscript.
I'm hoping that we can finish with cleaning it up and self-publish the work, possibly on a chapter-by-chapter basis via a service such as Leanpub. I think that the work is a fun read and that it doubles as a time capsule.
Below are the steps that I've taken on this project so far. Each of the main sections' tasks were done many weeks apart, but they could all be done in one day. Structuring the text has been the slowest part so far.
Scans to PDFs
I first scanned the pages of the manuscript on our printer/scanner, which saved each page as a separate JPG file with an increment in the filename (page one's name as
...0001, page two as
...0002, etc). Due to the increment in the filename, I could easily pull a list of the files in-sequence, rather than needing to put them back in an order. Accordingly, I was able to use the code here to put all of the JPG files into one PDF file and preserve their sequence. (The solutions above that one did not help.)
PDF Images to OCR Text
I then used the GUI OCR program
gscan2pdf to run OCR software on each of the pages and embed the recognized text inside the PDF (it's hidden in the PDF until you select the page, preferring to instead show you the image of the page).
Structuring the Text
I copied all of the text embedded in the PDF via
CTRL+C, then pasted it into my Markdown editor of choice, Typora. With the editor side by side with the PDF open in my PDF viewer, I started formatting the heading text as headings, fixing paragraphs that were running together, noting where diagrams are and need to be added or where some are referenced but seem to be missing, and removing headers that were typed on each page (the page numbers, author name, and book title).