Comment on Searching through a bulk of pdf files
hoppolito@mander.xyz 6 days ago
For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.
With that tool it should be doable with something like a simple while loop:
find . -type f -name '*.pdf' -print0 | while IFS= read -r -d '' file; do echo "Processing $file ..." ocrmypdf "$file" "$file" # ocrmypdf "$file" "${file%.pdf}_ocr.pdf" # if you want a new file instead of overwriting the old done
If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.
Darkassassin07@lemmy.ca 5 days ago
That’s a neat little tool that seems to work pretty well. Turns out the files I thought I’d need it for already have embedded OCR data, so I didn’t end up needing it. Definitely one I’ll keep in mind for the future though.