Comment on Searching through a bulk of pdf files

hoppolito@mander.xyz ⁨6⁩ ⁨days⁩ ago

For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.

With that tool it should be doable with something like a simple while loop:

find . -type f -name '*.pdf' -print0 |
    while IFS= read -r -d '' file; do
        echo "Processing $file ..."
        ocrmypdf "$file" "$file"
        # ocrmypdf "$file" "${file%.pdf}_ocr.pdf"   # if you want a new file instead of overwriting the old
    done

If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.

source
Sort:hotnewtop