Comment

Comment on Searching through a bulk of pdf files

tofu@lemmy.nocturnal.garden ⁨4⁩ ⁨months⁩ ago

The OCR thing is it’s own task but for just searching a string in PDFs, pdfgrep is very good.

pdfgrep -ri CoolNumber69 /path/to/folder

Sort:hotnew top

Darkassassin07@lemmy.ca ⁨4⁩ ⁨months⁩ ago
That works magnificently. I added -l so it spits out a list of files instead of listing each matching line in each file, then set it up with an alias. Now I can ssh in from my phone and search the whole collection for any string with a single command.

Thanks again!

source
- tofu@lemmy.nocturnal.garden ⁨4⁩ ⁨months⁩ ago
  Glad to hear that!
  
  source
Darkassassin07@lemmy.ca ⁨4⁩ ⁨months⁩ ago
Interesting; that would be much simpler. I’ll give that a shot in the morning, thanks!

source
- hoppolito@mander.xyz ⁨4⁩ ⁨months⁩ ago
  In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.
  
  It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.
  
  source