Comment

Comment on What is an efficient workflow to separate and organize bulk scanned PDF documents? (At work; software is limited.)

<- View Parent

endless@lemmy.ml ⁨3⁩ ⁨months⁩ ago

I don’t have the volume where learning a completely new technology would be worthwhile. I would have to manually verify each one anyways because it has to be perfect. The documents do not have any format as nice as a heading at the top. I’m willing to put in the time to go through each page, I just need a fast way to tag them, then automate separation and renaming.

source

Sort:hotnew top

Sunsofold@lemmings.world ⁨3⁩ ⁨months⁩ ago
Hmm. Well, first off, if you mean you don’t know how to write a script and don’t view it as worth learning for this task, that limits the task a fair amount. If you mean you don’t want to learn about the particulars of script based PDF editing or OCR, that’s understandable.

If you don’t want to script at all, you should be able to segment the PDFs via acrobat, or even just ‘print to PDF’ with page ranges on most viewers. There are ways of bulk renaming files once you have segmented them, even without scripting, though it’d be use case dependent as to whether/how that’d be useful to you.

If you want to script just a little, I made a script ages ago where I used the documents’ name to hold the metadata of what needed to be modified. You could certainly do that. (e.g. open the doc in one window, select the file for renaming in your file explorer, scroll through and input the sequence of pages in the rename field, [documentName3,7,15,22,29.PDF] run a script to segment the PDF at those page numbers so you end up with ‘documentName-1.PDF’ containing pages 1 to 2, another with 3 to 6, etc.)

A bit more effort could maybe be used to do some level of renaming, though how much use that would be would depend on the particulars of your case. I could see extending the previous script a little and making the page annotations include a doc type. (e.g. 13cn meaning segment at page 13 and label it as ‘originalDocumentName-clientNotification’, or even 13’arbitraryText’ and use the arbitrary text as the new file name)

The particularity of your case may be precisely why it hasn’t been automated yet.

source
- endless@lemmy.ml ⁨3⁩ ⁨months⁩ ago
  I am not going to learn how to train an AI for this task. It is non trivial to install anything and I cannot use any remote/online tools. I would need to find an appropriate local AI (deepseek?) and learn how to use it from scratch.
  
  I could write a bash script to modify filenames at home on my linux machine. But at work I just have windows. It has… powershell? I guess. I’ve never used that and to be honest I have no desire to. I would have to install something to cut up the PDFs. ocrmypdf that could do everything. And there are various other cli PDF manipulation tools in the repos. I would have to ask to have it installed. And any other dependencies required. Not gonna happen.
  
  I want a way to easily go through hundreds of pages, look at them and quickly tag them. That is a perfect task for a GUI. To use a script I would have to scroll through the PDF in one application then switch back and forth into a text editor, to manually create a text document specifying which pages are in what document, and what category etc. I’d sooner do it on paper. But I’m sure there is a solution for this, I just don’t know what it is.
  
  source
  - Sunsofold@lemmings.world ⁨3⁩ ⁨months⁩ ago
    At the purely GUI level, if you’re being granted acrobat, it turns out you can extract arbitrary subsets of pages manually, very quickly. You can then rename them. I haven’t learned powershell personally but it absolutely could be used to batch rename files, even if it’s a somewhat silly looking language compared to bash. Again, though, how much work that involves depends on your desired naming conventions.
    
    source