Comment

Comment on Pdf to odt/docx conversion has me weeping!

observantTrapezium@lemmy.ca ⁨8⁩ ⁨months⁩ ago

I know the pain. While there are definitely solutions that work sometimes, there’s just no “one size fits all” that I’m aware of. PDFs can represent text very differently internally.

What I did for one project where extracting the text produced a complete mess was to convert the PDF pages to images and then OCR them…

source

Sort:hotnew top

fossilesque@mander.xyz ⁨8⁩ ⁨months⁩ ago
StirlingPDF is basically 1 size fits all.

source
- observantTrapezium@lemmy.ca ⁨8⁩ ⁨months⁩ ago
  Interesting, I’ll keep it in mind next time I have to deal with this problem (hopefully never but who knows).
  
  A few years ago I was in contact with researchers that were developing an AI tool to parse PDFs (I think they didn’t care about converting to editable formats, but extracting data), from their material I got the impression that it’s extremely difficult to do right using traditional algorithms.
  
  source
  - fossilesque@mander.xyz ⁨8⁩ ⁨months⁩ ago
    news.ycombinator.com/item?id=44287043
    
    source