Comment on Pdf to odt/docx conversion has me weeping!
Treczoks@lemmy.world 1 week ago
The problem lies in the PDFs themselves. In there are objects that represent lines of glyphs. If you are lucky. A conversion tool can guess which of those lines belong together and produce the text.
It cannot know any intentions behind it, though. Take a numbered list. The first line is two line objects: the number plus the . or the ), and the first line of text. The conversion tool can now guess. As the line blocks with the numbers are all left of the line blocks with text, this could be a numbered list. Or it could be a table with two columns. Nothing in the PDF is giving any hints.
And that is the easy part. This assumes that the document either uses default fonts, or keeps its embedded fonts untouched. If they use embedded fonts and a PDF optimizer that only embeds the used characters and renumbers them, any copy or conversion tool is bound to fail.
Same with protected PDFs where you simply cannot copy the text from the start.
And then there are PDFs that just consist of scanned pages. Here you would need an OCR software to get something readable out of them.
PDF is an archival, output format, the end of a process. Not something to work from.
Always preserve the original file. Keep it safe. If you change tools, make sure you have a conversion path into something editable. The PDF is for giving away, nothing else.
ChaoticNeutralCzech@feddit.org 1 week ago
Renumbering characters during font minimization? I haven’t encountered that, it would break searching and copying.
Anyway, PDFs for example don’t even say whether a line of text is left, center or justified – they usually store the coordinates of the first character and then spacing to each subsequent one unless defined by the font.
And what if the document contains text boxes, or other Word objects? Well, the text is separate from the underlying rectangle (if there is one) and it’s up to the conversion tool to guess if it’s part of the main text layer.
Sorry, it’s really hard to edit PDFs. You might want to use Inkscape for editing the graphical parts. If you also need to edit paragraphs, I suggest recreating the document by pasting them into Word/LibreOffice, and importing any graphical shapes as SVGs (use Inkscape for the conversion, then you can try Word’s “Graphic > Convert to Shapes” feature).
Really, every software that uses PDF should treat it as an export format, hopefully making it clearer that “saving as PDF” is a visually lossless but structurally lossy and messy process.
Treczoks@lemmy.world 6 days ago
The compressing and renumbering seems to be more common with embedded Chinese fonts - Space-wise it makes a lot of sense. But yes, mark and copy text, paste it into word or writer, and you get gibberish. Can’t verify the search, though. And, of course, Google translate can’t do anything with it, either.