Comment on Recreating uncensored Epstein PDFs from raw encoded attachments
trolololol@lemmy.world 1 day agoCurious here, this is base 64? And what’s behind it is more often than not an image or text? And you need to do ocr to get the characters?
Maybe for the text it could use a dictionary to rubber stamp whether that zero is actually a letter oh, etc etc?
I’m curious to know what the challenge is and what your approach is.
kescusay@lemmy.world 1 day ago
Yes, it’s base64. And what’s behind it could be anything that can be attached to an email.
In this case, it’s a PDF. If the base64 text can be extracted accurately, then the PDF that was attached to the email can be recreated.
The challenge is basically twofold:
As for my approach, I’m basically just slowly and painstakingly running several OCR tools on small bits at a time, merging the resulting outputs, and doing my best to correct mistakes manually.
trolololol@lemmy.world 1 day ago
Ah yes pdf is a clusterfuck where anything is valid I think, so minimal redundancy.
Text and image formats are way more lenient and are full of redundancies.