Comment

Comment on Recreating uncensored Epstein PDFs from raw encoded attachments

I tried to leave a comment, but it doesn’t seem to be showing up there.

I’ll just leave it here:

too tired to look into this, one suggestion though - since the hangup seems to be comparing an L and a 1, maybe you need to get into per-pixel measurements. This might be necessary if the effectiveness of ML or OCR models isn’t at least 99.5% for a document containing thousands of ambiguous L’s. Any inaccuracies from an ML or OCR model will leave you guessing 2^N candidates which becomes infeasible quickly. Maybe reverse engineering the font rendering by creating an exact replica of the source image? I trust some talented hacker will nail this in no time.

source

Sort:hotnew top

RIotingPacifist@lemmy.world ⁨5⁩ ⁨weeks⁩ ago
How big is N though?

source
- Qwaffle_waffle@sh.itjust.works ⁨5⁩ ⁨weeks⁩ ago
  64
  
  source
- mEEGal@lemmy.world ⁨5⁩ ⁨weeks⁩ ago
  Asking the real questions
  
  source
- hodgepodgin@lemmy.zip ⁨5⁩ ⁨weeks⁩ ago
  Since there’s 78 pages, I’m guessing at least 1 ambiguity per page? Anyways, it’s dreadfully big.
  
  source
  - RIotingPacifist@lemmy.world ⁨5⁩ ⁨weeks⁩ ago
    2^78 is large but computers can do an awful lot per second, so if only about some the pages contain attachments 2^40-55 is something you could bruteforce in weeks if you can do millions of attempts a second
    
    source
    vatlark@lemmy.world ⁨5⁩ ⁨weeks⁩ ago
    I have never looked into the details of an OCR, but if it’s a classifier it should give the it’s confidence in being a 1 or L so you can start with the low confidence characters.
    
    source