hoppolito
@hoppolito@mander.xyz
- Comment on Searching through a bulk of pdf files 5 days ago:
For the OCR process you can probably wrangle up a simple bash pipeline with ocrmypdf and just let it run in the background once until all your PDFs have a text layer.
With that tool it should be doable with something like a simple while loop:
find . -type f -name '*.pdf' -print0 | while IFS= read -r -d '' file; do echo "Processing $file ..." ocrmypdf "$file" "$file" # ocrmypdf "$file" "${file%.pdf}_ocr.pdf" # if you want a new file instead of overwriting the old done
If you need additional languages or other options you’ll have to delve a little deeper into the ocrmypdf documentation but this should be enough duct tape to just whip up a full OCR cycle.
- Comment on Searching through a bulk of pdf files 5 days ago:
In case you are already using ripgrep (rg) instead of grep, there is also ripgrep-all (rga) which lets you search through a whole bunch of files like PDFs quickly. And it’s cached, so while the first indexing takes a moment any further search is lightning fast.
It supports a whole truckload of file types (pdf, odt, xlsx, tar.gz, mp4, and so on) but i mostly used it to quickly search through thousands of research papers. Takes around 5 minutes to index everything for my 4000 PDFs on the first run, then it’s smooth sailing for any further searches from there.
- Comment on How do I determine what a mystery dongle does? 2 weeks ago:
Do you know if this functionality can be turned off? I’ve been stung by the ‘gibberish’ once or twice but never enough to dive into the docs for it :)
- Comment on Being a Mastodon Moderator 4 weeks ago:
On the other hand I would, as it is really interesting to me the kind of day-to-day behind the scenes of a community place like this.
As would evidently, or perhaps has, Cris_Color@lemmy.world, so I don’t think it’s an unreasonable article hook/assumption to make.
- Comment on My Ultimate Self-hosting Setup 5 weeks ago:
I used the recommended migration tool and it worked okay for many containers but iirc the docker ones had to have one of the security options manually changed in their config which didn’t transform properly with the tool (maybe nesting enable?).
May very well have changed in the meantime or I only made a mistake, that was in my experimentation phase.
Ultimately, I did rebuild my instances from the ground since I also switched file system, and to make better use of incus profiles (e.g. one with docker provisioned, one with monitoring and so on) so I couldn’t give you a long-term migration review.
For me that was (relatively) painless by just migrating the docker volumes in place and rebuilding the stacks, of course ymmv.
If you decide on migrating and stumble upon issues don’t hesitate to hit me up - I’m only an amateur but maybe I can still help!
- Comment on My Ultimate Self-hosting Setup 5 weeks ago:
After having my dinky homelab machine on proxmox for a couple years, since the start of the year I am now running basically everything under a clean Debian system using incus and docker on the individual lxc guests.
Incus has completely replaced proxmox for me and it’s so much easier to reason about (for me at least) that I wanted to maybe point your cold hands in that direction too ;)
- Comment on Vintage gaming advertising pictures: a gallery 1 month ago:
Heh this post blew my mind twice in one package: I was definitely one of those that believed it was a real ad. I distinctly remember some discussions about the serialized nature of it or not. So as you said, super well done.
But secondly, the official ad you posted instead has three nipples at once? And one male two female on top? That almost seems weirder to me.
- Comment on Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not 1 month ago:
One point I would refute here is determinism. AI models are, by default, deterministic. They are made from deterministic parts and “any combination of deterministic components will result in a deterministic system”. Randomness has to be externally injected into e.g. current LLMs to produce ‘non-deterministic’ output.
There is the notable exception of newer models like ChatGPT4 which seemingly produces non-deterministic outputs (i.e. give it the same sentence and it produces different outputs even with its temperature set to 0) - but my understanding is this is due to floating point number inaccuracies which lead to different token selection and thus a function of our current processor architectures and not inherent in the model itself.
- Comment on Judge Rules Training AI on Authors' Books Is Legal But Pirating Them Is Not 1 month ago:
I am not sure what your contention, or gotcha, is with the comment above but they are quite correct. And additionally chose quite an apt example with video compression since in most ways current ‘AI’ effectively functions as a compression algorithm, just for our language corpora instead of video.
- Comment on Searching advice for selfhosting critical data 3 months ago:
I took a look at the BuyVM offer you mentioned since it sounds really good, but am I understanding correctly that to make use of the 1TB storage offer I would have to also order a dedicated VM with them to actually make use of it? (i.e. no mounting from a vps with a different provider)
- Comment on Organic Maps migrates to Forgejo due to GitHub account blocked by Microsoft. 4 months ago:
Forgejo is in fact working on being decentralized, just like the underlying git structure is. There are some first federation things in there, but the full implementation is still pretty far out.