Algorithm based on LLMs doubles lossless data compression rates

⁨0⁩ ⁨likes⁩

Submitted ⁨⁨1⁩ ⁨year⁩ ago⁩ by ⁨NoSpotOfGround@lemmy.world⁩ to ⁨technology@lemmy.world⁩

https://techxplore.com/news/2025-05-algorithm-based-llms-lossless-compression.html

source

Comments

Sort:hotnew top

tekato@lemmy.world ⁨1⁩ ⁨year⁩ ago
Interesting how they forgot to go over the architecture for LMDecompress.

source
andallthat@lemmy.world ⁨1⁩ ⁨year⁩ ago
I tried reading the paper. There is a free preprint version on arxiv. The article linked by OP also links the code they used and the data they tried compressing, in the end.

While most of the theory is above my head, the basic intuition is that compression improves if you have some level of “understanding” or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you’ll probably transmit that information more efficiently than a compression algorithm. “The first chapter of Moby-Dick”; there … I just did it.

source
skip0110@lemm.ee ⁨1⁩ ⁨year⁩ ago
This is not new knowledge and predates the current LLM fad.

See the Hutter prize which has had “machine learning” based compressors leading the ranking for some time: prize.hutter1.net

It’s important to note when applied to compressors, the model does produce a code (aka encoding) that exactly reproduces the input. But on a different input the same model is unlikely to produce an impressive compression.

source
futatorius@lemm.ee ⁨1⁩ ⁨year⁩ ago
Where I work, we’ve been looking into data compression that’s optimized by a ML system. We have a shit-ton of parameters, and the ML algorithm compares the number of sig figs in each parameter to its byte size, and truncates where that doesn’t cause any loss of fidelity. So far, it looks promising, really good compression factor, but we still need to do more work on de-skilling the decompression at the receiving end.

I wouldn’t have thought LLM was the right technology to use for something like this.

source
fluxion@lemmy.world ⁨1⁩ ⁨year⁩ ago
Middle-LLM compression

source
- sherlock@feddit.nu ⁨1⁩ ⁨year⁩ ago
  We’re heading for a 5.2 Weissman score
  
  source
deur@feddit.nl ⁨1⁩ ⁨year⁩ ago
This is just a more complex version of shared dictionary compression which I think one of the web compression algorithms does. Stupid LLM fuckers at it again with dumb garbage.

source
Harlehatschi@lemmy.ml ⁨1⁩ ⁨year⁩ ago
Ok so the article is very vague about what’s actually done. But as I understand it the “understood content” is transmitted and the original data reconstructed from that.

If that’s the case I’m highly skeptical about the “losslessness” or that the output is exactly the input.

But there are more things to consider like de-/compression speed and compatibility. I would guess it’s pretty hard to reconstruct data with a different LLM or even a newer version of the same one, so you have to make sure you decompress your data done years later with a compatible LLM.

And when it comes to speed I doubt it’s nearly as fast as using zlib (which is neither the fastest nor the best compressing…).

And all that for a high risk of bricked data.

source
- modeler@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I’m guessing that exactly the same LLM model is used (somehow) on both sides - using different models or different weights would not work at all.
  
  An LLM is (at core) an algorithm that takes a bunch of text as input and produces an output of a list of word/probabilities such that the sum of all probabilities adds to 1.0. You could place a wrapper on this that creates a list of words by probability. A specific word can be identified by the index in the list, i.e. first word, tenth word etc.
  
  (Technically the system uses ‘tokens’ which represent either whole words or parts of words, but that’s not important here).
  
  A document can be compressed by feeding in each word in turn, creating the list in the LLM, and searching for the new word in the list. If the LLM is good, the output will be a stream of small integers. If the LLM is a perfect predictor, the next word will always be the top of the list, i.e. a 1. A bad prediction will be a relatively large number in the thousands or millions.
  
  Streams of small numbers are very well (even optimally) compressed using extant technology.
  
  source
- barsoap@lemm.ee ⁨1⁩ ⁨year⁩ ago
  
  I would guess it’s pretty hard to reconstruct data with a different LLM
  
  I think the idea is to have compressor and decompressor use the exact same neural network. Looks like arithmetic coding with a learned function.
  
  But yes model size is probably going to be an issue.
  
  source
  - Harlehatschi@lemmy.ml ⁨1⁩ ⁨year⁩ ago
    Ye but that would limit the use cases to very few. Most of the time you compress data to either transfer it to a different system or to store it for some time, in both cases you wouldn’t want to be limited to the exact same LLM. Which leaves us with almost no use case.
    
    I mean… cool research… kinda… but pretty useless.
    
    source
Alphane_Moon@lemmy.world ⁨1⁩ ⁨year⁩ ago
I found the article to be rather confusing.

One thing to point out is that the video codec used in this research (but for which results weren’t published for some reason), H264, is not at all state of the art.

H265 is far newer and they are already working in H266. There are also other much higher quality codecs such as AV1. For what it’s worth, they do reference H265, but I don’t have access to the source research paper so it’s difficult to say what they are comparing against.

The performance relative to FLAC is interesting though.

source
- InvertedParallax@lemm.ee ⁨1⁩ ⁨year⁩ ago
  Vvc is h266, the spec is ready it’s just not in a lot of hardware, or even decent software yet, that often takes a few years.
  
  Av1 isn’t much better than hevc (h265), it’s just open and patent free and Google is pushing it like crazy.
  
  It has iirc 1 major feature over hevc, non-square subpictures, beyond that it has some extensions for animation and slideshows basically.
  
  source
- paraphrand@lemmy.world ⁨1⁩ ⁨year⁩ ago
  I wonder what the practical reasons for starting with h.264 are.
  
  source
  - entropicdrift@lemmy.sdf.org ⁨1⁩ ⁨year⁩ ago
    Low/no patent issues, much simpler complexity
    
    source
besselj@lemmy.ca ⁨1⁩ ⁨year⁩ ago
So if I have two machines running the same local LLM and I pass a prompt between them, I’ve achieved data compression by transmitting the prompt rather than the LLM’s expected response to the prompt? That’s what I’m understanding from the article.

Neat idea, but what if you want to transmit some information that an LLM can’t generate accurately?

source
- taladar@sh.itjust.works ⁨1⁩ ⁨year⁩ ago
  And how do I get the prompt that will reliably generate the data from the data? Usually for compression we do not start from an already compressed version.
  
  source
AbouBenAdhem@lemmy.world ⁨1⁩ ⁨year⁩ ago

The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

Great… but if that’s the case, maybe we should re-think whether we need to transmit that data in the first place.

source
dotdi@lemmy.world ⁨1⁩ ⁨year⁩ ago
Can’t wait to find hallucinated data in your uncompressed files.

source