Comment on Implementing a spellchecker on 64 kB of RAM back in the 1970s led to a compression algorithm that's technically unbeaten and part of it is still in use today

NoSpotOfGround@lemmy.world ⁨4⁩ ⁨days⁩ ago

The real meat of the story is in the referenced blog post: blog.codingconfessions.com/…/how-unix-spell-ran-i…:

TL;DR

If you’re short on time, here’s the key engineering story:

  • McIlroy’s first innovation was a clever linguistics-based stemming algorithm that reduced the dictionary to just 25,000 words while improving accuracy.
  • For fast lookups, he initially used a Bloom filter—perhaps one of its first production uses. Interestingly, Dennis Ritchie provided the implementation. They tuned it to have such a low false positive rate that they could skip actual dictionary lookups.
  • When the dictionary grew to 30,000 words, the Bloom filter approach became impractical, leading to innovative hash compression techniques.
  • They computed that 27-bit hash codes would keep collision probability acceptably low, but needed compression.
  • McIlroy’s solution was to store differences between sorted hash codes, after discovering these differences followed a geometric distribution.
* Using Golomb's code, a compression scheme designed for geometric distributions, he achieved 13.60 bits per word—remarkably close to the theoretical minimum of 13.57 bits.
  • Finally, he partitioned the compressed data to speed up lookups, trading a small memory increase (final size ~14 bits per word) for significantly faster performance.

source
Sort:hotnewtop