Thank you
NoSpotOfGround@lemmy.world 4 days ago
The real meat of the story is in the referenced blog post: blog.codingconfessions.com/…/how-unix-spell-ran-i…:
TL;DR
If you’re short on time, here’s the key engineering story:
- McIlroy’s first innovation was a clever linguistics-based stemming algorithm that reduced the dictionary to just 25,000 words while improving accuracy.
- For fast lookups, he initially used a Bloom filter—perhaps one of its first production uses. Interestingly, Dennis Ritchie provided the implementation. They tuned it to have such a low false positive rate that they could skip actual dictionary lookups.
- When the dictionary grew to 30,000 words, the Bloom filter approach became impractical, leading to innovative hash compression techniques.
- They computed that 27-bit hash codes would keep collision probability acceptably low, but needed compression.
- McIlroy’s solution was to store differences between sorted hash codes, after discovering these differences followed a geometric distribution.
* Using Golomb's code, a compression scheme designed for geometric distributions, he achieved 13.60 bits per word—remarkably close to the theoretical minimum of 13.57 bits.
- Finally, he partitioned the compressed data to speed up lookups, trading a small memory increase (final size ~14 bits per word) for significantly faster performance.
db2@lemmy.world 4 days ago
potate@lemmy.ca 4 days ago
The blog post is an incredible read.
ch00f@lemmy.world 4 days ago
For anyone struggling, lemmy web interface added the colon into the URL for the blog post link. Here’s a clickable version without the colon:
blog.codingconfessions.com/…/how-unix-spell-ran-i…
0x0@programming.dev 4 days ago
here’s another
NoSpotOfGround@lemmy.world 3 days ago
Thanks, and sorry about that! I removed the colon from near my URL now, just in case.