Comment on I'm gonna die on this hill or die trying
Redjard@lemmy.dbzer0.com 3 weeks agoI’d expect tokenizers to include spaces in tokens. You get words constructed from multiple tokens, so can’t really insert spaces based on them. And too much information doesn’t work well when spaces are stripped.
In my tests plenty of llms are also capable of seeing and using double spaces when accessed with the right interface.
CodeInvasion@sh.itjust.works 3 weeks ago
The tokenizer is capable of decoding spaceless tokens into compound words following a set of rules referred to as a grammar in Natural Language Processing (NLP). I do LLM research and have spent an uncomfortable amount of time staring at the encoded outputs of most tokenizers when debugging. Normally spaces are not included.
There is of course a token for spaces in special circumstances, but I don’t know exactly how each tokenizer implements those spaces. So it does make sense that some models would be capable of the behavior you find in your tests, but that appears to be an emergent behavior, which is very interesting to see it work successfully.
I intended for my original comment to convey the idea that it’s not surprising that LLMs might fail at following the instructions to include spaces since it normally doesn’t see spaces except in special circumstances. Similar to how it’s unsurprising that LLMs are bad at numerical operations because of how the use Markov Chain probability to each next token, one at a time.
Redjard@lemmy.dbzer0.com 3 weeks ago
Yeah, I would expect it to be hard, similar to asking an llm to substitiute all letters e with an a. Which I’m sure they struggle with but manage to perform it too.
In this context though it’s a bit misleading explaining the observed behavior of op with that though, since it implies it is due to that fundamental nature of llms when in practice all models I have tested fundamentally had the ability.
It does seem that llms simply don’t use double spaces (or I have not noticed them doing it anywhere yet), but if you trained or just systemprompted them differently they could easily start to. So it isn’t a very stable method for non-ai identification.