Comment on I'm gonna die on this hill or die trying

<- View Parent
CodeInvasion@sh.itjust.works ⁨1⁩ ⁨day⁩ ago

This is because spaces typically are encoded by model tokenizers.

In many cases it would be redundant to show spaces, so tokenizers collapse them down to no spaces at all. Instead the model reads tokens as if the spaces never existed.

For example it might output: thequickbrownfoxjumpsoverthelazydog

Except it would actually be a list of numbers like: [1, 256, 6273, 7836, 1922, 2244, 3245, 256, 6734, 1176, 2]

Then the tokenizer decodes this and adds the spaces because they are assumed to be there. The tokenizer has no knowledge of your request, and the model output typically does not include spaces, hencr your output sentence will not have double spaces.

source
Sort:hotnewtop