This already happens intrinsically in the models. The tokens are abstracted in the internal layers and only translated in the output layer back to next token prediction. Training visual models is slightly different because you’re not outputting tokens but pixel values (or possibly bounding boxes or edges, but not usually; conversely if not generative you may be predicting labels which could theoretically be in token space).
The field itself is actually fairly stagnant in architecture. It’s still just attention layers all the way down. It’s just adding more context length and more layers and wider layers while training on more data. I personally think this approach will never achieve AGI or anything like it. It will get better at perfectly reciting its training data, but I don’t expect truly emergent phenomena to occur with these architectures just because they’re very big. They’ll be decent chatbots, but we already have that, and they’ll just consumer ever more resources for vanishingly small improvements (and won’t functionally improve any true logical capability beyond regurgitating logical paths already trodden in their training data but in a very brittle way, because they do not actually understand the logic or why the logic is valid, they have no true state model of objects which are described in the token space they’re traversing probabilistically).
in_my_honest_opinion@piefed.social 18 hours ago
I was unclear I guess, I was talking about injecting other models, running their prediction pipeline for the specific topic, and then dropped out of the window to be replaced by another expert. This functionality handled by a larger model that is running the context window. Not nested models, but interchangeable ones dependent on the vector of the tokens.
Currently this is handled with MCP servers and as I understand it those use natural language.