Improvements are made all the time. You can’t feed a very large SVM the same data as transformer networks and expect it to perform the same. Transformers are used because they can more easily learn complicated patterns with less data.
I think I’ve read somewhere that neural networks with only one hidden layer can theoretically predict anything (if the hidden layer is large enough), but an incredible amount of data is required for it to do so, so it’s not practical.
Over time other models will be discovered that can make better use of the training data.
barsoap@lemm.ee 6 months ago
The paper isn’t about parameter size but the need for exponentially more training data to get a mere linear increase in output performance.