Korrelan, if adding more data didn't make a brain more intelligent, then I could literally stop researching how to build AGI tomorrow, because I wouldn't need any new information. Every day, my own brain's goal is to learn more, and in doing so I become smarter at: 1) finding more new data, 2) how to look at that data, and 3) deciding what to learn, and in doing so I come closer to being able to stumble on the paragraph explaining all details of how the brain works. I update my research domains everyday - specializing/ exploiting where I will explore next.
The problem with GPT-2 is many things and they all have the same trait: You can learn and learn and learn all the data you want, updating your weights to the point where you've seen eat entail dogs 88,100 times and run follow dogs 88,050 times, learning that dogs eat just a bit more than they run, but the search space is extremely huge and you'll almost never see the same sentence again as well because we don't use all the search space either, only quantized items in it. So instead of storing every phrase similar to "my cats eat food", I store only that one. At some point, adding more dataset doesn't improve the model, because basically you [can] [quickly] learn all the low level features like a, b, c, th, sh, .ed, ion, then at some point if you learn a larger feature - it is not shared as much and not as useful, smaller features are as powerful and are learnt more quickly, it's all you need, you don't need to store every possible 40 word phrase, only a model of physics! The issues with GPT-2/3 is that it doesn't know how to do that trick....fully. It does make a model, it does recognize "my cat ate" as "her dog ran" some amount (maybe bit less than could see, or too much), but it's not fully digging up the information hidden in the data, and so it's stuck with the ever growing Trie Tree mentioned, and therefore, being stuck closer to the root, sucks, and adding more data doesn't help the Trie Tree either lol.
What GPT-2 needs to do is see more information in the same sized dataset better/smarter. Do you understand korrelan what I mean? Does anyone here? Semantics let's you. You use cats when prompted with "dogs _?_" to help prediction, they share contexts/predictions. GPT-2, therefore, needs a LOT more data, so much that it's nearly impossible to give a number, and to do that requires tricks like Semantics, and many others no one talks about. Not storing a monster Trie Tree that has "seen all" 40 word sentences, many times over and over again.
@Ivan, and maybe, America can bounce back like that, like a seed, where people still know how to do what they do and still have the main machine-tools.