Ok i think i got it finally!....... below is my last note and below that is my final note from tonight. That clarifies it was more random trying that actual intelligence lol. It makes complete total sense! Or, prove me wrong.
Prior understanding:
Who here can explain the following to a 5 year old kid, or an old mom, in clear English?:
https://openai.com/blog/emergent-tool-use/I.e. how well do you understand it? ; How many people can you recruit into AGI?
How well can you summarize it (in 100 words. And 200 words. And 500 words.) without any filler content and only main points?
Can you make it seem boring, like cake?
Who here can draw an intuitive viz so mom can understand it?
&
Here's my go at it.
1) There is a simulated world with 2 teams.
2) Looking at their Paper, each little man has the choice to move forward/backward, rotate, grab a nearby object, and lock an object in place.
3) They seem to start off randomly jiggling around/etc, but learn tricks, general tricks that work in diverse environments. For example, they seem to learn that usually holding an object is better than not, or if a teammate is nearby it is more likely to win, or if it makes non-sharp turns around walls it will run faster. They will decide when to use/combine these, and do so usually, based on when they see the recognizable cues.
4) The red team learns to move towards the blue team members because one of its jiggling moves got it in its line of sight. So now the red team found general tricks to find the blue team and wins the game goal criteria. Until now, there was only random jiggling, but then it won a game and found what, and when, will work, for diverse environments.
5) The red team cannot further optimize, but the blue team can. The blue team learns to use cubes to trap themselves in a room by random jiggles. This can be used in different environments, and the learning of this technique could-have stemmed from the desire to usually carry boxes around.
6) This competition continues until it can no long continue. Total/global-optimization of evolution.
7) At the last stage, when OpenAI moves a ramp to the back of the room, the blue hider learns to prepare his friend's box and then goes to get his box when the other arrives with the ramp, because the time limit before the red enter the room is so short. This, without a sentence/plan generator that reasons its plans, must be the result of a brute-force algorithm as mentioned. They do use past tactics to learn new ones, which generalize to new environments, but this was clearly learnt from randomly jiggling around (while doing learnt tactics, so not completely random jiggling). Btw it's a small environment really, not many objects or time for long game plans. And multiple plans result in the same win/solution.
8) So it seems that they learn by randomly trying actions. These actions will work in diverse scenes, when they see similar cues/conditions. This behavior, topped with some random behavior, can result in deeper learnt behavior that wins deeper games.
9) They say they use algorithms GPT-2 uses. They look at objects instead of words, and decides what Next Action to write to the story plan. If it sees a scene it's seen, and knows what it did most frequently in previous won games, it will use that learnt action.
The Update:
Ok I got it, they run millions of rounds....at stage 1 they start random and musssst accidentally do the right action to find the hiders or run from seekers. By stage 2/3 they use boxes by accident, they must, there's no hints. The only thing they could have learned so far is that they won last game when their friend was nearby seen! Or make sharp corner turns. The ramp use, unless they re-used the other agents's model, same thing, it was a accident! The only search space hint was to be near their buddy.....boooo....anyway by the next stage the blues learned to take in their ramp, this, was accident too although they had hint supposedly i assume of running around with objects, near buddy, buddy does same behavior (actually he didn't, lol, he stayed at base), and to go back to base, and make sharp turns. By the last stage, the helping out of his friend in preparing his box, again, ACCIDENT! Why would he go up to it? Ok...to carry it, but it required a random behavior of doing so then dropping it, and that one worked, he didddnt know it would dude obviously! So it is a combo of raw simple RL, competition like GANs do, and some hints of what is what and what worked entailing that.