Demand for discussing training results of a Deep Reinforcement Learning agent

  • 7 Replies
  • 364 Views
*

Marco

  • Bumblebee
  • **
  • 33
Hello everybody!

I'm messing with Unity's ML agents currently. I build a game simulation called "Basket Catch". Using Unity's "Proximal Policy Optimization" (PPO) I managed to train a proper result for an agent controlling a box to catch rewards and to avoid punishments.

This is what one trained behavior looks like:


The agent makes use of 28 inputs. These are made of measuring the distance to the left and right end of its environment and its 13 "eyes".

Each eye makes up two inputs telling the agent if it sees a punishment or a reward and its distance.

The output actions are moving left, moving right and doing nothing. Rewards are signaled for catching a blue box (+1) and a red box (-1).

Demand for discussion
Via this link you can find all the training sessions which I ran so far and which I'd like to discuss in order to shrink down the training time or get a better result. If you use tensorboard, you can find the graph files here.

The most influential hyperparameter so far is the batch size. The smaller the batch size the longer the training takes, but it scores the best results with value estimates greater than 5. This was achieved with a batch size of one, but it took more than 3 hours for the training to converge. The slowness of the training time can be related to the GPU computations. GPUs are good on lots of big batches, but not on a few. So I have to check if the CPU can do the training faster.

So far I trained with 1, 9 and 18 agents. It doesn't really look like that this makes a difference in the training progress. All agents cumulatively train one brain. Also, the CPU and GPU usage are far from being exhausted. So I'm not quite sure what can be tuned to get the most out of my hardware to increase the training speed. The interface between the Unity build and python is based on sockets.

Any ideas and questions are welcome. My goal is to shrink down the training time to like 5 minutes as my example is more trivial in comparison to the one from Unity, which uses a continuous action space to balance a ball on a board in 3D.
« Last Edit: November 15, 2017, 06:23:38 pm by Marco »

*

ranch vermin

  • Not much time left.
  • Starship Trooper
  • *******
  • 484
  • Its nearly time!
god the computers arent fast enough to do it this random development,  and you need a new body for every new population member...  theres got to be a more optimized way than doing it like this...

*

keghn

  • Trusty Member
  • ********
  • Replicant
  • *
  • 689
 Those web sites are not working on my computer. So I re posted them.

Unity ML - Agents (Beta):
https://github.com/Unity-Technologies/ml-agents 

Proximal Policy Optimization Algorithms, PPO:
https://arxiv.org/abs/1707.06347

BasketCatch Sessions, via this link:
https://docs.google.com/spreadsheets/d/1O-0cE_txWTZISck8NKzIBEJjdUag0FmxFtKvESE67ec/edit#gid=0

Down load, here:
https://drive.google.com/file/d/1iA1T82JlnnmN1P-fRPpopkfC-QmPajlP/view


*

korrelan

  • Trusty Member
  • *********
  • Terminator
  • *
  • 784
  • Look into my eyes! WOAH!
    • Google +
Very cool thread... I'm liking your work Marco

Quote
theres got to be a more optimized way than doing it like this

There is.  The end result has to be at least as good as a human, preferably better.  So the ideal scenario is for a human to play the game and the bot to watch and learn.  All humans aren’t equal… so get lots of humans to play.  The bot then just has to mimic the human performance throwing in a slight genetic enhancement to see if the performance improves, if it does keep it… if it doesn’t scrap it.

Unlike the game of GO this is a motor skill game.

 :)
It thunk... therefore it is!

*

Marco

  • Bumblebee
  • **
  • 33
So the ideal scenario is for a human to play the game and the bot to watch and learn.

This is called imitation learning and it is one of the features which are going to be released by Unity for their ml agents.

I just uploaded a behavior which plays probably superhuman like.






P.S.: Fixed the links.

*

Marco

  • Bumblebee
  • **
  • 33
I still didn't find the perfectly tuned parameters yet. I'll leave it as it is for now and start on a new environment there an agent is controlling a canon to shoot comets. The inputs will be sensory just like in basket catch, but the outputs will be continuous this time.

I've got some more ideas for interesting challenges. Some of them are listed on my GitHub repository.

In the near future I have to come up with an environment which can be related to Digital Fabrication or Industry 4.0.

*

keghn

  • Trusty Member
  • ********
  • Replicant
  • *
  • 689
  l like "sliding window algorithm": 

https://www.youtube.com/watch?time_continue=1&v=ShbRCjvB_yQ
http://www.geeksforgeeks.org/window-sliding-technique/

 A window could slid over  frames of video. Say a window could view 7 frames of video or a k = 7. Or any size you like.
 The first three frames are entered into NN. The 4th frame is your outputs and also what is exactly being seen, right now. Frame 5 is a
prediction of what frame 4 should looks like and what output were predicted to be used. All other frames 6, 7. are long
range prediction. That evolve into frame 4 and 5.

 There are two NN here.
 The NN that looking at frames 1, 2, 3, at the same time, is a detector and control generator  NN. Frame 5, 6, and 7, are a generator by a
generator NN.
 If you going off recorded video, you can see how well frames, 6, 7 on the predicted to frame 6, 7, on the pre recorded video.
 But if you running this in real time then frame 4 is recorded and pass to frame 3 when the sliding window move forward.

MarioFlow: 
https://www.youtube.com/watch?time_continue=340&v=4a-W3Od5-t8



*

Marco

  • Bumblebee
  • **
  • 33
I just started on a new environment:

 


Users Online

29 Guests, 1 User

Most Online Today: 54. Most Online Ever: 208 (August 27, 2008, 09:36:30 am)

Articles