Ideas/opinions for troubleshooting exploding output values (DQN)

Marco · « **on:** August 08, 2017, 12:25:45 pm »

Hello folks!

While introducing myself, I mentioned that I work on adding a Deep Reinforcement Learning (DQN) implementation to the library ConvNetSharp. I've been facing one particular issue for days now: during training, the output values grow exponentially till they reach negative or positive infinity. For this particular issue I could need some fresh ideas or opinions, which could aid me in finding further opportunities for tracking down the cause of this issue.

So here is some information about the project itself. Afterwards (i.e. the next paragraph), I'll list the conducted steps for troubleshooting. For C#, there are not that many promising libraries out there for neural networks. I already worked with Encog, which is quite convenient, but it does not provide GPU support neither convolutional neural nets. The alternative, which I chose now, is ConvNetSharp. The drawback of that library is the lack of documentation and in-code comments, but it supports CUDA (using managed CUDA). An alternative would be to implement some interface between C# and Python, but I don't have any idea for such an approach, e.g. TCP most likely will turn out to be a bottleneck. The DQN implementation is adapted to ConvNetJS's deepqlearn.js and a former ConvNetSharp Port . For testing my implementation, I created a slot machine simulation using 3 reels, which are stopped individually by the agent. The agent receives as input the current 9 slots. The available actions are Wait and Stop Reel. A reward is handed out according to the score as soon as all reels are stopped. The best score is 1. If I use the old ConvNetSharp port with the DQN Demo, the action values (output values of the neural net) stay below 1. The same scenario on my implementation, using the most recent version of ConvNetSharp, faces the issue of exponentially growth during training.

Here is what I checked so far.

Logged inputs, outputs, rewards, experiences (all, except the growing outputs, look fine)
Used former DQN ConvNetSharp Demo for testing the slot machine simulation (well, the agent does not come up with a suitable solution, but the outputs do not explode)
Varying hyperparameters, such as a very low learning rate or big and small training batches

There are two components, which are vague to me. The regression layer of ConvNetSharp got introduced recently and I'm not sure if I'm using the Volume (i.e. Tensor) object as intended by its author. As I'm not familiar with the actual implementation details of neural nets, I cannot figure out if the issue is caused by ConvNetSharp or not. I was in touch with the author of ConvNetSharp a few times, but still wasn't able to make progress on this issue. Parts of this issue are tracked on Github.

It would be great if someone has some fresh ideas for getting new insights about the underlying issue.

keghn · « **Reply #1 on:** August 08, 2017, 03:37:53 pm »

Exploding values? Do your neurons us a squashing function?

Marco · « **Reply #2 on:** August 08, 2017, 05:50:43 pm »

I'm not sure if this is desired for regression. To my understanding of the ConvNetSharp Code, there is no squashing function during the training process on the regression layers. Regardless, the values are expected to be in the range 0 to maybe 2. Growing beyond that range for the slot machine is just wrong.

Korrelan · « **Reply #3 on:** August 08, 2017, 07:22:51 pm »

I'm not familiar with this setup but I would check the Sigmoid function in the new version of Convnetsharp.

keghn · « **Reply #4 on:** August 08, 2017, 07:46:16 pm »

Tanh will keep the output between -1 and 1. Which is about the same thing:

https://www.google.com/imgres?imgurl=http://mathworld.wolfram.com/images/interactive/TanhReal.gif&imgrefurl=http://mathworld.wolfram.com/HyperbolicTangent.html&h=233&w=360&tbnid=L1oXlQOCweP7rM:&tbnh=136&tbnw=211&usg=__uFQAtr9nqUDIA8HsgLc6M6LPlSU=&vet=1&docid=W4RxENI2SKGxAM&client=ubuntu&sa=X&ved=0ahUKEwj-3pSBnsjVAhXCzVQKHc_BDtgQ9QEILDAA

Could be Sigmoid or Relu function times two?:

If early in a deep Neural net work the signal coming out of a neuron get pined to maximum or minimum, then the
information is lost to the following layer. If the signal bounce in and out of max output or minimal amplification and spends 50 percent
of the time there then the information is only there for the following layer 50 percent of the time. Or if signal to begin with
is not need at all. But is pulled within detection and the following are using it then that would not be good?

So i guess you would be normalizing your data between 0 and 2 or -1 and 1?

Marco · « **Reply #5 on:** August 08, 2017, 07:51:42 pm »

This is how the neural net is setup:

Input Layer (9 nodes)
Hidden Layer, ReLU Activation (10 nodes)
Output Layer, Regression (2 nodes)

The Sigmoid activation function is not used. The loss of the regression layer is computed similar to this (this is the ConvNetJs version):

var i = y.dim;
var yi = y.val;
var dy = x.w - yi;
x.dw = dy;
loss += 0.5*dy*dy;

I could use Softmax for the last layer, but that would be for classification and not regression. The inputs are normalized.

keghn · « **Reply #6 on:** August 08, 2017, 08:08:10 pm »

Slot machine is the worst thing for this example. What "F" is you input data? The input form the last spin? Or data logged from the
past hundred?

Marco · « **Reply #7 on:** August 08, 2017, 08:24:46 pm »

Single arm and contextual bandits have been used before for reinforcement learning tasks.

About the slot machine:
Each reel has 3 slots. Each slot is occupied by an item (Peach, Cherry, ..., Seven). AS sopon as the slot machine starts, each reel starts spinning (pulling new items from a custom probability distribution). The action to stop the reel stops the left reel, the second stop stops the middle real and of course the last stop event stops the right reel. So this is not a typical single armed bandit where all reels stop at the same time or automatically. It works like seen in the games Digimon World or Pokemon:

Each tick of the main loop updates each reel. Decisions are to be made on each tick (waiting or stopping a reel).

Maybe the slot machine is not the right fit for Deep Reinforcement Learning, but it already reveals issues of the DQN implementation. The old implementation does not suffer from that high value issue, but the newer one does. I could implement the poison and apples example of ConvNetJS for additional tests.

Marco · « **Reply #8 on:** August 14, 2017, 09:33:34 am »

Here is a small update:

The author of ConvNetSharp fixed a major bug of the regression layer. So now the outputs do not grow exponentially anymore.

The next step is to keep testing the implementation. Concerning the slot machine example, I didn't achieve a reasonable result, yet. I'm going to try out different reward signals soon. For the Apples&Poison Demo, the performance lacks severely as soon as the training starts.

Does anybody know of a good demo for verification. The goal is to successfully train some model within minutes.

keghn · « **Reply #9 on:** August 14, 2017, 02:43:49 pm »

So the operator of this slot machine can stop A spinning wheel or reel when it see the desired symbols are showing?
Then can dot the same with the following two others spinning reels? Or when the first is stopped then the other reals stop in a chain reaction? What about order of selection of wheel? What about time between selection
of wheels?

Marco · « **Reply #10 on:** August 14, 2017, 03:09:29 pm »

With the start of the slot machine, all 3 reels start spinning. The available actions are StopReel and Wait. StopReel stops the first reel, while the other two reels keep spinning. Triggering again StopReel stops the second and thus the third one is still spinning. And of course the last reel is stopped again by executing StopReel.

It's not a typical single arm bandit which just works on a single probability. So the agent looks at the state of the whole slot machine, it can observe all the slots items and can then decide to to stop one reel. After executing one action (either wait or stop), the slot machine is getting updated. So the spinning reels shift down the slots' items. So for each change of the slot, the agent is asked to stop or wait.

For this example I came up with new reward signals. Before, the reward was based on the total outcome - meaning the achieved score after stopping all reels.
Now, each reel provides a reward or punishment. The first one rewards the agent based on the item (e.g. a 7 is worth 1 and a cherry is worth 0.01). The second reel is all about if the agent managed to score a match or not. Scoring a match grants a reward of one and if it doesn't match, the agent is punished by -0.5. The second and the third reel behave the same concerning the reward.

Edit:
One more info: The slot machine has 6^9 (10 Mio) states

Marco · « **Reply #11 on:** August 22, 2017, 02:07:45 pm »

Progress is still sparse. I still didn't get a good result for the slot machine so I switched to the Apples&Poison Demo. Well, this revealed a huge lack in performance, because ConvNetSharp seems to be single threaded (no imports for using threads). The old port of ConvNetJS, which comes with the Apples&Poison Demo is running much faster, but is single threaded as well.

As ConvNetSharp features CUDA, I wanted to overcome this problem for now by letting the GPU do the job, but from that point on I'm pushed from one exception to another one. The first one was about setting the project to 64bit only and now I'm stuck on a CUDA exception concerning memory allocation.

ConvNetSharp does not come with any documentation or some comments in the code. What do you think? Should I stick with ConvNetSharp? The alternative would be to build an interface to Python or to implement neural networks myself. Of course the implementation takes much more effort, but has educational advantages. And I'd approach it using compute shaders in Unity that wouldn't limit the usage to nvidia GPUs.

Zero · « **Reply #12 on:** August 22, 2017, 02:36:18 pm »

Hi,
You did the pros & cons of implementing NN yourself, which I understand. But what are the pros of sticking with ConvNetSharp? (I mean, since it doesn't work nicely)

Marco · « **Reply #13 on:** August 22, 2017, 02:57:16 pm »

Zero · « **Reply #14 on:** August 23, 2017, 08:27:28 pm »

The "GPU exceptions" cons is shadowing two pros out of four, isn't it?

Also, don't you think the educational benefits of DIY are heavy weight in the balance?

Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Ideas/opinions for troubleshooting exploding output values (DQN)

keghn

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Korrelan

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

keghn

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

keghn

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

keghn

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Zero

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Marco

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Zero

Re: Ideas/opinions for troubleshooting exploding output values (DQN)

Recent Topics

Recent News

Users Online

Articles