Tips for debugging in (D)RL #ReinforcementLearning

Introduction

this is not meant to be in public tho, so let me ignore any grammatical mistakes made in this article. please kindly ignore them if exist...

Hyper-param Optimisation

Coursera: Hyperparameter tuning 1
- Select the most influential parameters since we can't deal with all of them
- Understand, how exactly they influence the training
- Tune them, either manually or automatically(e.g., use libraries)
Tune the hyper-params on the fly
- paper: https://arxiv.org/pdf/1902.06583.pdf
- Reddit: https://www.reddit.com/r/reinforcementlearning/comments/bp6coe/d_greedy_hyperparameter_tuning/
Carefully employ the same hyper-params as in the other paper you referenced.
- Reference paper can be considered as your baseline so that start with their hyper-params. But make sure to understand their behaviours first of all.
As this guy mentioned, by adapting the learning rate over time in training, you may be able to find a better learning rate for your architecture.

I'm also confused by this, but I use decaying learning rates, then I watch the loss curves to see when they begin to converge. In this example the loss_critic is only decreasing when lr_critic (learning rate) is 2e-3. So I probably need to increase it.

Debugging RL algorithms in general

Start with the simplest environment available
- As this guy's summarised on Github, in the continuous action space problem it is better to start with Pendulum, while in the discrete action space case, Cartpole.
- Alway, make sure to play with a random agent at the begging of your journey to implement an algorithm to familiarize yourself with certain environments.
- Once your algorithm could solve those tasks, then you may move on to more difficult tasks.
Feature engineering
- As people discussed, we may need to scale the observations(e.g., raw pixels)/states(e.g., sensor signals) to fit in [0, 1]. But as far as I've read some codebases, e.g., dopamine / baselines / tf-agents, they don't scale the observations in DQN. Rather they seem to prefer to scale the states than the observations. Of course, reward scaling does matter so don't miss it like this guy

I don't think I would agree. I think there's always a trick or a bug. In my particular case I'm working on at the moment what turned out to be the game-changer (and as of tonight made my RL agents actually learn something :)) was rescaling the reward from [-1, 1] to [0, 1] as suggested in this seemingly unrelated post and, admittedly, several of the pointers mentioned above. Thanks again to everyone that contributed!

Think a lot, experiment less.
- I, once, experienced that throughout trial-error of debugging and repeating the similar experiments without devoting much time on the thought process. I was literally start an experiment and wait for some 10-20 mins doing other task and check Tensorboard and stop the experiment to change other part. After read this great article, I've started following his work-log style and deeply analyse the result of experiment as well as train it for longer. Turns out this makes my thought quite clear and reduces the time to find the bug/hyper-params behavioural influence on the agent.
Train for longer.
- As mentioned just above, while I was in the thought process, I accidentally found that the experiment which I'd thought wouldn't go well, turned out better result than before. this is just because I waited a bit longer than the previous experiment. So as John Schulman mentioned, train it for longer than stated in the original paper.

Why PyTorch implementation is working but Tensorflow one isn't??

Different implementation optimisation methods discussed here
- https://discuss.pytorch.org/t/suboptimal-convergence-when-compared-with-tensorflow-model/5099/14
Perhaps, different default weight init method?
- https://discuss.pytorch.org/t/how-are-layer-weights-and-biases-initialized-by-default/13073/3

Tips for debugging in (D)RL

Introduction

Hyper-param Optimisation

Debugging RL algorithms in general

Why PyTorch implementation is working but Tensorflow one isn't??

References