Perception, Control, Cognition

Practical Advice for Building Deep Neural Networks

In our machine learning lab, we’ve accumulated tens of thousands of training hours across numerous high-powered machines. The computers weren’t the only ones to learn a lot in the process, though: we ourselves have made a lot of mistakes and fixed a lot of bugs.

Here we present some practical tips for training deep neural networks based on our experiences (rooted mainly in TensorFlow). Some of the suggestions may seem obvious to you, but they weren’t to one of us at some point. Other suggestions may not apply or might even be bad advice for your particular task: use discretion!

We acknowledge these are all well-known methods. We, too, stand on the shoulders of giants here! Our objective with this article is simply to summarize them at a high level for use in practice.

General Tips

Debugging a Neural Network

If your network isn’t learning (meaning: the loss/accuracy is not converging during training, or you’re not getting results you expect), try these tips:

An Example Case Study

To help make the process described above more relatable, here are a few loss charts (via TensorBoard) for some actual regression experiments of a convolutional neural network that we built.

At first, the network was not learning at all:

We tried clipping the values, to prevent them from going out of bounds:

Huh. Look at how crazy the un-smoothed values are. Learning rate too high? We tried decaying the learning rate and training on just one input:

You can see where the first few changes to the learning rate occurred (at about steps 300 and 3000). Obviously, we decayed too quickly. So, giving it more time between decays, it did better:

You can see we decayed at steps 2000 and 5000. This was better, but still not great, because it didn’t go to 0.

Then we disabled LR decay and tried moving the values into a narrower range instead by putting the inputs through a tanh. While this obviously brought the error values below 1, we still couldn’t overfit the training set:

This is where we discovered, by removing batch normalization, that the network was quickly outputting NaN after one or two iterations. We left batch norm disabled and changed our initialization to variance scaling. These made all the difference! We were able to overfit our test set of just one or two inputs. While the chart on the bottom clips the Y axis, the initial error value was well above 5, showing a reduction in error by almost 4 orders of magnitude:

The top chart is heavily smoothed, but you can see that it overfit the test input extremely quickly, and the loss of the whole training set marched down below 0.01 over time. This was without decaying the learning rate. We then continued training after dropping the learning rate by one order of magnitude, and got even better results:

These results were much better! But what if we decayed the learning rate geometrically rather than splitting training into two parts?

By multiplying the learning rate by 0.9995 at each step, the results were not as good:

… presumably because the decay was too quick. A multiplier of 0.999995 did better, but the results were nearly equivalent to not decaying it at all. We concluded from this particular sequence of experiments that batch normalization was hiding exploding gradients caused by poor initialization, and that decaying the learning rate was not particularly helpful with the ADAM optimizer, except perhaps one deliberate decay at the end. Along with batch norm, clipping the values was just masking the real problem. We also tamed our high-variance input values by putting them through a tanh.

We hope you will find these basic tips useful as you become more familiar with building deep neural networks. Often, it’s just simple things that can make all the difference.