“Loss” computation of MatConvNet and Caffe
I wrote a contrastive loss layer for MatConvNet, and somehow the loss value compared to Caffe is dramatically different.
After days of working, digging into the codes, I found that both loss.m and softmaxloss.m are not divided by N or HxWx1xN, but HxW. And if you run the example code, you will know that at each iterations, cnn_train.m shows the correct loss value correctly, why?
That is, cnn_train.m shows the loss value and divide N for you. This means the computed gradient for backpropagation is still a multiplication of N, which makes neural network to converge faster. So I try to “correct” the code (divide loss and gradient by N), but I finally know it’s a wrong step: If you divide gradient by N, then your network will not converge, or slowly converges to a bad valley.
At the mean time I take a look into Caffe’s architecture, I found the definition of loss layers are just like mathematical equations (gradients are divided by N). I’m confused, why different approaches lead to the same result?
As the result, my workaround is....... multiply the contrastive loss by N in order to get it runs correctly in MatConvNet.









