Defined the loss, now we’ll have onesto compute its gradient respect to the output neurons of the CNN con order onesto backpropagate it through the net and optimize the defined loss function tuning the net parameters. The loss terms coming from the negative classes are nulla. However, the loss gradient respect those negative classes is not cancelled, since the Softmax of the positive class also depends on the negative classes scores.
The gradient expression will be the same for all \(C\) except for the ground truth class \(C_p\), because the risultato of \(C_p\) (\(s_p\)) is sopra the nominator.
- Caffe: SoftmaxWithLoss Layer. Is limited onesto multi-class classification.
- Pytorch: CrossEntropyLoss. Is limited preciso multi-class classification.
- TensorFlow: softmax_cross_entropy. Is limited preciso multi-class classification.
Per this Facebook rete di emittenti they claim that, despite being counter-intuitive, Categorical Ciclocampestre-Entropy loss, or Softmax loss worked better than Binary Cross-Entropy loss sopra their multi-label classification problem.
> Skip this part if you are not interested sopra Facebook or me using Softmax Loss for multi-label classification, which is not canone.
When Softmax loss is used is per multi-label scenario, the gradients get verso bit more complex, since the loss contains an element for each positive class. Consider \(M\) are the positive classes of verso sample. The CE Loss with Softmax activations would be:
Where each \(s_p\) con \(M\) is the CNN conteggio for each positive class. As per Facebook paper, I introduce per scaling factor \(1/M\) onesto make the loss invariant puro the number of positive classes, which ple.
As Caffe Softmax with Loss layer nor Multinomial Logistic Loss Layer accept multi-label targets, I implemented my own PyCaffe Softmax loss layer, following the specifications of the Facebook paper. Caffe python layers let’s us easily customize the operations done mediante the forward and backward passes of the layer:
Forward pass: Loss computation
We first compute Softmax activations for each class and filtre them sopra probs. incontri asiame Then we compute the loss for each image per the batch considering there might be more than one positive label. We use an scale_factor (\(M\)) and we also multiply losses by the labels, which can be binary or real numbers, so they can be used for instance sicuro introduce class balancing. The batch loss will be the mean loss of the elements in the batch. We then save the datazione_loss preciso video it and the probs to use them in the backward pass.
Backward pass: Gradients computation
Durante the backward pass we need esatto compute the gradients of each element of the batch respect puro each one of the classes scores \(s\). As the gradient for all the classes \(C\) except positive classes \(M\) is equal puro probs, we assign probs values sicuro delta. For the positive classes in \(M\) we subtract 1 esatto the corresponding probs value and use scale_factor to match the gradient expression. We compute the mean gradients of all the batch preciso run the backpropagation.
Binary Ciclocampestre-Entropy Loss
Also called Sigmoid Ciclocross-Entropy loss. It is per Sigmoid activation plus per Cross-Entropy loss. Unlike Softmax loss it is independent for each vector component (class), meaning that the loss computed for every CNN output vector component is not affected by other component values. That’s why it is used for multi-label classification, were the insight of an element belonging to a indivisible class should not influence the decision for another class. It’s called Binary Ciclocross-Entropy Loss because it sets up a binary classification problem between \(C’ = 2\) classes for every class per \(C\), as explained above. So when using this Loss, the formulation of Cross Entroypy Loss for binary problems is often used: