The authors suggest to model the annotation errors as a noise function depending on the true distribution of the data and on noise parameters. These noise parameters are learned during the optimization of the network.

Paper review of the paper by Degani Yair, Greenspan Hayit, Goldberger Jacob accepted at IBSI 2018

Problem :
Human experts annotations can be unreliable depending on the complexity of the task, differences in medical training or prior experience.

Idea :
The authors suggest to model the annotation errors as a noise function depending on the true distribution of the data and on noise parameters. These noise parameters are learned during the optimization of the network.

Solution :
A noisy annotation layer is plugged at the end of a classical neural network to learn $\theta(i,j) = p(z=j | y=i)$, which is the probability of $z$ (the noisy label), depending on $y$ (the true label). In the framework on the neural net, this is equivalent to :
$$p(z=j | x;w;\theta) = \sum_{i=1}^{k} \theta(i,j) p(y=i|x,w)$$
With $x$ the input features, $w$ the parameters of the network, $\theta$ the noise layer parameters.
The probability of the noisy labels is defined by the dot product between the noise parameters and the probability vector of the ground truth: $P_z = \theta^T \cdot P_y$ where $\theta(i,j) = \frac{\exp(b_{ij})}{\sum_k \exp(b_{ik})}$.

Practically, the model is previously trained without the noise layer, then it is plugged at the end of network, for a second training stage.

Results :
Regions of Interest are extracted from the original images and a list of classical features (GLCM, GLRLM, gabor filters) are evaluated. As this is a multi-view problem, for one ROI, there are 2 views of it, which gives a total of 24 features (2 views x 12 features). The neural network is made of 2 hidden layers with relu, dropout, softmax. The noisy annotation layer is plugged after the softmax in order to produce $p(z|x)$.
For the experiments, a network is trained with synthetic noisy data, by flipping the binary labels with probability $p$ (to simulate noisy annotations). In the results section, we see that as $p$ increases, the model with the noise layer produces better accuracy compared to the baseline (around +0.05). Leading to a better accuracy even in the absence of noise.

Personal Opinion :
This is an interesting topic in general, dealing with annotations from multiple experts is a situation that should be explored with care. While I find the idea of modeling the noise label a good start, I have concerns about the experimental setup and the interpretation of the results. In the experiments section, the authors chose to simulate the annotation noise by randomly flipping the labels, how is it similar to the real life setup ?
The effectiveness of the proposed layer is interpreted by the fact that the noise is modeled by parameters learned while training, I my opinion it could also be that those parameters are : increasing the capacity of the network or learning inter-class relationships (like a Conditional Random Field does). The figure 5. showing classification accuracy indicates that adding a noise layer to the baseline model improves even without the presence of noise, raising questions about what is really happening under the hood.

The authors are encouraged to provide any information that could help to understand this work better, or inform about any followup of it.

Original Paper