Source: Deep Learning on Medium
Problems of MIXUP
But, Not all data generated by MIXUP will be good data.
In this data, the left of the figure below is likely to be good data augmentations (label=0.~1.) that smooths the decision boundary, but the right is not good data. This is because the label of the augmented datas are all “red” ( label=1), but these are in the feature space of the blue data( label=0).
Manifold mixup  addresses this issue. Manifold Mixup is a data augmentation method that performs MIXUP in an intermediate layer in which feature spaces are more aligned than that of input.
The above figure is a conceptual diagram when MIXUP is performed in the input layer and the intermediate layer. In the input layer MIXUP (left), good augmented data cannot be obtained because of complicated feature space of blue and red. On the other hand, in the middle layer, feature space is more aligned compared to the input layer, so good augmented data can be obtained.
Which layer we perform mixup?
So which layer is best to mixup? There was research  that seems to be a hint. In this research, Soft Nearest Neighbor Loss with a temperature term is introduced in the middle layer of ResNet, and its behavior is investigated. Large Soft Nearest Neighbor Loss value indicates that the features by class are intertwined, and small one indicates that the features are separated by class.
The figure below shows the value of Soft Nearest Neighbor Loss in each block of ResNet. ResNet is known as a high-performance image classification network, but it shows that the features for each class are not separated except for the final layer.
In light of this, the hypothesis comes out that if you use ResNet, you should MIXUP in the final layer. In manifold mixup paper , authors said that the best result comes when you mix up the final layer.
However, this figure itself is stored in Appendix. It seems that the authors of Manifold Mixup did not attach importance to the result itself.
I compare the final layer MIXUP and the input layer MIXUP. The data used is CIFAR10.The code (Jupyter Notebook) used for this experiment has been uploaded to my github.
Since the operation of mixing in the intermediate layer is required, MIXUP layer is created as follows. By putting two inputs and a ratio as input, the layer mixes the data according to the ratio.
By using this layer, Input Mixup (normal MIXUP) and Final Layer Mixup, that Manifold mixup is performed in the final layer, can be implemented as follows. I use ResNet18 in here.
Training code Implementation
An excerpt of the code is as following. Random numbers are generated according to the beta distribution every epoch, and MIXUP is performed at that ratio. (Line 95)