Source: Deep Learning on Medium
Classifying metastases is probably not an easy task for a trained pathologist and extremely difficult for an untrained eye. According to Libre Pathology, lymph node metastases can have these features:
- Foreign cell population — key feature (Classic location: subcapsular sinuses)
- Cells with cytologic features of malignancy
- Cells in architectural arrangements seen in malignancy; highly variable — dependent on tumour type and differentiation
We know that the label of the image is influenced only by the center region (32 x 32px) so it would make sense to crop our data to that region only. However, some useful information about the surroundings could be lost if we crop too close. This hypothesis could be confirmed by training models with varying crop sizes.
There are couple of ways we can use to avoid overfitting; more data, augmentation, regularization and less complex model architectures. The augmentations we can use for this type of data:
- random rotation
- random crop
- random flip (horizontal and vertical both)
- random lighting
Note that if we apply augmentation here, augmentations will also be applied when we are predicting (inference). This is called test time augmentation (TTA) and it can improve our results if we run inference multiple times for each image and average out the predictions.
We will use Fast.ai V1 software library that is built on PyTorch. What I like about Fast.ai is that it includes “out of the box” support for many recent advancements in deep learning research. If you want to use the 0.7 version of Fast.ai, see commit version 9 of this kernel.we will be using a pretrained convnet model and transfer learning to adjust the weights to our data. Going for a deeper model architecture will start overfitting faster.
First, we find the optimal
learning rate and
weight decay values. The optimal lr is just before the base of the loss and before the start of divergence. It is important that the loss is still descending where we select the learning rate.
As for the
weight decay that is the L2 penalty of the optimizer, Leslie proposes to select the largest one that will still let us train at a high learning rate so we do a small grid search with 1e-2, 1e-4 and 1e-6 weight decays.
We want to select the largest weight decay that gets to a low loss and has the highest learning rate before shooting up. Out of the tested WD’s, 1e-4 seems like the largest WD that allow us to train with maximal learning rate.
Next, we train only the heads while keeping the rest of the model frozen. Otherwise, the random initialization of the head weights could harm the relatively well-performing pre-trained weights of the model. After the heads have adjusted and the model somewhat works, we can continue to train all the weights.
We can see from the plotted losses that there is a small rise after the initial drop which is caused by the increasing learning rate of the first half cycle. The losses are temporarily rising when
max_lr drives the model out of local minima but this will pay off in the end when the learning rates are decreased.
Confusion matrix can help us understand the ratio of false negatives and positives and it’s a fast way looking at our model’s performance. This is a simple table that shows the counts in a way of actual label vs. predicted label. Here we can see that the model has learned to distinguish tumor and negative sample and it’s already performing well.