Source: Deep Learning on Medium
How did our models perform?
We evaluated each of the models based on the below metrics:
Metrics of Evaluation
Log loss function
We utilized both binary cross entropy and weighted log loss to score the performance of the model. Weighted log loss was the metric defined by RSNA and gives higher weightage to detection of hemorrhage (‘any’ label) compared to the identification of subtypes.
In our problem, recall signifies the number of actual hemorrhages identified correctly by the model. Higher the recall, lower the number of false negatives and hence, lower the number of hemorrhages undetected. In this context, where an unidentified hemorrhage can have dire consequences, we found recall being a very important parameter for model evaluation.
What results did we get?
We trained each of our models using the tuned hyper parameters found through our preliminary modelling. We observed losses, time performance and recall for each model, for further comparison.
How we programmed it
Shuffle split was used for cross validation, which randomly samples your training dataset during each iteration to generate a training and test set. In kfold, there maybe a tradeoff between cross validation time and using the power of entire training data for larger datasets, which we do not encounter for Shuffle split.
#Read train set
df = read_trainset()# train set and validation set (Change number of splits)
ss = ShuffleSplit(n_splits=3, test_size=0.2, random_state=42).split(df.index)train_idx, valid_idx = next(ss)
model = MyDeepModel(engine=InceptionV3, input_dims=(224, 224, 3), batch_size=32, learning_rate=3e-4,
num_epochs=3, decay_rate=0.8, decay_steps=3, weights="imagenet", verbose=1)
history = model.fit_model(df.iloc[train_idx], df.iloc[valid_idx])
Log loss function
We observed the optimal validation loss as 0.06 with Inception V3 in epoch 2, post which we found the validation loss to increase (overfitting). An example of how the training and validation loss changes with every epoch can be seen below:
The precision recall curves for the different models showed us that the classification performance was good for detecting hemorrhages, but suffered when it came to identifying subtypes. The most impacted was epidural, which can be seen clearly in the below image.
Since recall is one of our primary evaluation metrics, the focus was on evaluating model performance with respect to this. The curves for precision and recall show that with a lower threshold, one can attain better recall. Given the tradeoff, we chose a lower threshold for classification of our data.
This sample plot is based on Inception V3 for the label ‘any’, at a threshold of 0.15.