Reproducing Deep Transfer Learning for Art Classification Problems

Original article can be found here (source): Deep Learning on Medium

Reproducing Deep Transfer Learning for Art Classification Problems

Several images of the Rijksmuseum art collection

The lack of available training data is a well known issue in the Deep Learning community. It is one of the main reasons that has led to the development of the research field of Transfer Learning (TL). The basic notion of TL is training a machine learning algorithm on a new task (e.g. a classification problem) while exploiting knowledge that the algorithm has already learned on a previously related task (a different classification problem). TL has proved to be extremely successful in Deep Learning. In practice, researchers often apply TL to make up for the lack of data for their own task instead of training a network from scratch.

Sabatelli et al. (2018) explored the efficiency of Transfer Learning by applying it to the field of Art Classification. The main idea is to take a ready-to-use deep convolutional network that was pre-trained on a different dataset and set it loose on your own collection of art images. To summarize their paper, the authors experimented with several networks pre-trained on ImageNet and tested how well they performed on two different datasets: the Rijksmuseum dataset and the Antwerpen dataset. They tackled three classification problems: classification by material, by type and by artist. They explored two different approaches: the ‘off-the-shelf’ approach and the ‘fine-tuning’ approach. The first relates to a frozen network where the parameters of the network are unchanged. The final top-layer classifier is the only component of the architecture which is actually trained. The second approach unfreezes all parameters and trains the entire network along with the classifier. They summarized their results in a nice table shown below … and we tried to reproduce them.

Results as published by the authors.


For the reproduction of the paper, we downloaded the Rijksmuseum collection of art images from the Rijksmuseum challenge 2014 website. This is the same dataset used by the authors and consists of over 100K images. We split the data into 80% training data, 10% validation and 10% test data. To make the data usable for the models we categorized all images by material, by author and by type.

Next to reproducing the results, we contributed in two ways to the research:

  1. The models were tested on a different dataset, the iMet 2020 collection of the New York Metropolitan Museum of Art
  2. A different model, Alexnet, was evaluated on the Rijksmuseum and iMet datasets

The first contribution was made by experimenting with an additional dataset. We downloaded the iMet 2020 collection of art images which consists of over 140K labeled items. Unfortunately, label categories were not identical to the Rijksmuseum ones. The provided labels were categorized by country, by culture, by material and tags. We decided to create a subset of images labeled by country and one subset of images labeled by material. Leaving the first one with 25K labeled images and the second set with 17K labeled images. Both were split in 90% training data and 10% validation data.

The second contribution was the training of Alexnet on the Rijksmuseum dataset and iMet dataset. Similar to the other models, it was evaluated ‘off-the-shelf’ and after ‘fine-tuning’. Its performance was then compared to the other models’ performance.


The datasets that were created vary from the ones from the paper, based on the label numbers presented by the authors. The data was processed with the following method:

  • Retrieving data from the XML
  • Splitting the data by with a 8:1:1 ratio into train, test and val folders respectively, and adding the artwork to a folder representing its label
  • Removing label folders that weren’t present in all 3 sets

This resulted in lower label amounts than given in the paper. In the table below one can see that for type and material, the difference is significant. This difference in the amount of labels, is the most likely reason why our models outperformed the ones in the paper. Possibly because images that are challenging to classify are removed by pre-processing.

The number of items and labels for each challenge and dataset


TL algorithms have become abundant over the past years, and many plug-and-play algorithms can be found online. We based our algorithm mainly on this PyTorch implementation and this one in Google Colab. For the full code, we refer to our GitHub page.

Importing the data

Given the sheer size of the Rijksmuseum dataset especially (~10GB after pre-processing), we had to figure out a way to get the data into Colab. Simply uploading would take hours every time we started a Colab session and would at times fail, thus proved not to be an option. When linking the Colab to Google Drive, we ran into similar issues. We also got the Colab to work on a local run time, but our own GPU’s could not nearly get the job done sufficiently fast.

We found our solution by linking Colab to a Kaggle account. Uploading to Kaggle still took hours, but had to be done only once. Also, exporting the big dataset from Kaggle to Colab and unzipping it was a matter of minutes, and required just the following lines of code.

Code snippet for integrating Colab with Kaggle and unzipping the data.

Here are the Kaggle links for the Rijksmuseum dataset classified by , material, type and artist, and for the iMet dataset classified by material and by country.

The TL algorithm

The algorithm consists of two main stages: defining the pre-trained model and retraining it. The following piece of code allows for setting the model architecture we want to use, as well as whether to freeze the pre-trained parameters.

Code snippet for selecting the desired pre-trained model. The parameters are frozen if necessary.

The next piece of code is the actual TL procedure, where the specified model will train on the new data for at most a given number of epochs, 25 in our case. Just as in the paper, we implemented an early stopping mechanism which makes the algorithm terminate if the accuracy on the validation set has not improved for 7 epochs in a row.

Code snippet for the actual training of the unfrozen parameters in the network.

We now make the distinction between the two types of TL the authors explored: fine-tuning vs off-the-shelf models. To recap, fine-tuning allows all parameters to be retrained, while off-the-shelf only retrains the final softmax classification layer of the network, keeping all other parameters fixed. The lines of code to initialize and run these two TL procedures are shown below.


Code snippet for optimizing a network by fine-tuning.


Code snippet for optimizing a network by using it off-the-shelf.


The results of our reproduction of “Deep Transfer Learning for Art Classification Problems” are given in the coming paragraphs. They have been split up in the results of the Rijksmuseum dataset, and the iMet dataset. The Rijksmuseum dataset was evaluated for classification tasks of type, material and artist. The iMet dataset was evaluated for classification of country of origin, and material.

Both datasets were learned by three different neural network models: ResNet50 (portrayed by the blue colour in the figures), VGG19 (red), and AlexNet (yellow). Solid lines portray the ‘fine-tuning’ approach and the dashed lines are results with the network as ‘off-the-shelf’. The figures below show the accuracy per model as a function of epochs completed. A straight line indicates that the early stopping mechanism was applied. The table shows the highest obtained validation accuracy.

Rijksmuseum dataset

In the table below, the final accuracies for our models are listed, next to those of the paper. For many the difference is quite small. One can however see that the models performed better in the reproduction. The possible reasons for this are listed in the discussion. Remarkable differences are the gap for the type classification problem, as well as the large difference between the off-the-shelf ResNet performance for artist classification.

Table with the accuracies of our reproduced experiments compared to the original accuracies.

Here the figure shows the stability in the VGG19 off-the-shelf learning compared to ResNet and AlexNet.

Test accuracy of material classification on the Rijksmuseum dataset

This figure repeats the trend of AlexNet underperforming compared to the other models.

Test accuracy of type classification on the Rijksmuseum dataset.

The figure for the artist classification as a function of accuracy per epoch shows that the Resnet50 had an unstable learning process compared to VGG19 and AlexNet.

Test accuracy of artist classification on the Rijksmuseum dataset.

iMet 2020 dataset

Table with the accuracies on the iMet 2020 dataset.

The IMet dataset results, as shown in the figures below show an accuracy of between 0.73 and 0.81 for material classification for the fine-tuned models, which is slightly lower than the material accuracy of the Rijksmuseum set. We expect this to be due to the fact that iMet 2020 has much less data.

Test accuracy of material classification on the iMet 2020 dataset.

The classification by country shows that the models are almost as capable in recognizing country of origin as in recognizing material.

Test accuracy of country classification on the iMet 2020 dataset.


The results of the reproduction project were slightly odd as our models, although quite similar, generally slightly outperformed the authors’ models. This can be seen in the figures and tables. Our belief regarding the reason for this difference is the pre-processing of our data. As the pre-processing of the author eluded us a little, and we believed the reproduction would be tainted by asking the author, the outcome of the pre-processing is different. This can be seen in the table under the dataset differences. Due to the smaller amount of labels left in the datasets, the models have an easier classification job, allowing them to perform slightly better. For further research, it would be interesting to see if the performance of a certain model can be plotted as a function of the amount of labels, with the labels of the smallest clusters being removed at every iteration.

One result that was severely different was the off-the-shelf performance of the ResNet50 for artist classification. As the amount of labels is fairly similar, it eluded us as to why such a difference occurred.

The new contributions of our research also provided some valuable insights. The first one being that AlexNet performs worse than VGG19 and ResNet50 for this application of Transfer Learning. On every occasion its accuracy was lower than that of the other models.

The second contribution, of applying this technique to the iMet dataset, shows that this learning method can be translated to different art collections and different label types.


All in all, this reproduction shows that the study had valid results. It also showed that it can easily be expanded with more interesting research. Last but not least, in and was very educational for us to gain more insight into the field of Deep Learning.


A large thank you to our Professor Jan van Gemert and our TA Yapkan Choi who assisted us in the process!


  • Sabatelli, M., Kestemont, M., Daelemans, W., & Geurts, P. (2018). Deep Transfer Learning for Art Classification Problems. ECCV Workshops.