Comic Strip generation

Source: Deep Learning on Medium

The validation loss was much higher than the training loss. However, it is debatable as to how good the generated text was, because even if it was similar to the ground truth, but not identical, loss would increase.

The code for the decoder and training was modified from here and the code for text encoding and overall architecture was taken from here.

Evaluation and Results

We evaluated our pix2pix generated results with the Frechet Inception Distance (FID) [19]. It considers the ground truth and the generated data to be from two different datasets, and approximates them with a Gaussian random variable. FID is calculated by measuring the difference in the estimated means and covariance matrices.

A lower FID score is considered better as it means that the generated image is quite similar to the ground truth. When we saw our FID for the test results (in bold) (TODO make it bold on website) were fairly high, we decided to calculate FID between the first two panels and the third panel, and found those values to be quite high as well, implying that there is a lot of diversity/variance in the original dataset as well.

For text generation, we initially used metrics such as perplexity and accuracy, but later decided to judge the text qualitatively.


All our experiments were implemented in PyTorch, and for the most part, trained on either Google Cloud Platform (GCP) or with Google Colab. Here is the comprehensive list of environments used for each model:

Conclusion & Future Scope

  • Our best results were obtained by using the images of the first two panels as context by passing them to a conditional GAN, i.e., pix2pix, to generate the third panel.
  • For text generation, results from the pre-trained LSTM and GPT-2, both of which were fine-tuned on our Garfield-only text dataset, worked well.
  • We explored joint embedding but did not achieve results better than those of our individual image and text experiments. However, we do feel like the future scope of this project lies in this direction.


[1] Iyyer, M., Manjunatha, V., Guha, A., Vyas, Y., Boyd-Graber, J., Daume, H., & Davis, L. S. (2017). The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7186–7195).

[2] Ronneberger, O., Fischer, P., Brox, T. (2015, October). U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer, Cham.

[3] Radford, A., Metz, L. and Chintala, S., 2015. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

[4] Kingma, Diederik P., and Max Welling. ”Auto-encoding variational bayes.” arXiv preprint arXiv:1312.6114 (2013).

[5] Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. ”Generative adversarial nets.” In Advances in neural information processing systems, pp. 2672–2680. 2014.

[6] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A. and Chen, X., 2016. Improved techniques for training gans. In Advances in neural information processing systems (pp. 2234–2242).

[7] Chen, Xi, et al. ”Infogan: Interpretable representation learning by information maximizing generative adversarial nets.” Advances in neural information processing systems. 2016.

[8] Cho, K., Van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., ¨ Schwenk, H. and Bengio, Y., 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

[9] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

[10] Wang, A. and Cho, K., 2019. BERT has a mouth, and it must speak: BERT as a markov random field language model. arXiv preprint arXiv:1902.04094.


[12] Conneau, A., Kiela, D., Schwenk, H., Barrault, L. and Bordes, A., 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.

[13] Devlin, J., Chang, M.W., Lee, K. and Toutanova, K., 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[14] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI Blog, 1(8).

[15] He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

[16] Liu, Z., Luo, P., Wang, X. and Tang, X., 2015. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision (pp. 3730–3738).

[17] Mescheder, L., Geiger, A. and Nowozin, S., 2018. Which training methods for GANs do actually converge?. arXiv preprint arXiv:1801.04406.


[19] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B. and Hochreiter, S., 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (pp. 6626–6637).

[20] Ronneberger, O., Fischer, P. and Brox, T., 2015, October. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). Springer, Cham.