Extended Story Visualization

Original article was published on Deep Learning on Medium

Experiments

Fixed Length Generation

In our first experiment, we implemented a DistilBERT story encoder to verify if it will provide enough context to the generator so the output images make sense. We noticed that as the training progressed, the images generated became clearer and made more sense. In most of the generated sequences, we saw that the DistilBERT embeddings passed enough context information to accurately identify the characters present in the input story as well as their exact positions with respect to each other and other objects in the scene. One such sequence is described below, where the model generates all the characters in their correct positions but it has not been able to detect the bed should be in the generated distribution.

P2iSeq generated visual stories matching the context distribution of the ground truth counterparts.

Additionally, the color distribution of the scene, as well as the background environment (outdoors/indoors), are generally well illustrated by the model as seen in the below examples.

Summarized Generation

Once we achieved considerable good results in the fixed-length image generation experiments in contrast to the StoryGAN baseline, we continued to set the experimentation pipeline in the variable-length approach proposed with the Extractive Text Summarizer module. This added a higher level of complexity to the optimization problem, the GAN training became more unstable and made the job easier for the discriminator since it was harder for the generator to create an interpolated image distribution that covered the full story information with less amount of images.

Summarized image generation example with sentence selection.

As shown in the figure above, the model was capable of identifying the underlying distribution in terms of background (i.e. the framed sequence shows a snowy background in the first image even when none of the selected sentences mentioned it) and in terms of story characters selection; since the summarizer chose the features illustrating the most relevant ones in the story, but the image resolution was harmed in attaining this objective.

Conclusions

The StoryGAN’s approach documented in [1], has a reliable framework to solve the story visualization problem capable to identify key features from a given written story. And while an extractive summarizer fetches the relevant sentences to tell the story in a visual way, it is still unable to reach a reliable image quality. This approach is capable of learning the underlying context distribution of the Pororo dataset but needs extra information to derive the image space that lies in between each frame of the full image sequence. Throughout experimentation, the generator showed to be a slower learner with respect to the story discriminator since during the summarization process it needed to infer the image-to-image distribution while interpolating or trying to create images that showed the same amount of information in less amount of images.

Another complicated parameter to tune in the training process was the ratio of summarized sentences of the extractive module. Hence, we believe adding a soft-relaxation to the sentence selection process would be an interesting way to help the model identify the correct amount of sentences to generate at training time, as it would backpropagate the error when generating the incorrect amount of images for a given story.

Related Work

Several other tasks which are also related to story visualization are story image retrieval from a pre-collected training set instead of generating images [15], a “cut and paste” technique for cartoon generation [8]. The opposite task of story visualization is visual storytelling where the output is a paragraph describing the input sequence of images. For visual storytelling task text generation models or reinforcement learning [11, 14, 10] are used.

[1] Y. Shen J. Liu Y. Cheng Y. Wu L. Carin D. Carlson J. Gao Y. Li, Z. Gan. A sequential conditional GAN for story visualization. CVPR, 2019.
[2] M. Mirza B. Xu D. Warde-Farley S. Ozair A. Courville Y. Bengio I. Goodfellow, J. Pouget-Abadie. Generative adversarial networks. NIPS, 2014.
[3] J. Chaumond T. Wolf V. Sanh, L. Debut. DistilBERT, a distilled version of BERT: smaller, faster, cheaper, and lighter. NIPS, 2019.
[4] S. Choi B. Zhang K. Kim, M. Heo. PororoQA: A cartoon video series dataset for story understanding. NIPS, 2016.
[5] X. Yan L. Logeswaran B. Schiele H. Lee S. Reed, Z. Akata. Generative adversarial text to image synthesis. ICML, 2016.
[6] D. Miller. Leveraging BERT for extractive text summarization on lectures. CoRR, 2019.
[7] H. Cai, C. Bai, Y.-W. Tai, and C.-K. Tang. Deep video generation, prediction, and completion of human action sequences. arXiv preprint arXiv:1711.08682, 2018
[8] T. Gupta, D. Schwenk, A. Farhadi, D. Hoiem, and A. Kembhavi. Imagine this! scripts to compositions to videos. ECCV, 2018.
[9] J. He, A. Lehrmann, J. Marino, G. Mori, and L. Sigal. Probabilistic video generation using holistic attribute control. arXiv preprint arXiv:1803.08085, 2018.
[10] Q. Huang, Z. Gan, A. Celikyilmaz, D. Wu, J. Wang, and X. He. Hierarchically structured reinforcement. learning for topically coherent visual story generation. arXiv preprint arXiv:1805.08191, 2018
[11] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal, J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra, et al. Visual storytelling. In NAACL, 2016.
[12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
[13] Y. Li, M. R. Min, D. Shen, D. Carlson, and L. Carin. Video generation from text. AAAI, 2018.
[14] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition gan for visual paragraph generation. arXiv preprint arXiv:1703.07022, 2017.
[15] H. Ravi, L. Wang, C. Muniz, L. Sigal, D. Metaxas, and M. Kapadia. Show me a story: Towards coherent neural story illustration. In CVPR, 2018.
[16] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. MoCoGAN: Decomposing motion and content for video generation. CVPR, 2018.
[17] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
[18] X. Yan, J. Yang, K. Sohn, and H. Lee. Attribute2image: Conditional image generation from visual attributes. In ECCV, 2016.
[19] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. ICCV, 2017.
[20] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio. Chatpainter: Improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216, 2018.
[21] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
[22] D. Miller. Leveraging BERT for Extractive Text Summarization on Lectures. arXiv:1906.04165.
[23] K. Clark, C. Manning. Improving Coreference Resolution by Learning Entity-Level Distributed Representations. In ACL 2016.