Everyone’s a critic

Original article was published on Artificial Intelligence on Medium

I mean it’s right, but…

Previously, we looked at image captioning and concluded that whilst sometimes accurate, captioning lacked imagination. To be fair, that’s because it wasn’t trained to have imagination, it was trained to generate the right caption. It’s like criticising a shovel for not composing music.

So can we train a model to be a bit more creative in it’s answers using the same architecture? Taking Google’s Show, Attend, and Tell as a base, let’s see if we can make something a bit more descriptive. To do so, we need a source of descriptive writing about images to learn from.

The basic captioning process. Easy, right?

A handy place to gather linked image / text data is reddit, and tools like praw allow for easy scraping. In an attempt to ensure that the comments are actually talking about the image and not just noise, we’ll use various art subreddits, and build ourselves a little art critic.

As we’re building our dataset from internet comments, the quality is pretty variable. Comments range from the useful “I prefer your style over something that is hyper realistic. I think this is a happy mid point between cartoony stylized and realistic” to “i like it a lot but where the hell is where are my at”, so we shouldn’t expect Adrian Searle quality criticism. but let’s see what we can get.

Eggcellent Work

By u/EC

Initially I thought I had created a something wonderful when the training image on the left was greeted with:

“Can I offer you an egg in this trying time?”

But it turns out that it was just repeating verbatim a comment in the training set, as well as an It’s Always Sunny meme. Oh well.

This ably demonstrates our problem with reusing captioning code— our network is using Sparse Categorical Cross entropy as its loss function. This means that it is rewarded for getting as close as possible to the original training text, which works for captions but isn’t great for descriptive criticism. One way to improve our model would be to assess not the accuracy of the text, but perhaps the readability instead. We’ll explore this at a later date.

For now, let’s see how well it fares on new images:

Well it is.

Just like Microsoft’s CaptionBot, it’s technically right but I can’t really get excited about it. However on other examples, we do begin to get something more interesting: