A Toddler Learns to Speak

On language and its equivalents

“…from the demonstration of ML capabilities through images we can infer capabilities in other modalities and potential increases of sophistication and higher dimensional scales. By modality I mean the categorization of ML environment of operation such as the differences between images, video, speech, text, language, sound, music, market actions, or even eventually higher order considerations such as writing style, reasoning, emotion, or personality traits. If there is a noun to describe something, expect that we will be able to classify and depict it. If there is a defining feature or characteristic of some data, expect we will be able to extract it even if we can’t necessarily describe it in our limited vocabulary or three dimensional imagination. If there is an adjective for a trait, expect we will find a way to measure or manipulate along that axis (if not yet or yet with very high fidelity then eventually). … It will even be possible to translate between different modalities … just like translating between languages is done today.”

From the Diaries of John Henry

This passage from a prior essay had a fair bit of speculation about the eventual capabilities of a type of machine learning fueled interface. It was one of the (several) themes of that post that we can infer higher order possibilities of a system from representations of reduced dimension — such as one might better intuit some dataset after a principal component analysis for instance. For any that have even loosely followed the developments of this field in recent times, I am sure you will have come across a whole series of demonstration images ranging from the psychedelia of a deep dream to GAN generated celebrity faces. While each of these demonstrations may initially surprise, my experience is that this kind of initial reaction quickly fades as an observer becomes acclimated to a new normal, a world with added possibilities. Easy to overlook in the repetition of this process is the velocity or even acceleration of these instances of, for lack of a better word, induced wonder.

Michelangelo — Sketches of the Virgin, the Christ Child Reclining on a Cushion, and Other Infants

I am somewhat fond of this excerpted passage above, it came near the culmination of a week-long writing binge, and I didn’t even know it was coming until it was there on the page in front of me. I think part of the appeal is that it is the closest I’ve ever come to writing science fiction extrapolated from modern technology — in the ballpark of what could fuel a short story of the type found at conclusion of one of Jack Clark’s excellent Import AI email newsletters for instance. I’d like to use this essay to expand on this same passage, exploring potential implications and especially the limits of what is implied. In the process I’ll engage in a fair bit of speculation and ask some questions that I won’t necessarily know the answer to. I’ll likely also retread some of the same material from the John Henry post so apologize in advance for any repetition. The exercise is primarily for my own benefit (the writing helps me organize my thoughts), but if the process triggers a useful thought or eventual inspiration in some interested reader all the better.

“The main point of essay writing, when done right, is the new ideas you have while doing it. A real essay, as the name implies, is dynamic: you don’t know what you’re going to write when you start. It will be about whatever you discover in the course of writing it.” — Paul Graham

“…from the demonstration of ML capabilities through images we can infer capabilities in other modalities and potential increases of sophistication and higher dimensional scales.”

From the Diaries of John Henry

Not knowing how to ease into what will likely be a somewhat technical discussion, I’m just going to jump right in the deep end and take it from there.

An image can be represented for evaluation by a machine learning algorithm by a five dimensional vector of numerical values, where two of those dimensions represent the “pixel” locations on x and y axes, and the other three axes approximate color of a pixel using intensities of red, green, and blue — similar to a liquid crystal display pixel for anyone that has ever sat too close to their television. Although such a matrix collecting pixel vectors carries all of the information of an image, the interpretation requires processing of a feedforward neural network which for this application is generally performed using a machine learning architecture known as a convolutional neural network. A convolutional network is a type of machine learning algorithm trained using supervised learning. Each layer of the convolutional network is a kind of pattern matching filter, and as an image is fed further through the network the number of filters applied along with the equivalent window size under evaluation get progressively larger. Early filters may detect generic micro features such as edges, a subsequent row might categorize those edges into types of shapes, a later row might categorize those shapes into more sophisticated or domain specific categories — as a farcical example consider a late stage filter that activates to everything in an image that’s lame. Bear in mind that the ordering and design of these filters are not programmed by human intent but developed algorithmically by the training process. As an image progresses through the layers, the image representation (for evaluation by an increasing number of filters) is progressively compressed through the application of max-pooling, which reduces a evaluated window to a single point based on a filter’s maximum value of a filter output points in that window. This type of compression of data is common in machine learning algorithmic architectures as the process forces an network to find higher order representations and in the process denoises a signal.

Max-pooling demonstration via Wikipedia

The interpretation of an image thus condenses the representation of the pixels to increasingly abstracted representations as the convolutional network transformation proceeds deeper into the rows of hidden layers, alternating between filters of increasing sophistication and representational rescaling layers via max-pooling. The output of the convolutional network’s evaluation could be a range of functions such as inputs to the operation of a self driving car, but I’ll focus here on one specific type of output: the translation of an image to a textual description. In short even in modern practice it is possible for a convolutional network to categorize and describe features of an image, in the process translating between the modality of images to that of text.

Image via A Review on Deep Learning Techniques Applied to Semantic Segmentation by A. Garcia-Garcia, S. Orts-Escolano, S.O. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez [link]

As the theme of this essay centers around language, let’s now contrast image evaluations with methods for interpreting text. In vectorized representations of language, the equivalent of an image’s pixel could be the letters of the alphabet that make up our words. We actually have a choice in the level of granularity to evaluate language, in that we can either represent a text for our algorithm’s input by coded individual letters or alternatively develop representations at the granularity of unique words or even going up another layer to statements / word groupings. As you increase the granularity size the range of values for a representation increases quickly but parallel the depth of early layers in a network required to evaluate goes down — for example feeding the network with a series of letters means each “pixel” has a range of 26 values (or more if you include punctuations or other symbols), while a dictionary of words could have anywhere from 10,000 values for a narrow domain to 100,000+ for generic language (the current Oxford dictionary contains 171,476 words, which excludes names and other proper nouns), however the depth of a network to evaluate the granularity of letter pixels would be much higher (many more early hidden layers) since it would have to capture rules of spelling and word roots and correspondingly huge range of complexity to translate from letters to words — while I expect deep learning techniques are capable in this range, in practice it is more common for a granularity of individual words for evaluation (at least for known words, I expect there maybe be other methods available for interpretations of words in novelty). The representation of words for evaluation can initially be captured in one hot encoding, which for the example of a 10,000 word vocabulary would be based on a sparse vector with 9,999 values of 0 with a single 1 in the slot corresponding to the word in question. An ‘embedding matrix’ can then be developed via a neural network using the context of the set of target words’ surrounding context in a training text corpus to facilitate translation of one hot encodings to dense vectors of reduced dimensionality (by dense meaning most of the values are non-zero). For the current example a 10,000 word vocabulary’s 10,000 unit length, a one hot encoding vector could potentially be embedded into a 300 unit length densely coded vector representation as an example. What’s neat about these dense vector encodings is that it is even possible to perform vector addition and subtraction to translate from one definition to another — for an example, one might take the densely encoded vector for ‘king’, subtract the vector for ‘man’, and add the vector for ‘woman’ to arrive at the vector for ‘queen’.

Word2vec vector demonstration image via Tensorflow

Having developed encodings to represent individual words, the series of these vectors can then be fed into a recurrent neural network (a neural net with a kind of memory between steps of progression which is useful for evaluating patterns in time series data or in our case text series where the order of words can impact a statement’s interpretation), with such network serving to evaluate and interpret the collective meaning of a passage. Common outputs of a recurrent network evaluation include predictions for subsequent time steps (such as you may find with an autocomplete option when composing an email on Gmail) or language translation services (such as translating a passage from French to English).

Never did two men make the same judgment of the same thing; and ’tis impossible to find two opinions exactly alike, not only in several men, but in the same man, at diverse hours.

– Michel de Montaigne — Of Experience

Having looked at the evaluation of both images and text, let’s pause for a second and contrast the two further. Some of the major differences between the two approaches included the input data matrix dimensions and range of values for the numerically coded data fed into our neural network as well as the class of neural network architectures that are then applied to evaluate (i.e. convolutional network vs. recurrent neural network). On the other hand, certainly of note should be that in both cases we were able to output an evaluation in the form of pure text. This shared modality of output implies that there exists an equivalency between the two input forms, if not in data structure then at least in abstraction. We’ve already seen that we can translate between different languages of text, we’ve even seen that we can translate from images to text, now how about the other direction — is it possible to translate from text to computer generated images? It turns out that it is, use a technique known as Generative Adversarial Networks (aka GAN). In a GAN the goal is to generate new data that is lifelike and representative of the properties as would be found in of our training set. For example one may wish for the machine to generate pictures of imaginary birds based on a textual description. The generation is achieved by pairing a generative algorithm with a classifier algorithm to serve as a kind of reinforcement teacher. As the generator attempts life-like creations, the classifier reinforces those aspects that can pass for real data and rejects those that do not, so the two aspects have a kind of adversarial back and forth competition. Through the interaction of the two the generated data approaches much closer to the representative training properties than what we could achieve otherwise. Using the realistic images from a GAN we can thus translate from a text description to a whole new image.

via “Generative Adversarial Text to Image Synthesis” by Reed, Akata, Yan, Logewaran, Schiele, and Lee — Link

“By modality I mean the categorization of ML environment of operation such as the differences between images, video, speech, text, language, sound, music, market actions, or even eventually higher order considerations such as writing style, reasoning, emotion, or personality traits. … It will even be possible to translate between different modalities … just like translating between languages is done today.”

From the Diaries of John Henry

Let’s consider what is meant here by increases of sophistication or higher dimensional scales. In this John Henry quote I gave several examples of potential modalities. So far we have looked at modalities of text and images. Consider now a 3D image as may be produced by a LIDAR, although lacking a color gamut it will allow us to incorporate a third spacial dimension (the z axis) to our representation, (and I suspect that when paired with a camera generate the full color range of measurement as well). Alternatively, consider what happens when we incorporate the time dimension, allowing ourselves for moving images. In each case the data representation of our numerically coded input is simply increased to facilitate an extra dimension in the input matrix — in the case of 3D images an extra coordinate for location along the z axis, in the case of video creating a series of independent images per step along the time axis. In many cases modalities of increased sophistication can thus be captured by simply adding extra dimensions to our representations. [*author’s note: the rest of this paragraph is a bit of a tangent and contains a fair bit of speculation so can be skipped without losing much of the narrative: Now it should be noted that although in both examples we are simply adding a dimension, I believe (although some of this is speculation) that traditionally they are treated differently, with the 3rd spacial dimension accommodated via increasing the input dimensions to a convolutional net, and the added time dimension treated via feeding each time step’s respective image through a convolutional net and then the time series output series fed through a recurrent network. Now again this part of the discussion is partly speculation, but I want to highlight an interesting discussion that took place in Francois Chollet’s Deep Learning With Python. Although the specific passage escapes me, I recall the author describing an alternative to recurrent neural networks for one dimensional time series data, specifically suggesting the use of one dimensional convolutional networks in their place, with a key advantage of the approach being that the convolutional network is easier to train. Now I’m left to wonder if convolutional networks can handle time series data, why can’t we simply incorporate the time dimension into the full convolutional network. Is there some unique feature of the time dimension that requires it to be cordoned off from the others? Anyway just a little tangent of thought.] Sorry I suspect some of this has potential to put one to sleep.

Eros Sleeping, Greek 3rd — 2nd century BC

Now wake up and ask yourself, is there any modality that can’t be similarly abstracted to a textual description, and thus demonstrate potential for translation?

excerpt from Yoshua Bengio NIPS 2015 Deep Learning Tutorial presentation — Link

“If there is a noun to describe something, expect that we will be able to classify and depict it.”

From the Diaries of John Henry

The use of a neural network for classification purposes is a fairly straightforward, and can be accomplished with relatively simple architectures such as logistic regression. A common demonstration in introductory tutorials is sentiment classifiers, say a door camera that only unlocks for happy smiling faces. Such binary logistic networks’ output sigmoid activations can even be extended to a full dictionary of identifiers (after training) using a softmax function. Now are there any classifications that are outside the reach of modern tech? If you ask a computer “Isn’t it ironic?” will it have an answer? Will a computer ever be able to tell when you are being sarcastic? I’m going to get a little abstract here, but consider an artist painting a collection of carefree children running through a field. A thousand different artists could paint the same scene a thousand different ways, each in a style unique to that artist. Now if you were to extract an ironic tone to overlay on another arbitrary passage of text, could that be considered isomorphic to the application of overlaying one artist’s style onto the painting of another? After all haven’t we already demonstrated that images and text are just two different translations of the same vectorized language?

“If there is a defining feature or characteristic of some data, expect we will be able to extract it even if we can’t necessarily describe it in our limited vocabulary or three dimensional imagination. If there is an adjective for a trait, expect we will find a way to measure or manipulate along that axis (if not yet or yet with very high fidelity then eventually).”

From the Diaries of John Henry

Image source: Francois Chollet’s Deep Learning With Python.

The thing about these type of higher level abstractions of tone or artist style is that when all is said and done they are just a single added dimensions to an output space. I expect through techniques such as variational autoencoders we will develop the ability to turn a knob along the axis of irony or any other arbitrary trait, just like the above image manipulates smiles and frowns. A brain’s imagination is somewhat handicapped by our experience in a four dimensional world, although mathematicians can perform linear algebra on a matrix of arbitrary dimensions, our ability to visualize such spaces is generally weak. Computers have no such limitations, they see little difference between a 3D sphere or a high dimensional hyper-sphere, in general they are only eventually constrained by the curse of dimensionality. Now consider again the isomorphic equivalency of images and text, and that text can be embedded into a vectorized representation, upon which we can perform algebraic manipulations. Does that mean we can perform equivalent manipulations on images? You should already know the answer.

via “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” by Radfgord, Metz, and Chintala — Link

We previously asked the question if there any classifications that are outside the reach of modern tech. Let’s revisit that and extend it to these kind of vectorized manipulations (taking the scenic route to get there). In contract negotiations, my experience is that (although not ideal) it is not uncommon for parties to incorporate a “reasonable” standard to a clause when codifying the full intent in detail is too cumbersome or the domain too complex, thus deferring to a subsequent interpretation during implementation in cases of contentious edge cases — depending on a hopefully shared standard of common sense to deal with the unexpected. As we look to apply machine learning to increasingly complex domains, I expect we will need computers to perform a comparable evaluation of reasonableness in cases of incomplete information or novel circumstances missing from representation in a training corpus. We’ll need an artificial form of common sense. If you give a self-driving car the goal of driving “safely”, there will certainly be circumstances found on the road that are missing from your representation. Using our instances of image modality for demonstration, consider the example of pictures with portions intentionally deleted, leaving an incomplete representation. It turns out that it is possible to use the context of visible surroundings to infer content and thus “inpaint” the missing portions of the image, which I speculate could potentially serve as solution for dealing with circumstances of incomplete information in other modalities.

Image via “Context Encoders: Feature Learning by Inpainting” by Pathak, Krähenbühl, Donahue, Darrell, and Efros [link]

In the seminal paper Computing Machinery and Intelligence, Alan Turing described his imitation game as a kind of test for computer intelligence. If a computer can produce textual language in conversation that is indistinguishable to an observer from that produced by a human, then it has passed the test for intelligence. I expect the performance in this test requires a kind of reasonable standard of its own. Now let’s use the trick of modality equivalence to extend Turing’s test to other domains. If a Turing test requires the production of reasonable text, and text can be translated to other modalities like images, sound, video, etc., well then a computer that can pass the Turing test should be capable of addressing modality representation evolutions of some environment with manipulations along any axis that can be represented by language in a fashion that can reasonably be inferred to have been produced by human intelligence. Let’s call this higher standard for proof of intelligence Turing² (read as “Turing Squared”). There I named something new feeling really creative now.

And there never were, in the world, two opinions alike, no more than two hairs, or two grains: their most universal quality is diversity.

– Michel de Montaigne — Of the Resemblance of Children to Their Fathers

Having addressed everything that was intended for this paper (and then some), let’s now awkwardly transition to a closing passage of science fiction, meant to light-heartedly demonstrate a few of the implications of these concepts.

It was the actor’s union strike that started it. Some absurd conflict over health insurance benefits and retirement accounts snowballed to a full walkout by the celebrity elite, grinding the production industry to a halt. Little did the (not traditionally too bright) actors realize that this form of action could be construed as a breach of prior collectively bargained agreements, which nullification stripped them of rights associated with likeness reproductions for future productions. Between the huge archive of prior films including unpublished scenes, the studios had all they needed to extract life-like representations, and they immediately put them to work in new films.

The writer’s union was the next to fall. Seeing all of the now unemployed actors, many of the top names refused to work on productions that didn’t hire live talent. With the resulting fall off in porrly written films’ box office receipts, the studios agin turned to the machines for development of storylines. This insourcing proceeded until eventually the only remaining creative talent were the team of directors. Instead of writing specific character dialogue or giving stage directions, the directors would merely feed suggestions in the form of desired emotions or audience impact, which would then be directly translated to a polished scene of new dialogue and tone. If a movie scene was found too dry, they would turn the comedy knob a few clicks to the right causing the generated characters to start acting just a bit wackier for a brief time. If they wanted to tug at the audience heartstrings they would tell the computer to incorporate some feel-good music into the otherwise diverse soundtrack. If they wanted the movie to make more money, they would tell the computer to pander to the desired audience.

This creative process wasn’t all without hiccups. As the automation accelerated, soon the creative directors became so ambitious that they didn’t feel the need to review all of the output in detail. Films would be shown to the public with obvious mistakes that would only be caught and cleaned up in the few days after release. The directors would give direction to the computer without consideration for the current limits of algorithmic understanding — computers would interpret metaphors as literal statements and I bet in more cases vice versa.

Some would say that the most successful work of computer generated entertainment turned out to be the rewrite of the final episode of Seinfeld. A lot of creative liberty was taken, including throwing out the title ‘Seinfeld’ for the more accessible ‘George and his Wacky Friends’. The ending was changed to a more traditional plot arc for the show. Jerry and Newman ended up burying a hatchet in a parakeet’s grave which they took from a cigar store Indian statue in Kramer’s apartment. George finally realizes that Susan had been the best thing that had ever happened to him and instead of wallowing in self-pity over the loss of what could have been, he does the opposite of his instincts and quickly finds a new love and an improved life. Elaine asks her boss for a raise.

Hi, I’m an amateur blogger writing for fun. If you enjoyed or got some value from this post feel free to like, comment, or share. I can also be reached on linkedin for professional inquiries or twitter for personal.

Source: Deep Learning on Medium