Curious Resources: Volume 3

Speaker Diarization, Descriptive Scene Building, & Automatic Interaction Systems

“Long exposure shot of a man creating a circle of light, as the moon rises behind” by Austin Neill on Unsplash

Welcome back!

Each edition we try to bring you a set of interesting finds in interactive technology. If you missed the previous edition, there’s another collection of great links waiting for you when you’re done here.

This issue we’re covering:

  • Speaker Diarization
  • Descriptive Scene Creation
  • Automatic Interaction Systems

Follow me on Github or get in touch with me on LinkedIn!

Speaker Diarization

Although we’ve recently seen vast improvements in speech recognition systems such as Amazon’s Alexa and Google’s Assistant, those improvements have for the most part come in systems that deal with a single speaker. They’re great at figuring out what’s being said — just not who’s saying it.

Speaker diarization is the process of segmenting out distinct speakers from a single audio stream. A quick way to think about it is: who spoke when.

Diarization is a necessary piece of any voice-based system that would create rich text transcripts from courtrooms proceedings, meetings, podcasts, and panel discussions.

It’s important that such a system recognize new speakers it has never heard before, across a variety of languages and voices and independent of background noise. Transcripts can contain additional information like speaker positions or neural network generated descriptions of visual elements.

We’re interested in using those transcripts in combination with the audio itself as the data source for generative content. A few examples:

  • We can take data from a company like Bridgewater, who records employee conversations, and turn every conversation into a repeatable learning environment. We can put you in the scene, pause it before a decision, and ask: What would you do?
  • We can take the entire history of Radio Drama and create virtual movie sets where we can shoot a million movies from any angle with any lens we want. Reimagine a favorite television show without having to recast it.
  • We can take audio of a podcast or panel discussion and feed it into an environment where the audio drives an avatar’s facial or skeletal animations. Sound comes from the position of the currently speaking character, since we can map speaker classifications to entities. You’ve got a ticket to an endless virtual concert hall — the room could be replicated infinitely or expanded to match

IBM is into speaker diarization. It is built into Watson now.

Who’s speaking? : Speaker Diarization with Watson Speech-to-Text API – IBM Cloud Blog

Amazon’s not quite there yet with their Transcribe service, but it’s coming soon.

Amazon Transcribe – Automatic Speech Recognition – AWS

Microsoft is into speaker diarization as well.

Notice a trend? You’ll find Google represented in the “Speaker Diarization with LSTM” paper below.

Here are some cool python libraries to get you started with your own experiments.

Pyannote has absorbed some other libraries, Tristounet. It’s being actively developed, but be wary: “The API is unfortunately not documented yet.” :/


Around here, we LOVE a paper with code. Or is it code with a paper? Thanks to the LIMSI team in France!


Regarding datasets: if you’re already paying for transcripts of calls or meetings from another service, you’ve got a great resource.

If you don’t, or if your dataset is small and needs augmentation, here are a few different speech corpuses: meeting recordings in English, Finnish Parliament proceedings, and French Radio and TV.

I-vectors were all the rage in diarization for a while. “ I-vectors convey the speaker characteristic among other information such as transmission channel, acoustic environment or phonetic content of the speech segment.”

Front-End Factor Analysis for Speaker Verification – IEEE Journals & Magazine

Then along came deep learning. Here’s are a newer paper that starts to move away from the idea of i-vectors.

Descriptive Scene Building

One unavoidable dilemma when building virtual worlds is the void.

You can build absolutely anything you want, but you’re faced with doing so in an endless expanse that goes on as far as you can see in either direction. It’s pretty intimidating to get started, to say nothing of the effort required to fill such a vast space with attractive content.

To generate an interesting object, you’ve got to be a specialist at creating sculpts by 3D modeling, lighting, and painting textures — all while using a manual Human Interface Device like a mouse, keyboard, or more recently a 3D mouse or set of 6DOF controllers.

Or you’ve got to become an expert at scouring the internet to find objects someone else has created that fit your budget — often by compromising on the initial concept. While this practice allows creators to avoid losing momentum and occasionally leads to delightful outcomes, the end result may not be the ideal form.

After all that, you’ve got to arrange all of these pieces of content in the space usually with a combination of rotation and movement handles attached to individual objects, one at a time.

And if you want the objects to be anything more than decoration, you’ve got to script interactions that specify any and all possible uses of an item beyond basic physics — e.g. a piece of paper that you can write on, crumple up and throw in a trash can, that will burn when its put in a toaster, etc. (More on that below in the section on Automatic Interaction Systems)

It should be easier.

If we can find the words for it, we can make it so.

Hence, descriptive scene building as a way forward.

We’ll be able to create and change the spaces that we inhabit online in real-time. And we’ll do so by describing with natural language what we want to see and where we want to be. Intelligent systems will get to know, riff off of and even preempt our desires as they adjust the world around us to suit our current mood and intent. The atoms that we build with in mixed reality can be the smart nano-dust that engineers dream of. As we describe our desire it will manifest and we’ll know it when we see it.

Pragmatic ambiguity will be a feature — if there is a possibility space wherein a number of arrangements would match your specification, you may be asked to curate your preferred embodiment. Or the system might show an arrangement that oscillates over time. Imagine a room that’s never the same but always recognizable.

The first generation of descriptive scene building techniques output static environments. Will the next be so constrained? Or will we use its constraints as creative fuel?

This 2006 paper seems to have kicked things off. Can you imagine trying to do this with the natural language processing tools available at the time? I bet they’d love another go.

Things advanced in 2009–2010, along with NLP.

And in the past few years we’ve started to approach systems that can handle situations like: “There is a room with a chair and a computer”. It gives me goosebumps to think about all the possibilities.

I highly recommend checking out Angel Chang’s body of work, since her research continues to push the boundaries of what’s possible in these areas.

Angel Xuan Chang | Angel Xuan Chang

One interesting problem in descriptive scene building is dealing with a phrase like “A window over there”. Over where? Understanding spatial intent is important, and Bayesian techniques can be used to filter out unlikely intents and aid decision-making.

Meet Matterport’s newest employee: Thomas Bayes

Automatic Interaction Systems

As I mentioned above, even if you can generate a scene or environment you’ve still got to wire it up with code or animations to make it come alive.

How do we do that at scale, in a way that is interesting and feels authentic?

This first paper takes recordings from depth sensors of people performing common interactions like “watching TV” and then allows for procedural generation of similar activity.

A few people doing the same activity will quickly open up a wide range of possibilities — as each node is attached to the graph, a character can explore a larger action space between each node for activities to perform. That is, after observing us both the character would be able to “watch TV” mostly the way that I “watch TV”, but also sort of the way that you “watch TV.”

I’d love to take it a bit further and see what happens in with a generative adversarial network — what would it look like if the network running the character just barely continued to classify the movement that it creates as “watching tv” while also making the movement as much “write on a whiteboard”as possible? This works with images with different styles of painters, and often produces interesting results. Could it work with motion too? Will it dream up new gestures or spit out nonsense?

You may need to watch the video to fullygrok the next paper, but what’s happening here is that nobody actually animates the pirate’s feet — the character is able to move over and transition between a variety of terrain without specialized programming to account for each situation. Props to these folks for having a video, a paper, and code 🙂


See ya next time!

Source: Deep Learning on Medium