Speech To Face

Original article can be found here (source): Deep Learning on Medium


First audio processing,We use up to 6 seconds of audio taken extracted from you-tube. If the audio clip is shorter than 6 seconds, we repeat the audio such that it becomes at least 6-seconds long. The audio waveform is re sampled at 16 kHz and only a single channel is used. Spectrograms are computed by taking STFT with a Hann window of 25 mm, the hop length of 10 ms, and 512 FFT frequency bands. Each complex spectrogram S subsequently goes through the power-law compression, resulting sgn(S)|S|0.3 for real and imaginary independently, where sgn(·) denotes the signum.


We run the CNN-based face detector from Dlib , crop the face regions from the frames, and resize them to 224 × 224 pixels .The VGG-Face features are computed from the resized face images. The computed spectrogram and VGG-Face feature of each segment are collected and used for training.

Face Encoder Model

Our voice encoder module is a convolutional neural network that turns the spectrogram of a short input speech into a pseudo face feature, which is subsequently fed into the face decoder to reconstruct the face image (Fig. 2). The architecture of the voice encoder is summarized in Table 1. The blocks of a convolution layer, ReLU,and batch normalization [23] alternate with max-pooling layers, which pool along only the temporal dimension of the spectrograms, while leaving the frequency information carried over. This is intended to preserve more of the vocal characteristics, since they are better contained in the frequency content, whereas linguistic information usually spans longer time duration . At the end of these blocks, we apply average pooling along the temporal dimension. This allows us to efficiently aggregate information over time and makes the model applicable to input speech of varying duration. The pooled features are then fed into two fully-connected layers to produce a 4096-D face feature.

Loss in Encoder
A natural choice for the loss function would be the L1 distance between the features: kvf − vsk1. However, we found that the training undergoes slow and unstable progression with this loss alone. To stabilize the training, we introduce additional loss terms. Specifically, we additionally penalize the difference in the activationof the last layer of the face encoder, fVGG : R4096 → R2622,i.e., fc8 of VGG-Face, and that of the first layer of the face decoder, fdec : R 4096→R 1000, which are pre-trained and fixed during training the voice encoder. We feed both our predictions and the ground truth face features to these layers to calculate the losses.

def absv(self, a, b): 
return self.__lambda1__ * torch.abs((a/torch.abs(a)) — (b/torch.abs(b))).pow(2)
def PI(self, a, i):
n = torch.exp(a[i] / self.__T__ )
d = torch.sum( torch.exp(a / self.__T__), dim=1)
return n / d
def Ldistill(self, a, b):
res = 0.0
for i in range(a.size[0]):
res = torch.add(res , self.PI(a, i) * torch.log(self.PI(b,i)))
return self.__lambda2__ * res

Face Decoder
It is based on coles Method.We could have mapped from F to an output image directly using a deep network. This would need to simultaneously model variation in the geometry and textures of faces. As with Lanitis et al. [7], we have found it substantially more effective to separately generate landmarks L and textures T and render the final result using warping. We generate L using a shallow multi-layer perceptron with ReLU non-linearities applied to F. To generate the texture images, we use a deep CNN. We first use a fullyconnected layer to map from F to 14 × 14 × 256 localized features. Then, we use a set of stacked transposed convolutions [28], separated by ReLUs, with a kernel width of 5 and stride of 2 to upsample to 224 × 224 × 32 localized features. The number of channels after the i th transposed convolution is max(256/2 i , 32). Finally, we apply a 1 × 1 convolution to yield 224 × 224 × 3 RGB values.
Because we are generating registered texture images, it is not unreasonable to use a fully-connected network, rather than a deep CNN. This maps from F to 224 × 224 × 3 pixel values directly using a linear transformation. Despite the spatial tiling of the CNN, these models have roughly the same number of parameters. We contrast the outputs of these approaches.

L1 = self.fc3(x)
L1 = self.ReLU(L1)
L2 = self.layerLandmark1(L1)
L2 = self.ReLU(L2)
L3 = self.layerLandmark2(L2)
L3 = self.ReLU(L3)
L4 = self.layerLandmark3(L3)
outL = self.ReLU(L4)
# B1 = self.fc_bn3(L1)
T0 = self.fc4(L1)
T0 = self.ReLU(T0)
# T0 = self.fc_bn4(T0)
T0 = T0.view(-1,64,14,14)
T1 = self.T1_(T0)
T2 = self.T2_(T1)
T3 = self.T3_(T2)
T4 = self.T4_(T3)
outT = self.ConvLast(T4)
return outL, outT

Loss in decoder

Each dashed line connects two terms that are compared in the loss function. Textures
are compared using mean absolute error, landmarks using mean
squared error, and FaceNet embedding using negative cosine similarity

Loss of decoder

Differential Image Wrapping
Let I0 be a 2-D image. Let L = {(x1, y1), . . . ,(xn, yn)} be a set of 2-D landmark points and let D = {(dx1, dy1), . . . ,(dxn, dyn)} be a set of displacement vectors for each control point. In the morphable model, I0 is the texture image T and D = L − L¯ is the displacement of the landmarks from the mean geometry.

The interpolation is done independently for horizontal and vertical displacements. For each dimension, we have a scalar gp defined at each 2-D control point p in L and seek to produce a dense 2-D grid of scalar values. Besides the facial landmark points, we include extra points at the boundary of the image, where we enforce zero displacement.
It is implemented in tensorflow.

def image_warping(src_img, src_landmarks, dest_landmarks):
warped_img, dense_flows = sparse_image_warp(src_img,
with tf.Session() as sess:
out_img = sess.run(warped_img)
warp_img = np.uint8(out_img[:, :, :, :] * 255)

return torch.from_numpy(warp_img).float())

These are some key features of Speech to face model.

Github Project : https://github.com/ravising-h/Speech2Face/