Source: Deep Learning on Medium

While watching a movie, we easily figure out the characters that are interacting with each other even if they are in different frames. Ever wondered how we are able to do that so effortlessly ?

Our mind automatically interprets where a person is looking at while we are watching movies or in real life. We know this because

- we are able to predict the gaze(look at) vector of a person,
- we are pretty good at guessing what are the “things” that the person might be looking at,
- and we have all the information of geometry of the surroundings.

In this article, I will be explaining the research paper

Following Gaze in Video.[Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Recasens, Adria and Vondrick, Carl and Khosla, Aditya and Torralba, Antonio, Pages 1435–1443, 2017].

In the above figure, the scenes are taken from the movie “Forrest Gump”. So the scene is where Forrest meets her wife dying of AIDS(maybe). We can look at the frame in which Tom Hanks(Forrest Gump) is present and is looking at her wife Jenny Gump. Now there are couple of following frames in which there is Jenny looking at her husband Forrest in a totally different frame. And the proposed model perfectly determines the target frame in which the gazed person/object is present i.e., here, Jenny alongwith the location of hers in that frame.

So now I will be digging deep into the intuition of such model and the actual details related to it.

**Understanding required**

This task requires both Semantic and Geometric understanding of the

video.

*Semantic Understanding* is required to identify frames that are

from the same scene (for example, indoor and outdoor frames are

most probably to be unlikely to be from the same scene).

*Geometric Understanding* is required to localize exactly where the person is looking in a novel frame using the head pose and geometric relationship between the frames.

**VideoGaze Dataset**

In this research they have trained their model using VideoGaze dataset created by the authors themselves. VideoGaze dataset contains 166,721 annotations from 140 movies. To build the dataset they used videos from the MovieQA dataset. Each sample of the dataset consists of 6 frames. The first frame contains the character being considered whose gaze is annotated. Eye location and a head bounding box for the character are provided. The other 5 frames contain the gaze location that the character is looking at that time, if present in the frame.

In the above figure, the frame with green borders are the ones having gazed object in them and the red ones doesn’t have gazed object in it.

**Network Architecture**

The network is divided into 3 broad pathways:

#### 1. Saliency Pathway :

This pathway is responsible to detect the salient regions of the target frames.

#### 2. Gaze Pathway :

This pathway is responsible for predicting the parameters of the gaze cone i.e., cone of field of view.

#### 3. Transformation Pathway :

This pathway is responsible for prediction of transformation parameters in order to relate different coordinate systems imposed by ** xs **and

*xt.*I will be discussing these pathways in detail a little later. Before that, have a look at the network architecture that is proposed to predict the gaze location.

In the following figure,

** xs** : Source frame where the person is located,

**: Image crop of source frame containing only person’s head,**

*xh***: Coordinates of the eyes of the person within the frame**

*ue*

*xs.*Let

**be the set of frames that we want to predict where a person is**

*x*looking (if any).

### Objective

We wish to,

- select a target frame
belonging to the set of frames*xt*that the object of gaze appears in and*x* - predict the coordinates of the person’s gaze
**ŷ**in*xt.*

### Method

Components of the Network in computation order is as follows so that you have no difficulty in understanding the complex model architecture.

- Saliency Pathway
- Gaze (Cone) Pathway
- Transformation Pathway
- Cone-Plane Intersection
- Frame Selection
- Gaze Prediction

### Multi-frame Gaze Network

To solve the problem, we need to solve the following sub-problems :

1. estimate the head pose of the person,

2. find the geometric relationship between the frame where the person is

and the frame where the gaze location might be,

3. find the potential locations in the target frame where the person

might be looking (salient spots)

With this structure in mind, we design a convolutional network **F** to

predict ** ŷ** for a target frame

**:**

*xt*where S(·) and G (·) are decompositions of the original problem and

encircled dot is the element-wise product operator.

#### Interpretation

- S(
*xt*) is intended to learn salient objects in the target frame. - G (
*xs , xh , ue , xt*) is intended to estimate the mask of all locations

where the person could be looking in the target frame*xt*. - We use the element-wise product as an and operation so that the

network predicts people are looking at salient objects that are within

their eyesight.

The structure of G is motivated to leverage the geometry of the scene.

G can be represented as the intersection of the persons gaze cone with a

plane representing the target frame ** xt** transformed into the same

coordinate frame as

**:**

*xs*### Transformation

To geometrically relate the two frames *xs*and *xt*, we expect our

transformation pathway to learn an affine transformation.

Let Z be the set of coordinates inside the square with corners (±1, ±1, 0).

Suppose the image *xs *is located in Z (*xs* is resized to have its corners in

(±1, ±1, 0) ).

We use T to transform the coordinates of *xt *into the coordinate system defined by *xs* . The transformation function can be represented as:

### Cone-Plane Intersection

The intersection of the person’s gaze cone and the transformed frame plane

τ (T) can be obtained by solving the following equation which is basically

the conic section equation if you expand it.

where,

1. (β1 , β2) are coordinates in the system of coordinates defined by *xt*.

2. Σ is a matrix defining the cone-plane intersection.

Solving above equation for all β gives us the cone-plane intersection,

however it is not discrete, which would not provide a gradient for learning.

Therefore, we use an approximation to make intersection soft,

where, σ is a sigmoid activation function.

Finally, to compute the intersection, we calculate Equation (4) for β1 , β2 belongs to [−1, 1].

### Explanation of Pathways

We estimate the parameters of the saliency map S, the cone C, and the

transformation T using CNNs.

#### Saliency Pathway

*Input*: Target frame*xt**Architecture*: 6-layer CNN where 5 initial CNN layers are of AlexNet

pretrained on ImageNet dataset and the last layer uses 1 × 1 kernel to

merge 256 channels in a simple k × k map.*Output*: Saliency Map S(*xt*) (in the paper k = 13, so basically

169-dimensional output)

#### Gaze(Cone) Pathway

*Input*: Head image*xh*and Eye location*ue*.*Architecture*: includes 5-layer CNN (from AlexNet) followed by 3 fc

layers with dimensions 500, 200 and 4 respectively. We set the origin of

the cone at the head of the person*ue*.*Output*: Cone Parameters v (3D direction vector of cone’s axial line that can also be considered as head pose vector) and α (cone’s axial angle).

#### Transformation Pathway

There are 2 parts to this network :

**T1 network :**

*Input*: Source Frame (*xs*) and Target Frame (*xt*)*Architecture*: T1 is applied separately to both source and target

frames. It includes 5-layer CNN network from AlexNet (the weights are

shared).*Output*: CNN feature map of 256 depth of each source and target frames.

**T2 network :**

*Input*: Concatenated CNN feature of both source and target frames

(depth of 512).*Architecture*: one CNN layer with one 1 × 1 kernel sized filter,

followed by 3 fully-connected layers of dimensions 200, 100 and 7 respectively.*Output*: 7-dimensional vector (first 3 are translation parameters in each coordinate axes, next 3 are rotation angles in each coordinate axes

and 7th one is γ(*xs*,*xt*) which is used to set G=0 if no transformation is found.

#### Cone-Plane Intersection

This is basically mathematical computation (not any kind of neural network).

*Input*: Head Pose vector ( v computed from Cone Pathway), Eyes

Location (*ue*), Axial(Aperture) angle of cone (α from Cone Pathway),

3 translation parameters and 3 rotation angles from Transformation

Pathway.*Computation*: Equation

represents a cone where M = transpose(v)*v − αI [2]. The translation and

rotation parameters from Transformation pathway and the matrix M

are used to compute the Σ matrix that is then used for computing the

cone-plane intersection equation(i.e., a conic section) as specified in

Equation (3).

3. *Output* : 169-dimensional feature vector (can be thought of as

encoded cone-projection mask of size 13 × 13).

#### Frame Selection

How to get the probability distribution of gazed object being in the target

frame *xt*?

1. We estimate the probability of the person looking inside a frame x t .

2. This probability is computed by Multilayer Perceptron network

E (S(*xt*), G (*xs, ue, xt*)) with one hidden layer of 200 dimensions and

output layer with 1 dimension (i.e, probability of gazed object in

frame x t ).

3. The input to this network is concatenated feature vector of each

169-dimensional saliency pathway and gaze(cone) pathway outputs.

#### Gaze Prediction

It is basically a computation (not any kind of neural network).

1. *Input* : 169-dimensional output vectors from each of Saliency pathway

and Cone-Plane Intersection.

2. *Computation* :

a. Take Element-wise multiplication of both input vectors resulting in 169-dimensional vector(13 × 13 map). Now this vector is fed to a fc layer (fc + Softmax) whose output is upscaled map to 400 (20 × 20

map from 13 × 13).

b. Selecting top target frame from the probability distribution that we got

from Frame Selection.

c. Now resizing the output map (20 × 20 map) corresponding to that

target frame using (cv2.resize(output map, (200, 200)) [Interpolation].

And location of the maximum value from this map is considered to be

the gaze point(scaled to target frame size) which is plotted in the

target frame correspondingly.

3. *Output* : Gaze Point Location (x, y)

### Learning

- We constrain each pathway to learn different aspects of the problem

by providing each pathway only a subset of the inputs. **Loss**: Sum of losses by all possible grid sizes

where p is target map and q is our network’s predicted output map and

E[w,h] (p, q) is a spatially smooth *shifted-grids* *cross-entropy loss* with grid cell size w × h.

Look at the above equation in this way, p(x, y) is the actual target map with gaze location area having 1 as value and same goes for the predicted target map i.e., q(x, y). In this way, you can look at this loss as a special kind of classification cross-entropy loss except the fact that it is calculated using shifted-grids concept.

### Results

Here in the above figure, we can see the outputs from “Cone-Plane intersection” component and from “Saliency Pathway” and the final result output from “Gaze Prediction” component.

From above figure, you can observe that the output from cone projection is actually the cone of field of view(of the person/character) projected on the transformed(to source image) target image. And the output from saliency pathway clearly points out the salient or important locations on the target frame at which the person/character might be looking at.

In the above figure, you can see the probability distribution graph of frame selection. And on the right half, is the final output of Gaze Location (x, y) in the target frame.

### Future articles

I know I haven’t explained loss function in detail here because the article is getting too long so I would be discussing about this kind of loss(“** shifted-grids cross entropy loss**”) in my next article in detail. Also I will explain other stuff in detail like “

**”.**

*Transformation of one image coordinates to other image coordinate system in computer vision*You must be wondering that how can such complex model be trained in an end-to-end manner and make parts of the network to learn to perform a specific task. Therefore, I will be talking about “** Backpropagation in Complex Ensemble Models**” in my following articles.

I would highly appreciate if you could ask questions and clear your doubts related to this topic and also give suggestions too.