What you looking at? I know…..

Source: Deep Learning on Medium

While watching a movie, we easily figure out the characters that are interacting with each other even if they are in different frames. Ever wondered how we are able to do that so effortlessly ?

Our mind automatically interprets where a person is looking at while we are watching movies or in real life. We know this because

  1. we are able to predict the gaze(look at) vector of a person,
  2. we are pretty good at guessing what are the “things” that the person might be looking at,
  3. and we have all the information of geometry of the surroundings.

In this article, I will be explaining the research paper Following Gaze in Video. [Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Recasens, Adria and Vondrick, Carl and Khosla, Aditya and Torralba, Antonio, Pages 1435–1443, 2017].

Fig 1. An example from a movie where the model is trying to predict the gaze of a character which is in a later scene.

In the above figure, the scenes are taken from the movie “Forrest Gump”. So the scene is where Forrest meets her wife dying of AIDS(maybe). We can look at the frame in which Tom Hanks(Forrest Gump) is present and is looking at her wife Jenny Gump. Now there are couple of following frames in which there is Jenny looking at her husband Forrest in a totally different frame. And the proposed model perfectly determines the target frame in which the gazed person/object is present i.e., here, Jenny alongwith the location of hers in that frame.
So now I will be digging deep into the intuition of such model and the actual details related to it.

Understanding required

This task requires both Semantic and Geometric understanding of the

Semantic Understanding is required to identify frames that are
from the same scene (for example, indoor and outdoor frames are
most probably to be unlikely to be from the same scene).

Geometric Understanding is required to localize exactly where the person is looking in a novel frame using the head pose and geometric relationship between the frames.

VideoGaze Dataset

In this research they have trained their model using VideoGaze dataset created by the authors themselves. VideoGaze dataset contains 166,721 annotations from 140 movies. To build the dataset they used videos from the MovieQA dataset. Each sample of the dataset consists of 6 frames. The first frame contains the character being considered whose gaze is annotated. Eye location and a head bounding box for the character are provided. The other 5 frames contain the gaze location that the character is looking at that time, if present in the frame.

In the above figure, the frame with green borders are the ones having gazed object in them and the red ones doesn’t have gazed object in it.

Network Architecture

The network is divided into 3 broad pathways:

1. Saliency Pathway :

This pathway is responsible to detect the salient regions of the target frames.

2. Gaze Pathway :

This pathway is responsible for predicting the parameters of the gaze cone i.e., cone of field of view.

3. Transformation Pathway :

This pathway is responsible for prediction of transformation parameters in order to relate different coordinate systems imposed by xs and xt.

I will be discussing these pathways in detail a little later. Before that, have a look at the network architecture that is proposed to predict the gaze location.

In the following figure,

xs : Source frame where the person is located,
xh : Image crop of source frame containing only person’s head,
ue : Coordinates of the eyes of the person within the frame xs.
Let x be the set of frames that we want to predict where a person is
looking (if any).


We wish to,

  1. select a target frame xt belonging to the set of frames x that the object of gaze appears in and
  2. predict the coordinates of the person’s gaze ŷ in xt.


Components of the Network in computation order is as follows so that you have no difficulty in understanding the complex model architecture.

  1. Saliency Pathway
  2. Gaze (Cone) Pathway
  3. Transformation Pathway
  4. Cone-Plane Intersection
  5. Frame Selection
  6. Gaze Prediction

Multi-frame Gaze Network

To solve the problem, we need to solve the following sub-problems :
1. estimate the head pose of the person,
2. find the geometric relationship between the frame where the person is
and the frame where the gaze location might be,
3. find the potential locations in the target frame where the person
might be looking (salient spots)

With this structure in mind, we design a convolutional network F to
predict ŷ for a target frame xt:

where S(·) and G (·) are decompositions of the original problem and
encircled dot is the element-wise product operator.


  1. S(xt) is intended to learn salient objects in the target frame.
  2. G (xs , xh , ue , xt) is intended to estimate the mask of all locations
    where the person could be looking in the target frame xt.
  3. We use the element-wise product as an and operation so that the
    network predicts people are looking at salient objects that are within
    their eyesight.

The structure of G is motivated to leverage the geometry of the scene.
G can be represented as the intersection of the persons gaze cone with a
plane representing the target frame xt transformed into the same
coordinate frame as xs:


To geometrically relate the two frames xsand xt, we expect our
transformation pathway to learn an affine transformation.
Let Z be the set of coordinates inside the square with corners (±1, ±1, 0).
Suppose the image xs is located in Z (xs is resized to have its corners in
(±1, ±1, 0) ).

We use T to transform the coordinates of xt into the coordinate system defined by xs . The transformation function can be represented as:

Cone-Plane Intersection

The intersection of the person’s gaze cone and the transformed frame plane
τ (T) can be obtained by solving the following equation which is basically
the conic section equation if you expand it.

1. (β1 , β2) are coordinates in the system of coordinates defined by xt.
2. Σ is a matrix defining the cone-plane intersection.

Solving above equation for all β gives us the cone-plane intersection,
however it is not discrete, which would not provide a gradient for learning.
Therefore, we use an approximation to make intersection soft,

where, σ is a sigmoid activation function.
Finally, to compute the intersection, we calculate Equation (4) for β1 , β2 belongs to [−1, 1].

Explanation of Pathways

We estimate the parameters of the saliency map S, the cone C, and the
transformation T using CNNs.

Saliency Pathway

  1. Input : Target frame xt
  2. Architecture : 6-layer CNN where 5 initial CNN layers are of AlexNet
    pretrained on ImageNet dataset and the last layer uses 1 × 1 kernel to
    merge 256 channels in a simple k × k map.
  3. Output : Saliency Map S(xt) (in the paper k = 13, so basically
    169-dimensional output)

Gaze(Cone) Pathway

  1. Input : Head image xh and Eye location ue.
  2. Architecture : includes 5-layer CNN (from AlexNet) followed by 3 fc
    layers with dimensions 500, 200 and 4 respectively. We set the origin of
    the cone at the head of the person ue.
  3. Output : Cone Parameters v (3D direction vector of cone’s axial line that can also be considered as head pose vector) and α (cone’s axial angle).

Transformation Pathway

There are 2 parts to this network :

T1 network :

  1. Input : Source Frame (xs) and Target Frame (xt)
  2. Architecture : T1 is applied separately to both source and target
    frames. It includes 5-layer CNN network from AlexNet (the weights are
  3. Output : CNN feature map of 256 depth of each source and target frames.

T2 network :

  1. Input : Concatenated CNN feature of both source and target frames
    (depth of 512).
  2. Architecture : one CNN layer with one 1 × 1 kernel sized filter,
    followed by 3 fully-connected layers of dimensions 200, 100 and 7 respectively.
  3. Output : 7-dimensional vector (first 3 are translation parameters in each coordinate axes, next 3 are rotation angles in each coordinate axes
    and 7th one is γ(xs, xt) which is used to set G=0 if no transformation is found.

Cone-Plane Intersection

This is basically mathematical computation (not any kind of neural network).

  1. Input : Head Pose vector ( v computed from Cone Pathway), Eyes
    Location (ue), Axial(Aperture) angle of cone (α from Cone Pathway),
    3 translation parameters and 3 rotation angles from Transformation
  2. Computation : Equation

represents a cone where M = transpose(v)*v − αI [2]. The translation and
rotation parameters from Transformation pathway and the matrix M
are used to compute the Σ matrix that is then used for computing the
cone-plane intersection equation(i.e., a conic section) as specified in
Equation (3).

3. Output : 169-dimensional feature vector (can be thought of as
encoded cone-projection mask of size 13 × 13).

Frame Selection

How to get the probability distribution of gazed object being in the target
frame xt?
1. We estimate the probability of the person looking inside a frame x t .
2. This probability is computed by Multilayer Perceptron network
E (S(xt), G (xs, ue, xt)) with one hidden layer of 200 dimensions and
output layer with 1 dimension (i.e, probability of gazed object in
frame x t ).
3. The input to this network is concatenated feature vector of each
169-dimensional saliency pathway and gaze(cone) pathway outputs.

Gaze Prediction

It is basically a computation (not any kind of neural network).

1. Input : 169-dimensional output vectors from each of Saliency pathway
and Cone-Plane Intersection.

2. Computation :

a. Take Element-wise multiplication of both input vectors resulting in 169-dimensional vector(13 × 13 map). Now this vector is fed to a fc layer (fc + Softmax) whose output is upscaled map to 400 (20 × 20
map from 13 × 13).
b. Selecting top target frame from the probability distribution that we got
from Frame Selection.
c. Now resizing the output map (20 × 20 map) corresponding to that
target frame using (cv2.resize(output map, (200, 200)) [Interpolation].
And location of the maximum value from this map is considered to be
the gaze point(scaled to target frame size) which is plotted in the
target frame correspondingly.

3. Output : Gaze Point Location (x, y)


  1. We constrain each pathway to learn different aspects of the problem
    by providing each pathway only a subset of the inputs.
  2. Loss : Sum of losses by all possible grid sizes

where p is target map and q is our network’s predicted output map and 
E[w,h] (p, q) is a spatially smooth shifted-grids cross-entropy loss with grid cell size w × h.

Look at the above equation in this way, p(x, y) is the actual target map with gaze location area having 1 as value and same goes for the predicted target map i.e., q(x, y). In this way, you can look at this loss as a special kind of classification cross-entropy loss except the fact that it is calculated using shifted-grids concept.


Visualisation results of each network

Here in the above figure, we can see the outputs from “Cone-Plane intersection” component and from “Saliency Pathway” and the final result output from “Gaze Prediction” component.

From above figure, you can observe that the output from cone projection is actually the cone of field of view(of the person/character) projected on the transformed(to source image) target image. And the output from saliency pathway clearly points out the salient or important locations on the target frame at which the person/character might be looking at.

Example : Final output from “Dead Poets Society”

In the above figure, you can see the probability distribution graph of frame selection. And on the right half, is the final output of Gaze Location (x, y) in the target frame.

Future articles

I know I haven’t explained loss function in detail here because the article is getting too long so I would be discussing about this kind of loss(“shifted-grids cross entropy loss”) in my next article in detail. Also I will explain other stuff in detail like “Transformation of one image coordinates to other image coordinate system in computer vision”.

You must be wondering that how can such complex model be trained in an end-to-end manner and make parts of the network to learn to perform a specific task. Therefore, I will be talking about “Backpropagation in Complex Ensemble Models” in my following articles.

I would highly appreciate if you could ask questions and clear your doubts related to this topic and also give suggestions too.


  1. Following Gaze in Video. Proceedings of the IEEE Conference on
    Computer Vision and Pattern Recognition. Recasens, Adria and
    Vondrick, Carl and Khosla, Aditya and Torralba, Antonio, Pages
    1435–1443, 2017.
  2. http://people.csail.mit.edu/recasens/docs/videogaze.pdf
  3. https://github.com/recasens/Gaze-Following